[2025-06-19 13:13:31,582] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-06-19 13:13:35,082] [INFO] [comm.py:652:init_distributed] cdb=None [2025-06-19 13:13:35,082] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 06/19/2025 13:13:35 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 06/19/2025 13:13:35 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=1000, eval_strategy=steps, eval_use_gather_object=False, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=None, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=weights/st1/mos0_st1/runs/Jun19_13-13-35_amax, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=weights/st1/mos0_st1, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=weights/st1/mos0_st1, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=70000000, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 06/19/2025 13:13:35 - INFO - __main__ - Loading Tokenizer: /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:13:35,223 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:13:35,223 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:13:35,223 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:13:35,223 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:13:35,223 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:13:35,223 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2304] 2025-06-19 13:13:35,595 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 06/19/2025 13:13:35 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:694] 2025-06-19 13:13:35,898 >> loading configuration file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/config.json [INFO|configuration_utils.py:768] 2025-06-19 13:13:35,900 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_attn_implementation_autoset": false, "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForSequenceClassification": "modeling_internlm2.InternLM2ForSequenceClassification" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.48.3", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internvl2_5", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_attn_implementation_autoset": false, "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.48.3", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 06/19/2025 13:13:35 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3901] 2025-06-19 13:13:35,901 >> loading weights file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/model.safetensors.index.json [INFO|modeling_utils.py:1582] 2025-06-19 13:13:35,901 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1140] 2025-06-19 13:13:35,903 >> Generate config GenerationConfig {} this model [WARNING|logging.py:328] 2025-06-19 13:13:35,975 >> InternLM2ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [INFO|configuration_utils.py:1140] 2025-06-19 13:13:35,976 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } Setting backbone: fragments_backbone Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [WARNING|modeling_utils.py:4890] 2025-06-19 13:13:38,396 >> Some weights of InternVLChatModel were not initialized from the model checkpoint at /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B and are newly initialized: ['evaluator.fragments_backbone.layers.0.blocks.0.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.norm2.weight', 'evaluator.fragments_backbone.layers.0.downsample.norm.bias', 'evaluator.fragments_backbone.layers.0.downsample.norm.weight', 'evaluator.fragments_backbone.layers.0.downsample.reduction.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.norm2.weight', 'evaluator.fragments_backbone.layers.1.downsample.norm.bias', 'evaluator.fragments_backbone.layers.1.downsample.norm.weight', 'evaluator.fragments_backbone.layers.1.downsample.reduction.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.norm2.weight', 'evaluator.fragments_backbone.layers.2.downsample.norm.bias', 'evaluator.fragments_backbone.layers.2.downsample.norm.weight', 'evaluator.fragments_backbone.layers.2.downsample.reduction.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.norm2.weight', 'evaluator.fragments_backbone.norm.bias', 'evaluator.fragments_backbone.norm.weight', 'evaluator.fragments_backbone.patch_embed.norm.bias', 'evaluator.fragments_backbone.patch_embed.norm.weight', 'evaluator.fragments_backbone.patch_embed.proj.bias', 'evaluator.fragments_backbone.patch_embed.proj.weight', 'fast_mlp.0.bias', 'fast_mlp.0.weight', 'fast_mlp.1.bias', 'fast_mlp.1.weight', 'fast_mlp.3.bias', 'fast_mlp.3.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|configuration_utils.py:1093] 2025-06-19 13:13:38,405 >> loading configuration file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/generation_config.json [INFO|configuration_utils.py:1140] 2025-06-19 13:13:38,405 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 06/19/2025 13:13:38 - INFO - __main__ - Finished 06/19/2025 13:13:38 - INFO - __main__ - model.config.force_image_size: 448 06/19/2025 13:13:38 - INFO - __main__ - data_args.force_image_size: 448 06/19/2025 13:13:38 - INFO - __main__ - model.config.vision_config.image_size: 448 06/19/2025 13:13:38 - INFO - __main__ - [Dataset] num_image_token: 256 06/19/2025 13:13:38 - INFO - __main__ - [Dataset] dynamic_image_size: True 06/19/2025 13:13:38 - INFO - __main__ - [Dataset] use_thumbnail: True 06/19/2025 13:13:38 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 06/19/2025 13:13:38 - INFO - __main__ - Formatting inputs...Skip in lazy mode 06/19/2025 13:13:38 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 4000 06/19/2025 13:13:38 - INFO - __main__ - [Dataset] num_image_token: 256 06/19/2025 13:13:38 - INFO - __main__ - [Dataset] dynamic_image_size: True 06/19/2025 13:13:38 - INFO - __main__ - [Dataset] use_thumbnail: True 06/19/2025 13:13:38 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 06/19/2025 13:13:38 - INFO - __main__ - Formatting inputs...Skip in lazy mode 06/19/2025 13:13:39 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 500 eval_dataset 06/19/2025 13:13:39 - INFO - __main__ - mlp1.0.weight 06/19/2025 13:13:39 - INFO - __main__ - mlp1.0.bias 06/19/2025 13:13:39 - INFO - __main__ - mlp1.1.weight 06/19/2025 13:13:39 - INFO - __main__ - mlp1.1.bias 06/19/2025 13:13:39 - INFO - __main__ - mlp1.3.weight 06/19/2025 13:13:39 - INFO - __main__ - mlp1.3.bias 06/19/2025 13:13:39 - INFO - __main__ - fast_mlp.0.weight 06/19/2025 13:13:39 - INFO - __main__ - fast_mlp.0.bias 06/19/2025 13:13:39 - INFO - __main__ - fast_mlp.1.weight 06/19/2025 13:13:39 - INFO - __main__ - fast_mlp.1.bias 06/19/2025 13:13:39 - INFO - __main__ - fast_mlp.3.weight 06/19/2025 13:13:39 - INFO - __main__ - fast_mlp.3.bias training_args TrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=1000, eval_strategy=steps, eval_use_gather_object=False, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=None, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=weights/st1/mos0_st1/runs/Jun19_13-13-35_amax, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=weights/st1/mos0_st1, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=weights/st1/mos0_st1, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=70000000, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) [INFO|trainer.py:741] 2025-06-19 13:13:39,113 >> Using auto half precision backend [WARNING|trainer.py:803] 2025-06-19 13:13:39,275 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:803] 2025-06-19 13:13:39,275 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [2025-06-19 13:13:39,283] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown [2025-06-19 13:13:39,284] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1 [2025-06-19 13:13:59,273] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /home/wangjiarui/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/wangjiarui/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.6348514556884766 seconds [2025-06-19 13:13:59,911] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2025-06-19 13:13:59,911] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2025-06-19 13:13:59,916] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2025-06-19 13:13:59,917] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2025-06-19 13:13:59,917] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2025-06-19 13:13:59,917] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000 [2025-06-19 13:13:59,917] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000 [2025-06-19 13:13:59,917] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2025-06-19 13:13:59,917] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False [2025-06-19 13:14:00,311] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-06-19 13:14:00,312] [INFO] [utils.py:782:see_memory_usage] MA 15.85 GB Max_MA 15.95 GB CA 16.08 GB Max_CA 16 GB [2025-06-19 13:14:00,312] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 61.55 GB, percent = 24.5% [2025-06-19 13:14:00,477] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-06-19 13:14:00,477] [INFO] [utils.py:782:see_memory_usage] MA 15.85 GB Max_MA 16.05 GB CA 16.28 GB Max_CA 16 GB [2025-06-19 13:14:00,477] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 61.56 GB, percent = 24.5% [2025-06-19 13:14:00,478] [INFO] [stage_1_and_2.py:545:__init__] optimizer state initialized [2025-06-19 13:14:00,639] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-06-19 13:14:00,640] [INFO] [utils.py:782:see_memory_usage] MA 15.85 GB Max_MA 15.85 GB CA 16.28 GB Max_CA 16 GB [2025-06-19 13:14:00,640] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 61.56 GB, percent = 24.5% [2025-06-19 13:14:00,641] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2025-06-19 13:14:00,641] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2025-06-19 13:14:00,641] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2025-06-19 13:14:00,641] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2025-06-19 13:14:00,643] [INFO] [config.py:999:print] DeepSpeedEngine configuration: [2025-06-19 13:14:00,644] [INFO] [config.py:1003:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-06-19 13:14:00,644] [INFO] [config.py:1003:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-06-19 13:14:00,644] [INFO] [config.py:1003:print] amp_enabled .................. False [2025-06-19 13:14:00,644] [INFO] [config.py:1003:print] amp_params ................... False [2025-06-19 13:14:00,644] [INFO] [config.py:1003:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] bfloat16_enabled ............. True [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] bfloat16_immediate_grad_update False [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] checkpoint_parallel_write_pipeline False [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] checkpoint_tag_validation_enabled True [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] checkpoint_tag_validation_fail False [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] comms_config ................. [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] communication_data_type ...... None [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] disable_allgather ............ False [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] dump_state ................... False [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2025-06-19 13:14:00,645] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] fp16_enabled ................. False [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] global_rank .................. 0 [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 1 [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] graph_harvesting ............. False [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] memory_breakdown ............. False [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2025-06-19 13:14:00,646] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] optimizer_name ............... adamw [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] pld_enabled .................. False [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] pld_params ................... False [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] prescale_gradients ........... False [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] scheduler_name ............... None [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] scheduler_params ............. None [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] sparse_attention ............. None [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] steps_per_print .............. inf [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] train_batch_size ............. 4 [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 4 [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2025-06-19 13:14:00,647] [INFO] [config.py:1003:print] wall_clock_breakdown ......... True [2025-06-19 13:14:00,648] [INFO] [config.py:1003:print] weight_quantization_config ... None [2025-06-19 13:14:00,648] [INFO] [config.py:1003:print] world_size ................... 1 [2025-06-19 13:14:00,648] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False [2025-06-19 13:14:00,648] [INFO] [config.py:1003:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-06-19 13:14:00,648] [INFO] [config.py:1003:print] zero_enabled ................. True [2025-06-19 13:14:00,648] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True [2025-06-19 13:14:00,648] [INFO] [config.py:1003:print] zero_optimization_stage ...... 1 [2025-06-19 13:14:00,648] [INFO] [config.py:989:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 1, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 4, "train_micro_batch_size_per_gpu": 4, "wall_clock_breakdown": true } [INFO|trainer.py:2369] 2025-06-19 13:14:00,650 >> ***** Running training ***** [INFO|trainer.py:2370] 2025-06-19 13:14:00,650 >> Num examples = 4,000 [INFO|trainer.py:2371] 2025-06-19 13:14:00,650 >> Num Epochs = 10 [INFO|trainer.py:2372] 2025-06-19 13:14:00,650 >> Instantaneous batch size per device = 4 [INFO|trainer.py:2375] 2025-06-19 13:14:00,650 >> Total train batch size (w. parallel, distributed & accumulation) = 4 [INFO|trainer.py:2376] 2025-06-19 13:14:00,650 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2377] 2025-06-19 13:14:00,650 >> Total optimization steps = 10,000 [INFO|trainer.py:2378] 2025-06-19 13:14:00,651 >> Number of trainable parameters = 53,503,488 0%| | 0/10000 [00:00, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=weights/st1/mos0_st1/runs/Jun19_13-15-32_amax, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=weights/st1/mos0_st1, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=weights/st1/mos0_st1, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=70000000, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 06/19/2025 13:15:32 - INFO - __main__ - Loading Tokenizer: /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:15:32,724 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:15:32,724 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:15:32,724 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:15:32,724 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:15:32,724 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:15:32,724 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2304] 2025-06-19 13:15:33,016 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 06/19/2025 13:15:33 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:694] 2025-06-19 13:15:33,271 >> loading configuration file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/config.json [INFO|configuration_utils.py:768] 2025-06-19 13:15:33,273 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_attn_implementation_autoset": false, "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForSequenceClassification": "modeling_internlm2.InternLM2ForSequenceClassification" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.48.3", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internvl2_5", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_attn_implementation_autoset": false, "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.48.3", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 06/19/2025 13:15:33 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3901] 2025-06-19 13:15:33,273 >> loading weights file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/model.safetensors.index.json [INFO|modeling_utils.py:1582] 2025-06-19 13:15:33,274 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1140] 2025-06-19 13:15:33,276 >> Generate config GenerationConfig {} this model [WARNING|logging.py:328] 2025-06-19 13:15:33,343 >> InternLM2ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [INFO|configuration_utils.py:1140] 2025-06-19 13:15:33,343 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } Setting backbone: fragments_backbone Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [WARNING|modeling_utils.py:4890] 2025-06-19 13:15:35,239 >> Some weights of InternVLChatModel were not initialized from the model checkpoint at /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B and are newly initialized: ['evaluator.fragments_backbone.layers.0.blocks.0.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.norm2.weight', 'evaluator.fragments_backbone.layers.0.downsample.norm.bias', 'evaluator.fragments_backbone.layers.0.downsample.norm.weight', 'evaluator.fragments_backbone.layers.0.downsample.reduction.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.norm2.weight', 'evaluator.fragments_backbone.layers.1.downsample.norm.bias', 'evaluator.fragments_backbone.layers.1.downsample.norm.weight', 'evaluator.fragments_backbone.layers.1.downsample.reduction.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.norm2.weight', 'evaluator.fragments_backbone.layers.2.downsample.norm.bias', 'evaluator.fragments_backbone.layers.2.downsample.norm.weight', 'evaluator.fragments_backbone.layers.2.downsample.reduction.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.norm2.weight', 'evaluator.fragments_backbone.norm.bias', 'evaluator.fragments_backbone.norm.weight', 'evaluator.fragments_backbone.patch_embed.norm.bias', 'evaluator.fragments_backbone.patch_embed.norm.weight', 'evaluator.fragments_backbone.patch_embed.proj.bias', 'evaluator.fragments_backbone.patch_embed.proj.weight', 'fast_mlp.0.bias', 'fast_mlp.0.weight', 'fast_mlp.1.bias', 'fast_mlp.1.weight', 'fast_mlp.3.bias', 'fast_mlp.3.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|configuration_utils.py:1093] 2025-06-19 13:15:35,247 >> loading configuration file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/generation_config.json [INFO|configuration_utils.py:1140] 2025-06-19 13:15:35,248 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 06/19/2025 13:15:35 - INFO - __main__ - Finished 06/19/2025 13:15:35 - INFO - __main__ - model.config.force_image_size: 448 06/19/2025 13:15:35 - INFO - __main__ - data_args.force_image_size: 448 06/19/2025 13:15:35 - INFO - __main__ - model.config.vision_config.image_size: 448 06/19/2025 13:15:35 - INFO - __main__ - [Dataset] num_image_token: 256 06/19/2025 13:15:35 - INFO - __main__ - [Dataset] dynamic_image_size: True 06/19/2025 13:15:35 - INFO - __main__ - [Dataset] use_thumbnail: True 06/19/2025 13:15:35 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 06/19/2025 13:15:35 - INFO - __main__ - Formatting inputs...Skip in lazy mode 06/19/2025 13:15:35 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 4000 06/19/2025 13:15:35 - INFO - __main__ - [Dataset] num_image_token: 256 06/19/2025 13:15:35 - INFO - __main__ - [Dataset] dynamic_image_size: True 06/19/2025 13:15:35 - INFO - __main__ - [Dataset] use_thumbnail: True 06/19/2025 13:15:35 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 06/19/2025 13:15:35 - INFO - __main__ - Formatting inputs...Skip in lazy mode 06/19/2025 13:15:35 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 500 eval_dataset 06/19/2025 13:15:35 - INFO - __main__ - mlp1.0.weight 06/19/2025 13:15:35 - INFO - __main__ - mlp1.0.bias 06/19/2025 13:15:35 - INFO - __main__ - mlp1.1.weight 06/19/2025 13:15:35 - INFO - __main__ - mlp1.1.bias 06/19/2025 13:15:35 - INFO - __main__ - mlp1.3.weight 06/19/2025 13:15:35 - INFO - __main__ - mlp1.3.bias 06/19/2025 13:15:35 - INFO - __main__ - fast_mlp.0.weight 06/19/2025 13:15:35 - INFO - __main__ - fast_mlp.0.bias 06/19/2025 13:15:35 - INFO - __main__ - fast_mlp.1.weight 06/19/2025 13:15:35 - INFO - __main__ - fast_mlp.1.bias 06/19/2025 13:15:35 - INFO - __main__ - fast_mlp.3.weight 06/19/2025 13:15:35 - INFO - __main__ - fast_mlp.3.bias training_args TrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=1000, eval_strategy=steps, eval_use_gather_object=False, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=None, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=weights/st1/mos0_st1/runs/Jun19_13-15-32_amax, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=weights/st1/mos0_st1, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=weights/st1/mos0_st1, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=70000000, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) [INFO|trainer.py:741] 2025-06-19 13:15:35,906 >> Using auto half precision backend [WARNING|trainer.py:803] 2025-06-19 13:15:36,063 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:803] 2025-06-19 13:15:36,064 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [2025-06-19 13:15:36,072] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown [2025-06-19 13:15:36,072] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1 [2025-06-19 13:15:41,885] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /home/wangjiarui/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/wangjiarui/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.3843038082122803 seconds [2025-06-19 13:15:42,274] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2025-06-19 13:15:42,274] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2025-06-19 13:15:42,279] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2025-06-19 13:15:42,279] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2025-06-19 13:15:42,279] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2025-06-19 13:15:42,279] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000 [2025-06-19 13:15:42,279] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000 [2025-06-19 13:15:42,279] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2025-06-19 13:15:42,279] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False [2025-06-19 13:15:42,619] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-06-19 13:15:42,620] [INFO] [utils.py:782:see_memory_usage] MA 15.85 GB Max_MA 15.95 GB CA 16.08 GB Max_CA 16 GB [2025-06-19 13:15:42,620] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 60.99 GB, percent = 24.2% [2025-06-19 13:15:42,778] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-06-19 13:15:42,779] [INFO] [utils.py:782:see_memory_usage] MA 15.85 GB Max_MA 16.05 GB CA 16.28 GB Max_CA 16 GB [2025-06-19 13:15:42,779] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 60.99 GB, percent = 24.2% [2025-06-19 13:15:42,779] [INFO] [stage_1_and_2.py:545:__init__] optimizer state initialized [2025-06-19 13:15:42,928] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-06-19 13:15:42,929] [INFO] [utils.py:782:see_memory_usage] MA 15.85 GB Max_MA 15.85 GB CA 16.28 GB Max_CA 16 GB [2025-06-19 13:15:42,929] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 60.99 GB, percent = 24.2% [2025-06-19 13:15:42,930] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2025-06-19 13:15:42,930] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2025-06-19 13:15:42,930] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2025-06-19 13:15:42,931] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2025-06-19 13:15:42,933] [INFO] [config.py:999:print] DeepSpeedEngine configuration: [2025-06-19 13:15:42,933] [INFO] [config.py:1003:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-06-19 13:15:42,933] [INFO] [config.py:1003:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-06-19 13:15:42,933] [INFO] [config.py:1003:print] amp_enabled .................. False [2025-06-19 13:15:42,933] [INFO] [config.py:1003:print] amp_params ................... False [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] bfloat16_enabled ............. True [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] bfloat16_immediate_grad_update False [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] checkpoint_parallel_write_pipeline False [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] checkpoint_tag_validation_enabled True [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] checkpoint_tag_validation_fail False [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] comms_config ................. [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] communication_data_type ...... None [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] disable_allgather ............ False [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] dump_state ................... False [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2025-06-19 13:15:42,934] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] fp16_enabled ................. False [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] global_rank .................. 0 [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 1 [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] graph_harvesting ............. False [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2025-06-19 13:15:42,935] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] memory_breakdown ............. False [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] optimizer_name ............... adamw [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] pld_enabled .................. False [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] pld_params ................... False [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] prescale_gradients ........... False [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] scheduler_name ............... None [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] scheduler_params ............. None [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] sparse_attention ............. None [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2025-06-19 13:15:42,936] [INFO] [config.py:1003:print] steps_per_print .............. inf [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] train_batch_size ............. 4 [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 4 [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] wall_clock_breakdown ......... True [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] weight_quantization_config ... None [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] world_size ................... 1 [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] zero_enabled ................. True [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True [2025-06-19 13:15:42,937] [INFO] [config.py:1003:print] zero_optimization_stage ...... 1 [2025-06-19 13:15:42,937] [INFO] [config.py:989:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 1, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 4, "train_micro_batch_size_per_gpu": 4, "wall_clock_breakdown": true } [INFO|trainer.py:2369] 2025-06-19 13:15:42,939 >> ***** Running training ***** [INFO|trainer.py:2370] 2025-06-19 13:15:42,939 >> Num examples = 4,000 [INFO|trainer.py:2371] 2025-06-19 13:15:42,939 >> Num Epochs = 10 [INFO|trainer.py:2372] 2025-06-19 13:15:42,939 >> Instantaneous batch size per device = 4 [INFO|trainer.py:2375] 2025-06-19 13:15:42,939 >> Total train batch size (w. parallel, distributed & accumulation) = 4 [INFO|trainer.py:2376] 2025-06-19 13:15:42,939 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2377] 2025-06-19 13:15:42,939 >> Total optimization steps = 10,000 [INFO|trainer.py:2378] 2025-06-19 13:15:42,941 >> Number of trainable parameters = 53,503,488 0%| | 0/10000 [00:00, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=weights/st1/mos0_st1/runs/Jun19_13-16-11_amax, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=weights/st1/mos0_st1, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=weights/st1/mos0_st1, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=70000000, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 06/19/2025 13:16:12 - INFO - __main__ - Loading Tokenizer: /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:16:12,032 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:16:12,032 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:16:12,032 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:16:12,032 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:16:12,032 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:16:12,032 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2304] 2025-06-19 13:16:12,326 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 06/19/2025 13:16:12 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:694] 2025-06-19 13:16:12,585 >> loading configuration file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/config.json [INFO|configuration_utils.py:768] 2025-06-19 13:16:12,588 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_attn_implementation_autoset": false, "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForSequenceClassification": "modeling_internlm2.InternLM2ForSequenceClassification" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.48.3", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internvl2_5", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_attn_implementation_autoset": false, "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.48.3", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 06/19/2025 13:16:12 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3901] 2025-06-19 13:16:12,588 >> loading weights file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/model.safetensors.index.json [INFO|modeling_utils.py:1582] 2025-06-19 13:16:12,589 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1140] 2025-06-19 13:16:12,591 >> Generate config GenerationConfig {} this model [WARNING|logging.py:328] 2025-06-19 13:16:12,659 >> InternLM2ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [INFO|configuration_utils.py:1140] 2025-06-19 13:16:12,660 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } Setting backbone: fragments_backbone Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [WARNING|modeling_utils.py:4890] 2025-06-19 13:16:14,519 >> Some weights of InternVLChatModel were not initialized from the model checkpoint at /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B and are newly initialized: ['evaluator.fragments_backbone.layers.0.blocks.0.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.norm2.weight', 'evaluator.fragments_backbone.layers.0.downsample.norm.bias', 'evaluator.fragments_backbone.layers.0.downsample.norm.weight', 'evaluator.fragments_backbone.layers.0.downsample.reduction.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.norm2.weight', 'evaluator.fragments_backbone.layers.1.downsample.norm.bias', 'evaluator.fragments_backbone.layers.1.downsample.norm.weight', 'evaluator.fragments_backbone.layers.1.downsample.reduction.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.norm2.weight', 'evaluator.fragments_backbone.layers.2.downsample.norm.bias', 'evaluator.fragments_backbone.layers.2.downsample.norm.weight', 'evaluator.fragments_backbone.layers.2.downsample.reduction.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.norm2.weight', 'evaluator.fragments_backbone.norm.bias', 'evaluator.fragments_backbone.norm.weight', 'evaluator.fragments_backbone.patch_embed.norm.bias', 'evaluator.fragments_backbone.patch_embed.norm.weight', 'evaluator.fragments_backbone.patch_embed.proj.bias', 'evaluator.fragments_backbone.patch_embed.proj.weight', 'fast_mlp.0.bias', 'fast_mlp.0.weight', 'fast_mlp.1.bias', 'fast_mlp.1.weight', 'fast_mlp.3.bias', 'fast_mlp.3.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|configuration_utils.py:1093] 2025-06-19 13:16:14,527 >> loading configuration file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/generation_config.json [INFO|configuration_utils.py:1140] 2025-06-19 13:16:14,527 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 06/19/2025 13:16:14 - INFO - __main__ - Finished 06/19/2025 13:16:14 - INFO - __main__ - model.config.force_image_size: 448 06/19/2025 13:16:14 - INFO - __main__ - data_args.force_image_size: 448 06/19/2025 13:16:14 - INFO - __main__ - model.config.vision_config.image_size: 448 06/19/2025 13:16:14 - INFO - __main__ - [Dataset] num_image_token: 256 06/19/2025 13:16:14 - INFO - __main__ - [Dataset] dynamic_image_size: True 06/19/2025 13:16:14 - INFO - __main__ - [Dataset] use_thumbnail: True 06/19/2025 13:16:14 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 06/19/2025 13:16:14 - INFO - __main__ - Formatting inputs...Skip in lazy mode 06/19/2025 13:16:14 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 4000 06/19/2025 13:16:14 - INFO - __main__ - [Dataset] num_image_token: 256 06/19/2025 13:16:14 - INFO - __main__ - [Dataset] dynamic_image_size: True 06/19/2025 13:16:14 - INFO - __main__ - [Dataset] use_thumbnail: True 06/19/2025 13:16:14 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 06/19/2025 13:16:14 - INFO - __main__ - Formatting inputs...Skip in lazy mode 06/19/2025 13:16:15 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 500 eval_dataset 06/19/2025 13:16:15 - INFO - __main__ - mlp1.0.weight 06/19/2025 13:16:15 - INFO - __main__ - mlp1.0.bias 06/19/2025 13:16:15 - INFO - __main__ - mlp1.1.weight 06/19/2025 13:16:15 - INFO - __main__ - mlp1.1.bias 06/19/2025 13:16:15 - INFO - __main__ - mlp1.3.weight 06/19/2025 13:16:15 - INFO - __main__ - mlp1.3.bias 06/19/2025 13:16:15 - INFO - __main__ - fast_mlp.0.weight 06/19/2025 13:16:15 - INFO - __main__ - fast_mlp.0.bias 06/19/2025 13:16:15 - INFO - __main__ - fast_mlp.1.weight 06/19/2025 13:16:15 - INFO - __main__ - fast_mlp.1.bias 06/19/2025 13:16:15 - INFO - __main__ - fast_mlp.3.weight 06/19/2025 13:16:15 - INFO - __main__ - fast_mlp.3.bias training_args TrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=1.0, eval_strategy=steps, eval_use_gather_object=False, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=None, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=weights/st1/mos0_st1/runs/Jun19_13-16-11_amax, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=weights/st1/mos0_st1, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=weights/st1/mos0_st1, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=70000000, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) [INFO|trainer.py:741] 2025-06-19 13:16:15,200 >> Using auto half precision backend [WARNING|trainer.py:803] 2025-06-19 13:16:15,357 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:803] 2025-06-19 13:16:15,357 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [2025-06-19 13:16:15,365] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown [2025-06-19 13:16:15,365] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1 [2025-06-19 13:16:20,078] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /home/wangjiarui/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/wangjiarui/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.3961606025695801 seconds [2025-06-19 13:16:20,478] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2025-06-19 13:16:20,478] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2025-06-19 13:16:20,482] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2025-06-19 13:16:20,483] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2025-06-19 13:16:20,483] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2025-06-19 13:16:20,483] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000 [2025-06-19 13:16:20,483] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000 [2025-06-19 13:16:20,483] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2025-06-19 13:16:20,483] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False [2025-06-19 13:16:20,821] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-06-19 13:16:20,821] [INFO] [utils.py:782:see_memory_usage] MA 15.85 GB Max_MA 15.95 GB CA 16.08 GB Max_CA 16 GB [2025-06-19 13:16:20,822] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 61.37 GB, percent = 24.4% [2025-06-19 13:16:20,974] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-06-19 13:16:20,975] [INFO] [utils.py:782:see_memory_usage] MA 15.85 GB Max_MA 16.05 GB CA 16.28 GB Max_CA 16 GB [2025-06-19 13:16:20,975] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 61.38 GB, percent = 24.4% [2025-06-19 13:16:20,975] [INFO] [stage_1_and_2.py:545:__init__] optimizer state initialized [2025-06-19 13:16:21,123] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-06-19 13:16:21,124] [INFO] [utils.py:782:see_memory_usage] MA 15.85 GB Max_MA 15.85 GB CA 16.28 GB Max_CA 16 GB [2025-06-19 13:16:21,124] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 61.38 GB, percent = 24.4% [2025-06-19 13:16:21,124] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2025-06-19 13:16:21,125] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2025-06-19 13:16:21,125] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2025-06-19 13:16:21,125] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2025-06-19 13:16:21,127] [INFO] [config.py:999:print] DeepSpeedEngine configuration: [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] amp_enabled .................. False [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] amp_params ................... False [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] bfloat16_enabled ............. True [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] bfloat16_immediate_grad_update False [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] checkpoint_parallel_write_pipeline False [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] checkpoint_tag_validation_enabled True [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] checkpoint_tag_validation_fail False [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] comms_config ................. [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] communication_data_type ...... None [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2025-06-19 13:16:21,128] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] disable_allgather ............ False [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] dump_state ................... False [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] fp16_enabled ................. False [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] global_rank .................. 0 [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 1 [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2025-06-19 13:16:21,129] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] graph_harvesting ............. False [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] memory_breakdown ............. False [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] optimizer_name ............... adamw [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] pld_enabled .................. False [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] pld_params ................... False [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] prescale_gradients ........... False [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] scheduler_name ............... None [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] scheduler_params ............. None [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2025-06-19 13:16:21,130] [INFO] [config.py:1003:print] sparse_attention ............. None [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] steps_per_print .............. inf [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] train_batch_size ............. 4 [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 4 [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] wall_clock_breakdown ......... True [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] weight_quantization_config ... None [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] world_size ................... 1 [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] zero_enabled ................. True [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True [2025-06-19 13:16:21,131] [INFO] [config.py:1003:print] zero_optimization_stage ...... 1 [2025-06-19 13:16:21,131] [INFO] [config.py:989:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 1, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 4, "train_micro_batch_size_per_gpu": 4, "wall_clock_breakdown": true } [INFO|trainer.py:2369] 2025-06-19 13:16:21,133 >> ***** Running training ***** [INFO|trainer.py:2370] 2025-06-19 13:16:21,133 >> Num examples = 4,000 [INFO|trainer.py:2371] 2025-06-19 13:16:21,133 >> Num Epochs = 10 [INFO|trainer.py:2372] 2025-06-19 13:16:21,133 >> Instantaneous batch size per device = 4 [INFO|trainer.py:2375] 2025-06-19 13:16:21,133 >> Total train batch size (w. parallel, distributed & accumulation) = 4 [INFO|trainer.py:2376] 2025-06-19 13:16:21,133 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2377] 2025-06-19 13:16:21,133 >> Total optimization steps = 10,000 [INFO|trainer.py:2378] 2025-06-19 13:16:21,135 >> Number of trainable parameters = 53,503,488 0%| | 0/10000 [00:00> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 13:17:11,561 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 13:17:11,561 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 13:17:58,987 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 13:17:58,988 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 13:17:58,988 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 13:17:58,988 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json [2025-06-19 13:18:06,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 13:18:06,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.58 | bwd_microstep: 3270.66 | bwd_inner_microstep: 3269.75 | bwd_allreduce_microstep: 0.83 | step_microstep: 8.22 [2025-06-19 13:18:06,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.54 | bwd: 3270.70 | bwd_inner: 3269.75 | bwd_allreduce: 0.87 | step: 8.24 0%| | 2/10000 [01:45<153:14:35, 55.18s/it] {'loss': 1.389, 'grad_norm': 10.968062400817871, 'learning_rate': 2.666666666666667e-07, 'epoch': 0.0} 0%| | 2/10000 [01:45<153:14:35, 55.18s/it]evaluate! [2025-06-19 13:29:05,231] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-06-19 13:29:08,566] [INFO] [comm.py:652:init_distributed] cdb=None [2025-06-19 13:29:08,566] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 06/19/2025 13:29:08 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 06/19/2025 13:29:08 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=1000, eval_strategy=steps, eval_use_gather_object=False, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=None, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=weights/st1/mos0_st1/runs/Jun19_13-29-08_amax, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=weights/st1/mos0_st1, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=weights/st1/mos0_st1, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=70000000, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 06/19/2025 13:29:08 - INFO - __main__ - Loading Tokenizer: /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:29:08,693 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:29:08,693 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:29:08,693 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:29:08,693 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:29:08,693 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:29:08,693 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2304] 2025-06-19 13:29:08,990 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 06/19/2025 13:29:09 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:694] 2025-06-19 13:29:09,260 >> loading configuration file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/config.json [INFO|configuration_utils.py:768] 2025-06-19 13:29:09,262 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_attn_implementation_autoset": false, "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForSequenceClassification": "modeling_internlm2.InternLM2ForSequenceClassification" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.48.3", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internvl2_5", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_attn_implementation_autoset": false, "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.48.3", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 06/19/2025 13:29:09 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3901] 2025-06-19 13:29:09,262 >> loading weights file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/model.safetensors.index.json [INFO|modeling_utils.py:1582] 2025-06-19 13:29:09,263 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1140] 2025-06-19 13:29:09,265 >> Generate config GenerationConfig {} this model [WARNING|logging.py:328] 2025-06-19 13:29:09,331 >> InternLM2ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [INFO|configuration_utils.py:1140] 2025-06-19 13:29:09,331 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } [2025-06-19 13:29:26,142] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-06-19 13:29:29,465] [INFO] [comm.py:652:init_distributed] cdb=None [2025-06-19 13:29:29,465] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 06/19/2025 13:29:29 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False 06/19/2025 13:29:29 - INFO - __main__ - Training/evaluation parameters TrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=1000, eval_strategy=steps, eval_use_gather_object=False, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=None, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=weights/st1/mos0_st1/runs/Jun19_13-29-29_amax, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=weights/st1/mos0_st1, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=weights/st1/mos0_st1, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=70000000, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) 06/19/2025 13:29:29 - INFO - __main__ - Loading Tokenizer: /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:29:29,584 >> loading file ./tokenizer.model [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:29:29,584 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:29:29,585 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:29:29,585 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:29:29,585 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2032] 2025-06-19 13:29:29,585 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2304] 2025-06-19 13:29:29,884 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 06/19/2025 13:29:30 - INFO - __main__ - Loading InternVLChatModel... [INFO|configuration_utils.py:694] 2025-06-19 13:29:30,144 >> loading configuration file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/config.json [INFO|configuration_utils.py:768] 2025-06-19 13:29:30,146 >> Model config InternVLChatConfig { "_commit_hash": null, "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "llm_config": { "_attn_implementation_autoset": false, "_name_or_path": "internlm/internlm2_5-7b-chat", "add_cross_attention": false, "architectures": [ "InternLM2ForCausalLM" ], "attn_implementation": "flash_attention_2", "auto_map": { "AutoConfig": "configuration_internlm2.InternLM2Config", "AutoModel": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForCausalLM": "modeling_internlm2.InternLM2ForCausalLM", "AutoModelForSequenceClassification": "modeling_internlm2.InternLM2ForSequenceClassification" }, "bad_words_ids": null, "begin_suppress_tokens": null, "bias": false, "bos_token_id": 1, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 2, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 4096, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 14336, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "min_length": 0, "model_type": "internlm2", "no_repeat_ngram_size": 0, "num_attention_heads": 32, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 32, "num_key_value_heads": 8, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": 2, "prefix": null, "pretraining_tp": 1, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-05, "rope_scaling": { "factor": 2.0, "type": "dynamic" }, "rope_theta": 1000000, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": false, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.48.3", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "vocab_size": 92553 }, "max_dynamic_patch": 12, "min_dynamic_patch": 1, "model_type": "internvl_chat", "pad2square": false, "ps_version": "v2", "select_layer": -1, "template": "internvl2_5", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_attn_implementation_autoset": false, "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.48.3", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } } 06/19/2025 13:29:30 - INFO - __main__ - Using flash_attention_2 for InternLM [INFO|modeling_utils.py:3901] 2025-06-19 13:29:30,147 >> loading weights file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/model.safetensors.index.json [INFO|modeling_utils.py:1582] 2025-06-19 13:29:30,147 >> Instantiating InternVLChatModel model under default dtype torch.bfloat16. [INFO|configuration_utils.py:1140] 2025-06-19 13:29:30,149 >> Generate config GenerationConfig {} this model [WARNING|logging.py:328] 2025-06-19 13:29:30,214 >> InternLM2ForCausalLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. [INFO|configuration_utils.py:1140] 2025-06-19 13:29:30,214 >> Generate config GenerationConfig { "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 2 } Setting backbone: fragments_backbone Loading checkpoint shards: 0%| | 0/4 [00:00> All model checkpoint weights were used when initializing InternVLChatModel. [WARNING|modeling_utils.py:4890] 2025-06-19 13:29:32,218 >> Some weights of InternVLChatModel were not initialized from the model checkpoint at /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B and are newly initialized: ['evaluator.fragments_backbone.layers.0.blocks.0.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.0.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.0.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.0.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.0.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.0.blocks.1.norm2.weight', 'evaluator.fragments_backbone.layers.0.downsample.norm.bias', 'evaluator.fragments_backbone.layers.0.downsample.norm.weight', 'evaluator.fragments_backbone.layers.0.downsample.reduction.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.1.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.1.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.1.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.1.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.1.blocks.1.norm2.weight', 'evaluator.fragments_backbone.layers.1.downsample.norm.bias', 'evaluator.fragments_backbone.layers.1.downsample.norm.weight', 'evaluator.fragments_backbone.layers.1.downsample.reduction.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.1.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.2.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.2.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.2.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.3.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.3.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.3.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.4.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.4.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.4.norm2.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.fragment_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.proj.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.proj.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.qkv.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.qkv.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.2.blocks.5.attn.relative_position_index', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.norm1.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.norm1.weight', 'evaluator.fragments_backbone.layers.2.blocks.5.norm2.bias', 'evaluator.fragments_backbone.layers.2.blocks.5.norm2.weight', 'evaluator.fragments_backbone.layers.2.downsample.norm.bias', 'evaluator.fragments_backbone.layers.2.downsample.norm.weight', 'evaluator.fragments_backbone.layers.2.downsample.reduction.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.proj.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.proj.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.qkv.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.qkv.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.3.blocks.0.attn.relative_position_index', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.norm1.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.norm1.weight', 'evaluator.fragments_backbone.layers.3.blocks.0.norm2.bias', 'evaluator.fragments_backbone.layers.3.blocks.0.norm2.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.proj.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.proj.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.qkv.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.qkv.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.relative_position_bias_table', 'evaluator.fragments_backbone.layers.3.blocks.1.attn.relative_position_index', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc1.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc1.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc2.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.mlp.fc2.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.norm1.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.norm1.weight', 'evaluator.fragments_backbone.layers.3.blocks.1.norm2.bias', 'evaluator.fragments_backbone.layers.3.blocks.1.norm2.weight', 'evaluator.fragments_backbone.norm.bias', 'evaluator.fragments_backbone.norm.weight', 'evaluator.fragments_backbone.patch_embed.norm.bias', 'evaluator.fragments_backbone.patch_embed.norm.weight', 'evaluator.fragments_backbone.patch_embed.proj.bias', 'evaluator.fragments_backbone.patch_embed.proj.weight', 'fast_mlp.0.bias', 'fast_mlp.0.weight', 'fast_mlp.1.bias', 'fast_mlp.1.weight', 'fast_mlp.3.bias', 'fast_mlp.3.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [INFO|configuration_utils.py:1093] 2025-06-19 13:29:32,226 >> loading configuration file /home/wangjiarui/AIGI_2025/OpenGVLab/InternVL2_5-8B/generation_config.json [INFO|configuration_utils.py:1140] 2025-06-19 13:29:32,227 >> Generate config GenerationConfig { "eos_token_id": [ 92542, 92543 ] } 06/19/2025 13:29:32 - INFO - __main__ - Finished 06/19/2025 13:29:32 - INFO - __main__ - model.config.force_image_size: 448 06/19/2025 13:29:32 - INFO - __main__ - data_args.force_image_size: 448 06/19/2025 13:29:32 - INFO - __main__ - model.config.vision_config.image_size: 448 06/19/2025 13:29:32 - INFO - __main__ - [Dataset] num_image_token: 256 06/19/2025 13:29:32 - INFO - __main__ - [Dataset] dynamic_image_size: True 06/19/2025 13:29:32 - INFO - __main__ - [Dataset] use_thumbnail: True 06/19/2025 13:29:32 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 06/19/2025 13:29:32 - INFO - __main__ - Formatting inputs...Skip in lazy mode 06/19/2025 13:29:32 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 4000 06/19/2025 13:29:32 - INFO - __main__ - [Dataset] num_image_token: 256 06/19/2025 13:29:32 - INFO - __main__ - [Dataset] dynamic_image_size: True 06/19/2025 13:29:32 - INFO - __main__ - [Dataset] use_thumbnail: True 06/19/2025 13:29:32 - INFO - __main__ - [Dataset] min_dynamic_patch: 1, max_dynamic_patch: 6 06/19/2025 13:29:32 - INFO - __main__ - Formatting inputs...Skip in lazy mode 06/19/2025 13:29:32 - INFO - __main__ - Add dataset: sharegpt4v_instruct_gpt4-vision_cap100k with length: 500 eval_dataset 06/19/2025 13:29:32 - INFO - __main__ - mlp1.0.weight 06/19/2025 13:29:32 - INFO - __main__ - mlp1.0.bias 06/19/2025 13:29:32 - INFO - __main__ - mlp1.1.weight 06/19/2025 13:29:32 - INFO - __main__ - mlp1.1.bias 06/19/2025 13:29:32 - INFO - __main__ - mlp1.3.weight 06/19/2025 13:29:32 - INFO - __main__ - mlp1.3.bias 06/19/2025 13:29:32 - INFO - __main__ - fast_mlp.0.weight 06/19/2025 13:29:32 - INFO - __main__ - fast_mlp.0.bias 06/19/2025 13:29:32 - INFO - __main__ - fast_mlp.1.weight 06/19/2025 13:29:32 - INFO - __main__ - fast_mlp.1.bias 06/19/2025 13:29:32 - INFO - __main__ - fast_mlp.3.weight 06/19/2025 13:29:32 - INFO - __main__ - fast_mlp.3.bias training_args TrainingArguments( _n_gpu=1, accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False}, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, average_tokens_across_devices=False, batch_eval_metrics=False, bf16=True, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=4, dataloader_persistent_workers=False, dataloader_pin_memory=True, dataloader_prefetch_factor=None, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=zero_stage1_config.json, disable_tqdm=False, dispatch_batches=None, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_do_concat_batches=True, eval_on_start=False, eval_steps=1000, eval_strategy=steps, eval_use_gather_object=False, evaluation_strategy=steps, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=None, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_for_metrics=[], include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=4e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=weights/st1/mos0_st1/runs/Jun19_13-29-29_amax, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1.0, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=cosine, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, no_cuda=False, num_train_epochs=10.0, optim=adamw_torch, optim_args=None, optim_target_modules=None, output_dir=weights/st1/mos0_st1, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=['tensorboard'], restore_callback_states_from_checkpoint=False, resume_from_checkpoint=None, run_name=weights/st1/mos0_st1, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=70000000, save_strategy=steps, save_total_limit=1, seed=42, skip_memory_metrics=True, split_batches=None, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torch_empty_cache_steps=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_liger_kernel=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.01, ) [INFO|trainer.py:741] 2025-06-19 13:29:32,885 >> Using auto half precision backend [WARNING|trainer.py:803] 2025-06-19 13:29:33,041 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [WARNING|trainer.py:803] 2025-06-19 13:29:33,041 >> Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead. [2025-06-19 13:29:33,050] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown [2025-06-19 13:29:33,050] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1 [2025-06-19 13:29:37,828] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False Using /home/wangjiarui/.cache/torch_extensions/py39_cu121 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/wangjiarui/.cache/torch_extensions/py39_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.6892125606536865 seconds [2025-06-19 13:29:38,521] [INFO] [logging.py:128:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer [2025-06-19 13:29:38,521] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2025-06-19 13:29:38,524] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam [2025-06-19 13:29:38,524] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type= [2025-06-19 13:29:38,524] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer [2025-06-19 13:29:38,525] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 1000000000 [2025-06-19 13:29:38,525] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 1000000000 [2025-06-19 13:29:38,525] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False [2025-06-19 13:29:38,525] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False [2025-06-19 13:29:38,885] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-06-19 13:29:38,886] [INFO] [utils.py:782:see_memory_usage] MA 15.85 GB Max_MA 15.95 GB CA 16.08 GB Max_CA 16 GB [2025-06-19 13:29:38,887] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 62.09 GB, percent = 24.7% [2025-06-19 13:29:39,041] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-06-19 13:29:39,041] [INFO] [utils.py:782:see_memory_usage] MA 15.85 GB Max_MA 16.05 GB CA 16.28 GB Max_CA 16 GB [2025-06-19 13:29:39,042] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 62.09 GB, percent = 24.7% [2025-06-19 13:29:39,042] [INFO] [stage_1_and_2.py:545:__init__] optimizer state initialized [2025-06-19 13:29:39,189] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-06-19 13:29:39,190] [INFO] [utils.py:782:see_memory_usage] MA 15.85 GB Max_MA 15.85 GB CA 16.28 GB Max_CA 16 GB [2025-06-19 13:29:39,190] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 62.09 GB, percent = 24.7% [2025-06-19 13:29:39,191] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer [2025-06-19 13:29:39,191] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler [2025-06-19 13:29:39,191] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed LR Scheduler = [2025-06-19 13:29:39,191] [INFO] [logging.py:128:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[[0.9, 0.999]] [2025-06-19 13:29:39,193] [INFO] [config.py:999:print] DeepSpeedEngine configuration: [2025-06-19 13:29:39,194] [INFO] [config.py:1003:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-06-19 13:29:39,194] [INFO] [config.py:1003:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-06-19 13:29:39,194] [INFO] [config.py:1003:print] amp_enabled .................. False [2025-06-19 13:29:39,194] [INFO] [config.py:1003:print] amp_params ................... False [2025-06-19 13:29:39,194] [INFO] [config.py:1003:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-06-19 13:29:39,194] [INFO] [config.py:1003:print] bfloat16_enabled ............. True [2025-06-19 13:29:39,194] [INFO] [config.py:1003:print] bfloat16_immediate_grad_update False [2025-06-19 13:29:39,194] [INFO] [config.py:1003:print] checkpoint_parallel_write_pipeline False [2025-06-19 13:29:39,194] [INFO] [config.py:1003:print] checkpoint_tag_validation_enabled True [2025-06-19 13:29:39,194] [INFO] [config.py:1003:print] checkpoint_tag_validation_fail False [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] comms_config ................. [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] communication_data_type ...... None [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] dataloader_drop_last ......... False [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] disable_allgather ............ False [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] dump_state ................... False [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... None [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1 [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0 [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100 [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06 [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01 [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] elasticity_enabled ........... False [2025-06-19 13:29:39,195] [INFO] [config.py:1003:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] fp16_auto_cast ............... None [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] fp16_enabled ................. False [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] global_rank .................. 0 [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] grad_accum_dtype ............. None [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 1 [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0 [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0 [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] graph_harvesting ............. False [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 1 [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] load_universal_checkpoint .... False [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] loss_scale ................... 1.0 [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] memory_breakdown ............. False [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] mics_shard_size .............. -1 [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] optimizer_name ............... adamw [2025-06-19 13:29:39,196] [INFO] [config.py:1003:print] optimizer_params ............. {'lr': 4e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.01} [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] pld_enabled .................. False [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] pld_params ................... False [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] prescale_gradients ........... False [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] scheduler_name ............... None [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] scheduler_params ............. None [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32 [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] sparse_attention ............. None [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] steps_per_print .............. inf [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] train_batch_size ............. 4 [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 4 [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] use_node_local_storage ....... False [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] wall_clock_breakdown ......... True [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] weight_quantization_config ... None [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] world_size ................... 1 [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=1000000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=1000000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] zero_enabled ................. True [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True [2025-06-19 13:29:39,197] [INFO] [config.py:1003:print] zero_optimization_stage ...... 1 [2025-06-19 13:29:39,198] [INFO] [config.py:989:print_user_config] json = { "zero_optimization": { "stage": 1, "allgather_partitions": true, "allgather_bucket_size": 1.000000e+09, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 1.000000e+09, "contiguous_gradients": true }, "fp16": { "enabled": false, "auto_cast": true, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 4e-05, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "gradient_accumulation_steps": 1, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 4, "train_micro_batch_size_per_gpu": 4, "wall_clock_breakdown": true } [INFO|trainer.py:2369] 2025-06-19 13:29:39,199 >> ***** Running training ***** [INFO|trainer.py:2370] 2025-06-19 13:29:39,199 >> Num examples = 4,000 [INFO|trainer.py:2371] 2025-06-19 13:29:39,199 >> Num Epochs = 10 [INFO|trainer.py:2372] 2025-06-19 13:29:39,199 >> Instantaneous batch size per device = 4 [INFO|trainer.py:2375] 2025-06-19 13:29:39,199 >> Total train batch size (w. parallel, distributed & accumulation) = 4 [INFO|trainer.py:2376] 2025-06-19 13:29:39,199 >> Gradient Accumulation steps = 1 [INFO|trainer.py:2377] 2025-06-19 13:29:39,199 >> Total optimization steps = 10,000 [INFO|trainer.py:2378] 2025-06-19 13:29:39,201 >> Number of trainable parameters = 53,503,488 0%| | 0/10000 [00:00= 5 [h264 @ 0x1abc5a80] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x195cc180] left block unavailable for requested intra mode [h264 @ 0x195cc180] error while decoding MB 0 25, bytestream 45493 [h264 @ 0x1ab42e40] Reference 5 >= 5 [h264 @ 0x1ab42e40] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x1ab42e40] left block unavailable for requested intra mode [h264 @ 0x1ab42e40] error while decoding MB 0 25, bytestream 45493 [2025-06-19 14:35:37,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 14:35:37,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.68 | bwd_microstep: 3324.20 | bwd_inner_microstep: 3323.20 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.28 [2025-06-19 14:35:37,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.68 | bwd: 3324.23 | bwd_inner: 3323.20 | bwd_allreduce: 0.94 | step: 7.29 7%|▋ | 711/10000 [1:05:58<14:09:57, 5.49s/it] {'loss': 0.0754, 'grad_norm': 0.4358844459056854, 'learning_rate': 3.9823071191493565e-05, 'epoch': 0.71} 7%|▋ | 711/10000 [1:05:58<14:09:57, 5.49s/it][2025-06-19 14:35:43,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 14:35:43,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.85 | bwd_microstep: 3343.18 | bwd_inner_microstep: 3342.09 | bwd_allreduce_microstep: 1.00 | step_microstep: 8.67 [2025-06-19 14:35:43,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.85 | bwd: 3343.21 | bwd_inner: 3342.09 | bwd_allreduce: 1.04 | step: 8.68 7%|▋ | 712/10000 [1:06:03<14:12:05, 5.50s/it] {'loss': 0.2873, 'grad_norm': 0.9227029085159302, 'learning_rate': 3.982221045606437e-05, 'epoch': 0.71} 7%|▋ | 712/10000 [1:06:03<14:12:05, 5.50s/it][2025-06-19 14:35:48,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:35:48,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2152.02 | bwd_microstep: 3335.93 | bwd_inner_microstep: 3335.06 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.28 [2025-06-19 14:35:48,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2152.02 | bwd: 3335.96 | bwd_inner: 3335.06 | bwd_allreduce: 0.83 | step: 7.29 7%|▋ | 713/10000 [1:06:09<14:13:34, 5.51s/it] {'loss': 0.0724, 'grad_norm': 0.4255593419075012, 'learning_rate': 3.9821347641377305e-05, 'epoch': 0.71} 7%|▋ | 713/10000 [1:06:09<14:13:34, 5.51s/it][2025-06-19 14:35:54,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.72 [2025-06-19 14:35:54,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.44 | bwd_microstep: 3318.18 | bwd_inner_microstep: 3316.96 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.87 [2025-06-19 14:35:54,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.44 | bwd: 3318.21 | bwd_inner: 3316.96 | bwd_allreduce: 0.97 | step: 7.87 7%|▋ | 714/10000 [1:06:15<14:13:35, 5.52s/it] {'loss': 0.1683, 'grad_norm': 0.8366997241973877, 'learning_rate': 3.982048274752285e-05, 'epoch': 0.71} 7%|▋ | 714/10000 [1:06:15<14:13:35, 5.52s/it][2025-06-19 14:35:59,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:35:59,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2159.31 | bwd_microstep: 3380.48 | bwd_inner_microstep: 3379.46 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.27 [2025-06-19 14:35:59,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2159.31 | bwd: 3380.49 | bwd_inner: 3379.46 | bwd_allreduce: 0.99 | step: 7.28 7%|▋ | 715/10000 [1:06:20<14:16:44, 5.54s/it] {'loss': 0.1519, 'grad_norm': 1.0115703344345093, 'learning_rate': 3.981961577459176e-05, 'epoch': 0.71} 7%|▋ | 715/10000 [1:06:20<14:16:44, 5.54s/it][2025-06-19 14:36:05,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 14:36:05,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.52 | bwd_microstep: 3389.26 | bwd_inner_microstep: 3388.02 | bwd_allreduce_microstep: 1.16 | step_microstep: 8.00 [2025-06-19 14:36:05,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.52 | bwd: 3389.28 | bwd_inner: 3388.02 | bwd_allreduce: 1.20 | step: 8.00 7%|▋ | 716/10000 [1:06:26<14:19:07, 5.55s/it] {'loss': 0.0565, 'grad_norm': 0.38729092478752136, 'learning_rate': 3.981874672267495e-05, 'epoch': 0.72} 7%|▋ | 716/10000 [1:06:26<14:19:07, 5.55s/it][2025-06-19 14:36:10,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.73 [2025-06-19 14:36:10,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.69 | bwd_microstep: 3373.96 | bwd_inner_microstep: 3373.06 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.03 [2025-06-19 14:36:10,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.69 | bwd: 3373.97 | bwd_inner: 3373.06 | bwd_allreduce: 0.87 | step: 7.03 7%|▋ | 717/10000 [1:06:31<14:19:20, 5.55s/it] {'loss': 0.0834, 'grad_norm': 0.4633166193962097, 'learning_rate': 3.9817875591863596e-05, 'epoch': 0.72} 7%|▋ | 717/10000 [1:06:31<14:19:20, 5.55s/it][2025-06-19 14:36:16,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:36:16,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.23 | bwd_microstep: 3368.72 | bwd_inner_microstep: 3367.83 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.78 [2025-06-19 14:36:16,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.23 | bwd: 3368.73 | bwd_inner: 3367.83 | bwd_allreduce: 0.87 | step: 6.78 7%|▋ | 718/10000 [1:06:37<14:19:16, 5.55s/it] {'loss': 0.0736, 'grad_norm': 0.41098159551620483, 'learning_rate': 3.981700238224907e-05, 'epoch': 0.72} 7%|▋ | 718/10000 [1:06:37<14:19:16, 5.55s/it][2025-06-19 14:36:22,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:36:22,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.16 | bwd_microstep: 3364.82 | bwd_inner_microstep: 3364.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 14:36:22,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.15 | bwd: 3364.83 | bwd_inner: 3364.04 | bwd_allreduce: 0.76 | step: 6.54 7%|▋ | 719/10000 [1:06:42<14:18:02, 5.55s/it] {'loss': 0.0409, 'grad_norm': 0.300537109375, 'learning_rate': 3.9816127093922964e-05, 'epoch': 0.72} 7%|▋ | 719/10000 [1:06:42<14:18:02, 5.55s/it][2025-06-19 14:36:27,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:36:27,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.59 | bwd_microstep: 3320.36 | bwd_inner_microstep: 3319.42 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.22 [2025-06-19 14:36:27,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.59 | bwd: 3320.38 | bwd_inner: 3319.42 | bwd_allreduce: 0.91 | step: 7.23 7%|▋ | 720/10000 [1:06:48<14:14:25, 5.52s/it] {'loss': 0.1384, 'grad_norm': 0.9383010864257812, 'learning_rate': 3.98152497269771e-05, 'epoch': 0.72} 7%|▋ | 720/10000 [1:06:48<14:14:25, 5.52s/it][2025-06-19 14:36:32,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:36:32,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.03 | bwd_microstep: 3307.73 | bwd_inner_microstep: 3306.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 14:36:32,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.03 | bwd: 3307.74 | bwd_inner: 3306.93 | bwd_allreduce: 0.77 | step: 6.88 7%|▋ | 721/10000 [1:06:53<14:11:53, 5.51s/it] {'loss': 0.1495, 'grad_norm': 0.8408471345901489, 'learning_rate': 3.9814370281503506e-05, 'epoch': 0.72} 7%|▋ | 721/10000 [1:06:53<14:11:53, 5.51s/it][2025-06-19 14:36:38,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:36:38,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.06 | bwd_microstep: 3375.76 | bwd_inner_microstep: 3374.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-19 14:36:38,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.06 | bwd: 3375.77 | bwd_inner: 3374.97 | bwd_allreduce: 0.76 | step: 7.03 7%|▋ | 722/10000 [1:06:59<14:13:41, 5.52s/it] {'loss': 0.1332, 'grad_norm': 0.6096661686897278, 'learning_rate': 3.981348875759443e-05, 'epoch': 0.72} 7%|▋ | 722/10000 [1:06:59<14:13:41, 5.52s/it][2025-06-19 14:36:44,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 14:36:44,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.89 | bwd_microstep: 3362.25 | bwd_inner_microstep: 3361.48 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.59 [2025-06-19 14:36:44,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.89 | bwd: 3362.27 | bwd_inner: 3361.48 | bwd_allreduce: 0.75 | step: 6.59 7%|▋ | 723/10000 [1:07:04<14:14:01, 5.52s/it] {'loss': 0.0896, 'grad_norm': 0.55284583568573, 'learning_rate': 3.981260515534233e-05, 'epoch': 0.72} 7%|▋ | 723/10000 [1:07:04<14:14:01, 5.52s/it][2025-06-19 14:36:49,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 14:36:49,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.30 | bwd_microstep: 3369.43 | bwd_inner_microstep: 3368.33 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.80 [2025-06-19 14:36:49,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.30 | bwd: 3369.46 | bwd_inner: 3368.33 | bwd_allreduce: 1.05 | step: 7.81 7%|▋ | 724/10000 [1:07:10<14:14:35, 5.53s/it] {'loss': 0.1848, 'grad_norm': 1.295580267906189, 'learning_rate': 3.981171947483992e-05, 'epoch': 0.72} 7%|▋ | 724/10000 [1:07:10<14:14:35, 5.53s/it][2025-06-19 14:36:55,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 14:36:55,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2171.83 | bwd_microstep: 3369.02 | bwd_inner_microstep: 3367.96 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.53 [2025-06-19 14:36:55,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2171.84 | bwd: 3369.04 | bwd_inner: 3367.97 | bwd_allreduce: 1.00 | step: 7.54 7%|▋ | 725/10000 [1:07:15<14:17:24, 5.55s/it] {'loss': 0.0499, 'grad_norm': 0.3548991084098816, 'learning_rate': 3.981083171618007e-05, 'epoch': 0.72} 7%|▋ | 725/10000 [1:07:15<14:17:24, 5.55s/it][2025-06-19 14:37:00,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.66 | optimizer_step: 2.73 [2025-06-19 14:37:00,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2150.50 | bwd_microstep: 3328.73 | bwd_inner_microstep: 3327.35 | bwd_allreduce_microstep: 1.24 | step_microstep: 9.48 [2025-06-19 14:37:00,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2150.50 | bwd: 3328.79 | bwd_inner: 3327.35 | bwd_allreduce: 1.31 | step: 9.49 7%|▋ | 726/10000 [1:07:21<14:16:39, 5.54s/it] {'loss': 0.0626, 'grad_norm': 0.5020381212234497, 'learning_rate': 3.9809941879455924e-05, 'epoch': 0.73} 7%|▋ | 726/10000 [1:07:21<14:16:39, 5.54s/it][2025-06-19 14:37:06,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 14:37:06,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2153.14 | bwd_microstep: 3327.84 | bwd_inner_microstep: 3326.93 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.31 [2025-06-19 14:37:06,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2153.14 | bwd: 3327.86 | bwd_inner: 3326.93 | bwd_allreduce: 0.87 | step: 8.34 7%|▋ | 727/10000 [1:07:27<14:16:20, 5.54s/it] {'loss': 0.139, 'grad_norm': 0.9419936537742615, 'learning_rate': 3.9809049964760815e-05, 'epoch': 0.73} 7%|▋ | 727/10000 [1:07:27<14:16:20, 5.54s/it][2025-06-19 14:37:11,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:37:11,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.90 | bwd_microstep: 3323.59 | bwd_inner_microstep: 3322.70 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.31 [2025-06-19 14:37:11,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.90 | bwd: 3323.61 | bwd_inner: 3322.70 | bwd_allreduce: 0.84 | step: 7.32 7%|▋ | 728/10000 [1:07:32<14:13:52, 5.53s/it] {'loss': 0.0683, 'grad_norm': 0.6217383742332458, 'learning_rate': 3.9808155972188306e-05, 'epoch': 0.73} 7%|▋ | 728/10000 [1:07:32<14:13:52, 5.53s/it][2025-06-19 14:37:17,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.73 [2025-06-19 14:37:17,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.85 | bwd_microstep: 3315.47 | bwd_inner_microstep: 3314.60 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.55 [2025-06-19 14:37:17,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.85 | bwd: 3315.49 | bwd_inner: 3314.60 | bwd_allreduce: 0.83 | step: 7.55 7%|▋ | 729/10000 [1:07:38<14:11:17, 5.51s/it] {'loss': 0.1806, 'grad_norm': 0.634537935256958, 'learning_rate': 3.9807259901832166e-05, 'epoch': 0.73} 7%|▋ | 729/10000 [1:07:38<14:11:17, 5.51s/it][2025-06-19 14:37:22,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:37:22,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.02 | bwd_microstep: 3309.64 | bwd_inner_microstep: 3308.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 14:37:22,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.02 | bwd: 3309.66 | bwd_inner: 3308.85 | bwd_allreduce: 0.77 | step: 6.67 7%|▋ | 730/10000 [1:07:43<14:08:59, 5.50s/it] {'loss': 0.1485, 'grad_norm': 0.6905151605606079, 'learning_rate': 3.980636175378639e-05, 'epoch': 0.73} 7%|▋ | 730/10000 [1:07:43<14:08:59, 5.50s/it][2025-06-19 14:37:28,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 14:37:28,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.82 | bwd_microstep: 3317.36 | bwd_inner_microstep: 3316.33 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.73 [2025-06-19 14:37:28,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.82 | bwd: 3317.38 | bwd_inner: 3316.33 | bwd_allreduce: 1.00 | step: 7.74 7%|▋ | 731/10000 [1:07:48<14:07:19, 5.48s/it] {'loss': 0.1083, 'grad_norm': 0.7480597496032715, 'learning_rate': 3.9805461528145185e-05, 'epoch': 0.73} 7%|▋ | 731/10000 [1:07:48<14:07:19, 5.48s/it][2025-06-19 14:37:33,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 14:37:33,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.99 | bwd_microstep: 3318.33 | bwd_inner_microstep: 3317.40 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.24 [2025-06-19 14:37:33,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.99 | bwd: 3318.35 | bwd_inner: 3317.40 | bwd_allreduce: 0.90 | step: 7.24 7%|▋ | 732/10000 [1:07:54<14:06:58, 5.48s/it] {'loss': 0.1129, 'grad_norm': 0.5167878866195679, 'learning_rate': 3.980455922500299e-05, 'epoch': 0.73} 7%|▋ | 732/10000 [1:07:54<14:06:58, 5.48s/it][2025-06-19 14:37:39,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:37:39,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.70 | bwd_microstep: 3318.38 | bwd_inner_microstep: 3317.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-19 14:37:39,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.70 | bwd: 3318.40 | bwd_inner: 3317.58 | bwd_allreduce: 0.77 | step: 7.01 7%|▋ | 733/10000 [1:07:59<14:06:59, 5.48s/it] {'loss': 0.1445, 'grad_norm': 0.7168350219726562, 'learning_rate': 3.980365484445445e-05, 'epoch': 0.73} 7%|▋ | 733/10000 [1:07:59<14:06:59, 5.48s/it][2025-06-19 14:37:44,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:37:44,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.15 | bwd_microstep: 3366.27 | bwd_inner_microstep: 3365.31 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.30 [2025-06-19 14:37:44,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.15 | bwd: 3366.29 | bwd_inner: 3365.31 | bwd_allreduce: 0.94 | step: 7.31 7%|▋ | 734/10000 [1:08:05<14:09:28, 5.50s/it] {'loss': 0.1429, 'grad_norm': 0.7547043561935425, 'learning_rate': 3.9802748386594425e-05, 'epoch': 0.73} 7%|▋ | 734/10000 [1:08:05<14:09:28, 5.50s/it][2025-06-19 14:37:50,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:37:50,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.03 | bwd_microstep: 3366.85 | bwd_inner_microstep: 3366.01 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.53 [2025-06-19 14:37:50,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.03 | bwd: 3366.87 | bwd_inner: 3366.01 | bwd_allreduce: 0.80 | step: 7.53 7%|▋ | 735/10000 [1:08:10<14:11:19, 5.51s/it] {'loss': 0.0573, 'grad_norm': 0.36384448409080505, 'learning_rate': 3.9801839851518e-05, 'epoch': 0.73} 7%|▋ | 735/10000 [1:08:10<14:11:19, 5.51s/it][2025-06-19 14:37:55,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 14:37:55,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.38 | bwd_microstep: 3325.15 | bwd_inner_microstep: 3324.09 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.71 [2025-06-19 14:37:55,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.38 | bwd: 3325.18 | bwd_inner: 3324.09 | bwd_allreduce: 1.01 | step: 7.71 7%|▋ | 736/10000 [1:08:16<14:10:22, 5.51s/it] {'loss': 0.0737, 'grad_norm': 0.3300948441028595, 'learning_rate': 3.9800929239320485e-05, 'epoch': 0.74} 7%|▋ | 736/10000 [1:08:16<14:10:22, 5.51s/it][2025-06-19 14:38:01,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:38:01,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2187.93 | bwd_microstep: 3384.92 | bwd_inner_microstep: 3384.07 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.19 [2025-06-19 14:38:01,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2187.93 | bwd: 3384.95 | bwd_inner: 3384.07 | bwd_allreduce: 0.81 | step: 7.19 7%|▋ | 737/10000 [1:08:22<14:15:32, 5.54s/it] {'loss': 0.1457, 'grad_norm': 0.5238727331161499, 'learning_rate': 3.980001655009739e-05, 'epoch': 0.74} 7%|▋ | 737/10000 [1:08:22<14:15:32, 5.54s/it][2025-06-19 14:38:06,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 14:38:06,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2171.36 | bwd_microstep: 3383.41 | bwd_inner_microstep: 3382.38 | bwd_allreduce_microstep: 0.96 | step_microstep: 9.53 [2025-06-19 14:38:06,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2171.36 | bwd: 3383.44 | bwd_inner: 3382.38 | bwd_allreduce: 0.99 | step: 9.57 7%|▋ | 738/10000 [1:08:27<14:18:55, 5.56s/it] {'loss': 0.1386, 'grad_norm': 0.709325909614563, 'learning_rate': 3.979910178394445e-05, 'epoch': 0.74} 7%|▋ | 738/10000 [1:08:27<14:18:55, 5.56s/it][2025-06-19 14:38:12,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.68 | optimizer_step: 2.72 [2025-06-19 14:38:12,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.72 | bwd_microstep: 3332.38 | bwd_inner_microstep: 3331.31 | bwd_allreduce_microstep: 0.99 | step_microstep: 9.22 [2025-06-19 14:38:12,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.73 | bwd: 3332.41 | bwd_inner: 3331.31 | bwd_allreduce: 1.03 | step: 9.23 7%|▋ | 739/10000 [1:08:33<14:17:56, 5.56s/it] {'loss': 0.0683, 'grad_norm': 0.29630813002586365, 'learning_rate': 3.979818494095762e-05, 'epoch': 0.74} 7%|▋ | 739/10000 [1:08:33<14:17:56, 5.56s/it][2025-06-19 14:38:17,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 14:38:17,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.78 | bwd_microstep: 3312.98 | bwd_inner_microstep: 3311.95 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.60 [2025-06-19 14:38:17,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.78 | bwd: 3313.01 | bwd_inner: 3311.95 | bwd_allreduce: 0.99 | step: 7.61 7%|▋ | 740/10000 [1:08:38<14:13:49, 5.53s/it] {'loss': 0.0949, 'grad_norm': 0.5213107466697693, 'learning_rate': 3.979726602123308e-05, 'epoch': 0.74} 7%|▋ | 740/10000 [1:08:38<14:13:49, 5.53s/it][2025-06-19 14:38:23,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 14:38:23,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.83 | bwd_microstep: 3316.12 | bwd_inner_microstep: 3315.01 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.95 [2025-06-19 14:38:23,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.83 | bwd: 3316.14 | bwd_inner: 3315.01 | bwd_allreduce: 1.06 | step: 7.96 7%|▋ | 741/10000 [1:08:44<14:11:56, 5.52s/it] {'loss': 0.1077, 'grad_norm': 0.6796413064002991, 'learning_rate': 3.979634502486722e-05, 'epoch': 0.74} 7%|▋ | 741/10000 [1:08:44<14:11:56, 5.52s/it][2025-06-19 14:38:28,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:38:28,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.56 | bwd_microstep: 3310.47 | bwd_inner_microstep: 3309.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 14:38:28,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.56 | bwd: 3310.48 | bwd_inner: 3309.68 | bwd_allreduce: 0.76 | step: 6.72 7%|▋ | 742/10000 [1:08:49<14:08:59, 5.50s/it] {'loss': 0.1318, 'grad_norm': 0.7546108961105347, 'learning_rate': 3.9795421951956646e-05, 'epoch': 0.74} 7%|▋ | 742/10000 [1:08:49<14:08:59, 5.50s/it][2025-06-19 14:38:34,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:38:34,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.51 | bwd_microstep: 3315.27 | bwd_inner_microstep: 3314.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-19 14:38:34,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.51 | bwd: 3315.29 | bwd_inner: 3314.47 | bwd_allreduce: 0.77 | step: 6.92 7%|▋ | 743/10000 [1:08:55<14:06:34, 5.49s/it] {'loss': 0.1967, 'grad_norm': 0.8277217149734497, 'learning_rate': 3.979449680259818e-05, 'epoch': 0.74} 7%|▋ | 743/10000 [1:08:55<14:06:34, 5.49s/it][2025-06-19 14:38:39,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:38:39,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.04 | bwd_microstep: 3319.60 | bwd_inner_microstep: 3318.71 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.98 [2025-06-19 14:38:39,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.04 | bwd: 3319.62 | bwd_inner: 3318.71 | bwd_allreduce: 0.86 | step: 6.99 7%|▋ | 744/10000 [1:09:00<14:05:16, 5.48s/it] {'loss': 0.0827, 'grad_norm': 0.6151697039604187, 'learning_rate': 3.979356957688886e-05, 'epoch': 0.74} 7%|▋ | 744/10000 [1:09:00<14:05:16, 5.48s/it][2025-06-19 14:38:45,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 14:38:45,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.51 | bwd_microstep: 3317.60 | bwd_inner_microstep: 3316.31 | bwd_allreduce_microstep: 1.21 | step_microstep: 8.79 [2025-06-19 14:38:45,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.51 | bwd: 3317.62 | bwd_inner: 3316.31 | bwd_allreduce: 1.25 | step: 8.80 7%|▋ | 745/10000 [1:09:06<14:04:46, 5.48s/it] {'loss': 0.1081, 'grad_norm': 0.5382412075996399, 'learning_rate': 3.979264027492597e-05, 'epoch': 0.74} 7%|▋ | 745/10000 [1:09:06<14:04:46, 5.48s/it][2025-06-19 14:38:50,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:38:50,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.93 | bwd_microstep: 3316.73 | bwd_inner_microstep: 3315.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 14:38:50,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.93 | bwd: 3316.74 | bwd_inner: 3315.93 | bwd_allreduce: 0.77 | step: 6.88 7%|▋ | 746/10000 [1:09:11<14:04:41, 5.48s/it] {'loss': 0.0647, 'grad_norm': 0.30278560519218445, 'learning_rate': 3.9791708896806964e-05, 'epoch': 0.75} 7%|▋ | 746/10000 [1:09:11<14:04:41, 5.48s/it][2025-06-19 14:38:56,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.72 [2025-06-19 14:38:56,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.81 | bwd_microstep: 3322.29 | bwd_inner_microstep: 3320.98 | bwd_allreduce_microstep: 1.19 | step_microstep: 9.40 [2025-06-19 14:38:56,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.81 | bwd: 3322.33 | bwd_inner: 3320.98 | bwd_allreduce: 1.25 | step: 9.41 7%|▋ | 747/10000 [1:09:17<14:04:49, 5.48s/it] {'loss': 0.1351, 'grad_norm': 0.6789901256561279, 'learning_rate': 3.979077544262955e-05, 'epoch': 0.75} 7%|▋ | 747/10000 [1:09:17<14:04:49, 5.48s/it][2025-06-19 14:39:01,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.73 [2025-06-19 14:39:01,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2170.74 | bwd_microstep: 3383.44 | bwd_inner_microstep: 3382.22 | bwd_allreduce_microstep: 1.09 | step_microstep: 8.91 [2025-06-19 14:39:01,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2170.74 | bwd: 3383.48 | bwd_inner: 3382.22 | bwd_allreduce: 1.15 | step: 8.92 7%|▋ | 748/10000 [1:09:22<14:10:55, 5.52s/it] {'loss': 0.0694, 'grad_norm': 0.3926640450954437, 'learning_rate': 3.978983991249165e-05, 'epoch': 0.75} 7%|▋ | 748/10000 [1:09:22<14:10:55, 5.52s/it][2025-06-19 14:39:07,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-19 14:39:07,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2151.19 | bwd_microstep: 3328.42 | bwd_inner_microstep: 3326.94 | bwd_allreduce_microstep: 1.34 | step_microstep: 9.79 [2025-06-19 14:39:07,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2151.19 | bwd: 3328.46 | bwd_inner: 3326.94 | bwd_allreduce: 1.41 | step: 9.80 7%|▋ | 749/10000 [1:09:28<14:12:07, 5.53s/it] {'loss': 0.0983, 'grad_norm': 0.6003742218017578, 'learning_rate': 3.978890230649139e-05, 'epoch': 0.75} 7%|▋ | 749/10000 [1:09:28<14:12:07, 5.53s/it][2025-06-19 14:39:12,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.76 | optimizer_step: 2.73 [2025-06-19 14:39:12,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2177.17 | bwd_microstep: 3340.04 | bwd_inner_microstep: 3338.89 | bwd_allreduce_microstep: 1.06 | step_microstep: 10.62 [2025-06-19 14:39:12,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2177.17 | bwd: 3340.06 | bwd_inner: 3338.89 | bwd_allreduce: 1.10 | step: 10.62 8%|▊ | 750/10000 [1:09:33<14:14:16, 5.54s/it] {'loss': 0.1309, 'grad_norm': 0.76543128490448, 'learning_rate': 3.9787962624727126e-05, 'epoch': 0.75} 8%|▊ | 750/10000 [1:09:33<14:14:16, 5.54s/it][2025-06-19 14:39:18,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:39:18,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2150.25 | bwd_microstep: 3334.72 | bwd_inner_microstep: 3333.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 14:39:18,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2150.24 | bwd: 3334.73 | bwd_inner: 3333.94 | bwd_allreduce: 0.76 | step: 6.60 8%|▊ | 751/10000 [1:09:39<14:14:12, 5.54s/it] {'loss': 0.1381, 'grad_norm': 0.6480944752693176, 'learning_rate': 3.978702086729741e-05, 'epoch': 0.75} 8%|▊ | 751/10000 [1:09:39<14:14:12, 5.54s/it][2025-06-19 14:39:23,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 14:39:23,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.37 | bwd_microstep: 3337.20 | bwd_inner_microstep: 3336.17 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.10 [2025-06-19 14:39:23,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.37 | bwd: 3337.22 | bwd_inner: 3336.17 | bwd_allreduce: 1.00 | step: 7.11 8%|▊ | 752/10000 [1:09:44<14:12:08, 5.53s/it] {'loss': 0.0776, 'grad_norm': 0.367703914642334, 'learning_rate': 3.978607703430105e-05, 'epoch': 0.75} 8%|▊ | 752/10000 [1:09:44<14:12:08, 5.53s/it][2025-06-19 14:39:29,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:39:29,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.18 | bwd_microstep: 3374.52 | bwd_inner_microstep: 3373.71 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.65 [2025-06-19 14:39:29,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.18 | bwd: 3374.54 | bwd_inner: 3373.71 | bwd_allreduce: 0.78 | step: 6.66 8%|▊ | 753/10000 [1:09:50<14:13:10, 5.54s/it] {'loss': 0.1059, 'grad_norm': 0.6976423263549805, 'learning_rate': 3.9785131125837035e-05, 'epoch': 0.75} 8%|▊ | 753/10000 [1:09:50<14:13:10, 5.54s/it][2025-06-19 14:39:35,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:39:35,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.42 | bwd_microstep: 3324.25 | bwd_inner_microstep: 3323.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 14:39:35,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.42 | bwd: 3324.26 | bwd_inner: 3323.45 | bwd_allreduce: 0.77 | step: 6.94 8%|▊ | 754/10000 [1:09:55<14:11:28, 5.53s/it] {'loss': 0.121, 'grad_norm': 0.7236694097518921, 'learning_rate': 3.978418314200459e-05, 'epoch': 0.75} 8%|▊ | 754/10000 [1:09:55<14:11:28, 5.53s/it][2025-06-19 14:39:40,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.74 [2025-06-19 14:39:40,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.63 | bwd_microstep: 3326.44 | bwd_inner_microstep: 3325.64 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-19 14:39:40,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.64 | bwd: 3326.46 | bwd_inner: 3325.64 | bwd_allreduce: 0.78 | step: 6.72 8%|▊ | 755/10000 [1:10:01<14:08:35, 5.51s/it] {'loss': 0.0725, 'grad_norm': 0.5854897499084473, 'learning_rate': 3.978323308290316e-05, 'epoch': 0.76} 8%|▊ | 755/10000 [1:10:01<14:08:35, 5.51s/it][2025-06-19 14:39:46,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 14:39:46,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.29 | bwd_microstep: 3394.23 | bwd_inner_microstep: 3393.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 14:39:46,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.29 | bwd: 3394.24 | bwd_inner: 3393.45 | bwd_allreduce: 0.75 | step: 6.54 8%|▊ | 756/10000 [1:10:06<14:11:37, 5.53s/it] {'loss': 0.0903, 'grad_norm': 0.4940344989299774, 'learning_rate': 3.978228094863239e-05, 'epoch': 0.76} 8%|▊ | 756/10000 [1:10:06<14:11:37, 5.53s/it][2025-06-19 14:39:51,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:39:51,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.45 | bwd_microstep: 3320.56 | bwd_inner_microstep: 3319.60 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.03 [2025-06-19 14:39:51,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.45 | bwd: 3320.58 | bwd_inner: 3319.60 | bwd_allreduce: 0.92 | step: 7.03 8%|▊ | 757/10000 [1:10:12<14:08:45, 5.51s/it] {'loss': 0.0837, 'grad_norm': 0.3667879104614258, 'learning_rate': 3.978132673929216e-05, 'epoch': 0.76} 8%|▊ | 757/10000 [1:10:12<14:08:45, 5.51s/it][2025-06-19 14:39:57,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 14:39:57,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.70 | bwd_microstep: 3365.07 | bwd_inner_microstep: 3364.18 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.25 [2025-06-19 14:39:57,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.70 | bwd: 3365.08 | bwd_inner: 3364.18 | bwd_allreduce: 0.85 | step: 7.26 8%|▊ | 758/10000 [1:10:17<14:10:19, 5.52s/it] {'loss': 0.1191, 'grad_norm': 0.4948112666606903, 'learning_rate': 3.978037045498257e-05, 'epoch': 0.76} 8%|▊ | 758/10000 [1:10:17<14:10:19, 5.52s/it][2025-06-19 14:40:02,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:40:02,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.91 | bwd_microstep: 3317.34 | bwd_inner_microstep: 3316.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 14:40:02,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.91 | bwd: 3317.35 | bwd_inner: 3316.56 | bwd_allreduce: 0.75 | step: 6.58 8%|▊ | 759/10000 [1:10:23<14:08:06, 5.51s/it] {'loss': 0.1403, 'grad_norm': 0.6984390020370483, 'learning_rate': 3.977941209580391e-05, 'epoch': 0.76} 8%|▊ | 759/10000 [1:10:23<14:08:06, 5.51s/it][2025-06-19 14:40:08,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:40:08,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.04 | bwd_microstep: 3328.99 | bwd_inner_microstep: 3328.14 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.07 [2025-06-19 14:40:08,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.04 | bwd: 3329.02 | bwd_inner: 3328.14 | bwd_allreduce: 0.82 | step: 7.07 8%|▊ | 760/10000 [1:10:28<14:06:53, 5.50s/it] {'loss': 0.1641, 'grad_norm': 0.6959852576255798, 'learning_rate': 3.977845166185673e-05, 'epoch': 0.76} 8%|▊ | 760/10000 [1:10:28<14:06:53, 5.50s/it][2025-06-19 14:40:13,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:40:13,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.79 | bwd_microstep: 3327.81 | bwd_inner_microstep: 3326.98 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.95 [2025-06-19 14:40:13,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.79 | bwd: 3327.84 | bwd_inner: 3326.98 | bwd_allreduce: 0.80 | step: 6.95 8%|▊ | 761/10000 [1:10:34<14:06:48, 5.50s/it] {'loss': 0.1342, 'grad_norm': 0.7092061042785645, 'learning_rate': 3.9777489153241765e-05, 'epoch': 0.76} 8%|▊ | 761/10000 [1:10:34<14:06:48, 5.50s/it][2025-06-19 14:40:19,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 14:40:19,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.14 | bwd_microstep: 3344.14 | bwd_inner_microstep: 3343.13 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.86 [2025-06-19 14:40:19,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.14 | bwd: 3344.16 | bwd_inner: 3343.13 | bwd_allreduce: 0.96 | step: 7.87 8%|▊ | 762/10000 [1:10:39<14:08:04, 5.51s/it] {'loss': 0.0719, 'grad_norm': 0.34626439213752747, 'learning_rate': 3.977652457005998e-05, 'epoch': 0.76} 8%|▊ | 762/10000 [1:10:39<14:08:04, 5.51s/it][2025-06-19 14:40:24,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.68 | optimizer_step: 2.72 [2025-06-19 14:40:24,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2179.71 | bwd_microstep: 3347.61 | bwd_inner_microstep: 3346.74 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.49 [2025-06-19 14:40:24,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2179.71 | bwd: 3347.64 | bwd_inner: 3346.74 | bwd_allreduce: 0.84 | step: 7.49 8%|▊ | 763/10000 [1:10:45<14:11:06, 5.53s/it] {'loss': 0.115, 'grad_norm': 0.6207411289215088, 'learning_rate': 3.977555791241255e-05, 'epoch': 0.76} 8%|▊ | 763/10000 [1:10:45<14:11:06, 5.53s/it][2025-06-19 14:40:30,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:40:30,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2167.09 | bwd_microstep: 3381.87 | bwd_inner_microstep: 3381.01 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.89 [2025-06-19 14:40:30,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2167.09 | bwd: 3381.89 | bwd_inner: 3381.01 | bwd_allreduce: 0.82 | step: 6.90 8%|▊ | 764/10000 [1:10:51<14:14:05, 5.55s/it] {'loss': 0.1118, 'grad_norm': 0.6843888163566589, 'learning_rate': 3.9774589180400874e-05, 'epoch': 0.76} 8%|▊ | 764/10000 [1:10:51<14:14:05, 5.55s/it][2025-06-19 14:40:35,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 14:40:35,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.26 | bwd_microstep: 3324.25 | bwd_inner_microstep: 3323.48 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.47 [2025-06-19 14:40:35,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.26 | bwd: 3324.27 | bwd_inner: 3323.48 | bwd_allreduce: 0.75 | step: 6.47 8%|▊ | 765/10000 [1:10:56<14:10:36, 5.53s/it] {'loss': 0.1257, 'grad_norm': 0.9832025766372681, 'learning_rate': 3.977361837412657e-05, 'epoch': 0.77} 8%|▊ | 765/10000 [1:10:56<14:10:36, 5.53s/it][2025-06-19 14:40:41,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.72 [2025-06-19 14:40:41,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.01 | bwd_microstep: 3332.81 | bwd_inner_microstep: 3331.73 | bwd_allreduce_microstep: 1.01 | step_microstep: 8.12 [2025-06-19 14:40:41,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.01 | bwd: 3332.83 | bwd_inner: 3331.73 | bwd_allreduce: 1.04 | step: 8.14 8%|▊ | 766/10000 [1:11:02<14:08:56, 5.52s/it] {'loss': 0.1672, 'grad_norm': 1.2187511920928955, 'learning_rate': 3.9772645493691475e-05, 'epoch': 0.77} 8%|▊ | 766/10000 [1:11:02<14:08:56, 5.52s/it][2025-06-19 14:40:46,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:40:46,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.15 | bwd_microstep: 3329.89 | bwd_inner_microstep: 3329.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 14:40:46,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.15 | bwd: 3329.90 | bwd_inner: 3329.11 | bwd_allreduce: 0.76 | step: 6.69 8%|▊ | 767/10000 [1:11:07<14:07:39, 5.51s/it] {'loss': 0.155, 'grad_norm': 0.751334011554718, 'learning_rate': 3.9771670539197636e-05, 'epoch': 0.77} 8%|▊ | 767/10000 [1:11:07<14:07:39, 5.51s/it][2025-06-19 14:40:52,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:40:52,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.30 | bwd_microstep: 3331.14 | bwd_inner_microstep: 3330.13 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.20 [2025-06-19 14:40:52,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.30 | bwd: 3331.17 | bwd_inner: 3330.13 | bwd_allreduce: 0.97 | step: 7.20 8%|▊ | 768/10000 [1:11:13<14:07:10, 5.51s/it] {'loss': 0.0785, 'grad_norm': 0.4629436135292053, 'learning_rate': 3.977069351074732e-05, 'epoch': 0.77} 8%|▊ | 768/10000 [1:11:13<14:07:10, 5.51s/it][2025-06-19 14:40:57,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:40:57,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.48 | bwd_microstep: 3314.00 | bwd_inner_microstep: 3313.16 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.74 [2025-06-19 14:40:57,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.48 | bwd: 3314.02 | bwd_inner: 3313.16 | bwd_allreduce: 0.80 | step: 6.75 8%|▊ | 769/10000 [1:11:18<14:04:52, 5.49s/it] {'loss': 0.1211, 'grad_norm': 0.7273602485656738, 'learning_rate': 3.9769714408443005e-05, 'epoch': 0.77} 8%|▊ | 769/10000 [1:11:18<14:04:52, 5.49s/it][2025-06-19 14:41:03,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 14:41:03,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.27 | bwd_microstep: 3374.81 | bwd_inner_microstep: 3373.83 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.10 [2025-06-19 14:41:03,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.27 | bwd: 3374.83 | bwd_inner: 3373.83 | bwd_allreduce: 0.95 | step: 7.10 8%|▊ | 770/10000 [1:11:24<14:07:23, 5.51s/it] {'loss': 0.0721, 'grad_norm': 0.43169522285461426, 'learning_rate': 3.9768733232387414e-05, 'epoch': 0.77} 8%|▊ | 770/10000 [1:11:24<14:07:23, 5.51s/it][2025-06-19 14:41:08,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 14:41:08,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.17 | bwd_microstep: 3328.67 | bwd_inner_microstep: 3327.89 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.49 [2025-06-19 14:41:08,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.17 | bwd: 3328.68 | bwd_inner: 3327.89 | bwd_allreduce: 0.75 | step: 6.49 8%|▊ | 771/10000 [1:11:29<14:05:57, 5.50s/it] {'loss': 0.0929, 'grad_norm': 0.5197500586509705, 'learning_rate': 3.976774998268345e-05, 'epoch': 0.77} 8%|▊ | 771/10000 [1:11:29<14:05:57, 5.50s/it][2025-06-19 14:41:14,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 14:41:14,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.26 | bwd_microstep: 3337.81 | bwd_inner_microstep: 3336.73 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.63 [2025-06-19 14:41:14,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.26 | bwd: 3337.84 | bwd_inner: 3336.73 | bwd_allreduce: 1.03 | step: 7.64 8%|▊ | 772/10000 [1:11:35<14:06:11, 5.50s/it] {'loss': 0.0597, 'grad_norm': 0.38957664370536804, 'learning_rate': 3.976676465943426e-05, 'epoch': 0.77} 8%|▊ | 772/10000 [1:11:35<14:06:11, 5.50s/it][2025-06-19 14:41:19,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 14:41:19,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.19 | bwd_microstep: 3342.60 | bwd_inner_microstep: 3341.76 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.02 [2025-06-19 14:41:19,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.19 | bwd: 3342.62 | bwd_inner: 3341.76 | bwd_allreduce: 0.81 | step: 7.02 8%|▊ | 773/10000 [1:11:40<14:07:41, 5.51s/it] {'loss': 0.1412, 'grad_norm': 0.8601365685462952, 'learning_rate': 3.976577726274319e-05, 'epoch': 0.77} 8%|▊ | 773/10000 [1:11:40<14:07:41, 5.51s/it][2025-06-19 14:41:25,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.75 | optimizer_step: 2.72 [2025-06-19 14:41:25,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2172.14 | bwd_microstep: 3426.03 | bwd_inner_microstep: 3424.97 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.27 [2025-06-19 14:41:25,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2172.14 | bwd: 3426.06 | bwd_inner: 3424.97 | bwd_allreduce: 1.02 | step: 7.26 8%|▊ | 774/10000 [1:11:46<14:13:43, 5.55s/it] {'loss': 0.1433, 'grad_norm': 1.3793449401855469, 'learning_rate': 3.976478779271383e-05, 'epoch': 0.77} 8%|▊ | 774/10000 [1:11:46<14:13:43, 5.55s/it][2025-06-19 14:41:30,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:41:30,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.66 | bwd_microstep: 3372.99 | bwd_inner_microstep: 3372.16 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.17 [2025-06-19 14:41:30,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.66 | bwd: 3373.00 | bwd_inner: 3372.16 | bwd_allreduce: 0.80 | step: 7.17 8%|▊ | 775/10000 [1:11:51<14:14:11, 5.56s/it] {'loss': 0.0809, 'grad_norm': 0.4909508526325226, 'learning_rate': 3.976379624944996e-05, 'epoch': 0.78} 8%|▊ | 775/10000 [1:11:51<14:14:11, 5.56s/it][2025-06-19 14:41:36,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:41:36,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.15 | bwd_microstep: 3378.26 | bwd_inner_microstep: 3377.36 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.59 [2025-06-19 14:41:36,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.15 | bwd: 3378.28 | bwd_inner: 3377.36 | bwd_allreduce: 0.88 | step: 6.59 8%|▊ | 776/10000 [1:11:57<14:13:53, 5.55s/it] {'loss': 0.1178, 'grad_norm': 0.7664084434509277, 'learning_rate': 3.976280263305559e-05, 'epoch': 0.78} 8%|▊ | 776/10000 [1:11:57<14:13:53, 5.55s/it][2025-06-19 14:41:42,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 14:41:42,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.83 | bwd_microstep: 3324.55 | bwd_inner_microstep: 3323.44 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.99 [2025-06-19 14:41:42,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.84 | bwd: 3324.58 | bwd_inner: 3323.44 | bwd_allreduce: 1.07 | step: 8.00 8%|▊ | 777/10000 [1:12:02<14:11:39, 5.54s/it] {'loss': 0.1595, 'grad_norm': 1.2498611211776733, 'learning_rate': 3.9761806943634947e-05, 'epoch': 0.78} 8%|▊ | 777/10000 [1:12:02<14:11:39, 5.54s/it][2025-06-19 14:41:47,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:41:47,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.77 | bwd_microstep: 3344.74 | bwd_inner_microstep: 3343.83 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.34 [2025-06-19 14:41:47,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.77 | bwd: 3344.77 | bwd_inner: 3343.83 | bwd_allreduce: 0.87 | step: 7.34 8%|▊ | 778/10000 [1:12:08<14:10:54, 5.54s/it] {'loss': 0.2399, 'grad_norm': 1.146748423576355, 'learning_rate': 3.9760809181292466e-05, 'epoch': 0.78} 8%|▊ | 778/10000 [1:12:08<14:10:54, 5.54s/it][2025-06-19 14:41:53,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 14:41:53,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.73 | bwd_microstep: 3323.47 | bwd_inner_microstep: 3322.54 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.75 [2025-06-19 14:41:53,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.72 | bwd: 3323.50 | bwd_inner: 3322.54 | bwd_allreduce: 0.88 | step: 7.76 8%|▊ | 779/10000 [1:12:13<14:10:02, 5.53s/it] {'loss': 0.1778, 'grad_norm': 1.3699699640274048, 'learning_rate': 3.975980934613282e-05, 'epoch': 0.78} 8%|▊ | 779/10000 [1:12:13<14:10:02, 5.53s/it][2025-06-19 14:41:58,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 14:41:58,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2172.51 | bwd_microstep: 3375.58 | bwd_inner_microstep: 3374.51 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.57 [2025-06-19 14:41:58,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2172.51 | bwd: 3375.61 | bwd_inner: 3374.51 | bwd_allreduce: 1.03 | step: 7.57 8%|▊ | 780/10000 [1:12:19<14:12:55, 5.55s/it] {'loss': 0.0668, 'grad_norm': 0.5118109583854675, 'learning_rate': 3.975880743826088e-05, 'epoch': 0.78} 8%|▊ | 780/10000 [1:12:19<14:12:55, 5.55s/it][2025-06-19 14:42:04,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:42:04,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.01 | bwd_microstep: 3381.51 | bwd_inner_microstep: 3380.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 14:42:04,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.01 | bwd: 3381.52 | bwd_inner: 3380.71 | bwd_allreduce: 0.77 | step: 6.97 8%|▊ | 781/10000 [1:12:25<14:13:28, 5.55s/it] {'loss': 0.1018, 'grad_norm': 0.5515004992485046, 'learning_rate': 3.975780345778175e-05, 'epoch': 0.78} 8%|▊ | 781/10000 [1:12:25<14:13:28, 5.55s/it][2025-06-19 14:42:09,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:42:09,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.64 | bwd_microstep: 3332.81 | bwd_inner_microstep: 3331.90 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.24 [2025-06-19 14:42:09,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.64 | bwd: 3332.84 | bwd_inner: 3331.90 | bwd_allreduce: 0.87 | step: 7.23 8%|▊ | 782/10000 [1:12:30<14:10:23, 5.54s/it] {'loss': 0.2579, 'grad_norm': 1.145370602607727, 'learning_rate': 3.975679740480073e-05, 'epoch': 0.78} 8%|▊ | 782/10000 [1:12:30<14:10:23, 5.54s/it][2025-06-19 14:42:15,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:42:15,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.07 | bwd_microstep: 3330.06 | bwd_inner_microstep: 3329.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.21 [2025-06-19 14:42:15,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.07 | bwd: 3330.08 | bwd_inner: 3329.26 | bwd_allreduce: 0.77 | step: 7.21 8%|▊ | 783/10000 [1:12:36<14:08:17, 5.52s/it] {'loss': 0.1117, 'grad_norm': 0.6611597537994385, 'learning_rate': 3.9755789279423355e-05, 'epoch': 0.78} 8%|▊ | 783/10000 [1:12:36<14:08:17, 5.52s/it][2025-06-19 14:42:20,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 14:42:20,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.27 | bwd_microstep: 3326.48 | bwd_inner_microstep: 3325.55 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.90 [2025-06-19 14:42:20,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.27 | bwd: 3326.49 | bwd_inner: 3325.55 | bwd_allreduce: 0.90 | step: 6.90 8%|▊ | 784/10000 [1:12:41<14:06:06, 5.51s/it] {'loss': 0.1192, 'grad_norm': 0.5498432517051697, 'learning_rate': 3.975477908175537e-05, 'epoch': 0.78} 8%|▊ | 784/10000 [1:12:41<14:06:06, 5.51s/it][2025-06-19 14:42:26,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:42:26,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.70 | bwd_microstep: 3323.73 | bwd_inner_microstep: 3322.89 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.77 [2025-06-19 14:42:26,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.70 | bwd: 3323.75 | bwd_inner: 3322.89 | bwd_allreduce: 0.82 | step: 6.77 8%|▊ | 785/10000 [1:12:46<14:04:25, 5.50s/it] {'loss': 0.0658, 'grad_norm': 0.3923082649707794, 'learning_rate': 3.9753766811902756e-05, 'epoch': 0.79} 8%|▊ | 785/10000 [1:12:46<14:04:25, 5.50s/it][2025-06-19 14:42:31,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:42:31,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.38 | bwd_microstep: 3375.96 | bwd_inner_microstep: 3375.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 14:42:31,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.38 | bwd: 3375.97 | bwd_inner: 3375.15 | bwd_allreduce: 0.78 | step: 7.21 8%|▊ | 786/10000 [1:12:52<14:06:50, 5.51s/it] {'loss': 0.1362, 'grad_norm': 0.5724232196807861, 'learning_rate': 3.9752752469971676e-05, 'epoch': 0.79} 8%|▊ | 786/10000 [1:12:52<14:06:50, 5.51s/it][2025-06-19 14:42:37,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:42:37,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.13 | bwd_microstep: 3371.50 | bwd_inner_microstep: 3370.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 14:42:37,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.13 | bwd: 3371.51 | bwd_inner: 3370.69 | bwd_allreduce: 0.78 | step: 7.13 8%|▊ | 787/10000 [1:12:58<14:08:23, 5.53s/it] {'loss': 0.0931, 'grad_norm': 0.7216118574142456, 'learning_rate': 3.975173605606854e-05, 'epoch': 0.79} 8%|▊ | 787/10000 [1:12:58<14:08:23, 5.53s/it][2025-06-19 14:42:42,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:42:42,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.96 | bwd_microstep: 3330.30 | bwd_inner_microstep: 3329.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.59 [2025-06-19 14:42:42,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.96 | bwd: 3330.31 | bwd_inner: 3329.51 | bwd_allreduce: 0.76 | step: 6.59 8%|▊ | 788/10000 [1:13:03<14:05:58, 5.51s/it] {'loss': 0.0799, 'grad_norm': 0.4354689121246338, 'learning_rate': 3.9750717570299964e-05, 'epoch': 0.79} 8%|▊ | 788/10000 [1:13:03<14:05:58, 5.51s/it][2025-06-19 14:42:48,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:42:48,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.69 | bwd_microstep: 3332.88 | bwd_inner_microstep: 3332.00 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.81 [2025-06-19 14:42:48,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.69 | bwd: 3332.91 | bwd_inner: 3332.00 | bwd_allreduce: 0.84 | step: 7.81 8%|▊ | 789/10000 [1:13:09<14:04:46, 5.50s/it] {'loss': 0.0825, 'grad_norm': 0.4284481108188629, 'learning_rate': 3.9749697012772784e-05, 'epoch': 0.79} 8%|▊ | 789/10000 [1:13:09<14:04:46, 5.50s/it][2025-06-19 14:42:53,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:42:53,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.89 | bwd_microstep: 3323.03 | bwd_inner_microstep: 3322.08 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.47 [2025-06-19 14:42:53,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.89 | bwd: 3323.06 | bwd_inner: 3322.08 | bwd_allreduce: 0.90 | step: 7.48 8%|▊ | 790/10000 [1:13:14<14:04:55, 5.50s/it] {'loss': 0.1743, 'grad_norm': 1.0043925046920776, 'learning_rate': 3.974867438359404e-05, 'epoch': 0.79} 8%|▊ | 790/10000 [1:13:14<14:04:55, 5.50s/it][2025-06-19 14:42:59,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 14:42:59,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.33 | bwd_microstep: 3327.22 | bwd_inner_microstep: 3326.15 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.51 [2025-06-19 14:42:59,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.33 | bwd: 3327.25 | bwd_inner: 3326.15 | bwd_allreduce: 1.03 | step: 7.52 8%|▊ | 791/10000 [1:13:20<14:05:10, 5.51s/it] {'loss': 0.0734, 'grad_norm': 0.3298061490058899, 'learning_rate': 3.974764968287101e-05, 'epoch': 0.79} 8%|▊ | 791/10000 [1:13:20<14:05:10, 5.51s/it][2025-06-19 14:43:04,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.71 | optimizer_step: 2.73 [2025-06-19 14:43:04,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2151.37 | bwd_microstep: 3335.87 | bwd_inner_microstep: 3334.41 | bwd_allreduce_microstep: 1.37 | step_microstep: 10.56 [2025-06-19 14:43:04,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2151.37 | bwd: 3335.91 | bwd_inner: 3334.41 | bwd_allreduce: 1.42 | step: 10.62 8%|▊ | 792/10000 [1:13:25<14:06:56, 5.52s/it] {'loss': 0.1201, 'grad_norm': 0.6789962649345398, 'learning_rate': 3.974662291071118e-05, 'epoch': 0.79} 8%|▊ | 792/10000 [1:13:25<14:06:56, 5.52s/it][2025-06-19 14:43:10,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 14:43:10,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2154.15 | bwd_microstep: 3370.78 | bwd_inner_microstep: 3369.67 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.74 [2025-06-19 14:43:10,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2154.15 | bwd: 3370.80 | bwd_inner: 3369.67 | bwd_allreduce: 1.07 | step: 7.74 8%|▊ | 793/10000 [1:13:31<14:10:00, 5.54s/it] {'loss': 0.0622, 'grad_norm': 0.5240780115127563, 'learning_rate': 3.974559406722226e-05, 'epoch': 0.79} 8%|▊ | 793/10000 [1:13:31<14:10:00, 5.54s/it][2025-06-19 14:43:15,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:43:15,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.86 | bwd_microstep: 3369.45 | bwd_inner_microstep: 3368.60 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.89 [2025-06-19 14:43:15,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.86 | bwd: 3369.46 | bwd_inner: 3368.60 | bwd_allreduce: 0.82 | step: 6.90 8%|▊ | 794/10000 [1:13:36<14:10:09, 5.54s/it] {'loss': 0.2036, 'grad_norm': 1.2260699272155762, 'learning_rate': 3.9744563152512156e-05, 'epoch': 0.79} 8%|▊ | 794/10000 [1:13:36<14:10:09, 5.54s/it][2025-06-19 14:43:21,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:43:21,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.30 | bwd_microstep: 3321.27 | bwd_inner_microstep: 3320.29 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.57 [2025-06-19 14:43:21,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.30 | bwd: 3321.28 | bwd_inner: 3320.29 | bwd_allreduce: 0.95 | step: 7.57 8%|▊ | 795/10000 [1:13:42<14:07:44, 5.53s/it] {'loss': 0.1056, 'grad_norm': 0.488321453332901, 'learning_rate': 3.974353016668902e-05, 'epoch': 0.8} 8%|▊ | 795/10000 [1:13:42<14:07:44, 5.53s/it][2025-06-19 14:43:26,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 14:43:26,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.57 | bwd_microstep: 3316.93 | bwd_inner_microstep: 3316.11 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.51 [2025-06-19 14:43:26,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.57 | bwd: 3316.95 | bwd_inner: 3316.11 | bwd_allreduce: 0.80 | step: 7.52 8%|▊ | 796/10000 [1:13:47<14:05:11, 5.51s/it] {'loss': 0.1352, 'grad_norm': 0.9080814719200134, 'learning_rate': 3.97424951098612e-05, 'epoch': 0.8} 8%|▊ | 796/10000 [1:13:47<14:05:11, 5.51s/it][2025-06-19 14:43:32,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:43:32,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.30 | bwd_microstep: 3324.94 | bwd_inner_microstep: 3324.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 14:43:32,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.30 | bwd: 3324.96 | bwd_inner: 3324.13 | bwd_allreduce: 0.78 | step: 7.23 8%|▊ | 797/10000 [1:13:53<14:03:46, 5.50s/it] {'loss': 0.1528, 'grad_norm': 0.9225308895111084, 'learning_rate': 3.974145798213727e-05, 'epoch': 0.8} 8%|▊ | 797/10000 [1:13:53<14:03:46, 5.50s/it][2025-06-19 14:43:37,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:43:37,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.51 | bwd_microstep: 3328.91 | bwd_inner_microstep: 3327.98 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.40 [2025-06-19 14:43:37,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.51 | bwd: 3328.92 | bwd_inner: 3327.98 | bwd_allreduce: 0.90 | step: 7.40 8%|▊ | 798/10000 [1:13:58<14:02:26, 5.49s/it] {'loss': 0.1239, 'grad_norm': 1.1158043146133423, 'learning_rate': 3.9740418783626025e-05, 'epoch': 0.8} 8%|▊ | 798/10000 [1:13:58<14:02:26, 5.49s/it][2025-06-19 14:43:43,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 14:43:43,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.70 | bwd_microstep: 3370.75 | bwd_inner_microstep: 3369.78 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.09 [2025-06-19 14:43:43,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.70 | bwd: 3370.77 | bwd_inner: 3369.78 | bwd_allreduce: 0.93 | step: 7.10 8%|▊ | 799/10000 [1:14:04<14:04:28, 5.51s/it] {'loss': 0.1745, 'grad_norm': 0.9956660866737366, 'learning_rate': 3.9739377514436467e-05, 'epoch': 0.8} 8%|▊ | 799/10000 [1:14:04<14:04:28, 5.51s/it][2025-06-19 14:43:48,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:43:48,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.25 | bwd_microstep: 3327.89 | bwd_inner_microstep: 3327.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 14:43:48,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.25 | bwd: 3327.91 | bwd_inner: 3327.08 | bwd_allreduce: 0.78 | step: 7.15 8%|▊ | 800/10000 [1:14:09<14:03:18, 5.50s/it] {'loss': 0.1111, 'grad_norm': 0.6294748783111572, 'learning_rate': 3.9738334174677816e-05, 'epoch': 0.8} 8%|▊ | 800/10000 [1:14:09<14:03:18, 5.50s/it][2025-06-19 14:43:54,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 14:43:54,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.33 | bwd_microstep: 3342.13 | bwd_inner_microstep: 3341.08 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.52 [2025-06-19 14:43:54,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.33 | bwd: 3342.15 | bwd_inner: 3341.08 | bwd_allreduce: 1.00 | step: 7.52 8%|▊ | 801/10000 [1:14:15<14:03:18, 5.50s/it] {'loss': 0.1371, 'grad_norm': 1.5324684381484985, 'learning_rate': 3.9737288764459524e-05, 'epoch': 0.8} 8%|▊ | 801/10000 [1:14:15<14:03:18, 5.50s/it][2025-06-19 14:43:59,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.72 [2025-06-19 14:43:59,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2150.75 | bwd_microstep: 3329.22 | bwd_inner_microstep: 3328.11 | bwd_allreduce_microstep: 1.03 | step_microstep: 8.27 [2025-06-19 14:43:59,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2150.75 | bwd: 3329.25 | bwd_inner: 3328.11 | bwd_allreduce: 1.07 | step: 8.27 8%|▊ | 802/10000 [1:14:20<14:04:41, 5.51s/it] {'loss': 0.1263, 'grad_norm': 0.8459265828132629, 'learning_rate': 3.9736241283891244e-05, 'epoch': 0.8} 8%|▊ | 802/10000 [1:14:20<14:04:41, 5.51s/it][2025-06-19 14:44:05,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 14:44:05,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.03 | bwd_microstep: 3386.57 | bwd_inner_microstep: 3385.43 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.70 [2025-06-19 14:44:05,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.03 | bwd: 3386.60 | bwd_inner: 3385.44 | bwd_allreduce: 1.10 | step: 8.71 8%|▊ | 803/10000 [1:14:26<14:07:18, 5.53s/it] {'loss': 0.1369, 'grad_norm': 1.1664867401123047, 'learning_rate': 3.9735191733082846e-05, 'epoch': 0.8} 8%|▊ | 803/10000 [1:14:26<14:07:18, 5.53s/it][2025-06-19 14:44:11,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 14:44:11,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2187.01 | bwd_microstep: 3382.47 | bwd_inner_microstep: 3381.58 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.89 [2025-06-19 14:44:11,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2187.01 | bwd: 3382.49 | bwd_inner: 3381.58 | bwd_allreduce: 0.85 | step: 7.90 8%|▊ | 804/10000 [1:14:31<14:11:34, 5.56s/it] {'loss': 0.0781, 'grad_norm': 0.5942212343215942, 'learning_rate': 3.973414011214443e-05, 'epoch': 0.8} 8%|▊ | 804/10000 [1:14:31<14:11:34, 5.56s/it][2025-06-19 14:44:16,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 14:44:16,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.99 | bwd_microstep: 3328.67 | bwd_inner_microstep: 3327.53 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.62 [2025-06-19 14:44:16,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.99 | bwd: 3328.69 | bwd_inner: 3327.53 | bwd_allreduce: 1.10 | step: 8.63 8%|▊ | 805/10000 [1:14:37<14:09:59, 5.55s/it] {'loss': 0.0872, 'grad_norm': 0.46074584126472473, 'learning_rate': 3.973308642118631e-05, 'epoch': 0.81} 8%|▊ | 805/10000 [1:14:37<14:09:59, 5.55s/it][2025-06-19 14:44:22,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:44:22,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.14 | bwd_microstep: 3332.25 | bwd_inner_microstep: 3331.29 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.27 [2025-06-19 14:44:22,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.14 | bwd: 3332.26 | bwd_inner: 3331.29 | bwd_allreduce: 0.93 | step: 7.27 8%|▊ | 806/10000 [1:14:42<14:09:03, 5.54s/it] {'loss': 0.1297, 'grad_norm': 0.7187538743019104, 'learning_rate': 3.973203066031901e-05, 'epoch': 0.81} 8%|▊ | 806/10000 [1:14:42<14:09:03, 5.54s/it][2025-06-19 14:44:27,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 14:44:27,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.92 | bwd_microstep: 3321.59 | bwd_inner_microstep: 3320.44 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.83 [2025-06-19 14:44:27,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.92 | bwd: 3321.62 | bwd_inner: 3320.44 | bwd_allreduce: 1.10 | step: 8.84 8%|▊ | 807/10000 [1:14:48<14:06:25, 5.52s/it] {'loss': 0.1759, 'grad_norm': 0.9448801875114441, 'learning_rate': 3.973097282965327e-05, 'epoch': 0.81} 8%|▊ | 807/10000 [1:14:48<14:06:25, 5.52s/it][2025-06-19 14:44:33,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 14:44:33,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.83 | bwd_microstep: 3369.62 | bwd_inner_microstep: 3368.48 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.86 [2025-06-19 14:44:33,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.83 | bwd: 3369.65 | bwd_inner: 3368.48 | bwd_allreduce: 1.09 | step: 7.86 8%|▊ | 808/10000 [1:14:53<14:07:35, 5.53s/it] {'loss': 0.1161, 'grad_norm': 0.5539291501045227, 'learning_rate': 3.972991292930005e-05, 'epoch': 0.81} 8%|▊ | 808/10000 [1:14:53<14:07:35, 5.53s/it][2025-06-19 14:44:38,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:44:38,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.70 | bwd_microstep: 3365.77 | bwd_inner_microstep: 3364.92 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.49 [2025-06-19 14:44:38,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.70 | bwd: 3365.79 | bwd_inner: 3364.92 | bwd_allreduce: 0.82 | step: 7.49 8%|▊ | 809/10000 [1:14:59<14:08:07, 5.54s/it] {'loss': 0.0602, 'grad_norm': 0.3275673985481262, 'learning_rate': 3.972885095937054e-05, 'epoch': 0.81} 8%|▊ | 809/10000 [1:14:59<14:08:07, 5.54s/it][2025-06-19 14:44:44,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 14:44:44,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2170.06 | bwd_microstep: 3397.81 | bwd_inner_microstep: 3396.66 | bwd_allreduce_microstep: 1.09 | step_microstep: 8.16 [2025-06-19 14:44:44,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2170.06 | bwd: 3397.83 | bwd_inner: 3396.66 | bwd_allreduce: 1.12 | step: 8.16 8%|▊ | 810/10000 [1:15:05<14:11:36, 5.56s/it] {'loss': 0.0751, 'grad_norm': 0.4602261185646057, 'learning_rate': 3.972778691997612e-05, 'epoch': 0.81} 8%|▊ | 810/10000 [1:15:05<14:11:36, 5.56s/it][2025-06-19 14:44:49,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:44:49,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.36 | bwd_microstep: 3312.27 | bwd_inner_microstep: 3311.26 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.61 [2025-06-19 14:44:49,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.36 | bwd: 3312.28 | bwd_inner: 3311.26 | bwd_allreduce: 0.97 | step: 7.62 8%|▊ | 811/10000 [1:15:10<14:08:09, 5.54s/it] {'loss': 0.133, 'grad_norm': 0.8563638925552368, 'learning_rate': 3.9726720811228426e-05, 'epoch': 0.81} 8%|▊ | 811/10000 [1:15:10<14:08:09, 5.54s/it][2025-06-19 14:44:55,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.73 [2025-06-19 14:44:55,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.76 | bwd_microstep: 3314.49 | bwd_inner_microstep: 3313.29 | bwd_allreduce_microstep: 1.11 | step_microstep: 8.73 [2025-06-19 14:44:55,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.76 | bwd: 3314.52 | bwd_inner: 3313.29 | bwd_allreduce: 1.15 | step: 8.74 8%|▊ | 812/10000 [1:15:16<14:05:22, 5.52s/it] {'loss': 0.1089, 'grad_norm': 0.529263973236084, 'learning_rate': 3.972565263323926e-05, 'epoch': 0.81} 8%|▊ | 812/10000 [1:15:16<14:05:22, 5.52s/it][2025-06-19 14:45:00,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 14:45:00,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.46 | bwd_microstep: 3319.88 | bwd_inner_microstep: 3318.97 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.98 [2025-06-19 14:45:00,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.46 | bwd: 3319.90 | bwd_inner: 3318.97 | bwd_allreduce: 0.87 | step: 7.99 8%|▊ | 813/10000 [1:15:21<14:04:28, 5.52s/it] {'loss': 0.158, 'grad_norm': 0.8524863719940186, 'learning_rate': 3.972458238612069e-05, 'epoch': 0.81} 8%|▊ | 813/10000 [1:15:21<14:04:28, 5.52s/it][2025-06-19 14:45:06,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 14:45:06,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.82 | bwd_microstep: 3321.72 | bwd_inner_microstep: 3320.81 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.65 [2025-06-19 14:45:06,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.82 | bwd: 3321.74 | bwd_inner: 3320.81 | bwd_allreduce: 0.87 | step: 7.65 8%|▊ | 814/10000 [1:15:27<14:04:30, 5.52s/it] {'loss': 0.0959, 'grad_norm': 0.8451917767524719, 'learning_rate': 3.9723510069984966e-05, 'epoch': 0.81} 8%|▊ | 814/10000 [1:15:27<14:04:30, 5.52s/it][2025-06-19 14:45:11,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 14:45:11,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.93 | bwd_microstep: 3316.90 | bwd_inner_microstep: 3316.05 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.12 [2025-06-19 14:45:11,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.93 | bwd: 3316.93 | bwd_inner: 3316.05 | bwd_allreduce: 0.82 | step: 7.13 8%|▊ | 815/10000 [1:15:32<14:03:52, 5.51s/it] {'loss': 0.0797, 'grad_norm': 0.3823253810405731, 'learning_rate': 3.972243568494458e-05, 'epoch': 0.81} 8%|▊ | 815/10000 [1:15:32<14:03:52, 5.51s/it][2025-06-19 14:45:17,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 14:45:17,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2164.78 | bwd_microstep: 3367.76 | bwd_inner_microstep: 3366.83 | bwd_allreduce_microstep: 0.85 | step_microstep: 8.10 [2025-06-19 14:45:17,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2164.78 | bwd: 3367.78 | bwd_inner: 3366.83 | bwd_allreduce: 0.88 | step: 8.11 8%|▊ | 816/10000 [1:15:38<14:06:43, 5.53s/it] {'loss': 0.1216, 'grad_norm': 0.6611133813858032, 'learning_rate': 3.9721359231112225e-05, 'epoch': 0.82} 8%|▊ | 816/10000 [1:15:38<14:06:43, 5.53s/it][2025-06-19 14:45:22,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:45:22,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.05 | bwd_microstep: 3306.14 | bwd_inner_microstep: 3305.31 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.71 [2025-06-19 14:45:22,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.05 | bwd: 3306.15 | bwd_inner: 3305.31 | bwd_allreduce: 0.79 | step: 6.71 8%|▊ | 817/10000 [1:15:43<14:03:50, 5.51s/it] {'loss': 0.0781, 'grad_norm': 0.3695961833000183, 'learning_rate': 3.972028070860081e-05, 'epoch': 0.82} 8%|▊ | 817/10000 [1:15:43<14:03:50, 5.51s/it][2025-06-19 14:45:28,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 14:45:28,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.77 | bwd_microstep: 3313.31 | bwd_inner_microstep: 3312.36 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.17 [2025-06-19 14:45:28,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.77 | bwd: 3313.32 | bwd_inner: 3312.36 | bwd_allreduce: 0.92 | step: 7.18 8%|▊ | 818/10000 [1:15:49<14:01:27, 5.50s/it] {'loss': 0.1549, 'grad_norm': 0.7253401875495911, 'learning_rate': 3.971920011752348e-05, 'epoch': 0.82} 8%|▊ | 818/10000 [1:15:49<14:01:27, 5.50s/it][2025-06-19 14:45:33,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:45:33,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.18 | bwd_microstep: 3360.25 | bwd_inner_microstep: 3359.25 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.30 [2025-06-19 14:45:33,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.18 | bwd: 3360.26 | bwd_inner: 3359.25 | bwd_allreduce: 0.97 | step: 7.31 8%|▊ | 819/10000 [1:15:54<14:03:11, 5.51s/it] {'loss': 0.1059, 'grad_norm': 0.5566462278366089, 'learning_rate': 3.9718117457993576e-05, 'epoch': 0.82} 8%|▊ | 819/10000 [1:15:54<14:03:11, 5.51s/it][2025-06-19 14:45:39,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 14:45:39,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.78 | bwd_microstep: 3318.53 | bwd_inner_microstep: 3317.47 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.59 [2025-06-19 14:45:39,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.78 | bwd: 3318.55 | bwd_inner: 3317.47 | bwd_allreduce: 1.01 | step: 7.60 8%|▊ | 820/10000 [1:16:00<14:02:02, 5.50s/it] {'loss': 0.1457, 'grad_norm': 0.9355897903442383, 'learning_rate': 3.971703273012467e-05, 'epoch': 0.82} 8%|▊ | 820/10000 [1:16:00<14:02:02, 5.50s/it][2025-06-19 14:45:44,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 14:45:44,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.99 | bwd_microstep: 3323.86 | bwd_inner_microstep: 3322.79 | bwd_allreduce_microstep: 0.98 | step_microstep: 8.06 [2025-06-19 14:45:44,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.99 | bwd: 3323.89 | bwd_inner: 3322.79 | bwd_allreduce: 1.02 | step: 8.07 8%|▊ | 821/10000 [1:16:05<14:02:26, 5.51s/it] {'loss': 0.1359, 'grad_norm': 1.3541327714920044, 'learning_rate': 3.9715945934030536e-05, 'epoch': 0.82} 8%|▊ | 821/10000 [1:16:05<14:02:26, 5.51s/it][2025-06-19 14:45:50,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 14:45:50,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.94 | bwd_microstep: 3310.73 | bwd_inner_microstep: 3309.67 | bwd_allreduce_microstep: 0.98 | step_microstep: 8.03 [2025-06-19 14:45:50,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.94 | bwd: 3310.76 | bwd_inner: 3309.67 | bwd_allreduce: 1.01 | step: 8.04 8%|▊ | 822/10000 [1:16:11<14:02:33, 5.51s/it] {'loss': 0.1339, 'grad_norm': 0.765977144241333, 'learning_rate': 3.971485706982518e-05, 'epoch': 0.82} 8%|▊ | 822/10000 [1:16:11<14:02:33, 5.51s/it][2025-06-19 14:45:55,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 14:45:55,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.71 | bwd_microstep: 3320.86 | bwd_inner_microstep: 3319.91 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.61 [2025-06-19 14:45:55,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.71 | bwd: 3320.89 | bwd_inner: 3319.91 | bwd_allreduce: 0.90 | step: 7.61 8%|▊ | 823/10000 [1:16:16<14:02:29, 5.51s/it] {'loss': 0.1769, 'grad_norm': 1.2466291189193726, 'learning_rate': 3.971376613762281e-05, 'epoch': 0.82} 8%|▊ | 823/10000 [1:16:16<14:02:29, 5.51s/it][2025-06-19 14:46:01,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.72 [2025-06-19 14:46:01,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2180.98 | bwd_microstep: 3365.80 | bwd_inner_microstep: 3364.72 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.28 [2025-06-19 14:46:01,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2180.98 | bwd: 3365.84 | bwd_inner: 3364.72 | bwd_allreduce: 1.04 | step: 8.29 8%|▊ | 824/10000 [1:16:22<14:06:45, 5.54s/it] {'loss': 0.0773, 'grad_norm': 0.4895480275154114, 'learning_rate': 3.971267313753787e-05, 'epoch': 0.82} 8%|▊ | 824/10000 [1:16:22<14:06:45, 5.54s/it][2025-06-19 14:46:07,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.72 [2025-06-19 14:46:07,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.65 | bwd_microstep: 3317.01 | bwd_inner_microstep: 3315.84 | bwd_allreduce_microstep: 1.07 | step_microstep: 8.89 [2025-06-19 14:46:07,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.65 | bwd: 3317.05 | bwd_inner: 3315.84 | bwd_allreduce: 1.12 | step: 8.89 8%|▊ | 825/10000 [1:16:27<14:05:51, 5.53s/it] {'loss': 0.1112, 'grad_norm': 0.48426881432533264, 'learning_rate': 3.971157806968501e-05, 'epoch': 0.82} 8%|▊ | 825/10000 [1:16:27<14:05:51, 5.53s/it][2025-06-19 14:46:12,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:46:12,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.39 | bwd_microstep: 3361.18 | bwd_inner_microstep: 3360.32 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.87 [2025-06-19 14:46:12,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.39 | bwd: 3361.20 | bwd_inner: 3360.32 | bwd_allreduce: 0.84 | step: 6.88 8%|▊ | 826/10000 [1:16:33<14:06:50, 5.54s/it] {'loss': 0.0911, 'grad_norm': 0.44446220993995667, 'learning_rate': 3.971048093417909e-05, 'epoch': 0.83} 8%|▊ | 826/10000 [1:16:33<14:06:50, 5.54s/it][2025-06-19 14:46:18,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 14:46:18,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.61 | bwd_microstep: 3312.89 | bwd_inner_microstep: 3312.05 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.05 [2025-06-19 14:46:18,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.61 | bwd: 3312.92 | bwd_inner: 3312.05 | bwd_allreduce: 0.81 | step: 7.05 8%|▊ | 827/10000 [1:16:38<14:03:23, 5.52s/it] {'loss': 0.1545, 'grad_norm': 0.8640658855438232, 'learning_rate': 3.970938173113521e-05, 'epoch': 0.83} 8%|▊ | 827/10000 [1:16:38<14:03:23, 5.52s/it][2025-06-19 14:46:23,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 14:46:23,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.90 | bwd_microstep: 3306.27 | bwd_inner_microstep: 3305.12 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.64 [2025-06-19 14:46:23,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.90 | bwd: 3306.30 | bwd_inner: 3305.12 | bwd_allreduce: 1.10 | step: 7.64 8%|▊ | 828/10000 [1:16:44<14:00:44, 5.50s/it] {'loss': 0.1661, 'grad_norm': 1.3273354768753052, 'learning_rate': 3.9708280460668646e-05, 'epoch': 0.83} 8%|▊ | 828/10000 [1:16:44<14:00:44, 5.50s/it][2025-06-19 14:46:28,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:46:28,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.03 | bwd_microstep: 3311.40 | bwd_inner_microstep: 3310.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-19 14:46:28,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.04 | bwd: 3311.41 | bwd_inner: 3310.60 | bwd_allreduce: 0.77 | step: 6.82 8%|▊ | 829/10000 [1:16:49<13:59:12, 5.49s/it] {'loss': 0.1059, 'grad_norm': 0.8582706451416016, 'learning_rate': 3.970717712289494e-05, 'epoch': 0.83} 8%|▊ | 829/10000 [1:16:49<13:59:12, 5.49s/it][2025-06-19 14:46:34,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:46:34,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.67 | bwd_microstep: 3323.04 | bwd_inner_microstep: 3322.15 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.92 [2025-06-19 14:46:34,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.68 | bwd: 3323.06 | bwd_inner: 3322.15 | bwd_allreduce: 0.86 | step: 6.93 8%|▊ | 830/10000 [1:16:55<13:58:26, 5.49s/it] {'loss': 0.0895, 'grad_norm': 0.4706730842590332, 'learning_rate': 3.9706071717929815e-05, 'epoch': 0.83} 8%|▊ | 830/10000 [1:16:55<13:58:26, 5.49s/it][2025-06-19 14:46:39,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.82 [2025-06-19 14:46:39,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.88 | bwd_microstep: 3320.78 | bwd_inner_microstep: 3319.91 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.09 [2025-06-19 14:46:39,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.88 | bwd: 3320.80 | bwd_inner: 3319.91 | bwd_allreduce: 0.83 | step: 7.08 8%|▊ | 831/10000 [1:17:00<13:58:12, 5.49s/it] {'loss': 0.1303, 'grad_norm': 0.899128794670105, 'learning_rate': 3.970496424588922e-05, 'epoch': 0.83} 8%|▊ | 831/10000 [1:17:00<13:58:12, 5.49s/it][2025-06-19 14:46:45,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:46:45,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.28 | bwd_microstep: 3373.39 | bwd_inner_microstep: 3372.48 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.96 [2025-06-19 14:46:45,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.28 | bwd: 3373.41 | bwd_inner: 3372.48 | bwd_allreduce: 0.88 | step: 6.96 8%|▊ | 832/10000 [1:17:06<14:01:22, 5.51s/it] {'loss': 0.1282, 'grad_norm': 0.6476308703422546, 'learning_rate': 3.970385470688934e-05, 'epoch': 0.83} 8%|▊ | 832/10000 [1:17:06<14:01:22, 5.51s/it][2025-06-19 14:46:50,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 14:46:50,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.15 | bwd_microstep: 3316.47 | bwd_inner_microstep: 3315.45 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.43 [2025-06-19 14:46:50,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.15 | bwd: 3316.49 | bwd_inner: 3315.45 | bwd_allreduce: 0.99 | step: 7.44 8%|▊ | 833/10000 [1:17:11<14:00:30, 5.50s/it] {'loss': 0.1168, 'grad_norm': 0.5791074633598328, 'learning_rate': 3.9702743101046544e-05, 'epoch': 0.83} 8%|▊ | 833/10000 [1:17:11<14:00:30, 5.50s/it][2025-06-19 14:46:56,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 14:46:56,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.39 | bwd_microstep: 3312.90 | bwd_inner_microstep: 3311.76 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.51 [2025-06-19 14:46:56,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.39 | bwd: 3312.91 | bwd_inner: 3311.76 | bwd_allreduce: 1.10 | step: 7.51 8%|▊ | 834/10000 [1:17:17<13:58:53, 5.49s/it] {'loss': 0.1156, 'grad_norm': 0.5542662143707275, 'learning_rate': 3.9701629428477436e-05, 'epoch': 0.83} 8%|▊ | 834/10000 [1:17:17<13:58:53, 5.49s/it][2025-06-19 14:47:01,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:47:01,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.80 | bwd_microstep: 3313.12 | bwd_inner_microstep: 3312.26 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.02 [2025-06-19 14:47:01,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.80 | bwd: 3313.14 | bwd_inner: 3312.26 | bwd_allreduce: 0.83 | step: 7.02 8%|▊ | 835/10000 [1:17:22<13:58:12, 5.49s/it] {'loss': 0.1385, 'grad_norm': 1.1090102195739746, 'learning_rate': 3.9700513689298844e-05, 'epoch': 0.83} 8%|▊ | 835/10000 [1:17:22<13:58:12, 5.49s/it][2025-06-19 14:47:07,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:47:07,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.68 | bwd_microstep: 3316.98 | bwd_inner_microstep: 3316.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 14:47:07,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.68 | bwd: 3316.99 | bwd_inner: 3316.18 | bwd_allreduce: 0.77 | step: 7.00 8%|▊ | 836/10000 [1:17:28<13:57:45, 5.49s/it] {'loss': 0.0928, 'grad_norm': 0.4846842885017395, 'learning_rate': 3.96993958836278e-05, 'epoch': 0.84} 8%|▊ | 836/10000 [1:17:28<13:57:45, 5.49s/it][2025-06-19 14:47:12,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 14:47:12,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.08 | bwd_microstep: 3316.91 | bwd_inner_microstep: 3316.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 14:47:12,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.08 | bwd: 3316.93 | bwd_inner: 3316.14 | bwd_allreduce: 0.75 | step: 6.57 8%|▊ | 837/10000 [1:17:33<13:56:35, 5.48s/it] {'loss': 0.113, 'grad_norm': 0.9258860349655151, 'learning_rate': 3.969827601158155e-05, 'epoch': 0.84} 8%|▊ | 837/10000 [1:17:33<13:56:35, 5.48s/it][2025-06-19 14:47:18,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 14:47:18,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.47 | bwd_microstep: 3336.00 | bwd_inner_microstep: 3335.08 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.07 [2025-06-19 14:47:18,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.47 | bwd: 3336.02 | bwd_inner: 3335.08 | bwd_allreduce: 0.89 | step: 7.08 8%|▊ | 838/10000 [1:17:39<13:56:53, 5.48s/it] {'loss': 0.1121, 'grad_norm': 0.5417500734329224, 'learning_rate': 3.969715407327758e-05, 'epoch': 0.84} 8%|▊ | 838/10000 [1:17:39<13:56:53, 5.48s/it][2025-06-19 14:47:23,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:47:23,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.60 | bwd_microstep: 3324.31 | bwd_inner_microstep: 3323.48 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.32 [2025-06-19 14:47:23,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.60 | bwd: 3324.33 | bwd_inner: 3323.48 | bwd_allreduce: 0.80 | step: 7.33 8%|▊ | 839/10000 [1:17:44<13:56:43, 5.48s/it] {'loss': 0.1039, 'grad_norm': 0.474083811044693, 'learning_rate': 3.969603006883355e-05, 'epoch': 0.84} 8%|▊ | 839/10000 [1:17:44<13:56:43, 5.48s/it][2025-06-19 14:47:29,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:47:29,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.15 | bwd_microstep: 3319.82 | bwd_inner_microstep: 3319.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 14:47:29,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.15 | bwd: 3319.83 | bwd_inner: 3319.03 | bwd_allreduce: 0.76 | step: 6.66 8%|▊ | 840/10000 [1:17:50<13:55:53, 5.48s/it] {'loss': 0.0996, 'grad_norm': 0.37970760464668274, 'learning_rate': 3.9694903998367384e-05, 'epoch': 0.84} 8%|▊ | 840/10000 [1:17:50<13:55:53, 5.48s/it][2025-06-19 14:47:34,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.96 [2025-06-19 14:47:34,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.50 | bwd_microstep: 3366.71 | bwd_inner_microstep: 3365.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.66 [2025-06-19 14:47:34,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.50 | bwd: 3366.73 | bwd_inner: 3365.90 | bwd_allreduce: 0.78 | step: 7.66 8%|▊ | 841/10000 [1:17:55<13:58:27, 5.49s/it] {'loss': 0.0652, 'grad_norm': 0.24797062575817108, 'learning_rate': 3.96937758619972e-05, 'epoch': 0.84} 8%|▊ | 841/10000 [1:17:55<13:58:27, 5.49s/it][2025-06-19 14:47:40,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 14:47:40,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.83 | bwd_microstep: 3374.40 | bwd_inner_microstep: 3373.23 | bwd_allreduce_microstep: 1.08 | step_microstep: 8.22 [2025-06-19 14:47:40,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.83 | bwd: 3374.43 | bwd_inner: 3373.23 | bwd_allreduce: 1.12 | step: 8.22 8%|▊ | 842/10000 [1:18:01<14:01:00, 5.51s/it] {'loss': 0.0819, 'grad_norm': 0.5629715919494629, 'learning_rate': 3.9692645659841325e-05, 'epoch': 0.84} 8%|▊ | 842/10000 [1:18:01<14:01:00, 5.51s/it][2025-06-19 14:47:45,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 14:47:45,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.95 | bwd_microstep: 3378.51 | bwd_inner_microstep: 3377.58 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.48 [2025-06-19 14:47:45,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.95 | bwd: 3378.54 | bwd_inner: 3377.58 | bwd_allreduce: 0.88 | step: 7.48 8%|▊ | 843/10000 [1:18:06<14:03:59, 5.53s/it] {'loss': 0.1154, 'grad_norm': 0.6536188125610352, 'learning_rate': 3.969151339201832e-05, 'epoch': 0.84} 8%|▊ | 843/10000 [1:18:06<14:03:59, 5.53s/it][2025-06-19 14:47:51,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 14:47:51,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2162.27 | bwd_microstep: 3385.28 | bwd_inner_microstep: 3384.32 | bwd_allreduce_microstep: 0.88 | step_microstep: 8.37 [2025-06-19 14:47:51,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2162.27 | bwd: 3385.31 | bwd_inner: 3384.32 | bwd_allreduce: 0.91 | step: 8.38 8%|▊ | 844/10000 [1:18:12<14:07:08, 5.55s/it] {'loss': 0.0829, 'grad_norm': 0.35579970479011536, 'learning_rate': 3.9690379058646946e-05, 'epoch': 0.84} 8%|▊ | 844/10000 [1:18:12<14:07:08, 5.55s/it][2025-06-19 14:47:57,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:47:57,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.20 | bwd_microstep: 3331.77 | bwd_inner_microstep: 3330.87 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.51 [2025-06-19 14:47:57,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.19 | bwd: 3331.80 | bwd_inner: 3330.87 | bwd_allreduce: 0.85 | step: 7.52 8%|▊ | 845/10000 [1:18:17<14:05:46, 5.54s/it] {'loss': 0.1063, 'grad_norm': 0.4738246500492096, 'learning_rate': 3.96892426598462e-05, 'epoch': 0.84} 8%|▊ | 845/10000 [1:18:17<14:05:46, 5.54s/it][2025-06-19 14:48:02,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 14:48:02,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.63 | bwd_microstep: 3345.22 | bwd_inner_microstep: 3344.37 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.06 [2025-06-19 14:48:02,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.63 | bwd: 3345.25 | bwd_inner: 3344.37 | bwd_allreduce: 0.81 | step: 7.07 8%|▊ | 846/10000 [1:18:23<14:04:35, 5.54s/it] {'loss': 0.078, 'grad_norm': 0.48059865832328796, 'learning_rate': 3.9688104195735275e-05, 'epoch': 0.85} 8%|▊ | 846/10000 [1:18:23<14:04:35, 5.54s/it][2025-06-19 14:48:08,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 14:48:08,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.38 | bwd_microstep: 3403.70 | bwd_inner_microstep: 3402.43 | bwd_allreduce_microstep: 1.19 | step_microstep: 8.42 [2025-06-19 14:48:08,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.38 | bwd: 3403.73 | bwd_inner: 3402.43 | bwd_allreduce: 1.23 | step: 8.41 8%|▊ | 847/10000 [1:18:28<14:06:55, 5.55s/it] {'loss': 0.0831, 'grad_norm': 0.3680099546909332, 'learning_rate': 3.96869636664336e-05, 'epoch': 0.85} 8%|▊ | 847/10000 [1:18:28<14:06:55, 5.55s/it][2025-06-19 14:48:13,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.72 [2025-06-19 14:48:13,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2151.23 | bwd_microstep: 3374.42 | bwd_inner_microstep: 3373.24 | bwd_allreduce_microstep: 1.09 | step_microstep: 8.43 [2025-06-19 14:48:13,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2151.23 | bwd: 3374.45 | bwd_inner: 3373.24 | bwd_allreduce: 1.13 | step: 8.44 8%|▊ | 848/10000 [1:18:34<14:08:02, 5.56s/it] {'loss': 0.0682, 'grad_norm': 0.37462425231933594, 'learning_rate': 3.96858210720608e-05, 'epoch': 0.85} 8%|▊ | 848/10000 [1:18:34<14:08:02, 5.56s/it][2025-06-19 14:48:19,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:48:19,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.89 | bwd_microstep: 3387.75 | bwd_inner_microstep: 3386.89 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.89 [2025-06-19 14:48:19,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.89 | bwd: 3387.77 | bwd_inner: 3386.89 | bwd_allreduce: 0.82 | step: 6.88 8%|▊ | 849/10000 [1:18:40<14:09:23, 5.57s/it] {'loss': 0.1148, 'grad_norm': 0.7190620303153992, 'learning_rate': 3.968467641273675e-05, 'epoch': 0.85} 8%|▊ | 849/10000 [1:18:40<14:09:23, 5.57s/it][2025-06-19 14:48:24,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:48:24,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.41 | bwd_microstep: 3323.50 | bwd_inner_microstep: 3322.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 14:48:24,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.41 | bwd: 3323.52 | bwd_inner: 3322.71 | bwd_allreduce: 0.76 | step: 6.57 8%|▊ | 850/10000 [1:18:45<14:04:54, 5.54s/it] {'loss': 0.0845, 'grad_norm': 0.4286629259586334, 'learning_rate': 3.968352968858149e-05, 'epoch': 0.85} 8%|▊ | 850/10000 [1:18:45<14:04:54, 5.54s/it][2025-06-19 14:48:30,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:48:30,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2161.60 | bwd_microstep: 3405.89 | bwd_inner_microstep: 3405.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 14:48:30,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2161.60 | bwd: 3405.91 | bwd_inner: 3405.10 | bwd_allreduce: 0.77 | step: 6.67 9%|▊ | 851/10000 [1:18:51<14:07:47, 5.56s/it] {'loss': 0.0975, 'grad_norm': 0.5144550800323486, 'learning_rate': 3.9682380899715324e-05, 'epoch': 0.85} 9%|▊ | 851/10000 [1:18:51<14:07:47, 5.56s/it][2025-06-19 14:48:35,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.74 [2025-06-19 14:48:35,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.29 | bwd_microstep: 3324.31 | bwd_inner_microstep: 3323.47 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.23 [2025-06-19 14:48:35,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.29 | bwd: 3324.32 | bwd_inner: 3323.47 | bwd_allreduce: 0.81 | step: 7.23 9%|▊ | 852/10000 [1:18:56<14:04:26, 5.54s/it] {'loss': 0.1284, 'grad_norm': 0.8325817584991455, 'learning_rate': 3.9681230046258744e-05, 'epoch': 0.85} 9%|▊ | 852/10000 [1:18:56<14:04:26, 5.54s/it][2025-06-19 14:48:41,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:48:41,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.08 | bwd_microstep: 3327.92 | bwd_inner_microstep: 3327.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-19 14:48:41,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.08 | bwd: 3327.94 | bwd_inner: 3327.11 | bwd_allreduce: 0.78 | step: 6.91 9%|▊ | 853/10000 [1:19:02<14:01:53, 5.52s/it] {'loss': 0.0775, 'grad_norm': 0.4411362409591675, 'learning_rate': 3.968007712833248e-05, 'epoch': 0.85} 9%|▊ | 853/10000 [1:19:02<14:01:53, 5.52s/it][2025-06-19 14:48:46,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:48:46,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.06 | bwd_microstep: 3369.49 | bwd_inner_microstep: 3368.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.55 [2025-06-19 14:48:46,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.06 | bwd: 3369.51 | bwd_inner: 3368.70 | bwd_allreduce: 0.76 | step: 6.55 9%|▊ | 854/10000 [1:19:07<14:03:01, 5.53s/it] {'loss': 0.0921, 'grad_norm': 0.8020077347755432, 'learning_rate': 3.9678922146057466e-05, 'epoch': 0.85} 9%|▊ | 854/10000 [1:19:07<14:03:01, 5.53s/it][2025-06-19 14:48:52,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 14:48:52,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.62 | bwd_microstep: 3374.62 | bwd_inner_microstep: 3373.73 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.87 [2025-06-19 14:48:52,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.62 | bwd: 3374.64 | bwd_inner: 3373.73 | bwd_allreduce: 0.85 | step: 6.88 9%|▊ | 855/10000 [1:19:13<14:03:28, 5.53s/it] {'loss': 0.1396, 'grad_norm': 0.8844695687294006, 'learning_rate': 3.967776509955485e-05, 'epoch': 0.85} 9%|▊ | 855/10000 [1:19:13<14:03:28, 5.53s/it][2025-06-19 14:48:57,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 14:48:57,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.21 | bwd_microstep: 3324.22 | bwd_inner_microstep: 3323.19 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.71 [2025-06-19 14:48:57,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.21 | bwd: 3324.24 | bwd_inner: 3323.19 | bwd_allreduce: 1.00 | step: 7.71 9%|▊ | 856/10000 [1:19:18<14:00:56, 5.52s/it] {'loss': 0.1148, 'grad_norm': 0.47414690256118774, 'learning_rate': 3.9676605988946e-05, 'epoch': 0.86} 9%|▊ | 856/10000 [1:19:18<14:00:56, 5.52s/it][2025-06-19 14:49:03,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:49:03,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.63 | bwd_microstep: 3335.32 | bwd_inner_microstep: 3334.53 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.61 [2025-06-19 14:49:03,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.63 | bwd: 3335.33 | bwd_inner: 3334.53 | bwd_allreduce: 0.76 | step: 6.61 9%|▊ | 857/10000 [1:19:24<13:59:55, 5.51s/it] {'loss': 0.1182, 'grad_norm': 0.8280680179595947, 'learning_rate': 3.96754448143525e-05, 'epoch': 0.86} 9%|▊ | 857/10000 [1:19:24<13:59:55, 5.51s/it][2025-06-19 14:49:08,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:49:08,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.26 | bwd_microstep: 3320.18 | bwd_inner_microstep: 3319.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 14:49:08,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.26 | bwd: 3320.20 | bwd_inner: 3319.40 | bwd_allreduce: 0.75 | step: 6.56 9%|▊ | 858/10000 [1:19:29<13:57:44, 5.50s/it] {'loss': 0.1148, 'grad_norm': 0.8139774203300476, 'learning_rate': 3.9674281575896165e-05, 'epoch': 0.86} 9%|▊ | 858/10000 [1:19:29<13:57:44, 5.50s/it][2025-06-19 14:49:14,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 14:49:14,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.04 | bwd_microstep: 3328.56 | bwd_inner_microstep: 3327.77 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 14:49:14,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.04 | bwd: 3328.58 | bwd_inner: 3327.77 | bwd_allreduce: 0.76 | step: 6.65 9%|▊ | 859/10000 [1:19:35<13:56:23, 5.49s/it] {'loss': 0.1204, 'grad_norm': 1.0138441324234009, 'learning_rate': 3.9673116273699e-05, 'epoch': 0.86} 9%|▊ | 859/10000 [1:19:35<13:56:23, 5.49s/it][2025-06-19 14:49:19,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:49:19,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.11 | bwd_microstep: 3335.53 | bwd_inner_microstep: 3334.53 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.63 [2025-06-19 14:49:19,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.11 | bwd: 3335.55 | bwd_inner: 3334.53 | bwd_allreduce: 0.97 | step: 7.64 9%|▊ | 860/10000 [1:19:40<13:55:55, 5.49s/it] {'loss': 0.0874, 'grad_norm': 0.36927708983421326, 'learning_rate': 3.9671948907883245e-05, 'epoch': 0.86} 9%|▊ | 860/10000 [1:19:40<13:55:55, 5.49s/it][2025-06-19 14:49:25,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:49:25,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.05 | bwd_microstep: 3328.51 | bwd_inner_microstep: 3327.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-19 14:49:25,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.05 | bwd: 3328.53 | bwd_inner: 3327.72 | bwd_allreduce: 0.76 | step: 6.91 9%|▊ | 861/10000 [1:19:46<13:55:38, 5.49s/it] {'loss': 0.1128, 'grad_norm': 0.659802258014679, 'learning_rate': 3.967077947857134e-05, 'epoch': 0.86} 9%|▊ | 861/10000 [1:19:46<13:55:38, 5.49s/it][2025-06-19 14:49:30,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 14:49:30,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.44 | bwd_microstep: 3403.35 | bwd_inner_microstep: 3402.48 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.12 [2025-06-19 14:49:30,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.44 | bwd: 3403.37 | bwd_inner: 3402.48 | bwd_allreduce: 0.85 | step: 7.12 9%|▊ | 862/10000 [1:19:51<14:00:14, 5.52s/it] {'loss': 0.0902, 'grad_norm': 0.573872983455658, 'learning_rate': 3.966960798588598e-05, 'epoch': 0.86} 9%|▊ | 862/10000 [1:19:51<14:00:14, 5.52s/it][2025-06-19 14:49:36,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 14:49:36,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2161.96 | bwd_microstep: 3389.34 | bwd_inner_microstep: 3388.32 | bwd_allreduce_microstep: 0.94 | step_microstep: 8.64 [2025-06-19 14:49:36,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2161.96 | bwd: 3389.37 | bwd_inner: 3388.32 | bwd_allreduce: 0.98 | step: 8.64 9%|▊ | 863/10000 [1:19:57<14:03:51, 5.54s/it] {'loss': 0.1936, 'grad_norm': 1.5033625364303589, 'learning_rate': 3.9668434429950015e-05, 'epoch': 0.86} 9%|▊ | 863/10000 [1:19:57<14:03:51, 5.54s/it][2025-06-19 14:49:42,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.72 [2025-06-19 14:49:42,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2170.60 | bwd_microstep: 3391.43 | bwd_inner_microstep: 3390.26 | bwd_allreduce_microstep: 1.08 | step_microstep: 8.91 [2025-06-19 14:49:42,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2170.60 | bwd: 3391.46 | bwd_inner: 3390.26 | bwd_allreduce: 1.13 | step: 8.92 9%|▊ | 864/10000 [1:20:02<14:07:03, 5.56s/it] {'loss': 0.1013, 'grad_norm': 0.6412862539291382, 'learning_rate': 3.9667258810886566e-05, 'epoch': 0.86} 9%|▊ | 864/10000 [1:20:02<14:07:03, 5.56s/it][2025-06-19 14:49:47,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 14:49:47,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.44 | bwd_microstep: 3337.39 | bwd_inner_microstep: 3336.46 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.11 [2025-06-19 14:49:47,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.44 | bwd: 3337.41 | bwd_inner: 3336.46 | bwd_allreduce: 0.88 | step: 8.13 9%|▊ | 865/10000 [1:20:08<14:05:43, 5.55s/it] {'loss': 0.1278, 'grad_norm': 0.7462999224662781, 'learning_rate': 3.966608112881896e-05, 'epoch': 0.86} 9%|▊ | 865/10000 [1:20:08<14:05:43, 5.55s/it][2025-06-19 14:49:53,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:49:53,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2165.74 | bwd_microstep: 3384.57 | bwd_inner_microstep: 3383.58 | bwd_allreduce_microstep: 0.94 | step_microstep: 6.96 [2025-06-19 14:49:53,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2165.74 | bwd: 3384.59 | bwd_inner: 3383.58 | bwd_allreduce: 0.96 | step: 6.96 9%|▊ | 866/10000 [1:20:14<14:07:18, 5.57s/it] {'loss': 0.1462, 'grad_norm': 0.8664668202400208, 'learning_rate': 3.966490138387071e-05, 'epoch': 0.87} 9%|▊ | 866/10000 [1:20:14<14:07:18, 5.57s/it][2025-06-19 14:49:58,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 14:49:58,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.88 | bwd_microstep: 3332.06 | bwd_inner_microstep: 3331.16 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.98 [2025-06-19 14:49:58,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.88 | bwd: 3332.09 | bwd_inner: 3331.16 | bwd_allreduce: 0.86 | step: 7.97 9%|▊ | 867/10000 [1:20:19<14:04:03, 5.55s/it] {'loss': 0.0906, 'grad_norm': 0.531995952129364, 'learning_rate': 3.9663719576165556e-05, 'epoch': 0.87} 9%|▊ | 867/10000 [1:20:19<14:04:03, 5.55s/it][2025-06-19 14:50:04,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:50:04,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2154.32 | bwd_microstep: 3373.10 | bwd_inner_microstep: 3372.22 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.26 [2025-06-19 14:50:04,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2154.32 | bwd: 3373.13 | bwd_inner: 3372.22 | bwd_allreduce: 0.83 | step: 7.27 9%|▊ | 868/10000 [1:20:25<14:04:58, 5.55s/it] {'loss': 0.1609, 'grad_norm': 1.0076195001602173, 'learning_rate': 3.96625357058275e-05, 'epoch': 0.87} 9%|▊ | 868/10000 [1:20:25<14:04:58, 5.55s/it][2025-06-19 14:50:09,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 14:50:09,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.32 | bwd_microstep: 3374.57 | bwd_inner_microstep: 3373.76 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.63 [2025-06-19 14:50:09,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.32 | bwd: 3374.58 | bwd_inner: 3373.76 | bwd_allreduce: 0.78 | step: 7.63 9%|▊ | 869/10000 [1:20:30<14:05:32, 5.56s/it] {'loss': 0.16, 'grad_norm': 1.6123408079147339, 'learning_rate': 3.96613497729807e-05, 'epoch': 0.87} 9%|▊ | 869/10000 [1:20:30<14:05:32, 5.56s/it][2025-06-19 14:50:15,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 14:50:15,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.37 | bwd_microstep: 3377.22 | bwd_inner_microstep: 3376.07 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.51 [2025-06-19 14:50:15,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.37 | bwd: 3377.24 | bwd_inner: 3376.07 | bwd_allreduce: 1.11 | step: 7.52 9%|▊ | 870/10000 [1:20:36<14:05:40, 5.56s/it] {'loss': 0.1083, 'grad_norm': 0.43785151839256287, 'learning_rate': 3.966016177774956e-05, 'epoch': 0.87} 9%|▊ | 870/10000 [1:20:36<14:05:40, 5.56s/it][2025-06-19 14:50:20,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:50:20,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.81 | bwd_microstep: 3338.14 | bwd_inner_microstep: 3337.34 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 14:50:20,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.81 | bwd: 3338.16 | bwd_inner: 3337.34 | bwd_allreduce: 0.78 | step: 7.20 9%|▊ | 871/10000 [1:20:41<14:02:41, 5.54s/it] {'loss': 0.1467, 'grad_norm': 0.6996121406555176, 'learning_rate': 3.965897172025869e-05, 'epoch': 0.87} 9%|▊ | 871/10000 [1:20:41<14:02:41, 5.54s/it][2025-06-19 14:50:26,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 14:50:26,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.06 | bwd_microstep: 3327.59 | bwd_inner_microstep: 3326.73 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.07 [2025-06-19 14:50:26,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.06 | bwd: 3327.61 | bwd_inner: 3326.73 | bwd_allreduce: 0.82 | step: 7.08 9%|▊ | 872/10000 [1:20:47<14:00:23, 5.52s/it] {'loss': 0.0557, 'grad_norm': 0.4250989854335785, 'learning_rate': 3.9657779600632934e-05, 'epoch': 0.87} 9%|▊ | 872/10000 [1:20:47<14:00:23, 5.52s/it][2025-06-19 14:50:32,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 14:50:32,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.19 | bwd_microstep: 3385.58 | bwd_inner_microstep: 3384.67 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.09 [2025-06-19 14:50:32,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.19 | bwd: 3385.61 | bwd_inner: 3384.67 | bwd_allreduce: 0.87 | step: 8.10 9%|▊ | 873/10000 [1:20:52<14:02:35, 5.54s/it] {'loss': 0.0645, 'grad_norm': 0.33145928382873535, 'learning_rate': 3.965658541899733e-05, 'epoch': 0.87} 9%|▊ | 873/10000 [1:20:52<14:02:35, 5.54s/it][2025-06-19 14:50:37,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 14:50:37,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2154.90 | bwd_microstep: 3394.12 | bwd_inner_microstep: 3393.24 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.82 [2025-06-19 14:50:37,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2154.90 | bwd: 3394.14 | bwd_inner: 3393.24 | bwd_allreduce: 0.85 | step: 6.83 9%|▊ | 874/10000 [1:20:58<14:04:34, 5.55s/it] {'loss': 0.1005, 'grad_norm': 0.5711426138877869, 'learning_rate': 3.965538917547714e-05, 'epoch': 0.87} 9%|▊ | 874/10000 [1:20:58<14:04:34, 5.55s/it][2025-06-19 14:50:43,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:50:43,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.56 | bwd_microstep: 3371.24 | bwd_inner_microstep: 3370.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 14:50:43,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.57 | bwd: 3371.25 | bwd_inner: 3370.43 | bwd_allreduce: 0.78 | step: 7.18 9%|▉ | 875/10000 [1:21:03<14:04:27, 5.55s/it] {'loss': 0.0809, 'grad_norm': 0.44735926389694214, 'learning_rate': 3.965419087019785e-05, 'epoch': 0.88} 9%|▉ | 875/10000 [1:21:03<14:04:27, 5.55s/it][2025-06-19 14:50:48,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:50:48,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.59 | bwd_microstep: 3325.99 | bwd_inner_microstep: 3325.05 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.07 [2025-06-19 14:50:48,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.59 | bwd: 3326.01 | bwd_inner: 3325.05 | bwd_allreduce: 0.91 | step: 7.07 9%|▉ | 876/10000 [1:21:09<14:00:42, 5.53s/it] {'loss': 0.1055, 'grad_norm': 0.5347434878349304, 'learning_rate': 3.965299050328516e-05, 'epoch': 0.88} 9%|▉ | 876/10000 [1:21:09<14:00:42, 5.53s/it][2025-06-19 14:50:54,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:50:54,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.29 | bwd_microstep: 3370.60 | bwd_inner_microstep: 3369.67 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.03 [2025-06-19 14:50:54,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.29 | bwd: 3370.61 | bwd_inner: 3369.67 | bwd_allreduce: 0.90 | step: 7.03 9%|▉ | 877/10000 [1:21:14<14:01:00, 5.53s/it] {'loss': 0.0499, 'grad_norm': 0.37258002161979675, 'learning_rate': 3.965178807486497e-05, 'epoch': 0.88} 9%|▉ | 877/10000 [1:21:15<14:01:00, 5.53s/it][2025-06-19 14:50:59,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:50:59,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.62 | bwd_microstep: 3367.46 | bwd_inner_microstep: 3366.58 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.40 [2025-06-19 14:50:59,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.62 | bwd: 3367.48 | bwd_inner: 3366.58 | bwd_allreduce: 0.85 | step: 7.41 9%|▉ | 878/10000 [1:21:20<14:01:38, 5.54s/it] {'loss': 0.1022, 'grad_norm': 0.7422018647193909, 'learning_rate': 3.9650583585063434e-05, 'epoch': 0.88} 9%|▉ | 878/10000 [1:21:20<14:01:38, 5.54s/it][2025-06-19 14:51:05,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:51:05,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.57 | bwd_microstep: 3317.87 | bwd_inner_microstep: 3317.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 14:51:05,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.57 | bwd: 3317.88 | bwd_inner: 3317.07 | bwd_allreduce: 0.76 | step: 6.60 9%|▉ | 879/10000 [1:21:26<13:58:21, 5.51s/it] {'loss': 0.0955, 'grad_norm': 0.5059086680412292, 'learning_rate': 3.9649377034006866e-05, 'epoch': 0.88} 9%|▉ | 879/10000 [1:21:26<13:58:21, 5.51s/it][2025-06-19 14:51:10,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:51:10,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.76 | bwd_microstep: 3311.57 | bwd_inner_microstep: 3310.64 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.43 [2025-06-19 14:51:10,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.76 | bwd: 3311.59 | bwd_inner: 3310.64 | bwd_allreduce: 0.91 | step: 7.43 9%|▉ | 880/10000 [1:21:31<13:55:57, 5.50s/it] {'loss': 0.1655, 'grad_norm': 1.3423963785171509, 'learning_rate': 3.964816842182185e-05, 'epoch': 0.88} 9%|▉ | 880/10000 [1:21:31<13:55:57, 5.50s/it][2025-06-19 14:51:16,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:51:16,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.99 | bwd_microstep: 3318.20 | bwd_inner_microstep: 3317.26 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.21 [2025-06-19 14:51:16,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.99 | bwd: 3318.22 | bwd_inner: 3317.26 | bwd_allreduce: 0.91 | step: 7.21 9%|▉ | 881/10000 [1:21:36<13:54:35, 5.49s/it] {'loss': 0.055, 'grad_norm': 0.5740203261375427, 'learning_rate': 3.964695774863516e-05, 'epoch': 0.88} 9%|▉ | 881/10000 [1:21:36<13:54:35, 5.49s/it][2025-06-19 14:51:21,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 14:51:21,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.73 | bwd_microstep: 3333.09 | bwd_inner_microstep: 3332.20 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.63 [2025-06-19 14:51:21,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.73 | bwd: 3333.12 | bwd_inner: 3332.20 | bwd_allreduce: 0.85 | step: 7.64 9%|▉ | 882/10000 [1:21:42<13:54:09, 5.49s/it] {'loss': 0.1638, 'grad_norm': 0.905860185623169, 'learning_rate': 3.964574501457378e-05, 'epoch': 0.88} 9%|▉ | 882/10000 [1:21:42<13:54:09, 5.49s/it][2025-06-19 14:51:27,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 14:51:27,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2288.70 | bwd_microstep: 3315.08 | bwd_inner_microstep: 3314.22 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.24 [2025-06-19 14:51:27,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2288.70 | bwd: 3315.10 | bwd_inner: 3314.22 | bwd_allreduce: 0.82 | step: 7.25 9%|▉ | 883/10000 [1:21:48<14:01:22, 5.54s/it] {'loss': 0.1355, 'grad_norm': 0.8187962174415588, 'learning_rate': 3.9644530219764926e-05, 'epoch': 0.88} 9%|▉ | 883/10000 [1:21:48<14:01:22, 5.54s/it][2025-06-19 14:51:32,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.66 | optimizer_step: 2.73 [2025-06-19 14:51:32,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2161.96 | bwd_microstep: 3374.74 | bwd_inner_microstep: 3373.69 | bwd_allreduce_microstep: 0.95 | step_microstep: 9.11 [2025-06-19 14:51:32,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2161.96 | bwd: 3374.78 | bwd_inner: 3373.69 | bwd_allreduce: 1.00 | step: 9.12 9%|▉ | 884/10000 [1:21:53<14:03:36, 5.55s/it] {'loss': 0.149, 'grad_norm': 0.7868996858596802, 'learning_rate': 3.9643313364336026e-05, 'epoch': 0.88} 9%|▉ | 884/10000 [1:21:53<14:03:36, 5.55s/it][2025-06-19 14:51:38,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.72 [2025-06-19 14:51:38,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2156.14 | bwd_microstep: 3367.53 | bwd_inner_microstep: 3366.46 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.21 [2025-06-19 14:51:38,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2156.14 | bwd: 3367.56 | bwd_inner: 3366.46 | bwd_allreduce: 1.03 | step: 8.22 9%|▉ | 885/10000 [1:21:59<14:04:40, 5.56s/it] {'loss': 0.1317, 'grad_norm': 0.667934775352478, 'learning_rate': 3.9642094448414724e-05, 'epoch': 0.89} 9%|▉ | 885/10000 [1:21:59<14:04:40, 5.56s/it][2025-06-19 14:51:43,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:51:43,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.64 | bwd_microstep: 3323.76 | bwd_inner_microstep: 3322.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-19 14:51:43,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.64 | bwd: 3323.78 | bwd_inner: 3322.94 | bwd_allreduce: 0.78 | step: 6.81 9%|▉ | 886/10000 [1:22:04<14:02:29, 5.55s/it] {'loss': 0.1565, 'grad_norm': 0.7553960084915161, 'learning_rate': 3.9640873472128875e-05, 'epoch': 0.89} 9%|▉ | 886/10000 [1:22:04<14:02:29, 5.55s/it][2025-06-19 14:51:49,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:51:49,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.24 | bwd_microstep: 3361.98 | bwd_inner_microstep: 3361.13 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.85 [2025-06-19 14:51:49,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.24 | bwd: 3361.99 | bwd_inner: 3361.14 | bwd_allreduce: 0.81 | step: 6.85 9%|▉ | 887/10000 [1:22:10<14:01:46, 5.54s/it] {'loss': 0.0813, 'grad_norm': 0.4032140374183655, 'learning_rate': 3.963965043560655e-05, 'epoch': 0.89} 9%|▉ | 887/10000 [1:22:10<14:01:46, 5.54s/it][2025-06-19 14:51:54,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:51:54,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.90 | bwd_microstep: 3318.74 | bwd_inner_microstep: 3317.86 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.49 [2025-06-19 14:51:54,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.90 | bwd: 3318.76 | bwd_inner: 3317.86 | bwd_allreduce: 0.84 | step: 7.49 9%|▉ | 888/10000 [1:22:15<13:58:57, 5.52s/it] {'loss': 0.0974, 'grad_norm': 0.5785618424415588, 'learning_rate': 3.963842533897605e-05, 'epoch': 0.89} 9%|▉ | 888/10000 [1:22:15<13:58:57, 5.52s/it][2025-06-19 14:52:00,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 14:52:00,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.16 | bwd_microstep: 3310.97 | bwd_inner_microstep: 3309.91 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.36 [2025-06-19 14:52:00,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.16 | bwd: 3310.99 | bwd_inner: 3309.91 | bwd_allreduce: 1.03 | step: 7.36 9%|▉ | 889/10000 [1:22:21<13:56:17, 5.51s/it] {'loss': 0.0743, 'grad_norm': 0.5469874739646912, 'learning_rate': 3.963719818236588e-05, 'epoch': 0.89} 9%|▉ | 889/10000 [1:22:21<13:56:17, 5.51s/it][2025-06-19 14:52:05,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:52:05,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.15 | bwd_microstep: 3331.02 | bwd_inner_microstep: 3330.13 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.98 [2025-06-19 14:52:05,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.15 | bwd: 3331.05 | bwd_inner: 3330.13 | bwd_allreduce: 0.85 | step: 7.98 9%|▉ | 890/10000 [1:22:26<13:55:26, 5.50s/it] {'loss': 0.1883, 'grad_norm': 0.7435557842254639, 'learning_rate': 3.9635968965904755e-05, 'epoch': 0.89} 9%|▉ | 890/10000 [1:22:26<13:55:26, 5.50s/it][2025-06-19 14:52:11,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:52:11,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.11 | bwd_microstep: 3322.47 | bwd_inner_microstep: 3321.42 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.64 [2025-06-19 14:52:11,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.11 | bwd: 3322.49 | bwd_inner: 3321.42 | bwd_allreduce: 1.01 | step: 7.64 9%|▉ | 891/10000 [1:22:32<13:54:36, 5.50s/it] {'loss': 0.1986, 'grad_norm': 0.6696277856826782, 'learning_rate': 3.963473768972162e-05, 'epoch': 0.89} 9%|▉ | 891/10000 [1:22:32<13:54:36, 5.50s/it][2025-06-19 14:52:16,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:52:16,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.22 | bwd_microstep: 3331.02 | bwd_inner_microstep: 3330.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 14:52:16,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.22 | bwd: 3331.03 | bwd_inner: 3330.23 | bwd_allreduce: 0.76 | step: 6.72 9%|▉ | 892/10000 [1:22:37<13:54:08, 5.49s/it] {'loss': 0.0478, 'grad_norm': 0.6170392632484436, 'learning_rate': 3.963350435394563e-05, 'epoch': 0.89} 9%|▉ | 892/10000 [1:22:37<13:54:08, 5.49s/it][2025-06-19 14:52:22,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:52:22,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.05 | bwd_microstep: 3315.37 | bwd_inner_microstep: 3314.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 14:52:22,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.05 | bwd: 3315.39 | bwd_inner: 3314.58 | bwd_allreduce: 0.76 | step: 6.96 9%|▉ | 893/10000 [1:22:43<13:53:29, 5.49s/it] {'loss': 0.0636, 'grad_norm': 0.4756268262863159, 'learning_rate': 3.963226895870615e-05, 'epoch': 0.89} 9%|▉ | 893/10000 [1:22:43<13:53:29, 5.49s/it][2025-06-19 14:52:27,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 14:52:27,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.61 | bwd_microstep: 3368.42 | bwd_inner_microstep: 3367.31 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.80 [2025-06-19 14:52:27,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.61 | bwd: 3368.44 | bwd_inner: 3367.31 | bwd_allreduce: 1.08 | step: 7.81 9%|▉ | 894/10000 [1:22:48<13:56:03, 5.51s/it] {'loss': 0.1124, 'grad_norm': 0.42438453435897827, 'learning_rate': 3.963103150413277e-05, 'epoch': 0.89} 9%|▉ | 894/10000 [1:22:48<13:56:03, 5.51s/it][2025-06-19 14:52:33,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:52:33,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.29 | bwd_microstep: 3403.70 | bwd_inner_microstep: 3402.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 14:52:33,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.29 | bwd: 3403.72 | bwd_inner: 3402.92 | bwd_allreduce: 0.76 | step: 6.57 9%|▉ | 895/10000 [1:22:54<13:59:39, 5.53s/it] {'loss': 0.1441, 'grad_norm': 0.7315039038658142, 'learning_rate': 3.9629791990355306e-05, 'epoch': 0.9} 9%|▉ | 895/10000 [1:22:54<13:59:39, 5.53s/it][2025-06-19 14:52:39,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:52:39,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.47 | bwd_microstep: 3323.32 | bwd_inner_microstep: 3322.54 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.58 [2025-06-19 14:52:39,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.47 | bwd: 3323.33 | bwd_inner: 3322.54 | bwd_allreduce: 0.75 | step: 6.58 9%|▉ | 896/10000 [1:22:59<13:57:02, 5.52s/it] {'loss': 0.1703, 'grad_norm': 1.0005443096160889, 'learning_rate': 3.9628550417503765e-05, 'epoch': 0.9} 9%|▉ | 896/10000 [1:22:59<13:57:02, 5.52s/it][2025-06-19 14:52:44,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:52:44,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.83 | bwd_microstep: 3313.46 | bwd_inner_microstep: 3312.66 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 14:52:44,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.83 | bwd: 3313.47 | bwd_inner: 3312.66 | bwd_allreduce: 0.77 | step: 7.06 9%|▉ | 897/10000 [1:23:05<13:53:58, 5.50s/it] {'loss': 0.1234, 'grad_norm': 0.6094174981117249, 'learning_rate': 3.962730678570838e-05, 'epoch': 0.9} 9%|▉ | 897/10000 [1:23:05<13:53:58, 5.50s/it][2025-06-19 14:52:49,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:52:49,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.50 | bwd_microstep: 3315.97 | bwd_inner_microstep: 3315.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 14:52:49,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.50 | bwd: 3315.98 | bwd_inner: 3315.18 | bwd_allreduce: 0.76 | step: 6.61 9%|▉ | 898/10000 [1:23:10<13:51:56, 5.48s/it] {'loss': 0.1089, 'grad_norm': 0.5352032780647278, 'learning_rate': 3.962606109509961e-05, 'epoch': 0.9} 9%|▉ | 898/10000 [1:23:10<13:51:56, 5.48s/it][2025-06-19 14:52:55,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 14:52:55,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.83 | bwd_microstep: 3311.05 | bwd_inner_microstep: 3310.12 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.03 [2025-06-19 14:52:55,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.83 | bwd: 3311.07 | bwd_inner: 3310.12 | bwd_allreduce: 0.90 | step: 7.04 9%|▉ | 899/10000 [1:23:16<13:50:20, 5.47s/it] {'loss': 0.1439, 'grad_norm': 0.5247402787208557, 'learning_rate': 3.9624813345808115e-05, 'epoch': 0.9} 9%|▉ | 899/10000 [1:23:16<13:50:20, 5.47s/it][2025-06-19 14:53:00,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:53:00,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.29 | bwd_microstep: 3365.46 | bwd_inner_microstep: 3364.64 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.27 [2025-06-19 14:53:00,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.29 | bwd: 3365.48 | bwd_inner: 3364.64 | bwd_allreduce: 0.79 | step: 7.27 9%|▉ | 900/10000 [1:23:21<13:53:16, 5.49s/it] {'loss': 0.1967, 'grad_norm': 0.8749784231185913, 'learning_rate': 3.9623563537964784e-05, 'epoch': 0.9} 9%|▉ | 900/10000 [1:23:21<13:53:16, 5.49s/it][2025-06-19 14:53:06,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:53:06,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.45 | bwd_microstep: 3363.62 | bwd_inner_microstep: 3362.59 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.04 [2025-06-19 14:53:06,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.45 | bwd: 3363.64 | bwd_inner: 3362.59 | bwd_allreduce: 0.99 | step: 7.05 9%|▉ | 901/10000 [1:23:27<13:54:36, 5.50s/it] {'loss': 0.1163, 'grad_norm': 0.6954323649406433, 'learning_rate': 3.962231167170071e-05, 'epoch': 0.9} 9%|▉ | 901/10000 [1:23:27<13:54:36, 5.50s/it][2025-06-19 14:53:12,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 14:53:12,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2154.75 | bwd_microstep: 3383.61 | bwd_inner_microstep: 3382.57 | bwd_allreduce_microstep: 0.97 | step_microstep: 8.19 [2025-06-19 14:53:12,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2154.75 | bwd: 3383.63 | bwd_inner: 3382.57 | bwd_allreduce: 1.00 | step: 8.21 9%|▉ | 902/10000 [1:23:32<13:58:10, 5.53s/it] {'loss': 0.1014, 'grad_norm': 0.4582231342792511, 'learning_rate': 3.962105774714722e-05, 'epoch': 0.9} 9%|▉ | 902/10000 [1:23:32<13:58:10, 5.53s/it][2025-06-19 14:53:17,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-19 14:53:17,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.51 | bwd_microstep: 3332.50 | bwd_inner_microstep: 3331.30 | bwd_allreduce_microstep: 1.03 | step_microstep: 10.59 [2025-06-19 14:53:17,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.51 | bwd: 3332.54 | bwd_inner: 3331.30 | bwd_allreduce: 1.13 | step: 10.60 9%|▉ | 903/10000 [1:23:38<13:58:04, 5.53s/it] {'loss': 0.1262, 'grad_norm': 0.6833867430686951, 'learning_rate': 3.961980176443583e-05, 'epoch': 0.9} 9%|▉ | 903/10000 [1:23:38<13:58:04, 5.53s/it][2025-06-19 14:53:23,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:53:23,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2153.24 | bwd_microstep: 3331.39 | bwd_inner_microstep: 3330.52 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.18 [2025-06-19 14:53:23,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2153.20 | bwd: 3331.42 | bwd_inner: 3330.52 | bwd_allreduce: 0.83 | step: 7.18 9%|▉ | 904/10000 [1:23:43<13:59:07, 5.54s/it] {'loss': 0.1331, 'grad_norm': 0.7064656019210815, 'learning_rate': 3.9618543723698294e-05, 'epoch': 0.9} 9%|▉ | 904/10000 [1:23:43<13:59:07, 5.54s/it][2025-06-19 14:53:28,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:53:28,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.08 | bwd_microstep: 3312.83 | bwd_inner_microstep: 3311.69 | bwd_allreduce_microstep: 1.09 | step_microstep: 7.73 [2025-06-19 14:53:28,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.08 | bwd: 3312.84 | bwd_inner: 3311.69 | bwd_allreduce: 1.11 | step: 7.75 9%|▉ | 905/10000 [1:23:49<13:57:13, 5.52s/it] {'loss': 0.1438, 'grad_norm': 0.6209529638290405, 'learning_rate': 3.961728362506657e-05, 'epoch': 0.91} 9%|▉ | 905/10000 [1:23:49<13:57:13, 5.52s/it][2025-06-19 14:53:34,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:53:34,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.05 | bwd_microstep: 3317.45 | bwd_inner_microstep: 3316.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 14:53:34,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.05 | bwd: 3317.46 | bwd_inner: 3316.66 | bwd_allreduce: 0.76 | step: 6.56 9%|▉ | 906/10000 [1:23:54<13:54:47, 5.51s/it] {'loss': 0.1318, 'grad_norm': 0.5496100783348083, 'learning_rate': 3.961602146867285e-05, 'epoch': 0.91} 9%|▉ | 906/10000 [1:23:54<13:54:47, 5.51s/it][2025-06-19 14:53:39,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:53:39,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.01 | bwd_microstep: 3322.43 | bwd_inner_microstep: 3321.59 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.76 [2025-06-19 14:53:39,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.01 | bwd: 3322.44 | bwd_inner: 3321.59 | bwd_allreduce: 0.80 | step: 6.76 9%|▉ | 907/10000 [1:24:00<13:52:51, 5.50s/it] {'loss': 0.1429, 'grad_norm': 0.5027588605880737, 'learning_rate': 3.9614757254649506e-05, 'epoch': 0.91} 9%|▉ | 907/10000 [1:24:00<13:52:51, 5.50s/it][2025-06-19 14:53:45,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:53:45,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.66 | bwd_microstep: 3367.25 | bwd_inner_microstep: 3366.35 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.49 [2025-06-19 14:53:45,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.66 | bwd: 3367.27 | bwd_inner: 3366.35 | bwd_allreduce: 0.85 | step: 7.48 9%|▉ | 908/10000 [1:24:05<13:55:38, 5.51s/it] {'loss': 0.072, 'grad_norm': 0.2775585353374481, 'learning_rate': 3.9613490983129175e-05, 'epoch': 0.91} 9%|▉ | 908/10000 [1:24:05<13:55:38, 5.51s/it][2025-06-19 14:53:50,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:53:50,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.31 | bwd_microstep: 3319.44 | bwd_inner_microstep: 3318.62 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 14:53:50,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.31 | bwd: 3319.46 | bwd_inner: 3318.62 | bwd_allreduce: 0.78 | step: 7.07 9%|▉ | 909/10000 [1:24:11<13:54:37, 5.51s/it] {'loss': 0.1351, 'grad_norm': 0.688682496547699, 'learning_rate': 3.9612222654244664e-05, 'epoch': 0.91} 9%|▉ | 909/10000 [1:24:11<13:54:37, 5.51s/it][2025-06-19 14:53:56,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 14:53:56,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.75 | bwd_microstep: 3366.09 | bwd_inner_microstep: 3365.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-19 14:53:56,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.75 | bwd: 3366.11 | bwd_inner: 3365.31 | bwd_allreduce: 0.76 | step: 6.52 9%|▉ | 910/10000 [1:24:16<13:55:49, 5.52s/it] {'loss': 0.1137, 'grad_norm': 0.4715980589389801, 'learning_rate': 3.961095226812902e-05, 'epoch': 0.91} 9%|▉ | 910/10000 [1:24:16<13:55:49, 5.52s/it][2025-06-19 14:54:01,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:54:01,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.46 | bwd_microstep: 3309.85 | bwd_inner_microstep: 3309.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 14:54:01,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.46 | bwd: 3309.86 | bwd_inner: 3309.07 | bwd_allreduce: 0.75 | step: 6.59 9%|▉ | 911/10000 [1:24:22<13:52:49, 5.50s/it] {'loss': 0.1459, 'grad_norm': 0.745646595954895, 'learning_rate': 3.960967982491549e-05, 'epoch': 0.91} 9%|▉ | 911/10000 [1:24:22<13:52:49, 5.50s/it][2025-06-19 14:54:07,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 14:54:07,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.70 | bwd_microstep: 3316.78 | bwd_inner_microstep: 3315.66 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.43 [2025-06-19 14:54:07,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.70 | bwd: 3316.80 | bwd_inner: 3315.66 | bwd_allreduce: 1.08 | step: 7.43 9%|▉ | 912/10000 [1:24:27<13:52:02, 5.49s/it] {'loss': 0.1689, 'grad_norm': 0.5855765342712402, 'learning_rate': 3.9608405324737564e-05, 'epoch': 0.91} 9%|▉ | 912/10000 [1:24:27<13:52:02, 5.49s/it][2025-06-19 14:54:12,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:54:12,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.47 | bwd_microstep: 3367.55 | bwd_inner_microstep: 3366.62 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.10 [2025-06-19 14:54:12,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.47 | bwd: 3367.57 | bwd_inner: 3366.62 | bwd_allreduce: 0.90 | step: 7.10 9%|▉ | 913/10000 [1:24:33<13:54:11, 5.51s/it] {'loss': 0.0932, 'grad_norm': 0.5593093633651733, 'learning_rate': 3.960712876772893e-05, 'epoch': 0.91} 9%|▉ | 913/10000 [1:24:33<13:54:11, 5.51s/it][2025-06-19 14:54:18,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:54:18,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.78 | bwd_microstep: 3315.64 | bwd_inner_microstep: 3314.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 14:54:18,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.78 | bwd: 3315.66 | bwd_inner: 3314.86 | bwd_allreduce: 0.75 | step: 6.66 9%|▉ | 914/10000 [1:24:38<13:51:34, 5.49s/it] {'loss': 0.1069, 'grad_norm': 0.34165525436401367, 'learning_rate': 3.9605850154023485e-05, 'epoch': 0.91} 9%|▉ | 914/10000 [1:24:38<13:51:34, 5.49s/it][2025-06-19 14:54:23,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:54:23,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.37 | bwd_microstep: 3312.04 | bwd_inner_microstep: 3311.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 14:54:23,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.37 | bwd: 3312.06 | bwd_inner: 3311.26 | bwd_allreduce: 0.75 | step: 6.55 9%|▉ | 915/10000 [1:24:44<13:49:35, 5.48s/it] {'loss': 0.2, 'grad_norm': 0.9226849675178528, 'learning_rate': 3.960456948375535e-05, 'epoch': 0.92} 9%|▉ | 915/10000 [1:24:44<13:49:35, 5.48s/it][2025-06-19 14:54:29,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:54:29,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.02 | bwd_microstep: 3372.75 | bwd_inner_microstep: 3371.88 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.86 [2025-06-19 14:54:29,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.02 | bwd: 3372.76 | bwd_inner: 3371.88 | bwd_allreduce: 0.84 | step: 6.87 9%|▉ | 916/10000 [1:24:49<13:52:09, 5.50s/it] {'loss': 0.1134, 'grad_norm': 0.46372902393341064, 'learning_rate': 3.960328675705886e-05, 'epoch': 0.92} 9%|▉ | 916/10000 [1:24:49<13:52:09, 5.50s/it][2025-06-19 14:54:34,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 14:54:34,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.50 | bwd_microstep: 3312.65 | bwd_inner_microstep: 3311.55 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.67 [2025-06-19 14:54:34,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.50 | bwd: 3312.67 | bwd_inner: 3311.55 | bwd_allreduce: 1.06 | step: 7.67 9%|▉ | 917/10000 [1:24:55<13:50:29, 5.49s/it] {'loss': 0.1031, 'grad_norm': 0.4173961877822876, 'learning_rate': 3.960200197406858e-05, 'epoch': 0.92} 9%|▉ | 917/10000 [1:24:55<13:50:29, 5.49s/it][2025-06-19 14:54:39,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 14:54:39,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.83 | bwd_microstep: 3319.66 | bwd_inner_microstep: 3318.72 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.82 [2025-06-19 14:54:39,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.83 | bwd: 3319.69 | bwd_inner: 3318.72 | bwd_allreduce: 0.90 | step: 7.82 9%|▉ | 918/10000 [1:25:00<13:49:45, 5.48s/it] {'loss': 0.1265, 'grad_norm': 0.565726637840271, 'learning_rate': 3.9600715134919266e-05, 'epoch': 0.92} 9%|▉ | 918/10000 [1:25:00<13:49:45, 5.48s/it][2025-06-19 14:54:45,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:54:45,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.01 | bwd_microstep: 3303.88 | bwd_inner_microstep: 3303.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 14:54:45,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.01 | bwd: 3303.89 | bwd_inner: 3303.08 | bwd_allreduce: 0.76 | step: 6.67 9%|▉ | 919/10000 [1:25:06<13:48:10, 5.47s/it] {'loss': 0.0835, 'grad_norm': 0.40795186161994934, 'learning_rate': 3.95994262397459e-05, 'epoch': 0.92} 9%|▉ | 919/10000 [1:25:06<13:48:10, 5.47s/it][2025-06-19 14:54:50,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.73 [2025-06-19 14:54:50,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.48 | bwd_microstep: 3315.53 | bwd_inner_microstep: 3314.09 | bwd_allreduce_microstep: 1.34 | step_microstep: 10.53 [2025-06-19 14:54:50,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.48 | bwd: 3315.57 | bwd_inner: 3314.09 | bwd_allreduce: 1.39 | step: 10.53 9%|▉ | 920/10000 [1:25:11<13:48:01, 5.47s/it] {'loss': 0.0911, 'grad_norm': 0.35353782773017883, 'learning_rate': 3.959813528868369e-05, 'epoch': 0.92} 9%|▉ | 920/10000 [1:25:11<13:48:01, 5.47s/it][2025-06-19 14:54:56,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.73 [2025-06-19 14:54:56,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2162.87 | bwd_microstep: 3373.62 | bwd_inner_microstep: 3372.38 | bwd_allreduce_microstep: 1.15 | step_microstep: 9.67 [2025-06-19 14:54:56,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2162.88 | bwd: 3373.65 | bwd_inner: 3372.38 | bwd_allreduce: 1.19 | step: 9.69 9%|▉ | 921/10000 [1:25:17<13:53:45, 5.51s/it] {'loss': 0.1788, 'grad_norm': 0.5883943438529968, 'learning_rate': 3.959684228186804e-05, 'epoch': 0.92} 9%|▉ | 921/10000 [1:25:17<13:53:45, 5.51s/it][2025-06-19 14:55:02,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 14:55:02,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.43 | bwd_microstep: 3365.50 | bwd_inner_microstep: 3364.27 | bwd_allreduce_microstep: 1.15 | step_microstep: 8.38 [2025-06-19 14:55:02,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.43 | bwd: 3365.51 | bwd_inner: 3364.27 | bwd_allreduce: 1.19 | step: 8.39 9%|▉ | 922/10000 [1:25:22<13:56:37, 5.53s/it] {'loss': 0.0828, 'grad_norm': 0.29434481263160706, 'learning_rate': 3.959554721943459e-05, 'epoch': 0.92} 9%|▉ | 922/10000 [1:25:22<13:56:37, 5.53s/it][2025-06-19 14:55:07,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:55:07,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2177.64 | bwd_microstep: 3369.76 | bwd_inner_microstep: 3368.73 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.66 [2025-06-19 14:55:07,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2177.64 | bwd: 3369.78 | bwd_inner: 3368.73 | bwd_allreduce: 0.98 | step: 7.67 9%|▉ | 923/10000 [1:25:28<13:59:57, 5.55s/it] {'loss': 0.0869, 'grad_norm': 0.2563900947570801, 'learning_rate': 3.959425010151919e-05, 'epoch': 0.92} 9%|▉ | 923/10000 [1:25:28<13:59:57, 5.55s/it][2025-06-19 14:55:13,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 14:55:13,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.93 | bwd_microstep: 3315.85 | bwd_inner_microstep: 3314.87 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.80 [2025-06-19 14:55:13,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.93 | bwd: 3315.86 | bwd_inner: 3314.87 | bwd_allreduce: 0.95 | step: 7.80 9%|▉ | 924/10000 [1:25:33<13:57:53, 5.54s/it] {'loss': 0.0782, 'grad_norm': 0.3409522473812103, 'learning_rate': 3.9592950928257886e-05, 'epoch': 0.92} 9%|▉ | 924/10000 [1:25:33<13:57:53, 5.54s/it][2025-06-19 14:55:18,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:55:18,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.91 | bwd_microstep: 3364.91 | bwd_inner_microstep: 3364.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-19 14:55:18,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.91 | bwd: 3364.92 | bwd_inner: 3364.12 | bwd_allreduce: 0.76 | step: 6.60 9%|▉ | 925/10000 [1:25:39<13:57:56, 5.54s/it] {'loss': 0.114, 'grad_norm': 0.5070955753326416, 'learning_rate': 3.9591649699786965e-05, 'epoch': 0.93} 9%|▉ | 925/10000 [1:25:39<13:57:56, 5.54s/it][2025-06-19 14:55:24,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 14:55:24,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.00 | bwd_microstep: 3359.04 | bwd_inner_microstep: 3358.08 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.32 [2025-06-19 14:55:24,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.00 | bwd: 3359.06 | bwd_inner: 3358.08 | bwd_allreduce: 0.92 | step: 7.32 9%|▉ | 926/10000 [1:25:45<13:57:06, 5.54s/it] {'loss': 0.0888, 'grad_norm': 0.3331439197063446, 'learning_rate': 3.959034641624292e-05, 'epoch': 0.93} 9%|▉ | 926/10000 [1:25:45<13:57:06, 5.54s/it][2025-06-19 14:55:29,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 14:55:29,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.26 | bwd_microstep: 3367.48 | bwd_inner_microstep: 3366.62 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.77 [2025-06-19 14:55:29,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.26 | bwd: 3367.49 | bwd_inner: 3366.62 | bwd_allreduce: 0.83 | step: 6.78 9%|▉ | 927/10000 [1:25:50<13:57:38, 5.54s/it] {'loss': 0.0963, 'grad_norm': 0.4871496260166168, 'learning_rate': 3.9589041077762445e-05, 'epoch': 0.93} 9%|▉ | 927/10000 [1:25:50<13:57:38, 5.54s/it][2025-06-19 14:55:35,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.89 [2025-06-19 14:55:35,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.74 | bwd_microstep: 3355.70 | bwd_inner_microstep: 3354.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 14:55:35,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.74 | bwd: 3355.72 | bwd_inner: 3354.90 | bwd_allreduce: 0.77 | step: 7.08 9%|▉ | 928/10000 [1:25:56<13:57:09, 5.54s/it] {'loss': 0.1106, 'grad_norm': 0.671501874923706, 'learning_rate': 3.958773368448249e-05, 'epoch': 0.93} 9%|▉ | 928/10000 [1:25:56<13:57:09, 5.54s/it][2025-06-19 14:55:40,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.94 [2025-06-19 14:55:40,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.84 | bwd_microstep: 3307.98 | bwd_inner_microstep: 3306.87 | bwd_allreduce_microstep: 1.05 | step_microstep: 9.14 [2025-06-19 14:55:40,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.84 | bwd: 3307.99 | bwd_inner: 3306.87 | bwd_allreduce: 1.07 | step: 9.16 9%|▉ | 929/10000 [1:26:01<13:54:16, 5.52s/it] {'loss': 0.0924, 'grad_norm': 0.5202027559280396, 'learning_rate': 3.958642423654018e-05, 'epoch': 0.93} 9%|▉ | 929/10000 [1:26:01<13:54:16, 5.52s/it][2025-06-19 14:55:46,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-19 14:55:46,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.87 | bwd_microstep: 3321.58 | bwd_inner_microstep: 3320.63 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.57 [2025-06-19 14:55:46,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.87 | bwd: 3321.60 | bwd_inner: 3320.63 | bwd_allreduce: 0.93 | step: 7.59 9%|▉ | 930/10000 [1:26:07<13:52:02, 5.50s/it] {'loss': 0.0841, 'grad_norm': 0.2949967682361603, 'learning_rate': 3.9585112734072865e-05, 'epoch': 0.93} 9%|▉ | 930/10000 [1:26:07<13:52:02, 5.50s/it][2025-06-19 14:55:51,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 14:55:51,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.25 | bwd_microstep: 3322.25 | bwd_inner_microstep: 3321.07 | bwd_allreduce_microstep: 1.11 | step_microstep: 7.63 [2025-06-19 14:55:51,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.25 | bwd: 3322.28 | bwd_inner: 3321.07 | bwd_allreduce: 1.15 | step: 7.63 9%|▉ | 931/10000 [1:26:12<13:50:59, 5.50s/it] {'loss': 0.1006, 'grad_norm': 0.657964289188385, 'learning_rate': 3.9583799177218126e-05, 'epoch': 0.93} 9%|▉ | 931/10000 [1:26:12<13:50:59, 5.50s/it][2025-06-19 14:55:57,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:55:57,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.68 | bwd_microstep: 3319.11 | bwd_inner_microstep: 3318.29 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-19 14:55:57,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.68 | bwd: 3319.12 | bwd_inner: 3318.29 | bwd_allreduce: 0.79 | step: 7.24 9%|▉ | 932/10000 [1:26:18<13:49:46, 5.49s/it] {'loss': 0.0781, 'grad_norm': 0.32835161685943604, 'learning_rate': 3.958248356611375e-05, 'epoch': 0.93} 9%|▉ | 932/10000 [1:26:18<13:49:46, 5.49s/it][2025-06-19 14:56:02,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:56:02,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.54 | bwd_microstep: 3390.88 | bwd_inner_microstep: 3390.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-19 14:56:02,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.54 | bwd: 3390.90 | bwd_inner: 3390.08 | bwd_allreduce: 0.77 | step: 6.91 9%|▉ | 933/10000 [1:26:23<13:53:39, 5.52s/it] {'loss': 0.0794, 'grad_norm': 0.36563336849212646, 'learning_rate': 3.958116590089772e-05, 'epoch': 0.93} 9%|▉ | 933/10000 [1:26:23<13:53:39, 5.52s/it][2025-06-19 14:56:08,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:56:08,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.05 | bwd_microstep: 3365.19 | bwd_inner_microstep: 3364.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.28 [2025-06-19 14:56:08,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.05 | bwd: 3365.20 | bwd_inner: 3364.38 | bwd_allreduce: 0.78 | step: 7.28 9%|▉ | 934/10000 [1:26:29<13:54:17, 5.52s/it] {'loss': 0.1192, 'grad_norm': 1.0907188653945923, 'learning_rate': 3.9579846181708275e-05, 'epoch': 0.93} 9%|▉ | 934/10000 [1:26:29<13:54:17, 5.52s/it][2025-06-19 14:56:13,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:56:13,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.69 | bwd_microstep: 3320.46 | bwd_inner_microstep: 3319.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 14:56:13,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.69 | bwd: 3320.48 | bwd_inner: 3319.67 | bwd_allreduce: 0.77 | step: 6.67 9%|▉ | 935/10000 [1:26:34<13:51:47, 5.51s/it] {'loss': 0.1602, 'grad_norm': 0.6687633395195007, 'learning_rate': 3.9578524408683835e-05, 'epoch': 0.94} 9%|▉ | 935/10000 [1:26:34<13:51:47, 5.51s/it][2025-06-19 14:56:19,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:56:19,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.18 | bwd_microstep: 3313.83 | bwd_inner_microstep: 3313.00 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.38 [2025-06-19 14:56:19,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.18 | bwd: 3313.84 | bwd_inner: 3313.00 | bwd_allreduce: 0.79 | step: 7.38 9%|▉ | 936/10000 [1:26:40<13:49:28, 5.49s/it] {'loss': 0.086, 'grad_norm': 0.5475336909294128, 'learning_rate': 3.957720058196305e-05, 'epoch': 0.94} 9%|▉ | 936/10000 [1:26:40<13:49:28, 5.49s/it][2025-06-19 14:56:24,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:56:24,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.50 | bwd_microstep: 3330.51 | bwd_inner_microstep: 3329.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 14:56:24,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.50 | bwd: 3330.53 | bwd_inner: 3329.72 | bwd_allreduce: 0.77 | step: 6.75 9%|▉ | 937/10000 [1:26:45<13:48:50, 5.49s/it] {'loss': 0.1108, 'grad_norm': 0.5637602210044861, 'learning_rate': 3.957587470168479e-05, 'epoch': 0.94} 9%|▉ | 937/10000 [1:26:45<13:48:50, 5.49s/it][2025-06-19 14:56:30,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:56:30,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.63 | bwd_microstep: 3324.26 | bwd_inner_microstep: 3323.45 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-19 14:56:30,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.63 | bwd: 3324.28 | bwd_inner: 3323.45 | bwd_allreduce: 0.78 | step: 6.95 9%|▉ | 938/10000 [1:26:51<13:47:45, 5.48s/it] {'loss': 0.1351, 'grad_norm': 1.1121511459350586, 'learning_rate': 3.957454676798812e-05, 'epoch': 0.94} 9%|▉ | 938/10000 [1:26:51<13:47:45, 5.48s/it][2025-06-19 14:56:35,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:56:35,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.53 | bwd_microstep: 3325.54 | bwd_inner_microstep: 3324.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 14:56:35,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.53 | bwd: 3325.55 | bwd_inner: 3324.73 | bwd_allreduce: 0.78 | step: 7.16 9%|▉ | 939/10000 [1:26:56<13:47:06, 5.48s/it] {'loss': 0.0995, 'grad_norm': 0.621331512928009, 'learning_rate': 3.957321678101235e-05, 'epoch': 0.94} 9%|▉ | 939/10000 [1:26:56<13:47:06, 5.48s/it][2025-06-19 14:56:41,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 14:56:41,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.20 | bwd_microstep: 3405.09 | bwd_inner_microstep: 3404.21 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.42 [2025-06-19 14:56:41,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.20 | bwd: 3405.12 | bwd_inner: 3404.21 | bwd_allreduce: 0.84 | step: 7.43 9%|▉ | 940/10000 [1:27:02<13:52:38, 5.51s/it] {'loss': 0.0867, 'grad_norm': 0.7601809501647949, 'learning_rate': 3.957188474089698e-05, 'epoch': 0.94} 9%|▉ | 940/10000 [1:27:02<13:52:38, 5.51s/it][2025-06-19 14:56:46,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.90 [2025-06-19 14:56:46,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.97 | bwd_microstep: 3329.28 | bwd_inner_microstep: 3328.35 | bwd_allreduce_microstep: 0.85 | step_microstep: 8.47 [2025-06-19 14:56:46,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.97 | bwd: 3329.30 | bwd_inner: 3328.35 | bwd_allreduce: 0.88 | step: 8.48 9%|▉ | 941/10000 [1:27:07<13:52:53, 5.52s/it] {'loss': 0.123, 'grad_norm': 1.0158611536026, 'learning_rate': 3.9570550647781735e-05, 'epoch': 0.94} 9%|▉ | 941/10000 [1:27:07<13:52:53, 5.52s/it][2025-06-19 14:56:52,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:56:52,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2179.60 | bwd_microstep: 3405.22 | bwd_inner_microstep: 3404.34 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.31 [2025-06-19 14:56:52,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2179.60 | bwd: 3405.25 | bwd_inner: 3404.34 | bwd_allreduce: 0.84 | step: 7.32 9%|▉ | 942/10000 [1:27:13<13:58:17, 5.55s/it] {'loss': 0.1035, 'grad_norm': 0.6450008749961853, 'learning_rate': 3.9569214501806555e-05, 'epoch': 0.94} 9%|▉ | 942/10000 [1:27:13<13:58:17, 5.55s/it][2025-06-19 14:56:58,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 14:56:58,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2162.86 | bwd_microstep: 3376.53 | bwd_inner_microstep: 3375.44 | bwd_allreduce_microstep: 1.02 | step_microstep: 9.03 [2025-06-19 14:56:58,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2162.86 | bwd: 3376.56 | bwd_inner: 3375.44 | bwd_allreduce: 1.06 | step: 9.04 9%|▉ | 943/10000 [1:27:18<14:00:02, 5.57s/it] {'loss': 0.1006, 'grad_norm': 0.5434111952781677, 'learning_rate': 3.9567876303111606e-05, 'epoch': 0.94} 9%|▉ | 943/10000 [1:27:18<14:00:02, 5.57s/it][2025-06-19 14:57:03,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.89 [2025-06-19 14:57:03,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2172.84 | bwd_microstep: 3378.59 | bwd_inner_microstep: 3377.60 | bwd_allreduce_microstep: 0.88 | step_microstep: 8.23 [2025-06-19 14:57:03,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2172.84 | bwd: 3378.63 | bwd_inner: 3377.60 | bwd_allreduce: 0.93 | step: 8.24 9%|▉ | 944/10000 [1:27:24<14:01:35, 5.58s/it] {'loss': 0.1097, 'grad_norm': 0.8846123218536377, 'learning_rate': 3.9566536051837244e-05, 'epoch': 0.94} 9%|▉ | 944/10000 [1:27:24<14:01:35, 5.58s/it][2025-06-19 14:57:09,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 14:57:09,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.46 | bwd_microstep: 3372.33 | bwd_inner_microstep: 3371.50 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.06 [2025-06-19 14:57:09,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.47 | bwd: 3372.35 | bwd_inner: 3371.50 | bwd_allreduce: 0.80 | step: 7.07 9%|▉ | 945/10000 [1:27:29<14:00:33, 5.57s/it] {'loss': 0.1581, 'grad_norm': 1.0350284576416016, 'learning_rate': 3.9565193748124056e-05, 'epoch': 0.94} 9%|▉ | 945/10000 [1:27:29<14:00:33, 5.57s/it][2025-06-19 14:57:14,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:57:14,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.86 | bwd_microstep: 3340.17 | bwd_inner_microstep: 3339.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-19 14:57:14,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.86 | bwd: 3340.19 | bwd_inner: 3339.38 | bwd_allreduce: 0.77 | step: 7.02 9%|▉ | 946/10000 [1:27:35<13:57:08, 5.55s/it] {'loss': 0.1292, 'grad_norm': 0.6629927158355713, 'learning_rate': 3.956384939211285e-05, 'epoch': 0.95} 9%|▉ | 946/10000 [1:27:35<13:57:08, 5.55s/it][2025-06-19 14:57:20,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 14:57:20,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.21 | bwd_microstep: 3329.98 | bwd_inner_microstep: 3328.89 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.91 [2025-06-19 14:57:20,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.21 | bwd: 3330.01 | bwd_inner: 3328.89 | bwd_allreduce: 1.05 | step: 7.92 9%|▉ | 947/10000 [1:27:40<13:55:00, 5.53s/it] {'loss': 0.1911, 'grad_norm': 1.287974238395691, 'learning_rate': 3.956250298394465e-05, 'epoch': 0.95} 9%|▉ | 947/10000 [1:27:40<13:55:00, 5.53s/it][2025-06-19 14:57:25,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 14:57:25,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.55 | bwd_microstep: 3390.00 | bwd_inner_microstep: 3389.09 | bwd_allreduce_microstep: 0.83 | step_microstep: 8.43 [2025-06-19 14:57:25,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.55 | bwd: 3390.03 | bwd_inner: 3389.09 | bwd_allreduce: 0.86 | step: 8.44 9%|▉ | 948/10000 [1:27:46<13:56:44, 5.55s/it] {'loss': 0.132, 'grad_norm': 1.2234538793563843, 'learning_rate': 3.956115452376067e-05, 'epoch': 0.95} 9%|▉ | 948/10000 [1:27:46<13:56:44, 5.55s/it][2025-06-19 14:57:31,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:57:31,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.71 | bwd_microstep: 3335.00 | bwd_inner_microstep: 3334.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 14:57:31,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.71 | bwd: 3335.02 | bwd_inner: 3334.21 | bwd_allreduce: 0.76 | step: 6.65 9%|▉ | 949/10000 [1:27:52<13:54:48, 5.53s/it] {'loss': 0.1598, 'grad_norm': 1.2967275381088257, 'learning_rate': 3.9559804011702375e-05, 'epoch': 0.95} 9%|▉ | 949/10000 [1:27:52<13:54:48, 5.53s/it][2025-06-19 14:57:36,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:57:36,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.58 | bwd_microstep: 3329.69 | bwd_inner_microstep: 3328.88 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-19 14:57:36,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.58 | bwd: 3329.71 | bwd_inner: 3328.88 | bwd_allreduce: 0.78 | step: 6.93 10%|▉ | 950/10000 [1:27:57<13:52:14, 5.52s/it] {'loss': 0.0583, 'grad_norm': 0.36266157031059265, 'learning_rate': 3.955845144791142e-05, 'epoch': 0.95} 10%|▉ | 950/10000 [1:27:57<13:52:14, 5.52s/it][2025-06-19 14:57:42,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 14:57:42,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.08 | bwd_microstep: 3329.41 | bwd_inner_microstep: 3328.52 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.62 [2025-06-19 14:57:42,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.08 | bwd: 3329.44 | bwd_inner: 3328.52 | bwd_allreduce: 0.85 | step: 7.62 10%|▉ | 951/10000 [1:28:03<13:50:57, 5.51s/it] {'loss': 0.2265, 'grad_norm': 1.4956556558609009, 'learning_rate': 3.955709683252967e-05, 'epoch': 0.95} 10%|▉ | 951/10000 [1:28:03<13:50:57, 5.51s/it][2025-06-19 14:57:47,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:57:47,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.69 | bwd_microstep: 3402.93 | bwd_inner_microstep: 3402.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 14:57:47,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.69 | bwd: 3402.95 | bwd_inner: 3402.14 | bwd_allreduce: 0.76 | step: 6.64 10%|▉ | 952/10000 [1:28:08<13:54:07, 5.53s/it] {'loss': 0.1828, 'grad_norm': 0.9449061751365662, 'learning_rate': 3.955574016569924e-05, 'epoch': 0.95} 10%|▉ | 952/10000 [1:28:08<13:54:07, 5.53s/it][2025-06-19 14:57:53,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:57:53,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.05 | bwd_microstep: 3324.83 | bwd_inner_microstep: 3324.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 14:57:53,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.05 | bwd: 3324.84 | bwd_inner: 3324.04 | bwd_allreduce: 0.76 | step: 6.60 10%|▉ | 953/10000 [1:28:14<13:51:14, 5.51s/it] {'loss': 0.1267, 'grad_norm': 1.0175763368606567, 'learning_rate': 3.955438144756242e-05, 'epoch': 0.95} 10%|▉ | 953/10000 [1:28:14<13:51:14, 5.51s/it][2025-06-19 14:57:58,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:57:58,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.49 | bwd_microstep: 3327.88 | bwd_inner_microstep: 3327.03 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.92 [2025-06-19 14:57:58,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.49 | bwd: 3327.90 | bwd_inner: 3327.03 | bwd_allreduce: 0.82 | step: 6.92 10%|▉ | 954/10000 [1:28:19<13:49:24, 5.50s/it] {'loss': 0.1134, 'grad_norm': 0.9995365738868713, 'learning_rate': 3.955302067826175e-05, 'epoch': 0.95} 10%|▉ | 954/10000 [1:28:19<13:49:24, 5.50s/it][2025-06-19 14:58:04,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 14:58:04,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.58 | bwd_microstep: 3387.77 | bwd_inner_microstep: 3386.85 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.95 [2025-06-19 14:58:04,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.58 | bwd: 3387.78 | bwd_inner: 3386.85 | bwd_allreduce: 0.89 | step: 6.95 10%|▉ | 955/10000 [1:28:25<13:52:07, 5.52s/it] {'loss': 0.0902, 'grad_norm': 0.8614029288291931, 'learning_rate': 3.955165785793995e-05, 'epoch': 0.95} 10%|▉ | 955/10000 [1:28:25<13:52:07, 5.52s/it][2025-06-19 14:58:09,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:58:09,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.29 | bwd_microstep: 3318.20 | bwd_inner_microstep: 3317.28 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.22 [2025-06-19 14:58:09,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.29 | bwd: 3318.22 | bwd_inner: 3317.28 | bwd_allreduce: 0.89 | step: 7.22 10%|▉ | 956/10000 [1:28:30<13:49:42, 5.50s/it] {'loss': 0.1186, 'grad_norm': 0.895531952381134, 'learning_rate': 3.955029298673999e-05, 'epoch': 0.96} 10%|▉ | 956/10000 [1:28:30<13:49:42, 5.50s/it][2025-06-19 14:58:15,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:58:15,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.09 | bwd_microstep: 3333.72 | bwd_inner_microstep: 3332.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 14:58:15,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.09 | bwd: 3333.73 | bwd_inner: 3332.92 | bwd_allreduce: 0.77 | step: 6.82 10%|▉ | 957/10000 [1:28:36<13:49:07, 5.50s/it] {'loss': 0.0949, 'grad_norm': 0.5959306359291077, 'learning_rate': 3.954892606480503e-05, 'epoch': 0.96} 10%|▉ | 957/10000 [1:28:36<13:49:07, 5.50s/it][2025-06-19 14:58:20,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:58:20,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.16 | bwd_microstep: 3382.69 | bwd_inner_microstep: 3381.75 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.06 [2025-06-19 14:58:20,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.16 | bwd: 3382.71 | bwd_inner: 3381.75 | bwd_allreduce: 0.90 | step: 7.06 10%|▉ | 958/10000 [1:28:41<13:51:46, 5.52s/it] {'loss': 0.114, 'grad_norm': 0.5911118984222412, 'learning_rate': 3.9547557092278456e-05, 'epoch': 0.96} 10%|▉ | 958/10000 [1:28:41<13:51:46, 5.52s/it][2025-06-19 14:58:26,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 14:58:26,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.87 | bwd_microstep: 3380.03 | bwd_inner_microstep: 3379.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.89 [2025-06-19 14:58:26,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.87 | bwd: 3380.04 | bwd_inner: 3379.24 | bwd_allreduce: 0.76 | step: 6.89 10%|▉ | 959/10000 [1:28:47<13:53:08, 5.53s/it] {'loss': 0.1783, 'grad_norm': 0.6508193612098694, 'learning_rate': 3.9546186069303856e-05, 'epoch': 0.96} 10%|▉ | 959/10000 [1:28:47<13:53:08, 5.53s/it][2025-06-19 14:58:31,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:58:31,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.49 | bwd_microstep: 3324.80 | bwd_inner_microstep: 3323.93 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.32 [2025-06-19 14:58:31,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.49 | bwd: 3324.82 | bwd_inner: 3323.93 | bwd_allreduce: 0.83 | step: 7.33 10%|▉ | 960/10000 [1:28:52<13:51:21, 5.52s/it] {'loss': 0.1551, 'grad_norm': 0.5240691304206848, 'learning_rate': 3.954481299602506e-05, 'epoch': 0.96} 10%|▉ | 960/10000 [1:28:52<13:51:21, 5.52s/it][2025-06-19 14:58:37,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 14:58:37,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.06 | bwd_microstep: 3334.43 | bwd_inner_microstep: 3333.45 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.58 [2025-06-19 14:58:37,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.06 | bwd: 3334.47 | bwd_inner: 3333.45 | bwd_allreduce: 0.92 | step: 7.59 10%|▉ | 961/10000 [1:28:58<13:51:41, 5.52s/it] {'loss': 0.1787, 'grad_norm': 1.1073660850524902, 'learning_rate': 3.9543437872586094e-05, 'epoch': 0.96} 10%|▉ | 961/10000 [1:28:58<13:51:41, 5.52s/it][2025-06-19 14:58:42,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.73 [2025-06-19 14:58:42,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.21 | bwd_microstep: 3337.54 | bwd_inner_microstep: 3336.24 | bwd_allreduce_microstep: 1.20 | step_microstep: 8.90 [2025-06-19 14:58:42,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.21 | bwd: 3337.57 | bwd_inner: 3336.24 | bwd_allreduce: 1.25 | step: 8.91 10%|▉ | 962/10000 [1:29:03<13:51:58, 5.52s/it] {'loss': 0.0765, 'grad_norm': 0.689035952091217, 'learning_rate': 3.95420606991312e-05, 'epoch': 0.96} 10%|▉ | 962/10000 [1:29:03<13:51:58, 5.52s/it][2025-06-19 14:58:48,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:58:48,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2158.92 | bwd_microstep: 3384.31 | bwd_inner_microstep: 3383.44 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.50 [2025-06-19 14:58:48,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2158.92 | bwd: 3384.33 | bwd_inner: 3383.44 | bwd_allreduce: 0.83 | step: 7.50 10%|▉ | 963/10000 [1:29:09<13:55:10, 5.54s/it] {'loss': 0.0911, 'grad_norm': 0.5861360430717468, 'learning_rate': 3.9540681475804834e-05, 'epoch': 0.96} 10%|▉ | 963/10000 [1:29:09<13:55:10, 5.54s/it][2025-06-19 14:58:54,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:58:54,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.88 | bwd_microstep: 3383.24 | bwd_inner_microstep: 3382.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 14:58:54,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.88 | bwd: 3383.25 | bwd_inner: 3382.46 | bwd_allreduce: 0.75 | step: 6.61 10%|▉ | 964/10000 [1:29:14<13:56:07, 5.55s/it] {'loss': 0.1604, 'grad_norm': 1.090343713760376, 'learning_rate': 3.953930020275167e-05, 'epoch': 0.96} 10%|▉ | 964/10000 [1:29:14<13:56:07, 5.55s/it][2025-06-19 14:58:59,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 14:58:59,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.75 | bwd_microstep: 3383.03 | bwd_inner_microstep: 3381.99 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.21 [2025-06-19 14:58:59,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.75 | bwd: 3383.05 | bwd_inner: 3381.99 | bwd_allreduce: 1.00 | step: 7.22 10%|▉ | 965/10000 [1:29:20<13:56:27, 5.55s/it] {'loss': 0.1374, 'grad_norm': 0.6481055021286011, 'learning_rate': 3.95379168801166e-05, 'epoch': 0.96} 10%|▉ | 965/10000 [1:29:20<13:56:27, 5.55s/it][2025-06-19 14:59:05,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:59:05,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.99 | bwd_microstep: 3339.92 | bwd_inner_microstep: 3339.10 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.30 [2025-06-19 14:59:05,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.99 | bwd: 3339.94 | bwd_inner: 3339.10 | bwd_allreduce: 0.79 | step: 7.32 10%|▉ | 966/10000 [1:29:25<13:54:30, 5.54s/it] {'loss': 0.0749, 'grad_norm': 0.5155017971992493, 'learning_rate': 3.953653150804473e-05, 'epoch': 0.97} 10%|▉ | 966/10000 [1:29:25<13:54:30, 5.54s/it][2025-06-19 14:59:10,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:59:10,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.02 | bwd_microstep: 3335.22 | bwd_inner_microstep: 3334.26 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.21 [2025-06-19 14:59:10,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.02 | bwd: 3335.24 | bwd_inner: 3334.26 | bwd_allreduce: 0.94 | step: 7.21 10%|▉ | 967/10000 [1:29:31<13:53:08, 5.53s/it] {'loss': 0.0889, 'grad_norm': 0.46920114755630493, 'learning_rate': 3.9535144086681375e-05, 'epoch': 0.97} 10%|▉ | 967/10000 [1:29:31<13:53:08, 5.53s/it][2025-06-19 14:59:16,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 14:59:16,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.13 | bwd_microstep: 3323.14 | bwd_inner_microstep: 3322.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 14:59:16,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.13 | bwd: 3323.15 | bwd_inner: 3322.35 | bwd_allreduce: 0.76 | step: 6.78 10%|▉ | 968/10000 [1:29:36<13:50:36, 5.52s/it] {'loss': 0.0918, 'grad_norm': 0.4742254912853241, 'learning_rate': 3.9533754616172076e-05, 'epoch': 0.97} 10%|▉ | 968/10000 [1:29:36<13:50:36, 5.52s/it][2025-06-19 14:59:21,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 14:59:21,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.87 | bwd_microstep: 3340.90 | bwd_inner_microstep: 3340.00 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.95 [2025-06-19 14:59:21,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.87 | bwd: 3340.91 | bwd_inner: 3340.00 | bwd_allreduce: 0.87 | step: 6.95 10%|▉ | 969/10000 [1:29:42<13:49:33, 5.51s/it] {'loss': 0.0797, 'grad_norm': 0.8867049217224121, 'learning_rate': 3.9532363096662566e-05, 'epoch': 0.97} 10%|▉ | 969/10000 [1:29:42<13:49:33, 5.51s/it][2025-06-19 14:59:27,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 14:59:27,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.87 | bwd_microstep: 3328.34 | bwd_inner_microstep: 3327.45 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.85 [2025-06-19 14:59:27,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.87 | bwd: 3328.36 | bwd_inner: 3327.45 | bwd_allreduce: 0.87 | step: 6.86 10%|▉ | 970/10000 [1:29:47<13:48:47, 5.51s/it] {'loss': 0.0915, 'grad_norm': 0.6286271214485168, 'learning_rate': 3.953096952829883e-05, 'epoch': 0.97} 10%|▉ | 970/10000 [1:29:47<13:48:47, 5.51s/it][2025-06-19 14:59:32,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 14:59:32,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.06 | bwd_microstep: 3327.70 | bwd_inner_microstep: 3326.77 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.11 [2025-06-19 14:59:32,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.06 | bwd: 3327.71 | bwd_inner: 3326.77 | bwd_allreduce: 0.90 | step: 7.11 10%|▉ | 971/10000 [1:29:53<13:47:24, 5.50s/it] {'loss': 0.1561, 'grad_norm': 1.1884716749191284, 'learning_rate': 3.952957391122703e-05, 'epoch': 0.97} 10%|▉ | 971/10000 [1:29:53<13:47:24, 5.50s/it][2025-06-19 14:59:38,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 14:59:38,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.55 | bwd_microstep: 3326.69 | bwd_inner_microstep: 3325.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 14:59:38,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.55 | bwd: 3326.70 | bwd_inner: 3325.90 | bwd_allreduce: 0.76 | step: 6.63 10%|▉ | 972/10000 [1:29:58<13:46:27, 5.49s/it] {'loss': 0.0671, 'grad_norm': 0.29224830865859985, 'learning_rate': 3.952817624559357e-05, 'epoch': 0.97} 10%|▉ | 972/10000 [1:29:58<13:46:27, 5.49s/it][2025-06-19 14:59:43,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:59:43,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.83 | bwd_microstep: 3325.01 | bwd_inner_microstep: 3324.04 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.10 [2025-06-19 14:59:43,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.83 | bwd: 3325.02 | bwd_inner: 3324.04 | bwd_allreduce: 0.93 | step: 7.10 10%|▉ | 973/10000 [1:30:04<13:45:36, 5.49s/it] {'loss': 0.0643, 'grad_norm': 0.2834857702255249, 'learning_rate': 3.9526776531545053e-05, 'epoch': 0.97} 10%|▉ | 973/10000 [1:30:04<13:45:36, 5.49s/it][2025-06-19 14:59:49,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.72 [2025-06-19 14:59:49,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.17 | bwd_microstep: 3369.92 | bwd_inner_microstep: 3368.74 | bwd_allreduce_microstep: 1.10 | step_microstep: 8.56 [2025-06-19 14:59:49,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.17 | bwd: 3369.94 | bwd_inner: 3368.74 | bwd_allreduce: 1.14 | step: 8.57 10%|▉ | 974/10000 [1:30:09<13:48:27, 5.51s/it] {'loss': 0.0939, 'grad_norm': 0.405424565076828, 'learning_rate': 3.9525374769228304e-05, 'epoch': 0.97} 10%|▉ | 974/10000 [1:30:09<13:48:27, 5.51s/it][2025-06-19 14:59:54,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 14:59:54,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.85 | bwd_microstep: 3366.94 | bwd_inner_microstep: 3365.95 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.08 [2025-06-19 14:59:54,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.85 | bwd: 3366.96 | bwd_inner: 3365.95 | bwd_allreduce: 0.95 | step: 7.08 10%|▉ | 975/10000 [1:30:15<13:50:12, 5.52s/it] {'loss': 0.1629, 'grad_norm': 0.7115171551704407, 'learning_rate': 3.952397095879036e-05, 'epoch': 0.97} 10%|▉ | 975/10000 [1:30:15<13:50:12, 5.52s/it][2025-06-19 15:00:00,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:00:00,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.70 | bwd_microstep: 3315.03 | bwd_inner_microstep: 3314.23 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.71 [2025-06-19 15:00:00,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.70 | bwd: 3315.05 | bwd_inner: 3314.23 | bwd_allreduce: 0.78 | step: 6.72 10%|▉ | 976/10000 [1:30:20<13:47:52, 5.50s/it] {'loss': 0.077, 'grad_norm': 0.2967051565647125, 'learning_rate': 3.952256510037848e-05, 'epoch': 0.98} 10%|▉ | 976/10000 [1:30:20<13:47:52, 5.50s/it][2025-06-19 15:00:05,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:00:05,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.40 | bwd_microstep: 3368.26 | bwd_inner_microstep: 3367.44 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.19 [2025-06-19 15:00:05,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.40 | bwd: 3368.27 | bwd_inner: 3367.44 | bwd_allreduce: 0.78 | step: 7.19 10%|▉ | 977/10000 [1:30:26<13:49:30, 5.52s/it] {'loss': 0.1799, 'grad_norm': 0.6750153303146362, 'learning_rate': 3.952115719414013e-05, 'epoch': 0.98} 10%|▉ | 977/10000 [1:30:26<13:49:30, 5.52s/it][2025-06-19 15:00:11,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:00:11,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.11 | bwd_microstep: 3367.05 | bwd_inner_microstep: 3366.22 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.00 [2025-06-19 15:00:11,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.11 | bwd: 3367.07 | bwd_inner: 3366.22 | bwd_allreduce: 0.80 | step: 7.00 10%|▉ | 978/10000 [1:30:32<13:50:34, 5.52s/it] {'loss': 0.0969, 'grad_norm': 0.4372217357158661, 'learning_rate': 3.9519747240222985e-05, 'epoch': 0.98} 10%|▉ | 978/10000 [1:30:32<13:50:34, 5.52s/it][2025-06-19 15:00:16,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 15:00:16,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.42 | bwd_microstep: 3320.73 | bwd_inner_microstep: 3319.54 | bwd_allreduce_microstep: 1.10 | step_microstep: 8.06 [2025-06-19 15:00:16,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.42 | bwd: 3320.76 | bwd_inner: 3319.54 | bwd_allreduce: 1.14 | step: 8.06 10%|▉ | 979/10000 [1:30:37<13:48:18, 5.51s/it] {'loss': 0.0832, 'grad_norm': 0.32390815019607544, 'learning_rate': 3.951833523877495e-05, 'epoch': 0.98} 10%|▉ | 979/10000 [1:30:37<13:48:18, 5.51s/it][2025-06-19 15:00:22,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-19 15:00:22,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2150.16 | bwd_microstep: 3329.26 | bwd_inner_microstep: 3327.82 | bwd_allreduce_microstep: 1.32 | step_microstep: 10.33 [2025-06-19 15:00:22,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2150.16 | bwd: 3329.31 | bwd_inner: 3327.82 | bwd_allreduce: 1.38 | step: 10.36 10%|▉ | 980/10000 [1:30:43<13:49:25, 5.52s/it] {'loss': 0.0706, 'grad_norm': 0.27529555559158325, 'learning_rate': 3.9516921189944135e-05, 'epoch': 0.98} 10%|▉ | 980/10000 [1:30:43<13:49:25, 5.52s/it][2025-06-19 15:00:27,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.72 [2025-06-19 15:00:27,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2166.38 | bwd_microstep: 3381.11 | bwd_inner_microstep: 3379.91 | bwd_allreduce_microstep: 1.12 | step_microstep: 8.31 [2025-06-19 15:00:27,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2166.38 | bwd: 3381.14 | bwd_inner: 3379.91 | bwd_allreduce: 1.16 | step: 8.31 10%|▉ | 981/10000 [1:30:48<13:53:18, 5.54s/it] {'loss': 0.0875, 'grad_norm': 0.9421545267105103, 'learning_rate': 3.951550509387887e-05, 'epoch': 0.98} 10%|▉ | 981/10000 [1:30:48<13:53:18, 5.54s/it][2025-06-19 15:00:33,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:00:33,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2163.72 | bwd_microstep: 3379.89 | bwd_inner_microstep: 3378.92 | bwd_allreduce_microstep: 0.87 | step_microstep: 8.08 [2025-06-19 15:00:33,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2163.72 | bwd: 3379.92 | bwd_inner: 3378.92 | bwd_allreduce: 0.91 | step: 8.08 10%|▉ | 982/10000 [1:30:54<13:55:38, 5.56s/it] {'loss': 0.157, 'grad_norm': 0.9993123412132263, 'learning_rate': 3.95140869507277e-05, 'epoch': 0.98} 10%|▉ | 982/10000 [1:30:54<13:55:38, 5.56s/it][2025-06-19 15:00:38,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 15:00:38,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.61 | bwd_microstep: 3318.02 | bwd_inner_microstep: 3316.99 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.46 [2025-06-19 15:00:38,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.61 | bwd: 3318.04 | bwd_inner: 3316.99 | bwd_allreduce: 0.99 | step: 7.45 10%|▉ | 983/10000 [1:30:59<13:53:06, 5.54s/it] {'loss': 0.0832, 'grad_norm': 0.5311493873596191, 'learning_rate': 3.9512666760639374e-05, 'epoch': 0.98} 10%|▉ | 983/10000 [1:30:59<13:53:06, 5.54s/it][2025-06-19 15:00:44,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 15:00:44,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.21 | bwd_microstep: 3323.17 | bwd_inner_microstep: 3322.31 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.07 [2025-06-19 15:00:44,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.21 | bwd: 3323.18 | bwd_inner: 3322.31 | bwd_allreduce: 0.83 | step: 7.08 10%|▉ | 984/10000 [1:31:05<13:50:33, 5.53s/it] {'loss': 0.0504, 'grad_norm': 0.3424258530139923, 'learning_rate': 3.951124452376286e-05, 'epoch': 0.98} 10%|▉ | 984/10000 [1:31:05<13:50:33, 5.53s/it][2025-06-19 15:00:49,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:00:49,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.58 | bwd_microstep: 3321.71 | bwd_inner_microstep: 3320.48 | bwd_allreduce_microstep: 1.14 | step_microstep: 8.73 [2025-06-19 15:00:49,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.58 | bwd: 3321.74 | bwd_inner: 3320.48 | bwd_allreduce: 1.18 | step: 8.74 10%|▉ | 985/10000 [1:31:10<13:48:43, 5.52s/it] {'loss': 0.1204, 'grad_norm': 0.4814141094684601, 'learning_rate': 3.950982024024735e-05, 'epoch': 0.98} 10%|▉ | 985/10000 [1:31:10<13:48:43, 5.52s/it][2025-06-19 15:00:55,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:00:55,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.09 | bwd_microstep: 3327.92 | bwd_inner_microstep: 3327.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-19 15:00:55,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.09 | bwd: 3327.93 | bwd_inner: 3327.11 | bwd_allreduce: 0.78 | step: 6.78 10%|▉ | 986/10000 [1:31:16<13:47:21, 5.51s/it] {'loss': 0.1198, 'grad_norm': 0.5827894806861877, 'learning_rate': 3.950839391024225e-05, 'epoch': 0.99} 10%|▉ | 986/10000 [1:31:16<13:47:21, 5.51s/it][2025-06-19 15:01:00,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:01:00,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.15 | bwd_microstep: 3377.45 | bwd_inner_microstep: 3376.57 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.92 [2025-06-19 15:01:00,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.15 | bwd: 3377.48 | bwd_inner: 3376.57 | bwd_allreduce: 0.84 | step: 6.92 10%|▉ | 987/10000 [1:31:21<13:49:00, 5.52s/it] {'loss': 0.1091, 'grad_norm': 0.7691855430603027, 'learning_rate': 3.950696553389716e-05, 'epoch': 0.99} 10%|▉ | 987/10000 [1:31:21<13:49:00, 5.52s/it][2025-06-19 15:01:06,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 15:01:06,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.48 | bwd_microstep: 3406.41 | bwd_inner_microstep: 3405.50 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.58 [2025-06-19 15:01:06,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.48 | bwd: 3406.44 | bwd_inner: 3405.50 | bwd_allreduce: 0.87 | step: 7.57 10%|▉ | 988/10000 [1:31:27<13:52:36, 5.54s/it] {'loss': 0.1266, 'grad_norm': 0.5679335594177246, 'learning_rate': 3.9505535111361934e-05, 'epoch': 0.99} 10%|▉ | 988/10000 [1:31:27<13:52:36, 5.54s/it][2025-06-19 15:01:12,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:01:12,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.18 | bwd_microstep: 3315.36 | bwd_inner_microstep: 3314.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 15:01:12,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.18 | bwd: 3315.38 | bwd_inner: 3314.58 | bwd_allreduce: 0.76 | step: 6.60 10%|▉ | 989/10000 [1:31:32<13:49:28, 5.52s/it] {'loss': 0.064, 'grad_norm': 0.3782638609409332, 'learning_rate': 3.95041026427866e-05, 'epoch': 0.99} 10%|▉ | 989/10000 [1:31:32<13:49:28, 5.52s/it][2025-06-19 15:01:17,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:01:17,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.99 | bwd_microstep: 3315.04 | bwd_inner_microstep: 3313.92 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.27 [2025-06-19 15:01:17,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.99 | bwd: 3315.06 | bwd_inner: 3313.92 | bwd_allreduce: 1.07 | step: 7.27 10%|▉ | 990/10000 [1:31:38<13:47:16, 5.51s/it] {'loss': 0.1017, 'grad_norm': 0.5915607810020447, 'learning_rate': 3.9502668128321415e-05, 'epoch': 0.99} 10%|▉ | 990/10000 [1:31:38<13:47:16, 5.51s/it][2025-06-19 15:01:23,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:01:23,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.85 | bwd_microstep: 3367.97 | bwd_inner_microstep: 3367.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 15:01:23,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.85 | bwd: 3367.99 | bwd_inner: 3367.19 | bwd_allreduce: 0.76 | step: 6.61 10%|▉ | 991/10000 [1:31:43<13:48:19, 5.52s/it] {'loss': 0.0626, 'grad_norm': 0.29962363839149475, 'learning_rate': 3.9501231568116855e-05, 'epoch': 0.99} 10%|▉ | 991/10000 [1:31:43<13:48:19, 5.52s/it][2025-06-19 15:01:28,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:01:28,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.36 | bwd_microstep: 3315.48 | bwd_inner_microstep: 3314.66 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.22 [2025-06-19 15:01:28,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.36 | bwd: 3315.49 | bwd_inner: 3314.66 | bwd_allreduce: 0.79 | step: 7.22 10%|▉ | 992/10000 [1:31:49<13:45:43, 5.50s/it] {'loss': 0.1873, 'grad_norm': 1.4336761236190796, 'learning_rate': 3.9499792962323614e-05, 'epoch': 0.99} 10%|▉ | 992/10000 [1:31:49<13:45:43, 5.50s/it][2025-06-19 15:01:34,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:01:34,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.11 | bwd_microstep: 3319.37 | bwd_inner_microstep: 3318.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.75 [2025-06-19 15:01:34,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.11 | bwd: 3319.38 | bwd_inner: 3318.58 | bwd_allreduce: 0.76 | step: 6.76 10%|▉ | 993/10000 [1:31:54<13:43:43, 5.49s/it] {'loss': 0.1355, 'grad_norm': 0.7983993291854858, 'learning_rate': 3.94983523110926e-05, 'epoch': 0.99} 10%|▉ | 993/10000 [1:31:54<13:43:43, 5.49s/it][2025-06-19 15:01:39,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:01:39,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.08 | bwd_microstep: 3321.99 | bwd_inner_microstep: 3321.17 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.33 [2025-06-19 15:01:39,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.08 | bwd: 3322.01 | bwd_inner: 3321.17 | bwd_allreduce: 0.80 | step: 7.34 10%|▉ | 994/10000 [1:32:00<13:42:28, 5.48s/it] {'loss': 0.1528, 'grad_norm': 0.5396023392677307, 'learning_rate': 3.9496909614574915e-05, 'epoch': 0.99} 10%|▉ | 994/10000 [1:32:00<13:42:28, 5.48s/it][2025-06-19 15:01:44,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:01:44,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.56 | bwd_microstep: 3361.67 | bwd_inner_microstep: 3360.75 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.86 [2025-06-19 15:01:44,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.56 | bwd: 3361.68 | bwd_inner: 3360.75 | bwd_allreduce: 0.89 | step: 6.86 10%|▉ | 995/10000 [1:32:05<13:44:18, 5.49s/it] {'loss': 0.1101, 'grad_norm': 0.4984806180000305, 'learning_rate': 3.949546487292191e-05, 'epoch': 0.99} 10%|▉ | 995/10000 [1:32:05<13:44:18, 5.49s/it][2025-06-19 15:01:50,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:01:50,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.65 | bwd_microstep: 3366.62 | bwd_inner_microstep: 3365.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 15:01:50,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.65 | bwd: 3366.63 | bwd_inner: 3365.81 | bwd_allreduce: 0.78 | step: 7.08 10%|▉ | 996/10000 [1:32:11<13:45:50, 5.50s/it] {'loss': 0.0705, 'grad_norm': 0.265344500541687, 'learning_rate': 3.949401808628511e-05, 'epoch': 1.0} 10%|▉ | 996/10000 [1:32:11<13:45:50, 5.50s/it][2025-06-19 15:01:56,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:01:56,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.74 | bwd_microstep: 3352.72 | bwd_inner_microstep: 3351.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.62 [2025-06-19 15:01:56,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.74 | bwd: 3352.74 | bwd_inner: 3351.92 | bwd_allreduce: 0.77 | step: 6.63 10%|▉ | 997/10000 [1:32:16<13:46:01, 5.51s/it] {'loss': 0.1125, 'grad_norm': 0.57488614320755, 'learning_rate': 3.9492569254816286e-05, 'epoch': 1.0} 10%|▉ | 997/10000 [1:32:16<13:46:01, 5.51s/it][2025-06-19 15:02:01,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:02:01,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.09 | bwd_microstep: 3316.46 | bwd_inner_microstep: 3315.59 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.30 [2025-06-19 15:02:01,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.09 | bwd: 3316.48 | bwd_inner: 3315.59 | bwd_allreduce: 0.83 | step: 7.30 10%|▉ | 998/10000 [1:32:22<13:43:45, 5.49s/it] {'loss': 0.0555, 'grad_norm': 0.36115750670433044, 'learning_rate': 3.949111837866742e-05, 'epoch': 1.0} 10%|▉ | 998/10000 [1:32:22<13:43:45, 5.49s/it][2025-06-19 15:02:07,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:02:07,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2155.56 | bwd_microstep: 3364.16 | bwd_inner_microstep: 3363.29 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.23 [2025-06-19 15:02:07,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2155.56 | bwd: 3364.18 | bwd_inner: 3363.29 | bwd_allreduce: 0.83 | step: 7.24 10%|▉ | 999/10000 [1:32:27<13:47:04, 5.51s/it] {'loss': 0.1617, 'grad_norm': 0.8480632305145264, 'learning_rate': 3.94896654579907e-05, 'epoch': 1.0} 10%|▉ | 999/10000 [1:32:27<13:47:04, 5.51s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-06-19 15:02:14,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:02:14,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2152.92 | bwd_microstep: 3347.50 | bwd_inner_microstep: 3346.59 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.23 [2025-06-19 15:02:14,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2152.92 | bwd: 3347.53 | bwd_inner: 3346.59 | bwd_allreduce: 0.87 | step: 8.23 10%|█ | 1000/10000 [1:32:35<15:15:13, 6.10s/it] {'loss': 0.1254, 'grad_norm': 0.8585809469223022, 'learning_rate': 3.948821049293853e-05, 'epoch': 1.0} 10%|█ | 1000/10000 [1:32:35<15:15:13, 6.10s/it]evaluate! [INFO|trainer.py:3910] 2025-06-19 15:02:27,558 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 15:02:27,562 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 15:02:27,562 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 15:03:22,264 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 15:03:22,267 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 15:03:22,267 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 15:03:22,268 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json evaluate! [INFO|trainer.py:3910] 2025-06-19 15:03:40,678 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 15:03:40,685 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 15:03:40,686 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 15:04:42,489 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 15:04:42,493 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 15:04:42,493 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 15:04:42,493 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json [2025-06-19 15:04:48,057] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 15:04:55,495] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 15:05:01,809] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 15:05:07,725] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 15:05:27,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.72 | optimizer_step: 2.81 [2025-06-19 15:05:27,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.30 | bwd_microstep: 3257.80 | bwd_inner_microstep: 3256.27 | bwd_allreduce_microstep: 1.41 | step_microstep: 10.01 [2025-06-19 15:05:27,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.28 | bwd: 3257.84 | bwd_inner: 3256.27 | bwd_allreduce: 1.47 | step: 9.99 10%|█ | 1001/10000 [1:35:48<155:14:59, 62.11s/it] {'loss': 0.1382, 'grad_norm': 0.625300407409668, 'learning_rate': 3.948675348366352e-05, 'epoch': 1.0} 10%|█ | 1001/10000 [1:35:48<155:14:59, 62.11s/it][2025-06-19 15:05:32,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 15:05:32,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2090.07 | bwd_microstep: 3277.38 | bwd_inner_microstep: 3276.21 | bwd_allreduce_microstep: 1.11 | step_microstep: 8.40 [2025-06-19 15:05:32,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2090.07 | bwd: 3277.40 | bwd_inner: 3276.21 | bwd_allreduce: 1.14 | step: 8.41 10%|█ | 1002/10000 [1:35:53<112:43:43, 45.10s/it] {'loss': 0.1216, 'grad_norm': 0.5768898725509644, 'learning_rate': 3.948529443031851e-05, 'epoch': 1.0} 10%|█ | 1002/10000 [1:35:53<112:43:43, 45.10s/it][2025-06-19 15:05:38,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:05:38,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.93 | bwd_microstep: 3333.56 | bwd_inner_microstep: 3332.59 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.64 [2025-06-19 15:05:38,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.93 | bwd: 3333.58 | bwd_inner: 3332.59 | bwd_allreduce: 0.95 | step: 7.64 10%|█ | 1003/10000 [1:35:59<83:02:07, 33.23s/it] {'loss': 0.1055, 'grad_norm': 0.37353548407554626, 'learning_rate': 3.9483833333056546e-05, 'epoch': 1.0} 10%|█ | 1003/10000 [1:35:59<83:02:07, 33.23s/it][2025-06-19 15:05:43,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.89 [2025-06-19 15:05:43,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2094.92 | bwd_microstep: 3296.53 | bwd_inner_microstep: 3295.63 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.75 [2025-06-19 15:05:43,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2094.92 | bwd: 3296.55 | bwd_inner: 3295.63 | bwd_allreduce: 0.86 | step: 7.75 10%|█ | 1004/10000 [1:36:04<62:11:48, 24.89s/it] {'loss': 0.0948, 'grad_norm': 0.4600435495376587, 'learning_rate': 3.94823701920309e-05, 'epoch': 1.0} 10%|█ | 1004/10000 [1:36:04<62:11:48, 24.89s/it][2025-06-19 15:05:49,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 15:05:49,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.51 | bwd_microstep: 3345.19 | bwd_inner_microstep: 3344.18 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.62 [2025-06-19 15:05:49,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.51 | bwd: 3345.21 | bwd_inner: 3344.18 | bwd_allreduce: 0.98 | step: 7.62 10%|█ | 1005/10000 [1:36:10<47:40:22, 19.08s/it] {'loss': 0.1176, 'grad_norm': 0.5084125995635986, 'learning_rate': 3.9480905007395035e-05, 'epoch': 1.0} 10%|█ | 1005/10000 [1:36:10<47:40:22, 19.08s/it][2025-06-19 15:05:54,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:05:54,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.56 | bwd_microstep: 3347.06 | bwd_inner_microstep: 3346.09 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.46 [2025-06-19 15:05:54,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.56 | bwd: 3347.09 | bwd_inner: 3346.09 | bwd_allreduce: 0.93 | step: 7.45 10%|█ | 1006/10000 [1:36:15<37:30:09, 15.01s/it] {'loss': 0.083, 'grad_norm': 0.3084852397441864, 'learning_rate': 3.9479437779302656e-05, 'epoch': 1.01} 10%|█ | 1006/10000 [1:36:15<37:30:09, 15.01s/it][2025-06-19 15:06:00,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:06:00,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.41 | bwd_microstep: 3342.57 | bwd_inner_microstep: 3341.52 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.34 [2025-06-19 15:06:00,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.41 | bwd: 3342.59 | bwd_inner: 3341.52 | bwd_allreduce: 1.01 | step: 7.35 10%|█ | 1007/10000 [1:36:21<30:22:50, 12.16s/it] {'loss': 0.0608, 'grad_norm': 0.30721956491470337, 'learning_rate': 3.947796850790766e-05, 'epoch': 1.01} 10%|█ | 1007/10000 [1:36:21<30:22:50, 12.16s/it][2025-06-19 15:06:05,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:06:05,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.76 | bwd_microstep: 3334.03 | bwd_inner_microstep: 3333.17 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.61 [2025-06-19 15:06:05,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.70 | bwd: 3334.06 | bwd_inner: 3333.17 | bwd_allreduce: 0.82 | step: 7.61 10%|█ | 1008/10000 [1:36:26<25:23:14, 10.16s/it] {'loss': 0.0942, 'grad_norm': 0.5646988749504089, 'learning_rate': 3.9476497193364156e-05, 'epoch': 1.01} 10%|█ | 1008/10000 [1:36:26<25:23:14, 10.16s/it][2025-06-19 15:06:11,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:06:11,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.80 | bwd_microstep: 3346.84 | bwd_inner_microstep: 3345.86 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.81 [2025-06-19 15:06:11,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.80 | bwd: 3346.86 | bwd_inner: 3345.86 | bwd_allreduce: 0.95 | step: 7.82 10%|█ | 1009/10000 [1:36:32<21:54:25, 8.77s/it] {'loss': 0.0905, 'grad_norm': 0.5624771118164062, 'learning_rate': 3.94750238358265e-05, 'epoch': 1.01} 10%|█ | 1009/10000 [1:36:32<21:54:25, 8.77s/it][2025-06-19 15:06:16,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:06:16,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.92 | bwd_microstep: 3298.24 | bwd_inner_microstep: 3297.23 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.46 [2025-06-19 15:06:16,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.92 | bwd: 3298.26 | bwd_inner: 3297.23 | bwd_allreduce: 0.97 | step: 7.47 10%|█ | 1010/10000 [1:36:37<19:25:06, 7.78s/it] {'loss': 0.0676, 'grad_norm': 0.32379278540611267, 'learning_rate': 3.947354843544923e-05, 'epoch': 1.01} 10%|█ | 1010/10000 [1:36:37<19:25:06, 7.78s/it][2025-06-19 15:06:22,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:06:22,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.07 | bwd_microstep: 3354.49 | bwd_inner_microstep: 3353.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 15:06:22,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.07 | bwd: 3354.51 | bwd_inner: 3353.69 | bwd_allreduce: 0.77 | step: 7.14 10%|█ | 1011/10000 [1:36:43<17:44:11, 7.10s/it] {'loss': 0.084, 'grad_norm': 0.31934037804603577, 'learning_rate': 3.9472070992387103e-05, 'epoch': 1.01} 10%|█ | 1011/10000 [1:36:43<17:44:11, 7.10s/it][2025-06-19 15:06:27,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:06:27,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2094.56 | bwd_microstep: 3303.67 | bwd_inner_microstep: 3302.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-19 15:06:27,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2094.56 | bwd: 3303.69 | bwd_inner: 3302.87 | bwd_allreduce: 0.77 | step: 6.75 10%|█ | 1012/10000 [1:36:48<16:29:19, 6.60s/it] {'loss': 0.1267, 'grad_norm': 0.6358789801597595, 'learning_rate': 3.94705915067951e-05, 'epoch': 1.01} 10%|█ | 1012/10000 [1:36:48<16:29:19, 6.60s/it][2025-06-19 15:06:33,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:06:33,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.26 | bwd_microstep: 3358.72 | bwd_inner_microstep: 3357.85 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.52 [2025-06-19 15:06:33,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.26 | bwd: 3358.74 | bwd_inner: 3357.85 | bwd_allreduce: 0.85 | step: 7.52 10%|█ | 1013/10000 [1:36:54<15:41:41, 6.29s/it] {'loss': 0.1147, 'grad_norm': 0.4109918773174286, 'learning_rate': 3.9469109978828424e-05, 'epoch': 1.01} 10%|█ | 1013/10000 [1:36:54<15:41:41, 6.29s/it][2025-06-19 15:06:38,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:06:38,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.61 | bwd_microstep: 3363.47 | bwd_inner_microstep: 3362.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-19 15:06:38,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.61 | bwd: 3363.49 | bwd_inner: 3362.65 | bwd_allreduce: 0.79 | step: 6.90 10%|█ | 1014/10000 [1:36:59<15:07:59, 6.06s/it] {'loss': 0.1331, 'grad_norm': 0.6468644142150879, 'learning_rate': 3.9467626408642466e-05, 'epoch': 1.01} 10%|█ | 1014/10000 [1:36:59<15:07:59, 6.06s/it][2025-06-19 15:06:44,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:06:44,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.62 | bwd_microstep: 3369.11 | bwd_inner_microstep: 3368.08 | bwd_allreduce_microstep: 0.98 | step_microstep: 8.04 [2025-06-19 15:06:44,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.62 | bwd: 3369.13 | bwd_inner: 3368.08 | bwd_allreduce: 1.00 | step: 8.05 10%|█ | 1015/10000 [1:37:05<14:45:03, 5.91s/it] {'loss': 0.0989, 'grad_norm': 0.3932211995124817, 'learning_rate': 3.9466140796392846e-05, 'epoch': 1.01} 10%|█ | 1015/10000 [1:37:05<14:45:03, 5.91s/it][2025-06-19 15:06:49,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 15:06:49,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.88 | bwd_microstep: 3325.82 | bwd_inner_microstep: 3324.77 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.97 [2025-06-19 15:06:49,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.89 | bwd: 3325.84 | bwd_inner: 3324.77 | bwd_allreduce: 1.02 | step: 7.98 10%|█ | 1016/10000 [1:37:10<14:26:09, 5.78s/it] {'loss': 0.0864, 'grad_norm': 0.33612194657325745, 'learning_rate': 3.9464653142235405e-05, 'epoch': 1.02} 10%|█ | 1016/10000 [1:37:10<14:26:09, 5.78s/it][2025-06-19 15:06:55,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:06:55,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.64 | bwd_microstep: 3375.95 | bwd_inner_microstep: 3375.13 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.92 [2025-06-19 15:06:55,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.64 | bwd: 3375.97 | bwd_inner: 3375.13 | bwd_allreduce: 0.79 | step: 6.92 10%|█ | 1017/10000 [1:37:16<14:16:02, 5.72s/it] {'loss': 0.0778, 'grad_norm': 2.282911777496338, 'learning_rate': 3.946316344632619e-05, 'epoch': 1.02} 10%|█ | 1017/10000 [1:37:16<14:16:02, 5.72s/it][2025-06-19 15:07:00,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:07:00,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.82 | bwd_microstep: 3401.55 | bwd_inner_microstep: 3400.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 15:07:00,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.82 | bwd: 3401.56 | bwd_inner: 3400.76 | bwd_allreduce: 0.76 | step: 6.65 10%|█ | 1018/10000 [1:37:21<14:09:44, 5.68s/it] {'loss': 0.0788, 'grad_norm': 0.32446983456611633, 'learning_rate': 3.946167170882145e-05, 'epoch': 1.02} 10%|█ | 1018/10000 [1:37:21<14:09:44, 5.68s/it][2025-06-19 15:07:06,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 15:07:06,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.27 | bwd_microstep: 3327.00 | bwd_inner_microstep: 3326.01 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.43 [2025-06-19 15:07:06,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.27 | bwd: 3327.02 | bwd_inner: 3326.01 | bwd_allreduce: 0.97 | step: 7.43 10%|█ | 1019/10000 [1:37:27<14:00:28, 5.62s/it] {'loss': 0.1126, 'grad_norm': 0.407632440328598, 'learning_rate': 3.9460177929877685e-05, 'epoch': 1.02} 10%|█ | 1019/10000 [1:37:27<14:00:28, 5.62s/it][2025-06-19 15:07:12,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:07:12,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.91 | bwd_microstep: 3397.40 | bwd_inner_microstep: 3396.36 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.28 [2025-06-19 15:07:12,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.91 | bwd: 3397.42 | bwd_inner: 3396.36 | bwd_allreduce: 1.00 | step: 7.28 10%|█ | 1020/10000 [1:37:32<13:58:53, 5.61s/it] {'loss': 0.105, 'grad_norm': 0.5029900670051575, 'learning_rate': 3.9458682109651566e-05, 'epoch': 1.02} 10%|█ | 1020/10000 [1:37:32<13:58:53, 5.61s/it][2025-06-19 15:07:17,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:07:17,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.61 | bwd_microstep: 3326.34 | bwd_inner_microstep: 3325.24 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.99 [2025-06-19 15:07:17,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.61 | bwd: 3326.37 | bwd_inner: 3325.24 | bwd_allreduce: 1.07 | step: 8.00 10%|█ | 1021/10000 [1:37:38<13:53:17, 5.57s/it] {'loss': 0.0954, 'grad_norm': 0.34708815813064575, 'learning_rate': 3.9457184248300006e-05, 'epoch': 1.02} 10%|█ | 1021/10000 [1:37:38<13:53:17, 5.57s/it][2025-06-19 15:07:23,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:07:23,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.95 | bwd_microstep: 3377.34 | bwd_inner_microstep: 3376.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 15:07:23,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.95 | bwd: 3377.35 | bwd_inner: 3376.55 | bwd_allreduce: 0.75 | step: 6.58 10%|█ | 1022/10000 [1:37:43<13:52:26, 5.56s/it] {'loss': 0.0654, 'grad_norm': 0.27580681443214417, 'learning_rate': 3.945568434598012e-05, 'epoch': 1.02} 10%|█ | 1022/10000 [1:37:43<13:52:26, 5.56s/it][2025-06-19 15:07:28,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:07:28,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.15 | bwd_microstep: 3399.79 | bwd_inner_microstep: 3398.95 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.27 [2025-06-19 15:07:28,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.15 | bwd: 3399.81 | bwd_inner: 3398.95 | bwd_allreduce: 0.82 | step: 7.28 10%|█ | 1023/10000 [1:37:49<13:53:05, 5.57s/it] {'loss': 0.0788, 'grad_norm': 0.4733675718307495, 'learning_rate': 3.9454182402849246e-05, 'epoch': 1.02} 10%|█ | 1023/10000 [1:37:49<13:53:05, 5.57s/it][2025-06-19 15:07:34,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:07:34,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.10 | bwd_microstep: 3367.83 | bwd_inner_microstep: 3366.89 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.02 [2025-06-19 15:07:34,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.10 | bwd: 3367.84 | bwd_inner: 3366.89 | bwd_allreduce: 0.91 | step: 7.02 10%|█ | 1024/10000 [1:37:54<13:51:47, 5.56s/it] {'loss': 0.1381, 'grad_norm': 0.5586032867431641, 'learning_rate': 3.9452678419064925e-05, 'epoch': 1.02} 10%|█ | 1024/10000 [1:37:54<13:51:47, 5.56s/it][2025-06-19 15:07:39,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:07:39,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.33 | bwd_microstep: 3324.06 | bwd_inner_microstep: 3323.26 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 15:07:39,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.33 | bwd: 3324.08 | bwd_inner: 3323.26 | bwd_allreduce: 0.77 | step: 7.05 10%|█ | 1025/10000 [1:38:00<13:47:59, 5.54s/it] {'loss': 0.0595, 'grad_norm': 0.23945145308971405, 'learning_rate': 3.945117239478492e-05, 'epoch': 1.02} 10%|█ | 1025/10000 [1:38:00<13:47:59, 5.54s/it][2025-06-19 15:07:45,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 15:07:45,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.90 | bwd_microstep: 3335.31 | bwd_inner_microstep: 3334.39 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.05 [2025-06-19 15:07:45,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.90 | bwd: 3335.33 | bwd_inner: 3334.39 | bwd_allreduce: 0.89 | step: 7.05 10%|█ | 1026/10000 [1:38:05<13:45:35, 5.52s/it] {'loss': 0.0753, 'grad_norm': 0.28514206409454346, 'learning_rate': 3.944966433016721e-05, 'epoch': 1.03} 10%|█ | 1026/10000 [1:38:05<13:45:35, 5.52s/it][2025-06-19 15:07:50,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:07:50,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.53 | bwd_microstep: 3325.72 | bwd_inner_microstep: 3324.87 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.90 [2025-06-19 15:07:50,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.53 | bwd: 3325.74 | bwd_inner: 3324.87 | bwd_allreduce: 0.81 | step: 6.90 10%|█ | 1027/10000 [1:38:11<13:44:06, 5.51s/it] {'loss': 0.0543, 'grad_norm': 0.24073009192943573, 'learning_rate': 3.944815422536998e-05, 'epoch': 1.03} 10%|█ | 1027/10000 [1:38:11<13:44:06, 5.51s/it][2025-06-19 15:07:56,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:07:56,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.28 | bwd_microstep: 3318.19 | bwd_inner_microstep: 3317.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 15:07:56,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.28 | bwd: 3318.21 | bwd_inner: 3317.41 | bwd_allreduce: 0.76 | step: 6.60 10%|█ | 1028/10000 [1:38:16<13:42:17, 5.50s/it] {'loss': 0.083, 'grad_norm': 0.45637306571006775, 'learning_rate': 3.944664208055163e-05, 'epoch': 1.03} 10%|█ | 1028/10000 [1:38:16<13:42:17, 5.50s/it][2025-06-19 15:08:01,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:08:01,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.33 | bwd_microstep: 3323.36 | bwd_inner_microstep: 3322.56 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 15:08:01,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.33 | bwd: 3323.38 | bwd_inner: 3322.56 | bwd_allreduce: 0.77 | step: 7.12 10%|█ | 1029/10000 [1:38:22<13:40:51, 5.49s/it] {'loss': 0.0777, 'grad_norm': 0.5032928586006165, 'learning_rate': 3.944512789587078e-05, 'epoch': 1.03} 10%|█ | 1029/10000 [1:38:22<13:40:51, 5.49s/it][2025-06-19 15:08:07,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:08:07,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.87 | bwd_microstep: 3368.00 | bwd_inner_microstep: 3367.05 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.27 [2025-06-19 15:08:07,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.87 | bwd: 3368.01 | bwd_inner: 3367.05 | bwd_allreduce: 0.92 | step: 7.27 10%|█ | 1030/10000 [1:38:27<13:43:03, 5.51s/it] {'loss': 0.1418, 'grad_norm': 0.5617329478263855, 'learning_rate': 3.944361167148626e-05, 'epoch': 1.03} 10%|█ | 1030/10000 [1:38:27<13:43:03, 5.51s/it][2025-06-19 15:08:12,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:08:12,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.87 | bwd_microstep: 3315.31 | bwd_inner_microstep: 3314.49 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.25 [2025-06-19 15:08:12,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.87 | bwd: 3315.33 | bwd_inner: 3314.49 | bwd_allreduce: 0.79 | step: 7.25 10%|█ | 1031/10000 [1:38:33<13:41:20, 5.49s/it] {'loss': 0.0976, 'grad_norm': 0.4736160635948181, 'learning_rate': 3.944209340755713e-05, 'epoch': 1.03} 10%|█ | 1031/10000 [1:38:33<13:41:20, 5.49s/it][2025-06-19 15:08:18,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:08:18,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.65 | bwd_microstep: 3367.01 | bwd_inner_microstep: 3366.07 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.02 [2025-06-19 15:08:18,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.65 | bwd: 3367.02 | bwd_inner: 3366.07 | bwd_allreduce: 0.91 | step: 7.02 10%|█ | 1032/10000 [1:38:38<13:43:10, 5.51s/it] {'loss': 0.0486, 'grad_norm': 0.32397937774658203, 'learning_rate': 3.944057310424262e-05, 'epoch': 1.03} 10%|█ | 1032/10000 [1:38:38<13:43:10, 5.51s/it][2025-06-19 15:08:23,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:08:23,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.24 | bwd_microstep: 3383.42 | bwd_inner_microstep: 3382.56 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.03 [2025-06-19 15:08:23,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.24 | bwd: 3383.44 | bwd_inner: 3382.56 | bwd_allreduce: 0.83 | step: 7.03 10%|█ | 1033/10000 [1:38:44<13:45:22, 5.52s/it] {'loss': 0.1141, 'grad_norm': 0.7620226740837097, 'learning_rate': 3.943905076170222e-05, 'epoch': 1.03} 10%|█ | 1033/10000 [1:38:44<13:45:22, 5.52s/it][2025-06-19 15:08:29,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:08:29,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.45 | bwd_microstep: 3319.36 | bwd_inner_microstep: 3318.24 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.04 [2025-06-19 15:08:29,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.45 | bwd: 3319.38 | bwd_inner: 3318.24 | bwd_allreduce: 1.09 | step: 8.05 10%|█ | 1034/10000 [1:38:49<13:43:30, 5.51s/it] {'loss': 0.1406, 'grad_norm': 0.5589507222175598, 'learning_rate': 3.943752638009562e-05, 'epoch': 1.03} 10%|█ | 1034/10000 [1:38:49<13:43:30, 5.51s/it][2025-06-19 15:08:34,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:08:34,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.70 | bwd_microstep: 3326.55 | bwd_inner_microstep: 3325.69 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.65 [2025-06-19 15:08:34,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.70 | bwd: 3326.56 | bwd_inner: 3325.69 | bwd_allreduce: 0.83 | step: 7.66 10%|█ | 1035/10000 [1:38:55<13:41:55, 5.50s/it] {'loss': 0.0903, 'grad_norm': 0.5102023482322693, 'learning_rate': 3.943599995958271e-05, 'epoch': 1.03} 10%|█ | 1035/10000 [1:38:55<13:41:55, 5.50s/it][2025-06-19 15:08:40,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:08:40,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.70 | bwd_microstep: 3317.05 | bwd_inner_microstep: 3316.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.57 [2025-06-19 15:08:40,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.70 | bwd: 3317.06 | bwd_inner: 3316.26 | bwd_allreduce: 0.76 | step: 6.57 10%|█ | 1036/10000 [1:39:00<13:40:30, 5.49s/it] {'loss': 0.1786, 'grad_norm': 0.8087278008460999, 'learning_rate': 3.9434471500323616e-05, 'epoch': 1.04} 10%|█ | 1036/10000 [1:39:00<13:40:30, 5.49s/it][2025-06-19 15:08:45,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:08:45,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.85 | bwd_microstep: 3318.82 | bwd_inner_microstep: 3317.97 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.50 [2025-06-19 15:08:45,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.85 | bwd: 3318.84 | bwd_inner: 3317.97 | bwd_allreduce: 0.82 | step: 7.50 10%|█ | 1037/10000 [1:39:06<13:39:21, 5.48s/it] {'loss': 0.1512, 'grad_norm': 0.7793768644332886, 'learning_rate': 3.9432941002478654e-05, 'epoch': 1.04} 10%|█ | 1037/10000 [1:39:06<13:39:21, 5.48s/it][2025-06-19 15:08:51,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:08:51,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.09 | bwd_microstep: 3324.99 | bwd_inner_microstep: 3324.08 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.90 [2025-06-19 15:08:51,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.09 | bwd: 3325.01 | bwd_inner: 3324.08 | bwd_allreduce: 0.88 | step: 7.92 10%|█ | 1038/10000 [1:39:11<13:38:40, 5.48s/it] {'loss': 0.0608, 'grad_norm': 0.3423984944820404, 'learning_rate': 3.9431408466208376e-05, 'epoch': 1.04} 10%|█ | 1038/10000 [1:39:11<13:38:40, 5.48s/it][2025-06-19 15:08:56,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:08:56,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.64 | bwd_microstep: 3373.55 | bwd_inner_microstep: 3372.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.06 [2025-06-19 15:08:56,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.64 | bwd: 3373.56 | bwd_inner: 3372.75 | bwd_allreduce: 0.77 | step: 7.07 10%|█ | 1039/10000 [1:39:17<13:41:34, 5.50s/it] {'loss': 0.1011, 'grad_norm': 0.6525359153747559, 'learning_rate': 3.942987389167353e-05, 'epoch': 1.04} 10%|█ | 1039/10000 [1:39:17<13:41:34, 5.50s/it][2025-06-19 15:09:02,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:09:02,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.85 | bwd_microstep: 3323.21 | bwd_inner_microstep: 3322.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 15:09:02,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.85 | bwd: 3323.23 | bwd_inner: 3322.43 | bwd_allreduce: 0.76 | step: 6.63 10%|█ | 1040/10000 [1:39:22<13:40:18, 5.49s/it] {'loss': 0.0827, 'grad_norm': 0.6305307149887085, 'learning_rate': 3.942833727903509e-05, 'epoch': 1.04} 10%|█ | 1040/10000 [1:39:22<13:40:18, 5.49s/it][2025-06-19 15:09:07,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 15:09:07,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.82 | bwd_microstep: 3320.07 | bwd_inner_microstep: 3318.96 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.04 [2025-06-19 15:09:07,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.82 | bwd: 3320.09 | bwd_inner: 3318.96 | bwd_allreduce: 1.08 | step: 8.06 10%|█ | 1041/10000 [1:39:28<13:39:10, 5.49s/it] {'loss': 0.0463, 'grad_norm': 0.2730240523815155, 'learning_rate': 3.942679862845424e-05, 'epoch': 1.04} 10%|█ | 1041/10000 [1:39:28<13:39:10, 5.49s/it][2025-06-19 15:09:13,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:09:13,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.62 | bwd_microstep: 3373.14 | bwd_inner_microstep: 3371.97 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.53 [2025-06-19 15:09:13,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.62 | bwd: 3373.15 | bwd_inner: 3371.97 | bwd_allreduce: 0.95 | step: 7.53 10%|█ | 1042/10000 [1:39:33<13:42:23, 5.51s/it] {'loss': 0.0709, 'grad_norm': 0.5192554593086243, 'learning_rate': 3.942525794009238e-05, 'epoch': 1.04} 10%|█ | 1042/10000 [1:39:33<13:42:23, 5.51s/it][2025-06-19 15:09:18,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:09:18,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.90 | bwd_microstep: 3362.05 | bwd_inner_microstep: 3361.25 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.71 [2025-06-19 15:09:18,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.91 | bwd: 3362.06 | bwd_inner: 3361.25 | bwd_allreduce: 0.77 | step: 6.71 10%|█ | 1043/10000 [1:39:39<13:43:19, 5.52s/it] {'loss': 0.0676, 'grad_norm': 0.36582574248313904, 'learning_rate': 3.9423715214111106e-05, 'epoch': 1.04} 10%|█ | 1043/10000 [1:39:39<13:43:19, 5.52s/it][2025-06-19 15:09:24,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:09:24,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.98 | bwd_microstep: 3315.92 | bwd_inner_microstep: 3315.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.26 [2025-06-19 15:09:24,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.98 | bwd: 3315.93 | bwd_inner: 3315.11 | bwd_allreduce: 0.78 | step: 7.26 10%|█ | 1044/10000 [1:39:44<13:40:41, 5.50s/it] {'loss': 0.1231, 'grad_norm': 0.8457975387573242, 'learning_rate': 3.942217045067226e-05, 'epoch': 1.04} 10%|█ | 1044/10000 [1:39:44<13:40:41, 5.50s/it][2025-06-19 15:09:29,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:09:29,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.29 | bwd_microstep: 3321.41 | bwd_inner_microstep: 3320.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 15:09:29,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.29 | bwd: 3321.43 | bwd_inner: 3320.62 | bwd_allreduce: 0.76 | step: 6.73 10%|█ | 1045/10000 [1:39:50<13:39:35, 5.49s/it] {'loss': 0.0882, 'grad_norm': 0.6265797019004822, 'learning_rate': 3.9420623649937884e-05, 'epoch': 1.04} 10%|█ | 1045/10000 [1:39:50<13:39:35, 5.49s/it][2025-06-19 15:09:35,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:09:35,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.08 | bwd_microstep: 3319.28 | bwd_inner_microstep: 3318.48 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.03 [2025-06-19 15:09:35,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.08 | bwd: 3319.29 | bwd_inner: 3318.48 | bwd_allreduce: 0.77 | step: 7.03 10%|█ | 1046/10000 [1:39:55<13:38:18, 5.48s/it] {'loss': 0.0749, 'grad_norm': 0.7962648868560791, 'learning_rate': 3.941907481207021e-05, 'epoch': 1.05} 10%|█ | 1046/10000 [1:39:55<13:38:18, 5.48s/it][2025-06-19 15:09:40,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:09:40,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.37 | bwd_microstep: 3373.99 | bwd_inner_microstep: 3373.14 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.70 [2025-06-19 15:09:40,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.37 | bwd: 3374.01 | bwd_inner: 3373.14 | bwd_allreduce: 0.82 | step: 6.71 10%|█ | 1047/10000 [1:40:01<13:40:41, 5.50s/it] {'loss': 0.0757, 'grad_norm': 0.768625795841217, 'learning_rate': 3.941752393723172e-05, 'epoch': 1.05} 10%|█ | 1047/10000 [1:40:01<13:40:41, 5.50s/it][2025-06-19 15:09:46,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:09:46,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.34 | bwd_microstep: 3319.71 | bwd_inner_microstep: 3318.91 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 15:09:46,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.34 | bwd: 3319.73 | bwd_inner: 3318.91 | bwd_allreduce: 0.78 | step: 7.24 10%|█ | 1048/10000 [1:40:06<13:38:51, 5.49s/it] {'loss': 0.1082, 'grad_norm': 0.7858297228813171, 'learning_rate': 3.941597102558509e-05, 'epoch': 1.05} 10%|█ | 1048/10000 [1:40:06<13:38:51, 5.49s/it][2025-06-19 15:09:51,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:09:51,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.57 | bwd_microstep: 3318.19 | bwd_inner_microstep: 3317.25 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.19 [2025-06-19 15:09:51,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.57 | bwd: 3318.21 | bwd_inner: 3317.25 | bwd_allreduce: 0.90 | step: 7.18 10%|█ | 1049/10000 [1:40:12<13:37:33, 5.48s/it] {'loss': 0.1227, 'grad_norm': 0.8154328465461731, 'learning_rate': 3.941441607729321e-05, 'epoch': 1.05} 10%|█ | 1049/10000 [1:40:12<13:37:33, 5.48s/it][2025-06-19 15:09:57,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:09:57,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.29 | bwd_microstep: 3363.40 | bwd_inner_microstep: 3362.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 15:09:57,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.29 | bwd: 3363.42 | bwd_inner: 3362.60 | bwd_allreduce: 0.78 | step: 7.13 10%|█ | 1050/10000 [1:40:17<13:39:50, 5.50s/it] {'loss': 0.1121, 'grad_norm': 1.5111379623413086, 'learning_rate': 3.9412859092519184e-05, 'epoch': 1.05} 10%|█ | 1050/10000 [1:40:17<13:39:50, 5.50s/it][2025-06-19 15:10:02,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:10:02,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.95 | bwd_microstep: 3314.51 | bwd_inner_microstep: 3313.71 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-19 15:10:02,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.95 | bwd: 3314.53 | bwd_inner: 3313.71 | bwd_allreduce: 0.77 | step: 7.27 11%|█ | 1051/10000 [1:40:23<13:38:40, 5.49s/it] {'loss': 0.2487, 'grad_norm': 0.8436018228530884, 'learning_rate': 3.9411300071426345e-05, 'epoch': 1.05} 11%|█ | 1051/10000 [1:40:23<13:38:40, 5.49s/it][2025-06-19 15:10:07,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:10:07,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.24 | bwd_microstep: 3311.98 | bwd_inner_microstep: 3311.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 15:10:07,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.24 | bwd: 3311.99 | bwd_inner: 3311.19 | bwd_allreduce: 0.77 | step: 6.62 11%|█ | 1052/10000 [1:40:28<13:37:32, 5.48s/it] {'loss': 0.2031, 'grad_norm': 1.489958643913269, 'learning_rate': 3.940973901417821e-05, 'epoch': 1.05} 11%|█ | 1052/10000 [1:40:28<13:37:32, 5.48s/it][2025-06-19 15:10:13,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:10:13,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.61 | bwd_microstep: 3361.09 | bwd_inner_microstep: 3360.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 15:10:13,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.61 | bwd: 3361.10 | bwd_inner: 3360.29 | bwd_allreduce: 0.76 | step: 6.96 11%|█ | 1053/10000 [1:40:34<13:39:25, 5.50s/it] {'loss': 0.1467, 'grad_norm': 1.0714610815048218, 'learning_rate': 3.940817592093854e-05, 'epoch': 1.05} 11%|█ | 1053/10000 [1:40:34<13:39:25, 5.50s/it][2025-06-19 15:10:19,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:10:19,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.48 | bwd_microstep: 3367.34 | bwd_inner_microstep: 3366.53 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.65 [2025-06-19 15:10:19,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.48 | bwd: 3367.35 | bwd_inner: 3366.53 | bwd_allreduce: 0.78 | step: 6.65 11%|█ | 1054/10000 [1:40:39<13:40:41, 5.50s/it] {'loss': 0.1003, 'grad_norm': 0.8732638359069824, 'learning_rate': 3.94066107918713e-05, 'epoch': 1.05} 11%|█ | 1054/10000 [1:40:39<13:40:41, 5.50s/it][2025-06-19 15:10:24,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:10:24,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.04 | bwd_microstep: 3365.08 | bwd_inner_microstep: 3364.27 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 15:10:24,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.04 | bwd: 3365.10 | bwd_inner: 3364.27 | bwd_allreduce: 0.78 | step: 7.05 11%|█ | 1055/10000 [1:40:45<13:41:32, 5.51s/it] {'loss': 0.065, 'grad_norm': 0.4949965178966522, 'learning_rate': 3.9405043627140644e-05, 'epoch': 1.05} 11%|█ | 1055/10000 [1:40:45<13:41:32, 5.51s/it][2025-06-19 15:10:30,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:10:30,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.20 | bwd_microstep: 3359.74 | bwd_inner_microstep: 3358.85 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.88 [2025-06-19 15:10:30,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.20 | bwd: 3359.76 | bwd_inner: 3358.85 | bwd_allreduce: 0.86 | step: 6.88 11%|█ | 1056/10000 [1:40:50<13:42:11, 5.52s/it] {'loss': 0.0665, 'grad_norm': 0.4105072319507599, 'learning_rate': 3.940347442691098e-05, 'epoch': 1.06} 11%|█ | 1056/10000 [1:40:50<13:42:11, 5.52s/it][2025-06-19 15:10:35,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:10:35,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.97 | bwd_microstep: 3316.96 | bwd_inner_microstep: 3316.17 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 15:10:35,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.96 | bwd: 3316.97 | bwd_inner: 3316.17 | bwd_allreduce: 0.76 | step: 6.89 11%|█ | 1057/10000 [1:40:56<13:39:50, 5.50s/it] {'loss': 0.1157, 'grad_norm': 0.9519124031066895, 'learning_rate': 3.94019031913469e-05, 'epoch': 1.06} 11%|█ | 1057/10000 [1:40:56<13:39:50, 5.50s/it][2025-06-19 15:10:40,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:10:40,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.48 | bwd_microstep: 3317.19 | bwd_inner_microstep: 3316.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.04 [2025-06-19 15:10:40,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.48 | bwd: 3317.20 | bwd_inner: 3316.39 | bwd_allreduce: 0.77 | step: 7.04 11%|█ | 1058/10000 [1:41:01<13:37:53, 5.49s/it] {'loss': 0.0741, 'grad_norm': 0.6677231192588806, 'learning_rate': 3.9400329920613224e-05, 'epoch': 1.06} 11%|█ | 1058/10000 [1:41:01<13:37:53, 5.49s/it][2025-06-19 15:10:46,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:10:46,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.83 | bwd_microstep: 3365.75 | bwd_inner_microstep: 3364.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 15:10:46,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.83 | bwd: 3365.77 | bwd_inner: 3364.96 | bwd_allreduce: 0.76 | step: 6.74 11%|█ | 1059/10000 [1:41:07<13:39:26, 5.50s/it] {'loss': 0.1037, 'grad_norm': 0.634925365447998, 'learning_rate': 3.939875461487498e-05, 'epoch': 1.06} 11%|█ | 1059/10000 [1:41:07<13:39:26, 5.50s/it][2025-06-19 15:10:51,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:10:51,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.63 | bwd_microstep: 3320.07 | bwd_inner_microstep: 3319.28 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 15:10:51,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.63 | bwd: 3320.08 | bwd_inner: 3319.28 | bwd_allreduce: 0.76 | step: 6.71 11%|█ | 1060/10000 [1:41:12<13:38:03, 5.49s/it] {'loss': 0.0842, 'grad_norm': 0.5012531876564026, 'learning_rate': 3.939717727429741e-05, 'epoch': 1.06} 11%|█ | 1060/10000 [1:41:12<13:38:03, 5.49s/it][2025-06-19 15:10:57,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:10:57,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.58 | bwd_microstep: 3369.95 | bwd_inner_microstep: 3369.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 15:10:57,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.58 | bwd: 3369.97 | bwd_inner: 3369.15 | bwd_allreduce: 0.77 | step: 6.67 11%|█ | 1061/10000 [1:41:18<13:39:43, 5.50s/it] {'loss': 0.0408, 'grad_norm': 0.3904375731945038, 'learning_rate': 3.9395597899045965e-05, 'epoch': 1.06} 11%|█ | 1061/10000 [1:41:18<13:39:43, 5.50s/it][2025-06-19 15:11:02,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:11:02,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.98 | bwd_microstep: 3321.86 | bwd_inner_microstep: 3321.05 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.13 [2025-06-19 15:11:02,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.98 | bwd: 3321.88 | bwd_inner: 3321.05 | bwd_allreduce: 0.79 | step: 7.13 11%|█ | 1062/10000 [1:41:23<13:37:54, 5.49s/it] {'loss': 0.0758, 'grad_norm': 0.5615527033805847, 'learning_rate': 3.9394016489286325e-05, 'epoch': 1.06} 11%|█ | 1062/10000 [1:41:23<13:37:54, 5.49s/it][2025-06-19 15:11:08,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:11:08,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.99 | bwd_microstep: 3319.54 | bwd_inner_microstep: 3318.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 15:11:08,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.99 | bwd: 3319.56 | bwd_inner: 3318.76 | bwd_allreduce: 0.76 | step: 6.72 11%|█ | 1063/10000 [1:41:29<13:36:24, 5.48s/it] {'loss': 0.1343, 'grad_norm': 0.6401475071907043, 'learning_rate': 3.939243304518436e-05, 'epoch': 1.06} 11%|█ | 1063/10000 [1:41:29<13:36:24, 5.48s/it][2025-06-19 15:11:14,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:11:14,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.86 | bwd_microstep: 3394.05 | bwd_inner_microstep: 3393.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 15:11:14,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.85 | bwd: 3394.06 | bwd_inner: 3393.26 | bwd_allreduce: 0.76 | step: 6.67 11%|█ | 1064/10000 [1:41:34<13:40:16, 5.51s/it] {'loss': 0.0722, 'grad_norm': 0.4957994520664215, 'learning_rate': 3.939084756690617e-05, 'epoch': 1.06} 11%|█ | 1064/10000 [1:41:34<13:40:16, 5.51s/it][2025-06-19 15:11:19,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:11:19,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.83 | bwd_microstep: 3370.10 | bwd_inner_microstep: 3369.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 15:11:19,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.83 | bwd: 3370.12 | bwd_inner: 3369.31 | bwd_allreduce: 0.76 | step: 6.64 11%|█ | 1065/10000 [1:41:40<13:41:30, 5.52s/it] {'loss': 0.0611, 'grad_norm': 0.5306382775306702, 'learning_rate': 3.938926005461807e-05, 'epoch': 1.06} 11%|█ | 1065/10000 [1:41:40<13:41:30, 5.52s/it][2025-06-19 15:11:25,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:11:25,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.52 | bwd_microstep: 3367.39 | bwd_inner_microstep: 3366.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 15:11:25,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.52 | bwd: 3367.40 | bwd_inner: 3366.59 | bwd_allreduce: 0.77 | step: 6.90 11%|█ | 1066/10000 [1:41:45<13:41:57, 5.52s/it] {'loss': 0.0521, 'grad_norm': 0.31466904282569885, 'learning_rate': 3.938767050848658e-05, 'epoch': 1.07} 11%|█ | 1066/10000 [1:41:45<13:41:57, 5.52s/it][2025-06-19 15:11:30,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:11:30,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.75 | bwd_microstep: 3361.78 | bwd_inner_microstep: 3360.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 15:11:30,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.75 | bwd: 3361.80 | bwd_inner: 3360.98 | bwd_allreduce: 0.77 | step: 6.95 11%|█ | 1067/10000 [1:41:51<13:42:02, 5.52s/it] {'loss': 0.1439, 'grad_norm': 1.3146475553512573, 'learning_rate': 3.938607892867843e-05, 'epoch': 1.07} 11%|█ | 1067/10000 [1:41:51<13:42:02, 5.52s/it][2025-06-19 15:11:36,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.72 [2025-06-19 15:11:36,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.25 | bwd_microstep: 3317.06 | bwd_inner_microstep: 3315.94 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.79 [2025-06-19 15:11:36,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.25 | bwd: 3317.09 | bwd_inner: 3315.94 | bwd_allreduce: 1.07 | step: 7.78 11%|█ | 1068/10000 [1:41:56<13:39:40, 5.51s/it] {'loss': 0.0679, 'grad_norm': 0.6791871786117554, 'learning_rate': 3.938448531536057e-05, 'epoch': 1.07} 11%|█ | 1068/10000 [1:41:56<13:39:40, 5.51s/it][2025-06-19 15:11:41,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:11:41,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.12 | bwd_microstep: 3397.83 | bwd_inner_microstep: 3396.93 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.42 [2025-06-19 15:11:41,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.12 | bwd: 3397.85 | bwd_inner: 3396.93 | bwd_allreduce: 0.87 | step: 7.42 11%|█ | 1069/10000 [1:42:02<13:42:42, 5.53s/it] {'loss': 0.0864, 'grad_norm': 0.6367367506027222, 'learning_rate': 3.938288966870018e-05, 'epoch': 1.07} 11%|█ | 1069/10000 [1:42:02<13:42:42, 5.53s/it][2025-06-19 15:11:47,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:11:47,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.61 | bwd_microstep: 3320.25 | bwd_inner_microstep: 3319.45 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.68 [2025-06-19 15:11:47,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.61 | bwd: 3320.27 | bwd_inner: 3319.45 | bwd_allreduce: 0.78 | step: 6.69 11%|█ | 1070/10000 [1:42:07<13:39:47, 5.51s/it] {'loss': 0.171, 'grad_norm': 1.026442289352417, 'learning_rate': 3.9381291988864613e-05, 'epoch': 1.07} 11%|█ | 1070/10000 [1:42:07<13:39:47, 5.51s/it][2025-06-19 15:11:52,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:11:52,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.57 | bwd_microstep: 3318.34 | bwd_inner_microstep: 3317.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-19 15:11:52,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.57 | bwd: 3318.36 | bwd_inner: 3317.53 | bwd_allreduce: 0.78 | step: 7.04 11%|█ | 1071/10000 [1:42:13<13:37:42, 5.49s/it] {'loss': 0.0733, 'grad_norm': 0.8266109824180603, 'learning_rate': 3.937969227602147e-05, 'epoch': 1.07} 11%|█ | 1071/10000 [1:42:13<13:37:42, 5.49s/it][2025-06-19 15:11:58,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:11:58,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.33 | bwd_microstep: 3311.84 | bwd_inner_microstep: 3310.90 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.03 [2025-06-19 15:11:58,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.33 | bwd: 3311.86 | bwd_inner: 3310.90 | bwd_allreduce: 0.92 | step: 7.04 11%|█ | 1072/10000 [1:42:18<13:35:45, 5.48s/it] {'loss': 0.1292, 'grad_norm': 0.9339929819107056, 'learning_rate': 3.9378090530338544e-05, 'epoch': 1.07} 11%|█ | 1072/10000 [1:42:18<13:35:45, 5.48s/it][2025-06-19 15:12:03,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:12:03,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.75 | bwd_microstep: 3311.64 | bwd_inner_microstep: 3310.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.50 [2025-06-19 15:12:03,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.75 | bwd: 3311.65 | bwd_inner: 3310.86 | bwd_allreduce: 0.75 | step: 6.50 11%|█ | 1073/10000 [1:42:24<13:34:38, 5.48s/it] {'loss': 0.0827, 'grad_norm': 1.1040101051330566, 'learning_rate': 3.937648675198387e-05, 'epoch': 1.07} 11%|█ | 1073/10000 [1:42:24<13:34:38, 5.48s/it][2025-06-19 15:12:08,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:12:08,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.99 | bwd_microstep: 3317.16 | bwd_inner_microstep: 3316.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-19 15:12:08,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.99 | bwd: 3317.17 | bwd_inner: 3316.35 | bwd_allreduce: 0.78 | step: 6.99 11%|█ | 1074/10000 [1:42:29<13:33:49, 5.47s/it] {'loss': 0.1185, 'grad_norm': 0.7213156819343567, 'learning_rate': 3.937488094112566e-05, 'epoch': 1.07} 11%|█ | 1074/10000 [1:42:29<13:33:49, 5.47s/it][2025-06-19 15:12:14,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.86 [2025-06-19 15:12:14,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.40 | bwd_microstep: 3324.75 | bwd_inner_microstep: 3323.97 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 15:12:14,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.40 | bwd: 3324.77 | bwd_inner: 3323.97 | bwd_allreduce: 0.76 | step: 6.77 11%|█ | 1075/10000 [1:42:35<13:34:04, 5.47s/it] {'loss': 0.0969, 'grad_norm': 0.7711635231971741, 'learning_rate': 3.9373273097932354e-05, 'epoch': 1.07} 11%|█ | 1075/10000 [1:42:35<13:34:04, 5.47s/it][2025-06-19 15:12:19,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:12:19,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.79 | bwd_microstep: 3400.26 | bwd_inner_microstep: 3399.25 | bwd_allreduce_microstep: 0.97 | step_microstep: 6.93 [2025-06-19 15:12:19,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.80 | bwd: 3400.27 | bwd_inner: 3399.25 | bwd_allreduce: 0.98 | step: 6.93 11%|█ | 1076/10000 [1:42:40<13:38:27, 5.50s/it] {'loss': 0.0551, 'grad_norm': 0.4648144841194153, 'learning_rate': 3.9371663222572625e-05, 'epoch': 1.08} 11%|█ | 1076/10000 [1:42:40<13:38:27, 5.50s/it][2025-06-19 15:12:25,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:12:25,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.08 | bwd_microstep: 3311.73 | bwd_inner_microstep: 3310.89 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.37 [2025-06-19 15:12:25,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.08 | bwd: 3311.74 | bwd_inner: 3310.89 | bwd_allreduce: 0.82 | step: 7.37 11%|█ | 1077/10000 [1:42:46<13:36:22, 5.49s/it] {'loss': 0.1131, 'grad_norm': 1.0989906787872314, 'learning_rate': 3.9370051315215326e-05, 'epoch': 1.08} 11%|█ | 1077/10000 [1:42:46<13:36:22, 5.49s/it][2025-06-19 15:12:30,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:12:30,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.03 | bwd_microstep: 3322.30 | bwd_inner_microstep: 3321.47 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.29 [2025-06-19 15:12:30,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.03 | bwd: 3322.32 | bwd_inner: 3321.47 | bwd_allreduce: 0.80 | step: 7.29 11%|█ | 1078/10000 [1:42:51<13:35:45, 5.49s/it] {'loss': 0.1849, 'grad_norm': 0.7628182768821716, 'learning_rate': 3.9368437376029545e-05, 'epoch': 1.08} 11%|█ | 1078/10000 [1:42:51<13:35:45, 5.49s/it][2025-06-19 15:12:36,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:12:36,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.05 | bwd_microstep: 3364.99 | bwd_inner_microstep: 3364.08 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.93 [2025-06-19 15:12:36,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.05 | bwd: 3365.00 | bwd_inner: 3364.08 | bwd_allreduce: 0.88 | step: 6.93 11%|█ | 1079/10000 [1:42:57<13:37:55, 5.50s/it] {'loss': 0.1446, 'grad_norm': 1.0241882801055908, 'learning_rate': 3.936682140518457e-05, 'epoch': 1.08} 11%|█ | 1079/10000 [1:42:57<13:37:55, 5.50s/it][2025-06-19 15:12:42,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.74 [2025-06-19 15:12:42,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.77 | bwd_microstep: 3364.25 | bwd_inner_microstep: 3363.43 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.28 [2025-06-19 15:12:42,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.77 | bwd: 3364.26 | bwd_inner: 3363.43 | bwd_allreduce: 0.79 | step: 7.28 11%|█ | 1080/10000 [1:43:02<13:39:36, 5.51s/it] {'loss': 0.0879, 'grad_norm': 0.8424712419509888, 'learning_rate': 3.9365203402849915e-05, 'epoch': 1.08} 11%|█ | 1080/10000 [1:43:02<13:39:36, 5.51s/it][2025-06-19 15:12:47,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:12:47,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.33 | bwd_microstep: 3364.22 | bwd_inner_microstep: 3363.39 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.73 [2025-06-19 15:12:47,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.33 | bwd: 3364.23 | bwd_inner: 3363.39 | bwd_allreduce: 0.80 | step: 6.73 11%|█ | 1081/10000 [1:43:08<13:40:46, 5.52s/it] {'loss': 0.0986, 'grad_norm': 0.6288619041442871, 'learning_rate': 3.9363583369195306e-05, 'epoch': 1.08} 11%|█ | 1081/10000 [1:43:08<13:40:46, 5.52s/it][2025-06-19 15:12:53,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:12:53,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.61 | bwd_microstep: 3320.81 | bwd_inner_microstep: 3319.99 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.26 [2025-06-19 15:12:53,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.62 | bwd: 3320.83 | bwd_inner: 3319.99 | bwd_allreduce: 0.80 | step: 7.27 11%|█ | 1082/10000 [1:43:13<13:38:19, 5.51s/it] {'loss': 0.2175, 'grad_norm': 1.4845889806747437, 'learning_rate': 3.936196130439067e-05, 'epoch': 1.08} 11%|█ | 1082/10000 [1:43:13<13:38:19, 5.51s/it][2025-06-19 15:12:58,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:12:58,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.27 | bwd_microstep: 3376.51 | bwd_inner_microstep: 3375.70 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-19 15:12:58,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.27 | bwd: 3376.53 | bwd_inner: 3375.70 | bwd_allreduce: 0.77 | step: 6.99 11%|█ | 1083/10000 [1:43:19<13:40:20, 5.52s/it] {'loss': 0.0679, 'grad_norm': 0.7131879925727844, 'learning_rate': 3.936033720860615e-05, 'epoch': 1.08} 11%|█ | 1083/10000 [1:43:19<13:40:20, 5.52s/it][2025-06-19 15:13:04,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 15:13:04,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.60 | bwd_microstep: 3321.98 | bwd_inner_microstep: 3320.95 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.48 [2025-06-19 15:13:04,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.60 | bwd: 3321.99 | bwd_inner: 3320.95 | bwd_allreduce: 1.00 | step: 7.49 11%|█ | 1084/10000 [1:43:24<13:38:20, 5.51s/it] {'loss': 0.1225, 'grad_norm': 0.9484450221061707, 'learning_rate': 3.935871108201211e-05, 'epoch': 1.08} 11%|█ | 1084/10000 [1:43:24<13:38:20, 5.51s/it][2025-06-19 15:13:09,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:13:09,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.06 | bwd_microstep: 3377.56 | bwd_inner_microstep: 3376.56 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.61 [2025-06-19 15:13:09,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.06 | bwd: 3377.58 | bwd_inner: 3376.56 | bwd_allreduce: 0.97 | step: 7.61 11%|█ | 1085/10000 [1:43:30<13:40:30, 5.52s/it] {'loss': 0.0667, 'grad_norm': 0.4976966381072998, 'learning_rate': 3.935708292477913e-05, 'epoch': 1.08} 11%|█ | 1085/10000 [1:43:30<13:40:30, 5.52s/it][2025-06-19 15:13:15,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:13:15,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.41 | bwd_microstep: 3331.71 | bwd_inner_microstep: 3330.86 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.51 [2025-06-19 15:13:15,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.41 | bwd: 3331.73 | bwd_inner: 3330.86 | bwd_allreduce: 0.81 | step: 7.51 11%|█ | 1086/10000 [1:43:35<13:38:59, 5.51s/it] {'loss': 0.0996, 'grad_norm': 0.5991069674491882, 'learning_rate': 3.935545273707798e-05, 'epoch': 1.09} 11%|█ | 1086/10000 [1:43:35<13:38:59, 5.51s/it][2025-06-19 15:13:20,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:13:20,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.13 | bwd_microstep: 3396.87 | bwd_inner_microstep: 3396.06 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-19 15:13:20,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.13 | bwd: 3396.88 | bwd_inner: 3396.06 | bwd_allreduce: 0.78 | step: 6.84 11%|█ | 1087/10000 [1:43:41<13:42:00, 5.53s/it] {'loss': 0.1032, 'grad_norm': 0.9053652286529541, 'learning_rate': 3.935382051907968e-05, 'epoch': 1.09} 11%|█ | 1087/10000 [1:43:41<13:42:00, 5.53s/it][2025-06-19 15:13:26,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:13:26,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.65 | bwd_microstep: 3327.89 | bwd_inner_microstep: 3326.96 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.45 [2025-06-19 15:13:26,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.65 | bwd: 3327.90 | bwd_inner: 3326.96 | bwd_allreduce: 0.90 | step: 7.45 11%|█ | 1088/10000 [1:43:46<13:39:34, 5.52s/it] {'loss': 0.2064, 'grad_norm': 1.08054518699646, 'learning_rate': 3.9352186270955426e-05, 'epoch': 1.09} 11%|█ | 1088/10000 [1:43:46<13:39:34, 5.52s/it][2025-06-19 15:13:31,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:13:31,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.16 | bwd_microstep: 3370.49 | bwd_inner_microstep: 3369.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.65 [2025-06-19 15:13:31,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.16 | bwd: 3370.50 | bwd_inner: 3369.69 | bwd_allreduce: 0.77 | step: 6.65 11%|█ | 1089/10000 [1:43:52<13:40:21, 5.52s/it] {'loss': 0.2068, 'grad_norm': 1.2069363594055176, 'learning_rate': 3.935054999287665e-05, 'epoch': 1.09} 11%|█ | 1089/10000 [1:43:52<13:40:21, 5.52s/it][2025-06-19 15:13:37,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:13:37,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.46 | bwd_microstep: 3324.16 | bwd_inner_microstep: 3323.34 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.26 [2025-06-19 15:13:37,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.46 | bwd: 3324.18 | bwd_inner: 3323.34 | bwd_allreduce: 0.79 | step: 7.27 11%|█ | 1090/10000 [1:43:57<13:37:42, 5.51s/it] {'loss': 0.0699, 'grad_norm': 0.5165584087371826, 'learning_rate': 3.934891168501499e-05, 'epoch': 1.09} 11%|█ | 1090/10000 [1:43:57<13:37:42, 5.51s/it][2025-06-19 15:13:42,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:13:42,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.73 | bwd_microstep: 3370.93 | bwd_inner_microstep: 3370.09 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.86 [2025-06-19 15:13:42,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.73 | bwd: 3370.95 | bwd_inner: 3370.09 | bwd_allreduce: 0.80 | step: 6.86 11%|█ | 1091/10000 [1:44:03<13:39:12, 5.52s/it] {'loss': 0.0932, 'grad_norm': 0.6387456655502319, 'learning_rate': 3.934727134754229e-05, 'epoch': 1.09} 11%|█ | 1091/10000 [1:44:03<13:39:12, 5.52s/it][2025-06-19 15:13:48,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:13:48,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.60 | bwd_microstep: 3316.52 | bwd_inner_microstep: 3315.71 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 15:13:48,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.60 | bwd: 3316.53 | bwd_inner: 3315.71 | bwd_allreduce: 0.78 | step: 6.83 11%|█ | 1092/10000 [1:44:08<13:36:56, 5.50s/it] {'loss': 0.0645, 'grad_norm': 0.3783479630947113, 'learning_rate': 3.9345628980630624e-05, 'epoch': 1.09} 11%|█ | 1092/10000 [1:44:08<13:36:56, 5.50s/it][2025-06-19 15:13:53,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:13:53,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.84 | bwd_microstep: 3326.91 | bwd_inner_microstep: 3326.09 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.22 [2025-06-19 15:13:53,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.84 | bwd: 3326.93 | bwd_inner: 3326.09 | bwd_allreduce: 0.79 | step: 7.22 11%|█ | 1093/10000 [1:44:14<13:35:57, 5.50s/it] {'loss': 0.0999, 'grad_norm': 0.6143218278884888, 'learning_rate': 3.934398458445226e-05, 'epoch': 1.09} 11%|█ | 1093/10000 [1:44:14<13:35:57, 5.50s/it][2025-06-19 15:13:59,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:13:59,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.01 | bwd_microstep: 3377.43 | bwd_inner_microstep: 3376.64 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 15:13:59,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.01 | bwd: 3377.44 | bwd_inner: 3376.64 | bwd_allreduce: 0.76 | step: 6.61 11%|█ | 1094/10000 [1:44:19<13:38:18, 5.51s/it] {'loss': 0.0794, 'grad_norm': 0.6852134466171265, 'learning_rate': 3.93423381591797e-05, 'epoch': 1.09} 11%|█ | 1094/10000 [1:44:20<13:38:18, 5.51s/it][2025-06-19 15:14:04,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 15:14:04,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.11 | bwd_microstep: 3324.44 | bwd_inner_microstep: 3323.63 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.87 [2025-06-19 15:14:04,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.11 | bwd: 3324.46 | bwd_inner: 3323.63 | bwd_allreduce: 0.79 | step: 6.88 11%|█ | 1095/10000 [1:44:25<13:36:19, 5.50s/it] {'loss': 0.1082, 'grad_norm': 0.5952850580215454, 'learning_rate': 3.934068970498563e-05, 'epoch': 1.09} 11%|█ | 1095/10000 [1:44:25<13:36:19, 5.50s/it][2025-06-19 15:14:10,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:14:10,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.97 | bwd_microstep: 3374.70 | bwd_inner_microstep: 3373.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 15:14:10,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.97 | bwd: 3374.71 | bwd_inner: 3373.90 | bwd_allreduce: 0.76 | step: 6.65 11%|█ | 1096/10000 [1:44:31<13:38:06, 5.51s/it] {'loss': 0.1375, 'grad_norm': 1.045731782913208, 'learning_rate': 3.933903922204297e-05, 'epoch': 1.1} 11%|█ | 1096/10000 [1:44:31<13:38:06, 5.51s/it][2025-06-19 15:14:15,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:14:15,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.56 | bwd_microstep: 3330.44 | bwd_inner_microstep: 3329.62 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.35 [2025-06-19 15:14:15,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.56 | bwd: 3330.46 | bwd_inner: 3329.62 | bwd_allreduce: 0.79 | step: 7.36 11%|█ | 1097/10000 [1:44:36<13:36:42, 5.50s/it] {'loss': 0.1223, 'grad_norm': 0.8040278553962708, 'learning_rate': 3.9337386710524855e-05, 'epoch': 1.1} 11%|█ | 1097/10000 [1:44:36<13:36:42, 5.50s/it][2025-06-19 15:14:21,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:14:21,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.00 | bwd_microstep: 3330.75 | bwd_inner_microstep: 3329.86 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.74 [2025-06-19 15:14:21,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.00 | bwd: 3330.77 | bwd_inner: 3329.86 | bwd_allreduce: 0.87 | step: 6.75 11%|█ | 1098/10000 [1:44:41<13:35:27, 5.50s/it] {'loss': 0.0562, 'grad_norm': 0.3823105990886688, 'learning_rate': 3.9335732170604624e-05, 'epoch': 1.1} 11%|█ | 1098/10000 [1:44:41<13:35:27, 5.50s/it][2025-06-19 15:14:26,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:14:26,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.14 | bwd_microstep: 3319.38 | bwd_inner_microstep: 3318.58 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 15:14:26,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.14 | bwd: 3319.40 | bwd_inner: 3318.58 | bwd_allreduce: 0.78 | step: 7.09 11%|█ | 1099/10000 [1:44:47<13:34:24, 5.49s/it] {'loss': 0.097, 'grad_norm': 0.7256019711494446, 'learning_rate': 3.933407560245583e-05, 'epoch': 1.1} 11%|█ | 1099/10000 [1:44:47<13:34:24, 5.49s/it][2025-06-19 15:14:32,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:14:32,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.04 | bwd_microstep: 3378.02 | bwd_inner_microstep: 3377.09 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.51 [2025-06-19 15:14:32,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.04 | bwd: 3378.03 | bwd_inner: 3377.09 | bwd_allreduce: 0.90 | step: 7.52 11%|█ | 1100/10000 [1:44:53<13:37:28, 5.51s/it] {'loss': 0.0999, 'grad_norm': 0.7563206553459167, 'learning_rate': 3.933241700625223e-05, 'epoch': 1.1} 11%|█ | 1100/10000 [1:44:53<13:37:28, 5.51s/it][2025-06-19 15:14:37,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:14:37,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.70 | bwd_microstep: 3324.76 | bwd_inner_microstep: 3323.80 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.04 [2025-06-19 15:14:37,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.70 | bwd: 3324.78 | bwd_inner: 3323.80 | bwd_allreduce: 0.93 | step: 7.04 11%|█ | 1101/10000 [1:44:58<13:35:49, 5.50s/it] {'loss': 0.0912, 'grad_norm': 0.5795722603797913, 'learning_rate': 3.9330756382167814e-05, 'epoch': 1.1} 11%|█ | 1101/10000 [1:44:58<13:35:49, 5.50s/it][2025-06-19 15:14:43,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:14:43,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.48 | bwd_microstep: 3324.76 | bwd_inner_microstep: 3323.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 15:14:43,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.48 | bwd: 3324.77 | bwd_inner: 3323.95 | bwd_allreduce: 0.78 | step: 7.19 11%|█ | 1102/10000 [1:45:03<13:35:00, 5.50s/it] {'loss': 0.1457, 'grad_norm': 0.7636048793792725, 'learning_rate': 3.932909373037677e-05, 'epoch': 1.1} 11%|█ | 1102/10000 [1:45:03<13:35:00, 5.50s/it][2025-06-19 15:14:48,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 15:14:48,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.68 | bwd_microstep: 3325.76 | bwd_inner_microstep: 3324.82 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.02 [2025-06-19 15:14:48,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.68 | bwd: 3325.77 | bwd_inner: 3324.82 | bwd_allreduce: 0.91 | step: 7.02 11%|█ | 1103/10000 [1:45:09<13:33:54, 5.49s/it] {'loss': 0.1197, 'grad_norm': 0.8659831285476685, 'learning_rate': 3.93274290510535e-05, 'epoch': 1.1} 11%|█ | 1103/10000 [1:45:09<13:33:54, 5.49s/it][2025-06-19 15:14:54,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:14:54,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.11 | bwd_microstep: 3325.74 | bwd_inner_microstep: 3324.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.40 [2025-06-19 15:14:54,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.11 | bwd: 3325.76 | bwd_inner: 3324.94 | bwd_allreduce: 0.77 | step: 7.41 11%|█ | 1104/10000 [1:45:14<13:33:22, 5.49s/it] {'loss': 0.1175, 'grad_norm': 0.8022148013114929, 'learning_rate': 3.932576234437263e-05, 'epoch': 1.1} 11%|█ | 1104/10000 [1:45:14<13:33:22, 5.49s/it][2025-06-19 15:14:59,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:14:59,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.38 | bwd_microstep: 3377.02 | bwd_inner_microstep: 3376.21 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 15:14:59,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.38 | bwd: 3377.04 | bwd_inner: 3376.21 | bwd_allreduce: 0.78 | step: 7.24 11%|█ | 1105/10000 [1:45:20<13:35:59, 5.50s/it] {'loss': 0.1079, 'grad_norm': 0.6541983485221863, 'learning_rate': 3.932409361050898e-05, 'epoch': 1.1} 11%|█ | 1105/10000 [1:45:20<13:35:59, 5.50s/it][2025-06-19 15:15:05,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 15:15:05,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.17 | bwd_microstep: 3328.99 | bwd_inner_microstep: 3327.92 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.73 [2025-06-19 15:15:05,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.17 | bwd: 3329.01 | bwd_inner: 3327.92 | bwd_allreduce: 1.04 | step: 7.73 11%|█ | 1106/10000 [1:45:25<13:35:03, 5.50s/it] {'loss': 0.1332, 'grad_norm': 0.7657188177108765, 'learning_rate': 3.9322422849637596e-05, 'epoch': 1.11} 11%|█ | 1106/10000 [1:45:25<13:35:03, 5.50s/it][2025-06-19 15:15:10,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.68 | optimizer_step: 2.73 [2025-06-19 15:15:10,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.31 | bwd_microstep: 3319.78 | bwd_inner_microstep: 3318.99 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-19 15:15:10,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.31 | bwd: 3319.79 | bwd_inner: 3318.99 | bwd_allreduce: 0.77 | step: 7.08 11%|█ | 1107/10000 [1:45:31<13:33:41, 5.49s/it] {'loss': 0.1522, 'grad_norm': 0.8655588626861572, 'learning_rate': 3.9320750061933736e-05, 'epoch': 1.11} 11%|█ | 1107/10000 [1:45:31<13:33:41, 5.49s/it][2025-06-19 15:15:16,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:15:16,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.89 | bwd_microstep: 3326.00 | bwd_inner_microstep: 3325.05 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.05 [2025-06-19 15:15:16,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.89 | bwd: 3326.02 | bwd_inner: 3325.05 | bwd_allreduce: 0.92 | step: 7.06 11%|█ | 1108/10000 [1:45:36<13:33:01, 5.49s/it] {'loss': 0.1527, 'grad_norm': 1.0228383541107178, 'learning_rate': 3.9319075247572856e-05, 'epoch': 1.11} 11%|█ | 1108/10000 [1:45:36<13:33:01, 5.49s/it][2025-06-19 15:15:21,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:15:21,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.84 | bwd_microstep: 3341.13 | bwd_inner_microstep: 3340.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.14 [2025-06-19 15:15:21,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.84 | bwd: 3341.15 | bwd_inner: 3340.34 | bwd_allreduce: 0.77 | step: 7.15 11%|█ | 1109/10000 [1:45:42<13:33:13, 5.49s/it] {'loss': 0.1131, 'grad_norm': 0.7751604318618774, 'learning_rate': 3.931739840673066e-05, 'epoch': 1.11} 11%|█ | 1109/10000 [1:45:42<13:33:13, 5.49s/it][2025-06-19 15:15:27,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:15:27,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.42 | bwd_microstep: 3371.52 | bwd_inner_microstep: 3370.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 15:15:27,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.42 | bwd: 3371.54 | bwd_inner: 3370.72 | bwd_allreduce: 0.77 | step: 6.80 11%|█ | 1110/10000 [1:45:47<13:35:32, 5.50s/it] {'loss': 0.0802, 'grad_norm': 0.4343569278717041, 'learning_rate': 3.931571953958301e-05, 'epoch': 1.11} 11%|█ | 1110/10000 [1:45:47<13:35:32, 5.50s/it][2025-06-19 15:15:32,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:15:32,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.15 | bwd_microstep: 3334.31 | bwd_inner_microstep: 3333.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 15:15:32,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.15 | bwd: 3334.32 | bwd_inner: 3333.51 | bwd_allreduce: 0.77 | step: 7.14 11%|█ | 1111/10000 [1:45:53<13:34:56, 5.50s/it] {'loss': 0.1073, 'grad_norm': 0.8200125098228455, 'learning_rate': 3.931403864630604e-05, 'epoch': 1.11} 11%|█ | 1111/10000 [1:45:53<13:34:56, 5.50s/it][2025-06-19 15:15:38,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:15:38,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.71 | bwd_microstep: 3330.80 | bwd_inner_microstep: 3330.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 15:15:38,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.71 | bwd: 3330.82 | bwd_inner: 3330.02 | bwd_allreduce: 0.76 | step: 6.63 11%|█ | 1112/10000 [1:45:58<13:33:40, 5.49s/it] {'loss': 0.0691, 'grad_norm': 0.4301200211048126, 'learning_rate': 3.931235572707605e-05, 'epoch': 1.11} 11%|█ | 1112/10000 [1:45:58<13:33:40, 5.49s/it][2025-06-19 15:15:43,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:15:43,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.48 | bwd_microstep: 3330.82 | bwd_inner_microstep: 3330.02 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 15:15:43,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.48 | bwd: 3330.84 | bwd_inner: 3330.02 | bwd_allreduce: 0.77 | step: 7.19 11%|█ | 1113/10000 [1:46:04<13:32:49, 5.49s/it] {'loss': 0.0944, 'grad_norm': 0.7025145292282104, 'learning_rate': 3.931067078206957e-05, 'epoch': 1.11} 11%|█ | 1113/10000 [1:46:04<13:32:49, 5.49s/it][2025-06-19 15:15:49,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:15:49,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.28 | bwd_microstep: 3335.47 | bwd_inner_microstep: 3334.63 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.65 [2025-06-19 15:15:49,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.28 | bwd: 3335.49 | bwd_inner: 3334.63 | bwd_allreduce: 0.80 | step: 7.66 11%|█ | 1114/10000 [1:46:09<13:33:08, 5.49s/it] {'loss': 0.0879, 'grad_norm': 0.9484989047050476, 'learning_rate': 3.930898381146336e-05, 'epoch': 1.11} 11%|█ | 1114/10000 [1:46:09<13:33:08, 5.49s/it][2025-06-19 15:15:54,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:15:54,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.18 | bwd_microstep: 3375.32 | bwd_inner_microstep: 3374.51 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.56 [2025-06-19 15:15:54,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.18 | bwd: 3375.34 | bwd_inner: 3374.51 | bwd_allreduce: 0.79 | step: 7.57 11%|█ | 1115/10000 [1:46:15<13:36:00, 5.51s/it] {'loss': 0.0865, 'grad_norm': 0.6104817390441895, 'learning_rate': 3.930729481543436e-05, 'epoch': 1.11} 11%|█ | 1115/10000 [1:46:15<13:36:00, 5.51s/it][2025-06-19 15:16:00,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:16:00,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.02 | bwd_microstep: 3329.12 | bwd_inner_microstep: 3328.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 15:16:00,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.02 | bwd: 3329.14 | bwd_inner: 3328.33 | bwd_allreduce: 0.76 | step: 6.72 11%|█ | 1116/10000 [1:46:20<13:34:40, 5.50s/it] {'loss': 0.1886, 'grad_norm': 1.0692694187164307, 'learning_rate': 3.930560379415974e-05, 'epoch': 1.12} 11%|█ | 1116/10000 [1:46:20<13:34:40, 5.50s/it][2025-06-19 15:16:05,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:16:05,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.73 | bwd_microstep: 3377.24 | bwd_inner_microstep: 3376.43 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.41 [2025-06-19 15:16:05,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.73 | bwd: 3377.26 | bwd_inner: 3376.43 | bwd_allreduce: 0.78 | step: 7.42 11%|█ | 1117/10000 [1:46:26<13:36:50, 5.52s/it] {'loss': 0.1027, 'grad_norm': 0.5966610312461853, 'learning_rate': 3.930391074781688e-05, 'epoch': 1.12} 11%|█ | 1117/10000 [1:46:26<13:36:50, 5.52s/it][2025-06-19 15:16:11,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:16:11,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.55 | bwd_microstep: 3324.48 | bwd_inner_microstep: 3323.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 15:16:11,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.55 | bwd: 3324.49 | bwd_inner: 3323.69 | bwd_allreduce: 0.76 | step: 6.67 11%|█ | 1118/10000 [1:46:31<13:34:40, 5.50s/it] {'loss': 0.0551, 'grad_norm': 0.4982689619064331, 'learning_rate': 3.930221567658338e-05, 'epoch': 1.12} 11%|█ | 1118/10000 [1:46:31<13:34:40, 5.50s/it][2025-06-19 15:16:16,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.73 [2025-06-19 15:16:16,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.99 | bwd_microstep: 3379.64 | bwd_inner_microstep: 3378.77 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.74 [2025-06-19 15:16:16,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.99 | bwd: 3379.65 | bwd_inner: 3378.77 | bwd_allreduce: 0.83 | step: 7.75 11%|█ | 1119/10000 [1:46:37<13:36:31, 5.52s/it] {'loss': 0.0921, 'grad_norm': 0.8193861842155457, 'learning_rate': 3.930051858063704e-05, 'epoch': 1.12} 11%|█ | 1119/10000 [1:46:37<13:36:31, 5.52s/it][2025-06-19 15:16:22,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 15:16:22,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.30 | bwd_microstep: 3337.69 | bwd_inner_microstep: 3336.63 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.13 [2025-06-19 15:16:22,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.30 | bwd: 3337.71 | bwd_inner: 3336.63 | bwd_allreduce: 1.03 | step: 7.14 11%|█ | 1120/10000 [1:46:42<13:35:07, 5.51s/it] {'loss': 0.0812, 'grad_norm': 0.6491066813468933, 'learning_rate': 3.929881946015587e-05, 'epoch': 1.12} 11%|█ | 1120/10000 [1:46:42<13:35:07, 5.51s/it][2025-06-19 15:16:27,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:16:27,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.77 | bwd_microstep: 3344.61 | bwd_inner_microstep: 3343.79 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.34 [2025-06-19 15:16:27,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.77 | bwd: 3344.62 | bwd_inner: 3343.79 | bwd_allreduce: 0.79 | step: 7.34 11%|█ | 1121/10000 [1:46:48<13:34:42, 5.51s/it] {'loss': 0.0555, 'grad_norm': 0.4632624685764313, 'learning_rate': 3.929711831531811e-05, 'epoch': 1.12} 11%|█ | 1121/10000 [1:46:48<13:34:42, 5.51s/it][2025-06-19 15:16:33,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:16:33,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.51 | bwd_microstep: 3376.71 | bwd_inner_microstep: 3375.52 | bwd_allreduce_microstep: 1.13 | step_microstep: 7.21 [2025-06-19 15:16:33,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.51 | bwd: 3376.73 | bwd_inner: 3375.52 | bwd_allreduce: 1.16 | step: 7.22 11%|█ | 1122/10000 [1:46:54<13:36:29, 5.52s/it] {'loss': 0.1058, 'grad_norm': 0.7335248589515686, 'learning_rate': 3.92954151463022e-05, 'epoch': 1.12} 11%|█ | 1122/10000 [1:46:54<13:36:29, 5.52s/it][2025-06-19 15:16:38,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:16:38,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.90 | bwd_microstep: 3322.93 | bwd_inner_microstep: 3322.11 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.34 [2025-06-19 15:16:38,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.90 | bwd: 3322.95 | bwd_inner: 3322.11 | bwd_allreduce: 0.79 | step: 7.35 11%|█ | 1123/10000 [1:46:59<13:34:07, 5.50s/it] {'loss': 0.1447, 'grad_norm': 0.6890469789505005, 'learning_rate': 3.92937099532868e-05, 'epoch': 1.12} 11%|█ | 1123/10000 [1:46:59<13:34:07, 5.50s/it][2025-06-19 15:16:44,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:16:44,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.67 | bwd_microstep: 3385.83 | bwd_inner_microstep: 3385.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 15:16:44,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.67 | bwd: 3385.84 | bwd_inner: 3385.03 | bwd_allreduce: 0.76 | step: 6.73 11%|█ | 1124/10000 [1:47:05<13:36:12, 5.52s/it] {'loss': 0.074, 'grad_norm': 1.2036255598068237, 'learning_rate': 3.9292002736450764e-05, 'epoch': 1.12} 11%|█ | 1124/10000 [1:47:05<13:36:12, 5.52s/it][2025-06-19 15:16:49,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:16:49,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.37 | bwd_microstep: 3333.31 | bwd_inner_microstep: 3332.20 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.80 [2025-06-19 15:16:49,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.37 | bwd: 3333.33 | bwd_inner: 3332.20 | bwd_allreduce: 1.07 | step: 7.80 11%|█▏ | 1125/10000 [1:47:10<13:34:13, 5.50s/it] {'loss': 0.1138, 'grad_norm': 1.0514097213745117, 'learning_rate': 3.929029349597318e-05, 'epoch': 1.12} 11%|█▏ | 1125/10000 [1:47:10<13:34:13, 5.50s/it][2025-06-19 15:16:55,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:16:55,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.53 | bwd_microstep: 3384.22 | bwd_inner_microstep: 3383.42 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-19 15:16:55,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.53 | bwd: 3384.23 | bwd_inner: 3383.42 | bwd_allreduce: 0.77 | step: 7.17 11%|█▏ | 1126/10000 [1:47:16<13:36:40, 5.52s/it] {'loss': 0.1216, 'grad_norm': 1.1080501079559326, 'learning_rate': 3.928858223203333e-05, 'epoch': 1.13} 11%|█▏ | 1126/10000 [1:47:16<13:36:40, 5.52s/it][2025-06-19 15:17:00,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:17:00,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.90 | bwd_microstep: 3331.85 | bwd_inner_microstep: 3331.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 15:17:00,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.90 | bwd: 3331.86 | bwd_inner: 3331.07 | bwd_allreduce: 0.75 | step: 6.56 11%|█▏ | 1127/10000 [1:47:21<13:34:38, 5.51s/it] {'loss': 0.2388, 'grad_norm': 0.9735575914382935, 'learning_rate': 3.928686894481074e-05, 'epoch': 1.13} 11%|█▏ | 1127/10000 [1:47:21<13:34:38, 5.51s/it][2025-06-19 15:17:06,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:17:06,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.46 | bwd_microstep: 3322.38 | bwd_inner_microstep: 3321.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-19 15:17:06,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.46 | bwd: 3322.39 | bwd_inner: 3321.59 | bwd_allreduce: 0.76 | step: 6.60 11%|█▏ | 1128/10000 [1:47:27<13:32:51, 5.50s/it] {'loss': 0.0678, 'grad_norm': 0.44981998205184937, 'learning_rate': 3.9285153634485106e-05, 'epoch': 1.13} 11%|█▏ | 1128/10000 [1:47:27<13:32:51, 5.50s/it][2025-06-19 15:17:11,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:17:11,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.47 | bwd_microstep: 3322.32 | bwd_inner_microstep: 3321.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 15:17:11,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.47 | bwd: 3322.33 | bwd_inner: 3321.53 | bwd_allreduce: 0.75 | step: 6.54 11%|█▏ | 1129/10000 [1:47:32<13:31:50, 5.49s/it] {'loss': 0.1036, 'grad_norm': 0.9004675149917603, 'learning_rate': 3.9283436301236354e-05, 'epoch': 1.13} 11%|█▏ | 1129/10000 [1:47:32<13:31:50, 5.49s/it][2025-06-19 15:17:17,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:17:17,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.30 | bwd_microstep: 3315.78 | bwd_inner_microstep: 3314.89 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.10 [2025-06-19 15:17:17,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.30 | bwd: 3315.79 | bwd_inner: 3314.89 | bwd_allreduce: 0.86 | step: 7.11 11%|█▏ | 1130/10000 [1:47:37<13:30:36, 5.48s/it] {'loss': 0.1773, 'grad_norm': 1.0437755584716797, 'learning_rate': 3.928171694524463e-05, 'epoch': 1.13} 11%|█▏ | 1130/10000 [1:47:37<13:30:36, 5.48s/it][2025-06-19 15:17:22,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:17:22,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.79 | bwd_microstep: 3327.10 | bwd_inner_microstep: 3326.17 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.91 [2025-06-19 15:17:22,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.79 | bwd: 3327.12 | bwd_inner: 3326.17 | bwd_allreduce: 0.90 | step: 6.92 11%|█▏ | 1131/10000 [1:47:43<13:30:18, 5.48s/it] {'loss': 0.1097, 'grad_norm': 0.6566056609153748, 'learning_rate': 3.92799955666903e-05, 'epoch': 1.13} 11%|█▏ | 1131/10000 [1:47:43<13:30:18, 5.48s/it][2025-06-19 15:17:28,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:17:28,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.07 | bwd_microstep: 3310.81 | bwd_inner_microstep: 3310.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 15:17:28,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.07 | bwd: 3310.82 | bwd_inner: 3310.03 | bwd_allreduce: 0.75 | step: 6.58 11%|█▏ | 1132/10000 [1:47:48<13:28:52, 5.47s/it] {'loss': 0.1579, 'grad_norm': 1.3385934829711914, 'learning_rate': 3.927827216575391e-05, 'epoch': 1.13} 11%|█▏ | 1132/10000 [1:47:48<13:28:52, 5.47s/it][2025-06-19 15:17:33,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 15:17:33,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.27 | bwd_microstep: 3319.07 | bwd_inner_microstep: 3318.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.51 [2025-06-19 15:17:33,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.27 | bwd: 3319.08 | bwd_inner: 3318.29 | bwd_allreduce: 0.75 | step: 6.51 11%|█▏ | 1133/10000 [1:47:54<13:28:13, 5.47s/it] {'loss': 0.1591, 'grad_norm': 0.9049662947654724, 'learning_rate': 3.9276546742616236e-05, 'epoch': 1.13} 11%|█▏ | 1133/10000 [1:47:54<13:28:13, 5.47s/it][2025-06-19 15:17:39,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:17:39,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.73 | bwd_microstep: 3317.69 | bwd_inner_microstep: 3316.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.54 [2025-06-19 15:17:39,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.73 | bwd: 3317.70 | bwd_inner: 3316.90 | bwd_allreduce: 0.77 | step: 6.54 11%|█▏ | 1134/10000 [1:47:59<13:28:08, 5.47s/it] {'loss': 0.0816, 'grad_norm': 0.46161356568336487, 'learning_rate': 3.9274819297458283e-05, 'epoch': 1.13} 11%|█▏ | 1134/10000 [1:47:59<13:28:08, 5.47s/it][2025-06-19 15:17:44,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:17:44,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.12 | bwd_microstep: 3326.60 | bwd_inner_microstep: 3325.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.58 [2025-06-19 15:17:44,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.12 | bwd: 3326.61 | bwd_inner: 3325.81 | bwd_allreduce: 0.76 | step: 6.57 11%|█▏ | 1135/10000 [1:48:05<13:28:07, 5.47s/it] {'loss': 0.065, 'grad_norm': 0.3936381936073303, 'learning_rate': 3.927308983046124e-05, 'epoch': 1.14} 11%|█▏ | 1135/10000 [1:48:05<13:28:07, 5.47s/it][2025-06-19 15:17:50,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 15:17:50,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.18 | bwd_microstep: 3377.39 | bwd_inner_microstep: 3376.19 | bwd_allreduce_microstep: 1.13 | step_microstep: 7.69 [2025-06-19 15:17:50,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.18 | bwd: 3377.41 | bwd_inner: 3376.19 | bwd_allreduce: 1.16 | step: 7.70 11%|█▏ | 1136/10000 [1:48:10<13:32:10, 5.50s/it] {'loss': 0.1541, 'grad_norm': 0.9454352259635925, 'learning_rate': 3.9271358341806515e-05, 'epoch': 1.14} 11%|█▏ | 1136/10000 [1:48:10<13:32:10, 5.50s/it][2025-06-19 15:17:55,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:17:55,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.37 | bwd_microstep: 3331.57 | bwd_inner_microstep: 3330.71 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.97 [2025-06-19 15:17:55,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.37 | bwd: 3331.59 | bwd_inner: 3330.71 | bwd_allreduce: 0.82 | step: 6.98 11%|█▏ | 1137/10000 [1:48:16<13:32:26, 5.50s/it] {'loss': 0.0837, 'grad_norm': 0.5472193360328674, 'learning_rate': 3.926962483167575e-05, 'epoch': 1.14} 11%|█▏ | 1137/10000 [1:48:16<13:32:26, 5.50s/it][2025-06-19 15:18:01,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:18:01,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.43 | bwd_microstep: 3324.45 | bwd_inner_microstep: 3323.51 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.32 [2025-06-19 15:18:01,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.43 | bwd: 3324.46 | bwd_inner: 3323.51 | bwd_allreduce: 0.90 | step: 7.33 11%|█▏ | 1138/10000 [1:48:21<13:31:09, 5.49s/it] {'loss': 0.1164, 'grad_norm': 0.6317402124404907, 'learning_rate': 3.926788930025078e-05, 'epoch': 1.14} 11%|█▏ | 1138/10000 [1:48:21<13:31:09, 5.49s/it][2025-06-19 15:18:06,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 15:18:06,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.54 | bwd_microstep: 3317.37 | bwd_inner_microstep: 3316.59 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 15:18:06,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.54 | bwd: 3317.39 | bwd_inner: 3316.59 | bwd_allreduce: 0.75 | step: 6.61 11%|█▏ | 1139/10000 [1:48:27<13:29:58, 5.48s/it] {'loss': 0.0659, 'grad_norm': 0.6182806491851807, 'learning_rate': 3.926615174771364e-05, 'epoch': 1.14} 11%|█▏ | 1139/10000 [1:48:27<13:29:58, 5.48s/it][2025-06-19 15:18:11,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:18:11,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.19 | bwd_microstep: 3324.88 | bwd_inner_microstep: 3323.96 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.90 [2025-06-19 15:18:11,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.19 | bwd: 3324.89 | bwd_inner: 3323.96 | bwd_allreduce: 0.89 | step: 6.89 11%|█▏ | 1140/10000 [1:48:32<13:29:19, 5.48s/it] {'loss': 0.0526, 'grad_norm': 0.3239014446735382, 'learning_rate': 3.926441217424659e-05, 'epoch': 1.14} 11%|█▏ | 1140/10000 [1:48:32<13:29:19, 5.48s/it][2025-06-19 15:18:17,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:18:17,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.55 | bwd_microstep: 3378.95 | bwd_inner_microstep: 3377.98 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.07 [2025-06-19 15:18:17,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.55 | bwd: 3378.97 | bwd_inner: 3377.98 | bwd_allreduce: 0.94 | step: 7.08 11%|█▏ | 1141/10000 [1:48:38<13:32:47, 5.50s/it] {'loss': 0.0604, 'grad_norm': 0.32904472947120667, 'learning_rate': 3.926267058003213e-05, 'epoch': 1.14} 11%|█▏ | 1141/10000 [1:48:38<13:32:47, 5.50s/it][2025-06-19 15:18:23,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:18:23,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.74 | bwd_microstep: 3313.52 | bwd_inner_microstep: 3312.63 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.86 [2025-06-19 15:18:23,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.74 | bwd: 3313.53 | bwd_inner: 3312.63 | bwd_allreduce: 0.86 | step: 6.87 11%|█▏ | 1142/10000 [1:48:43<13:30:59, 5.49s/it] {'loss': 0.1306, 'grad_norm': 1.3065555095672607, 'learning_rate': 3.926092696525291e-05, 'epoch': 1.14} 11%|█▏ | 1142/10000 [1:48:43<13:30:59, 5.49s/it][2025-06-19 15:18:28,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:18:28,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.13 | bwd_microstep: 3322.33 | bwd_inner_microstep: 3321.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 15:18:28,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.13 | bwd: 3322.34 | bwd_inner: 3321.55 | bwd_allreduce: 0.75 | step: 6.56 11%|█▏ | 1143/10000 [1:48:49<13:30:05, 5.49s/it] {'loss': 0.0764, 'grad_norm': 0.3714255094528198, 'learning_rate': 3.925918133009185e-05, 'epoch': 1.14} 11%|█▏ | 1143/10000 [1:48:49<13:30:05, 5.49s/it][2025-06-19 15:18:33,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:18:33,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.65 | bwd_microstep: 3326.47 | bwd_inner_microstep: 3325.66 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-19 15:18:33,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.65 | bwd: 3326.49 | bwd_inner: 3325.66 | bwd_allreduce: 0.78 | step: 7.03 11%|█▏ | 1144/10000 [1:48:54<13:29:27, 5.48s/it] {'loss': 0.1139, 'grad_norm': 0.622846245765686, 'learning_rate': 3.925743367473206e-05, 'epoch': 1.14} 11%|█▏ | 1144/10000 [1:48:54<13:29:27, 5.48s/it][2025-06-19 15:18:39,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:18:39,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.19 | bwd_microstep: 3325.26 | bwd_inner_microstep: 3324.36 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.05 [2025-06-19 15:18:39,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.19 | bwd: 3325.28 | bwd_inner: 3324.36 | bwd_allreduce: 0.87 | step: 7.05 11%|█▏ | 1145/10000 [1:49:00<13:29:08, 5.48s/it] {'loss': 0.0656, 'grad_norm': 0.7123352885246277, 'learning_rate': 3.9255683999356845e-05, 'epoch': 1.15} 11%|█▏ | 1145/10000 [1:49:00<13:29:08, 5.48s/it][2025-06-19 15:18:44,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 15:18:44,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.39 | bwd_microstep: 3324.22 | bwd_inner_microstep: 3323.01 | bwd_allreduce_microstep: 1.14 | step_microstep: 8.01 [2025-06-19 15:18:44,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.39 | bwd: 3324.24 | bwd_inner: 3323.01 | bwd_allreduce: 1.17 | step: 8.02 11%|█▏ | 1146/10000 [1:49:05<13:29:07, 5.48s/it] {'loss': 0.1396, 'grad_norm': 1.0731595754623413, 'learning_rate': 3.925393230414975e-05, 'epoch': 1.15} 11%|█▏ | 1146/10000 [1:49:05<13:29:07, 5.48s/it][2025-06-19 15:18:50,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:18:50,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.69 | bwd_microstep: 3366.68 | bwd_inner_microstep: 3365.66 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.07 [2025-06-19 15:18:50,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.69 | bwd: 3366.70 | bwd_inner: 3365.66 | bwd_allreduce: 0.98 | step: 7.07 11%|█▏ | 1147/10000 [1:49:11<13:31:46, 5.50s/it] {'loss': 0.0685, 'grad_norm': 0.42701059579849243, 'learning_rate': 3.925217858929451e-05, 'epoch': 1.15} 11%|█▏ | 1147/10000 [1:49:11<13:31:46, 5.50s/it][2025-06-19 15:18:55,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:18:55,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.63 | bwd_microstep: 3312.47 | bwd_inner_microstep: 3311.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-19 15:18:55,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.63 | bwd: 3312.48 | bwd_inner: 3311.69 | bwd_allreduce: 0.75 | step: 6.52 11%|█▏ | 1148/10000 [1:49:16<13:29:45, 5.49s/it] {'loss': 0.1235, 'grad_norm': 0.6714674234390259, 'learning_rate': 3.92504228549751e-05, 'epoch': 1.15} 11%|█▏ | 1148/10000 [1:49:16<13:29:45, 5.49s/it][2025-06-19 15:19:01,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:19:01,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.19 | bwd_microstep: 3315.25 | bwd_inner_microstep: 3314.34 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.93 [2025-06-19 15:19:01,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.19 | bwd: 3315.26 | bwd_inner: 3314.34 | bwd_allreduce: 0.89 | step: 6.93 11%|█▏ | 1149/10000 [1:49:22<13:28:29, 5.48s/it] {'loss': 0.1498, 'grad_norm': 0.8261744976043701, 'learning_rate': 3.924866510137567e-05, 'epoch': 1.15} 11%|█▏ | 1149/10000 [1:49:22<13:28:29, 5.48s/it][2025-06-19 15:19:06,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:19:06,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.70 | bwd_microstep: 3315.71 | bwd_inner_microstep: 3314.70 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.19 [2025-06-19 15:19:06,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.70 | bwd: 3315.73 | bwd_inner: 3314.70 | bwd_allreduce: 0.98 | step: 7.19 12%|█▏ | 1150/10000 [1:49:27<13:27:59, 5.48s/it] {'loss': 0.0761, 'grad_norm': 0.9228799343109131, 'learning_rate': 3.924690532868061e-05, 'epoch': 1.15} 12%|█▏ | 1150/10000 [1:49:27<13:27:59, 5.48s/it][2025-06-19 15:19:12,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:19:12,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.81 | bwd_microstep: 3307.39 | bwd_inner_microstep: 3306.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 15:19:12,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.81 | bwd: 3307.41 | bwd_inner: 3306.61 | bwd_allreduce: 0.76 | step: 6.61 12%|█▏ | 1151/10000 [1:49:33<13:27:22, 5.47s/it] {'loss': 0.1121, 'grad_norm': 0.621994137763977, 'learning_rate': 3.924514353707451e-05, 'epoch': 1.15} 12%|█▏ | 1151/10000 [1:49:33<13:27:22, 5.47s/it][2025-06-19 15:19:17,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:19:17,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.20 | bwd_microstep: 3316.62 | bwd_inner_microstep: 3315.73 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.83 [2025-06-19 15:19:17,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.20 | bwd: 3316.64 | bwd_inner: 3315.73 | bwd_allreduce: 0.85 | step: 7.83 12%|█▏ | 1152/10000 [1:49:38<13:26:46, 5.47s/it] {'loss': 0.0492, 'grad_norm': 0.30580562353134155, 'learning_rate': 3.9243379726742167e-05, 'epoch': 1.15} 12%|█▏ | 1152/10000 [1:49:38<13:26:46, 5.47s/it][2025-06-19 15:19:23,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:19:23,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.12 | bwd_microstep: 3319.11 | bwd_inner_microstep: 3318.29 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.28 [2025-06-19 15:19:23,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.12 | bwd: 3319.12 | bwd_inner: 3318.29 | bwd_allreduce: 0.79 | step: 7.28 12%|█▏ | 1153/10000 [1:49:44<13:26:53, 5.47s/it] {'loss': 0.0991, 'grad_norm': 0.586516261100769, 'learning_rate': 3.9241613897868595e-05, 'epoch': 1.15} 12%|█▏ | 1153/10000 [1:49:44<13:26:53, 5.47s/it][2025-06-19 15:19:28,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:19:28,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.33 | bwd_microstep: 3317.31 | bwd_inner_microstep: 3316.37 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.18 [2025-06-19 15:19:28,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.33 | bwd: 3317.32 | bwd_inner: 3316.37 | bwd_allreduce: 0.91 | step: 7.19 12%|█▏ | 1154/10000 [1:49:49<13:26:35, 5.47s/it] {'loss': 0.1505, 'grad_norm': 1.1112151145935059, 'learning_rate': 3.923984605063904e-05, 'epoch': 1.15} 12%|█▏ | 1154/10000 [1:49:49<13:26:35, 5.47s/it][2025-06-19 15:19:34,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:19:34,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.80 | bwd_microstep: 3360.16 | bwd_inner_microstep: 3359.37 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 15:19:34,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.80 | bwd: 3360.18 | bwd_inner: 3359.37 | bwd_allreduce: 0.76 | step: 6.94 12%|█▏ | 1155/10000 [1:49:55<13:29:14, 5.49s/it] {'loss': 0.0654, 'grad_norm': 0.5380552411079407, 'learning_rate': 3.923807618523893e-05, 'epoch': 1.16} 12%|█▏ | 1155/10000 [1:49:55<13:29:14, 5.49s/it][2025-06-19 15:19:39,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:19:39,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.29 | bwd_microstep: 3326.35 | bwd_inner_microstep: 3325.55 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 15:19:39,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.29 | bwd: 3326.37 | bwd_inner: 3325.55 | bwd_allreduce: 0.77 | step: 7.12 12%|█▏ | 1156/10000 [1:50:00<13:28:26, 5.48s/it] {'loss': 0.1151, 'grad_norm': 0.5837368965148926, 'learning_rate': 3.9236304301853906e-05, 'epoch': 1.16} 12%|█▏ | 1156/10000 [1:50:00<13:28:26, 5.48s/it][2025-06-19 15:19:45,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:19:45,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.62 | bwd_microstep: 3374.39 | bwd_inner_microstep: 3373.58 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 15:19:45,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.62 | bwd: 3374.40 | bwd_inner: 3373.58 | bwd_allreduce: 0.78 | step: 7.08 12%|█▏ | 1157/10000 [1:50:06<13:31:06, 5.50s/it] {'loss': 0.0878, 'grad_norm': 0.543116569519043, 'learning_rate': 3.923453040066985e-05, 'epoch': 1.16} 12%|█▏ | 1157/10000 [1:50:06<13:31:06, 5.50s/it][2025-06-19 15:19:50,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:19:50,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.14 | bwd_microstep: 3363.23 | bwd_inner_microstep: 3362.34 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.17 [2025-06-19 15:19:50,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.14 | bwd: 3363.25 | bwd_inner: 3362.34 | bwd_allreduce: 0.85 | step: 7.17 12%|█▏ | 1158/10000 [1:50:11<13:32:16, 5.51s/it] {'loss': 0.0754, 'grad_norm': 0.7960894107818604, 'learning_rate': 3.923275448187282e-05, 'epoch': 1.16} 12%|█▏ | 1158/10000 [1:50:11<13:32:16, 5.51s/it][2025-06-19 15:19:56,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:19:56,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.30 | bwd_microstep: 3316.77 | bwd_inner_microstep: 3315.86 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.84 [2025-06-19 15:19:56,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.30 | bwd: 3316.79 | bwd_inner: 3315.86 | bwd_allreduce: 0.88 | step: 6.85 12%|█▏ | 1159/10000 [1:50:17<13:29:34, 5.49s/it] {'loss': 0.0432, 'grad_norm': 0.2162005603313446, 'learning_rate': 3.92309765456491e-05, 'epoch': 1.16} 12%|█▏ | 1159/10000 [1:50:17<13:29:34, 5.49s/it][2025-06-19 15:20:01,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:20:01,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.98 | bwd_microstep: 3315.25 | bwd_inner_microstep: 3314.43 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.33 [2025-06-19 15:20:01,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.98 | bwd: 3315.26 | bwd_inner: 3314.43 | bwd_allreduce: 0.79 | step: 7.33 12%|█▏ | 1160/10000 [1:50:22<13:28:18, 5.49s/it] {'loss': 0.1216, 'grad_norm': 0.7928875088691711, 'learning_rate': 3.9229196592185205e-05, 'epoch': 1.16} 12%|█▏ | 1160/10000 [1:50:22<13:28:18, 5.49s/it][2025-06-19 15:20:07,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:20:07,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.79 | bwd_microstep: 3355.90 | bwd_inner_microstep: 3354.95 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.18 [2025-06-19 15:20:07,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.79 | bwd: 3355.92 | bwd_inner: 3354.95 | bwd_allreduce: 0.91 | step: 7.18 12%|█▏ | 1161/10000 [1:50:28<13:30:02, 5.50s/it] {'loss': 0.11, 'grad_norm': 0.906703531742096, 'learning_rate': 3.9227414621667826e-05, 'epoch': 1.16} 12%|█▏ | 1161/10000 [1:50:28<13:30:02, 5.50s/it][2025-06-19 15:20:12,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:20:12,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.05 | bwd_microstep: 3370.61 | bwd_inner_microstep: 3369.80 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.36 [2025-06-19 15:20:12,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.05 | bwd: 3370.62 | bwd_inner: 3369.80 | bwd_allreduce: 0.78 | step: 7.36 12%|█▏ | 1162/10000 [1:50:33<13:31:57, 5.51s/it] {'loss': 0.0567, 'grad_norm': 0.4313581585884094, 'learning_rate': 3.922563063428389e-05, 'epoch': 1.16} 12%|█▏ | 1162/10000 [1:50:33<13:31:57, 5.51s/it][2025-06-19 15:20:18,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 15:20:18,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.46 | bwd_microstep: 3369.51 | bwd_inner_microstep: 3368.53 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.81 [2025-06-19 15:20:18,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.46 | bwd: 3369.53 | bwd_inner: 3368.53 | bwd_allreduce: 0.95 | step: 7.81 12%|█▏ | 1163/10000 [1:50:39<13:32:51, 5.52s/it] {'loss': 0.1355, 'grad_norm': 1.1727808713912964, 'learning_rate': 3.9223844630220535e-05, 'epoch': 1.16} 12%|█▏ | 1163/10000 [1:50:39<13:32:51, 5.52s/it][2025-06-19 15:20:23,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 15:20:23,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.55 | bwd_microstep: 3315.37 | bwd_inner_microstep: 3314.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-19 15:20:23,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.55 | bwd: 3315.38 | bwd_inner: 3314.58 | bwd_allreduce: 0.76 | step: 6.81 12%|█▏ | 1164/10000 [1:50:44<13:30:06, 5.50s/it] {'loss': 0.085, 'grad_norm': 0.800372302532196, 'learning_rate': 3.9222056609665094e-05, 'epoch': 1.16} 12%|█▏ | 1164/10000 [1:50:44<13:30:06, 5.50s/it][2025-06-19 15:20:29,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:20:29,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.22 | bwd_microstep: 3320.92 | bwd_inner_microstep: 3319.95 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.83 [2025-06-19 15:20:29,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.22 | bwd: 3320.94 | bwd_inner: 3319.95 | bwd_allreduce: 0.94 | step: 7.83 12%|█▏ | 1165/10000 [1:50:50<13:28:18, 5.49s/it] {'loss': 0.1155, 'grad_norm': 1.0449179410934448, 'learning_rate': 3.922026657280513e-05, 'epoch': 1.17} 12%|█▏ | 1165/10000 [1:50:50<13:28:18, 5.49s/it][2025-06-19 15:20:34,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:20:34,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.23 | bwd_microstep: 3317.31 | bwd_inner_microstep: 3316.45 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.84 [2025-06-19 15:20:34,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.23 | bwd: 3317.32 | bwd_inner: 3316.45 | bwd_allreduce: 0.83 | step: 6.84 12%|█▏ | 1166/10000 [1:50:55<13:27:24, 5.48s/it] {'loss': 0.0862, 'grad_norm': 0.4542693495750427, 'learning_rate': 3.92184745198284e-05, 'epoch': 1.17} 12%|█▏ | 1166/10000 [1:50:55<13:27:24, 5.48s/it][2025-06-19 15:20:40,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:20:40,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.83 | bwd_microstep: 3313.03 | bwd_inner_microstep: 3312.23 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.28 [2025-06-19 15:20:40,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.83 | bwd: 3313.05 | bwd_inner: 3312.23 | bwd_allreduce: 0.78 | step: 7.28 12%|█▏ | 1167/10000 [1:51:00<13:26:25, 5.48s/it] {'loss': 0.0643, 'grad_norm': 0.6496155858039856, 'learning_rate': 3.9216680450922894e-05, 'epoch': 1.17} 12%|█▏ | 1167/10000 [1:51:00<13:26:25, 5.48s/it][2025-06-19 15:20:45,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:20:45,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.16 | bwd_microstep: 3370.33 | bwd_inner_microstep: 3369.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-19 15:20:45,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.16 | bwd: 3370.34 | bwd_inner: 3369.54 | bwd_allreduce: 0.76 | step: 6.74 12%|█▏ | 1168/10000 [1:51:06<13:28:58, 5.50s/it] {'loss': 0.0973, 'grad_norm': 1.0718907117843628, 'learning_rate': 3.9214884366276805e-05, 'epoch': 1.17} 12%|█▏ | 1168/10000 [1:51:06<13:28:58, 5.50s/it][2025-06-19 15:20:51,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:20:51,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.11 | bwd_microstep: 3328.21 | bwd_inner_microstep: 3327.11 | bwd_allreduce_microstep: 1.03 | step_microstep: 8.25 [2025-06-19 15:20:51,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.11 | bwd: 3328.24 | bwd_inner: 3327.11 | bwd_allreduce: 1.06 | step: 8.25 12%|█▏ | 1169/10000 [1:51:12<13:28:09, 5.49s/it] {'loss': 0.0685, 'grad_norm': 1.2614762783050537, 'learning_rate': 3.921308626607851e-05, 'epoch': 1.17} 12%|█▏ | 1169/10000 [1:51:12<13:28:09, 5.49s/it][2025-06-19 15:20:56,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:20:56,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.25 | bwd_microstep: 3325.92 | bwd_inner_microstep: 3325.01 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.93 [2025-06-19 15:20:56,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.25 | bwd: 3325.94 | bwd_inner: 3325.01 | bwd_allreduce: 0.88 | step: 6.94 12%|█▏ | 1170/10000 [1:51:17<13:27:36, 5.49s/it] {'loss': 0.0984, 'grad_norm': 0.7530120015144348, 'learning_rate': 3.921128615051665e-05, 'epoch': 1.17} 12%|█▏ | 1170/10000 [1:51:17<13:27:36, 5.49s/it][2025-06-19 15:21:02,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:21:02,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.03 | bwd_microstep: 3330.18 | bwd_inner_microstep: 3329.12 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.56 [2025-06-19 15:21:02,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.03 | bwd: 3330.20 | bwd_inner: 3329.12 | bwd_allreduce: 1.02 | step: 7.56 12%|█▏ | 1171/10000 [1:51:22<13:27:46, 5.49s/it] {'loss': 0.1736, 'grad_norm': 1.0485057830810547, 'learning_rate': 3.920948401978002e-05, 'epoch': 1.17} 12%|█▏ | 1171/10000 [1:51:22<13:27:46, 5.49s/it][2025-06-19 15:21:07,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:21:07,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.34 | bwd_microstep: 3317.39 | bwd_inner_microstep: 3316.41 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.14 [2025-06-19 15:21:07,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.34 | bwd: 3317.40 | bwd_inner: 3316.41 | bwd_allreduce: 0.95 | step: 7.15 12%|█▏ | 1172/10000 [1:51:28<13:26:47, 5.48s/it] {'loss': 0.2282, 'grad_norm': 1.1032674312591553, 'learning_rate': 3.920767987405768e-05, 'epoch': 1.17} 12%|█▏ | 1172/10000 [1:51:28<13:26:47, 5.48s/it][2025-06-19 15:21:13,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:21:13,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.12 | bwd_microstep: 3370.97 | bwd_inner_microstep: 3369.85 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.84 [2025-06-19 15:21:13,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.12 | bwd: 3370.99 | bwd_inner: 3369.85 | bwd_allreduce: 1.08 | step: 7.85 12%|█▏ | 1173/10000 [1:51:34<13:29:41, 5.50s/it] {'loss': 0.0855, 'grad_norm': 0.5573028326034546, 'learning_rate': 3.9205873713538864e-05, 'epoch': 1.17} 12%|█▏ | 1173/10000 [1:51:34<13:29:41, 5.50s/it][2025-06-19 15:21:18,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-19 15:21:18,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.17 | bwd_microstep: 3319.35 | bwd_inner_microstep: 3318.32 | bwd_allreduce_microstep: 0.96 | step_microstep: 8.21 [2025-06-19 15:21:18,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.17 | bwd: 3319.37 | bwd_inner: 3318.32 | bwd_allreduce: 0.99 | step: 8.22 12%|█▏ | 1174/10000 [1:51:39<13:28:26, 5.50s/it] {'loss': 0.127, 'grad_norm': 0.9002314805984497, 'learning_rate': 3.9204065538413034e-05, 'epoch': 1.17} 12%|█▏ | 1174/10000 [1:51:39<13:28:26, 5.50s/it][2025-06-19 15:21:24,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:21:24,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.26 | bwd_microstep: 3317.11 | bwd_inner_microstep: 3316.31 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 15:21:24,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.26 | bwd: 3317.12 | bwd_inner: 3316.31 | bwd_allreduce: 0.77 | step: 6.80 12%|█▏ | 1175/10000 [1:51:44<13:27:15, 5.49s/it] {'loss': 0.0746, 'grad_norm': 0.6853030323982239, 'learning_rate': 3.920225534886986e-05, 'epoch': 1.18} 12%|█▏ | 1175/10000 [1:51:44<13:27:15, 5.49s/it][2025-06-19 15:21:29,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 15:21:29,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.99 | bwd_microstep: 3371.38 | bwd_inner_microstep: 3370.30 | bwd_allreduce_microstep: 1.02 | step_microstep: 8.07 [2025-06-19 15:21:29,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.99 | bwd: 3371.40 | bwd_inner: 3370.30 | bwd_allreduce: 1.04 | step: 8.08 12%|█▏ | 1176/10000 [1:51:50<13:29:54, 5.51s/it] {'loss': 0.0716, 'grad_norm': 0.5730908513069153, 'learning_rate': 3.920044314509921e-05, 'epoch': 1.18} 12%|█▏ | 1176/10000 [1:51:50<13:29:54, 5.51s/it][2025-06-19 15:21:35,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 15:21:35,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.39 | bwd_microstep: 3322.54 | bwd_inner_microstep: 3321.49 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.26 [2025-06-19 15:21:35,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.39 | bwd: 3322.57 | bwd_inner: 3321.49 | bwd_allreduce: 1.01 | step: 8.27 12%|█▏ | 1177/10000 [1:51:55<13:29:11, 5.50s/it] {'loss': 0.0789, 'grad_norm': 0.6347646117210388, 'learning_rate': 3.9198628927291194e-05, 'epoch': 1.18} 12%|█▏ | 1177/10000 [1:51:56<13:29:11, 5.50s/it][2025-06-19 15:21:40,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:21:40,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.18 | bwd_microstep: 3398.44 | bwd_inner_microstep: 3397.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-19 15:21:40,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.18 | bwd: 3398.45 | bwd_inner: 3397.65 | bwd_allreduce: 0.76 | step: 6.61 12%|█▏ | 1178/10000 [1:52:01<13:32:22, 5.53s/it] {'loss': 0.0739, 'grad_norm': 0.8118216395378113, 'learning_rate': 3.919681269563611e-05, 'epoch': 1.18} 12%|█▏ | 1178/10000 [1:52:01<13:32:22, 5.53s/it][2025-06-19 15:21:46,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:21:46,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.13 | bwd_microstep: 3370.28 | bwd_inner_microstep: 3369.45 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.52 [2025-06-19 15:21:46,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.13 | bwd: 3370.31 | bwd_inner: 3369.45 | bwd_allreduce: 0.80 | step: 7.53 12%|█▏ | 1179/10000 [1:52:07<13:33:13, 5.53s/it] {'loss': 0.0743, 'grad_norm': 0.39170053601264954, 'learning_rate': 3.919499445032447e-05, 'epoch': 1.18} 12%|█▏ | 1179/10000 [1:52:07<13:33:13, 5.53s/it][2025-06-19 15:21:51,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:21:51,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.61 | bwd_microstep: 3383.57 | bwd_inner_microstep: 3382.77 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-19 15:21:51,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.61 | bwd: 3383.59 | bwd_inner: 3382.77 | bwd_allreduce: 0.77 | step: 7.05 12%|█▏ | 1180/10000 [1:52:12<13:34:18, 5.54s/it] {'loss': 0.1299, 'grad_norm': 0.8367404341697693, 'learning_rate': 3.919317419154699e-05, 'epoch': 1.18} 12%|█▏ | 1180/10000 [1:52:12<13:34:18, 5.54s/it][2025-06-19 15:21:57,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:21:57,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.21 | bwd_microstep: 3321.89 | bwd_inner_microstep: 3321.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.50 [2025-06-19 15:21:57,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.21 | bwd: 3321.91 | bwd_inner: 3321.11 | bwd_allreduce: 0.76 | step: 6.50 12%|█▏ | 1181/10000 [1:52:18<13:31:01, 5.52s/it] {'loss': 0.0606, 'grad_norm': 0.5243493318557739, 'learning_rate': 3.919135191949462e-05, 'epoch': 1.18} 12%|█▏ | 1181/10000 [1:52:18<13:31:01, 5.52s/it][2025-06-19 15:22:02,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:22:02,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.79 | bwd_microstep: 3328.42 | bwd_inner_microstep: 3327.64 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 15:22:02,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.79 | bwd: 3328.44 | bwd_inner: 3327.64 | bwd_allreduce: 0.75 | step: 6.55 12%|█▏ | 1182/10000 [1:52:23<13:29:02, 5.50s/it] {'loss': 0.1281, 'grad_norm': 0.6284741163253784, 'learning_rate': 3.918952763435851e-05, 'epoch': 1.18} 12%|█▏ | 1182/10000 [1:52:23<13:29:02, 5.50s/it][2025-06-19 15:22:08,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 15:22:08,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.51 | bwd_microstep: 3329.41 | bwd_inner_microstep: 3328.48 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.58 [2025-06-19 15:22:08,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.51 | bwd: 3329.42 | bwd_inner: 3328.48 | bwd_allreduce: 0.90 | step: 7.58 12%|█▏ | 1183/10000 [1:52:29<13:27:35, 5.50s/it] {'loss': 0.1008, 'grad_norm': 0.9065858125686646, 'learning_rate': 3.9187701336330004e-05, 'epoch': 1.18} 12%|█▏ | 1183/10000 [1:52:29<13:27:35, 5.50s/it][2025-06-19 15:22:13,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:22:13,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.81 | bwd_microstep: 3328.78 | bwd_inner_microstep: 3328.00 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 15:22:13,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.81 | bwd: 3328.80 | bwd_inner: 3328.00 | bwd_allreduce: 0.76 | step: 6.67 12%|█▏ | 1184/10000 [1:52:34<13:26:51, 5.49s/it] {'loss': 0.1104, 'grad_norm': 0.9155588746070862, 'learning_rate': 3.918587302560068e-05, 'epoch': 1.18} 12%|█▏ | 1184/10000 [1:52:34<13:26:51, 5.49s/it][2025-06-19 15:22:19,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:22:19,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.36 | bwd_microstep: 3337.90 | bwd_inner_microstep: 3337.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 15:22:19,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.36 | bwd: 3337.91 | bwd_inner: 3337.12 | bwd_allreduce: 0.75 | step: 6.61 12%|█▏ | 1185/10000 [1:52:40<13:26:18, 5.49s/it] {'loss': 0.0545, 'grad_norm': 0.27088141441345215, 'learning_rate': 3.9184042702362326e-05, 'epoch': 1.19} 12%|█▏ | 1185/10000 [1:52:40<13:26:18, 5.49s/it][2025-06-19 15:22:24,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:22:24,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.39 | bwd_microstep: 3381.48 | bwd_inner_microstep: 3380.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.43 [2025-06-19 15:22:24,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.39 | bwd: 3381.49 | bwd_inner: 3380.67 | bwd_allreduce: 0.78 | step: 7.44 12%|█▏ | 1186/10000 [1:52:45<13:29:03, 5.51s/it] {'loss': 0.1075, 'grad_norm': 0.86508709192276, 'learning_rate': 3.9182210366806926e-05, 'epoch': 1.19} 12%|█▏ | 1186/10000 [1:52:45<13:29:03, 5.51s/it][2025-06-19 15:22:30,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:22:30,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.86 | bwd_microstep: 3318.55 | bwd_inner_microstep: 3317.73 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.92 [2025-06-19 15:22:30,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.86 | bwd: 3318.57 | bwd_inner: 3317.73 | bwd_allreduce: 0.79 | step: 6.92 12%|█▏ | 1187/10000 [1:52:51<13:27:25, 5.50s/it] {'loss': 0.1298, 'grad_norm': 0.8949330449104309, 'learning_rate': 3.9180376019126684e-05, 'epoch': 1.19} 12%|█▏ | 1187/10000 [1:52:51<13:27:25, 5.50s/it][2025-06-19 15:22:35,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:22:35,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.94 | bwd_microstep: 3373.60 | bwd_inner_microstep: 3372.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 15:22:35,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.94 | bwd: 3373.61 | bwd_inner: 3372.81 | bwd_allreduce: 0.76 | step: 6.59 12%|█▏ | 1188/10000 [1:52:56<13:29:25, 5.51s/it] {'loss': 0.0928, 'grad_norm': 0.6078923344612122, 'learning_rate': 3.917853965951402e-05, 'epoch': 1.19} 12%|█▏ | 1188/10000 [1:52:56<13:29:25, 5.51s/it][2025-06-19 15:22:41,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:22:41,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.48 | bwd_microstep: 3412.06 | bwd_inner_microstep: 3411.28 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 15:22:41,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.48 | bwd: 3412.07 | bwd_inner: 3411.28 | bwd_allreduce: 0.75 | step: 6.59 12%|█▏ | 1189/10000 [1:53:02<13:32:48, 5.53s/it] {'loss': 0.1533, 'grad_norm': 0.9445534944534302, 'learning_rate': 3.9176701288161554e-05, 'epoch': 1.19} 12%|█▏ | 1189/10000 [1:53:02<13:32:48, 5.53s/it][2025-06-19 15:22:46,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 15:22:46,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.55 | bwd_microstep: 3378.70 | bwd_inner_microstep: 3377.71 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.40 [2025-06-19 15:22:46,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.55 | bwd: 3378.71 | bwd_inner: 3377.71 | bwd_allreduce: 0.96 | step: 7.41 12%|█▏ | 1190/10000 [1:53:07<13:33:27, 5.54s/it] {'loss': 0.0545, 'grad_norm': 0.3892096281051636, 'learning_rate': 3.917486090526212e-05, 'epoch': 1.19} 12%|█▏ | 1190/10000 [1:53:07<13:33:27, 5.54s/it][2025-06-19 15:22:52,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:22:52,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.99 | bwd_microstep: 3376.51 | bwd_inner_microstep: 3375.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 15:22:52,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.99 | bwd: 3376.52 | bwd_inner: 3375.71 | bwd_allreduce: 0.77 | step: 6.96 12%|█▏ | 1191/10000 [1:53:13<13:34:16, 5.55s/it] {'loss': 0.0904, 'grad_norm': 0.5384933352470398, 'learning_rate': 3.917301851100878e-05, 'epoch': 1.19} 12%|█▏ | 1191/10000 [1:53:13<13:34:16, 5.55s/it][2025-06-19 15:22:57,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:22:57,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.19 | bwd_microstep: 3321.65 | bwd_inner_microstep: 3320.75 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.90 [2025-06-19 15:22:57,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.19 | bwd: 3321.66 | bwd_inner: 3320.75 | bwd_allreduce: 0.87 | step: 6.90 12%|█▏ | 1192/10000 [1:53:18<13:30:32, 5.52s/it] {'loss': 0.0525, 'grad_norm': 0.4427521526813507, 'learning_rate': 3.917117410559478e-05, 'epoch': 1.19} 12%|█▏ | 1192/10000 [1:53:18<13:30:32, 5.52s/it][2025-06-19 15:23:03,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:23:03,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.33 | bwd_microstep: 3383.63 | bwd_inner_microstep: 3382.75 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.88 [2025-06-19 15:23:03,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.33 | bwd: 3383.65 | bwd_inner: 3382.75 | bwd_allreduce: 0.85 | step: 6.88 12%|█▏ | 1193/10000 [1:53:24<13:31:52, 5.53s/it] {'loss': 0.0677, 'grad_norm': 0.5693679451942444, 'learning_rate': 3.9169327689213584e-05, 'epoch': 1.19} 12%|█▏ | 1193/10000 [1:53:24<13:31:52, 5.53s/it][2025-06-19 15:23:09,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:23:09,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.76 | bwd_microstep: 3326.70 | bwd_inner_microstep: 3325.71 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.04 [2025-06-19 15:23:09,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.76 | bwd: 3326.72 | bwd_inner: 3325.71 | bwd_allreduce: 0.96 | step: 7.04 12%|█▏ | 1194/10000 [1:53:29<13:29:23, 5.51s/it] {'loss': 0.1197, 'grad_norm': 0.6513360142707825, 'learning_rate': 3.916747926205889e-05, 'epoch': 1.19} 12%|█▏ | 1194/10000 [1:53:29<13:29:23, 5.51s/it][2025-06-19 15:23:14,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:23:14,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.22 | bwd_microstep: 3327.29 | bwd_inner_microstep: 3326.49 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 15:23:14,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.22 | bwd: 3327.31 | bwd_inner: 3326.49 | bwd_allreduce: 0.77 | step: 7.10 12%|█▏ | 1195/10000 [1:53:35<13:27:39, 5.50s/it] {'loss': 0.0906, 'grad_norm': 0.3739613890647888, 'learning_rate': 3.9165628824324576e-05, 'epoch': 1.2} 12%|█▏ | 1195/10000 [1:53:35<13:27:39, 5.50s/it][2025-06-19 15:23:20,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:23:20,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.30 | bwd_microstep: 3377.38 | bwd_inner_microstep: 3376.48 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.97 [2025-06-19 15:23:20,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.30 | bwd: 3377.40 | bwd_inner: 3376.48 | bwd_allreduce: 0.87 | step: 6.97 12%|█▏ | 1196/10000 [1:53:40<13:29:31, 5.52s/it] {'loss': 0.0723, 'grad_norm': 0.42054882645606995, 'learning_rate': 3.916377637620475e-05, 'epoch': 1.2} 12%|█▏ | 1196/10000 [1:53:40<13:29:31, 5.52s/it][2025-06-19 15:23:25,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:23:25,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.90 | bwd_microstep: 3382.23 | bwd_inner_microstep: 3381.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-19 15:23:25,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.90 | bwd: 3382.24 | bwd_inner: 3381.43 | bwd_allreduce: 0.77 | step: 7.02 12%|█▏ | 1197/10000 [1:53:46<13:31:21, 5.53s/it] {'loss': 0.0849, 'grad_norm': 0.4711257815361023, 'learning_rate': 3.916192191789373e-05, 'epoch': 1.2} 12%|█▏ | 1197/10000 [1:53:46<13:31:21, 5.53s/it][2025-06-19 15:23:31,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:23:31,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.98 | bwd_microstep: 3330.23 | bwd_inner_microstep: 3329.44 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 15:23:31,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.98 | bwd: 3330.24 | bwd_inner: 3329.44 | bwd_allreduce: 0.76 | step: 6.77 12%|█▏ | 1198/10000 [1:53:51<13:28:54, 5.51s/it] {'loss': 0.061, 'grad_norm': 0.3291938602924347, 'learning_rate': 3.916006544958602e-05, 'epoch': 1.2} 12%|█▏ | 1198/10000 [1:53:51<13:28:54, 5.51s/it][2025-06-19 15:23:36,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 15:23:36,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.51 | bwd_microstep: 3329.52 | bwd_inner_microstep: 3328.60 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.39 [2025-06-19 15:23:36,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.51 | bwd: 3329.53 | bwd_inner: 3328.60 | bwd_allreduce: 0.89 | step: 7.39 12%|█▏ | 1199/10000 [1:53:57<13:27:34, 5.51s/it] {'loss': 0.0609, 'grad_norm': 0.5242598652839661, 'learning_rate': 3.915820697147638e-05, 'epoch': 1.2} 12%|█▏ | 1199/10000 [1:53:57<13:27:34, 5.51s/it][2025-06-19 15:23:42,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.73 [2025-06-19 15:23:42,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.00 | bwd_microstep: 3324.02 | bwd_inner_microstep: 3323.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.84 [2025-06-19 15:23:42,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.00 | bwd: 3324.03 | bwd_inner: 3323.24 | bwd_allreduce: 0.75 | step: 6.84 12%|█▏ | 1200/10000 [1:54:02<13:25:56, 5.50s/it] {'loss': 0.0613, 'grad_norm': 0.4831119477748871, 'learning_rate': 3.915634648375974e-05, 'epoch': 1.2} 12%|█▏ | 1200/10000 [1:54:02<13:25:56, 5.50s/it][2025-06-19 15:23:47,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:23:47,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.24 | bwd_microstep: 3336.83 | bwd_inner_microstep: 3335.98 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.72 [2025-06-19 15:23:47,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.24 | bwd: 3336.86 | bwd_inner: 3335.98 | bwd_allreduce: 0.82 | step: 7.72 12%|█▏ | 1201/10000 [1:54:08<13:25:40, 5.49s/it] {'loss': 0.0843, 'grad_norm': 0.6932913064956665, 'learning_rate': 3.915448398663127e-05, 'epoch': 1.2} 12%|█▏ | 1201/10000 [1:54:08<13:25:40, 5.49s/it][2025-06-19 15:23:53,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:23:53,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.11 | bwd_microstep: 3378.47 | bwd_inner_microstep: 3377.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 15:23:53,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.11 | bwd: 3378.48 | bwd_inner: 3377.69 | bwd_allreduce: 0.75 | step: 6.63 12%|█▏ | 1202/10000 [1:54:13<13:28:20, 5.51s/it] {'loss': 0.2267, 'grad_norm': 1.0174096822738647, 'learning_rate': 3.915261948028632e-05, 'epoch': 1.2} 12%|█▏ | 1202/10000 [1:54:13<13:28:20, 5.51s/it][2025-06-19 15:23:58,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:23:58,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.61 | bwd_microstep: 3398.55 | bwd_inner_microstep: 3397.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 15:23:58,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.61 | bwd: 3398.56 | bwd_inner: 3397.76 | bwd_allreduce: 0.76 | step: 6.73 12%|█▏ | 1203/10000 [1:54:19<13:30:52, 5.53s/it] {'loss': 0.1547, 'grad_norm': 0.976575493812561, 'learning_rate': 3.915075296492048e-05, 'epoch': 1.2} 12%|█▏ | 1203/10000 [1:54:19<13:30:52, 5.53s/it][2025-06-19 15:24:04,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:24:04,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.73 | bwd_microstep: 3376.41 | bwd_inner_microstep: 3375.51 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.01 [2025-06-19 15:24:04,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.73 | bwd: 3376.43 | bwd_inner: 3375.51 | bwd_allreduce: 0.87 | step: 7.01 12%|█▏ | 1204/10000 [1:54:25<13:31:46, 5.54s/it] {'loss': 0.104, 'grad_norm': 0.7636419534683228, 'learning_rate': 3.9148884440729535e-05, 'epoch': 1.2} 12%|█▏ | 1204/10000 [1:54:25<13:31:46, 5.54s/it][2025-06-19 15:24:09,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:24:09,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.95 | bwd_microstep: 3335.22 | bwd_inner_microstep: 3334.42 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-19 15:24:09,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.95 | bwd: 3335.23 | bwd_inner: 3334.42 | bwd_allreduce: 0.77 | step: 7.02 12%|█▏ | 1205/10000 [1:54:30<13:29:18, 5.52s/it] {'loss': 0.115, 'grad_norm': 0.465017169713974, 'learning_rate': 3.914701390790948e-05, 'epoch': 1.21} 12%|█▏ | 1205/10000 [1:54:30<13:29:18, 5.52s/it][2025-06-19 15:24:15,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:24:15,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.48 | bwd_microstep: 3377.77 | bwd_inner_microstep: 3376.85 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.94 [2025-06-19 15:24:15,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.48 | bwd: 3377.79 | bwd_inner: 3376.85 | bwd_allreduce: 0.89 | step: 6.94 12%|█▏ | 1206/10000 [1:54:36<13:30:30, 5.53s/it] {'loss': 0.0624, 'grad_norm': 0.46604108810424805, 'learning_rate': 3.914514136665654e-05, 'epoch': 1.21} 12%|█▏ | 1206/10000 [1:54:36<13:30:30, 5.53s/it][2025-06-19 15:24:20,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:24:20,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.19 | bwd_microstep: 3373.14 | bwd_inner_microstep: 3372.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.55 [2025-06-19 15:24:20,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.19 | bwd: 3373.16 | bwd_inner: 3372.35 | bwd_allreduce: 0.76 | step: 6.56 12%|█▏ | 1207/10000 [1:54:41<13:31:19, 5.54s/it] {'loss': 0.0889, 'grad_norm': 0.7274256944656372, 'learning_rate': 3.9143266817167117e-05, 'epoch': 1.21} 12%|█▏ | 1207/10000 [1:54:41<13:31:19, 5.54s/it][2025-06-19 15:24:26,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:24:26,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.53 | bwd_microstep: 3325.44 | bwd_inner_microstep: 3324.58 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.24 [2025-06-19 15:24:26,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.53 | bwd: 3325.46 | bwd_inner: 3324.58 | bwd_allreduce: 0.82 | step: 7.24 12%|█▏ | 1208/10000 [1:54:47<13:28:20, 5.52s/it] {'loss': 0.1702, 'grad_norm': 1.1264451742172241, 'learning_rate': 3.9141390259637855e-05, 'epoch': 1.21} 12%|█▏ | 1208/10000 [1:54:47<13:28:20, 5.52s/it][2025-06-19 15:24:31,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:24:31,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.14 | bwd_microstep: 3378.44 | bwd_inner_microstep: 3377.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 15:24:31,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.14 | bwd: 3378.45 | bwd_inner: 3377.64 | bwd_allreduce: 0.77 | step: 6.70 12%|█▏ | 1209/10000 [1:54:52<13:29:59, 5.53s/it] {'loss': 0.0813, 'grad_norm': 0.3251538872718811, 'learning_rate': 3.913951169426559e-05, 'epoch': 1.21} 12%|█▏ | 1209/10000 [1:54:52<13:29:59, 5.53s/it][2025-06-19 15:24:37,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:24:37,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.09 | bwd_microstep: 3383.29 | bwd_inner_microstep: 3382.42 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.79 [2025-06-19 15:24:37,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.09 | bwd: 3383.31 | bwd_inner: 3382.42 | bwd_allreduce: 0.84 | step: 6.79 12%|█▏ | 1210/10000 [1:54:58<13:30:54, 5.54s/it] {'loss': 0.2389, 'grad_norm': 1.1774072647094727, 'learning_rate': 3.913763112124738e-05, 'epoch': 1.21} 12%|█▏ | 1210/10000 [1:54:58<13:30:54, 5.54s/it][2025-06-19 15:24:42,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:24:42,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.57 | bwd_microstep: 3317.61 | bwd_inner_microstep: 3316.76 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.30 [2025-06-19 15:24:42,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.57 | bwd: 3317.63 | bwd_inner: 3316.76 | bwd_allreduce: 0.82 | step: 7.31 12%|█▏ | 1211/10000 [1:55:03<13:27:47, 5.51s/it] {'loss': 0.1794, 'grad_norm': 1.3365474939346313, 'learning_rate': 3.9135748540780484e-05, 'epoch': 1.21} 12%|█▏ | 1211/10000 [1:55:03<13:27:47, 5.51s/it][2025-06-19 15:24:48,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:24:48,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.54 | bwd_microstep: 3323.35 | bwd_inner_microstep: 3322.42 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.03 [2025-06-19 15:24:48,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.54 | bwd: 3323.36 | bwd_inner: 3322.42 | bwd_allreduce: 0.90 | step: 7.03 12%|█▏ | 1212/10000 [1:55:09<13:25:58, 5.50s/it] {'loss': 0.0495, 'grad_norm': 0.3273323178291321, 'learning_rate': 3.913386395306238e-05, 'epoch': 1.21} 12%|█▏ | 1212/10000 [1:55:09<13:25:58, 5.50s/it][2025-06-19 15:24:53,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:24:53,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.05 | bwd_microstep: 3375.07 | bwd_inner_microstep: 3374.14 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.89 [2025-06-19 15:24:53,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.05 | bwd: 3375.08 | bwd_inner: 3374.14 | bwd_allreduce: 0.90 | step: 6.90 12%|█▏ | 1213/10000 [1:55:14<13:27:39, 5.51s/it] {'loss': 0.0827, 'grad_norm': 0.6014090776443481, 'learning_rate': 3.913197735829075e-05, 'epoch': 1.21} 12%|█▏ | 1213/10000 [1:55:14<13:27:39, 5.51s/it][2025-06-19 15:24:59,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:24:59,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.32 | bwd_microstep: 3373.80 | bwd_inner_microstep: 3372.74 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.80 [2025-06-19 15:24:59,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.32 | bwd: 3373.82 | bwd_inner: 3372.74 | bwd_allreduce: 1.02 | step: 7.81 12%|█▏ | 1214/10000 [1:55:20<13:29:18, 5.53s/it] {'loss': 0.1915, 'grad_norm': 1.0341222286224365, 'learning_rate': 3.913008875666349e-05, 'epoch': 1.21} 12%|█▏ | 1214/10000 [1:55:20<13:29:18, 5.53s/it][2025-06-19 15:25:04,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:25:04,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.13 | bwd_microstep: 3380.00 | bwd_inner_microstep: 3379.05 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.20 [2025-06-19 15:25:04,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.13 | bwd: 3380.03 | bwd_inner: 3379.05 | bwd_allreduce: 0.90 | step: 7.19 12%|█▏ | 1215/10000 [1:55:25<13:30:27, 5.54s/it] {'loss': 0.0558, 'grad_norm': 0.34306758642196655, 'learning_rate': 3.91281981483787e-05, 'epoch': 1.22} 12%|█▏ | 1215/10000 [1:55:25<13:30:27, 5.54s/it][2025-06-19 15:25:10,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:25:10,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.45 | bwd_microstep: 3326.36 | bwd_inner_microstep: 3325.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.85 [2025-06-19 15:25:10,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.45 | bwd: 3326.38 | bwd_inner: 3325.58 | bwd_allreduce: 0.76 | step: 6.85 12%|█▏ | 1216/10000 [1:55:31<13:28:18, 5.52s/it] {'loss': 0.0697, 'grad_norm': 0.5245057344436646, 'learning_rate': 3.912630553363471e-05, 'epoch': 1.22} 12%|█▏ | 1216/10000 [1:55:31<13:28:18, 5.52s/it][2025-06-19 15:25:15,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 15:25:15,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.79 | bwd_microstep: 3321.00 | bwd_inner_microstep: 3320.22 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 15:25:15,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.79 | bwd: 3321.01 | bwd_inner: 3320.22 | bwd_allreduce: 0.75 | step: 6.68 12%|█▏ | 1217/10000 [1:55:36<13:25:40, 5.50s/it] {'loss': 0.0667, 'grad_norm': 0.37847718596458435, 'learning_rate': 3.912441091263003e-05, 'epoch': 1.22} 12%|█▏ | 1217/10000 [1:55:36<13:25:40, 5.50s/it][2025-06-19 15:25:21,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:25:21,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.77 | bwd_microstep: 3319.52 | bwd_inner_microstep: 3318.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 15:25:21,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.77 | bwd: 3319.54 | bwd_inner: 3318.72 | bwd_allreduce: 0.77 | step: 6.99 12%|█▏ | 1218/10000 [1:55:42<13:23:47, 5.49s/it] {'loss': 0.0945, 'grad_norm': 0.39040932059288025, 'learning_rate': 3.912251428556341e-05, 'epoch': 1.22} 12%|█▏ | 1218/10000 [1:55:42<13:23:47, 5.49s/it][2025-06-19 15:25:26,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:25:26,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.22 | bwd_microstep: 3401.18 | bwd_inner_microstep: 3400.22 | bwd_allreduce_microstep: 0.91 | step_microstep: 6.93 [2025-06-19 15:25:26,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.22 | bwd: 3401.19 | bwd_inner: 3400.22 | bwd_allreduce: 0.93 | step: 6.93 12%|█▏ | 1219/10000 [1:55:47<13:27:31, 5.52s/it] {'loss': 0.0629, 'grad_norm': 0.3909898102283478, 'learning_rate': 3.9120615652633784e-05, 'epoch': 1.22} 12%|█▏ | 1219/10000 [1:55:47<13:27:31, 5.52s/it][2025-06-19 15:25:32,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:25:32,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.30 | bwd_microstep: 3328.16 | bwd_inner_microstep: 3327.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 15:25:32,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.30 | bwd: 3328.17 | bwd_inner: 3327.38 | bwd_allreduce: 0.76 | step: 6.56 12%|█▏ | 1220/10000 [1:55:53<13:25:26, 5.50s/it] {'loss': 0.0899, 'grad_norm': 1.191773772239685, 'learning_rate': 3.9118715014040326e-05, 'epoch': 1.22} 12%|█▏ | 1220/10000 [1:55:53<13:25:26, 5.50s/it][2025-06-19 15:25:37,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 15:25:37,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.32 | bwd_microstep: 3318.10 | bwd_inner_microstep: 3316.97 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.06 [2025-06-19 15:25:37,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.32 | bwd: 3318.12 | bwd_inner: 3316.97 | bwd_allreduce: 1.10 | step: 7.06 12%|█▏ | 1221/10000 [1:55:58<13:23:33, 5.49s/it] {'loss': 0.12, 'grad_norm': 0.7895259261131287, 'learning_rate': 3.911681236998239e-05, 'epoch': 1.22} 12%|█▏ | 1221/10000 [1:55:58<13:23:33, 5.49s/it][2025-06-19 15:25:43,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:25:43,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.40 | bwd_microstep: 3313.06 | bwd_inner_microstep: 3312.25 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.65 [2025-06-19 15:25:43,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.40 | bwd: 3313.07 | bwd_inner: 3312.25 | bwd_allreduce: 0.78 | step: 6.66 12%|█▏ | 1222/10000 [1:56:04<13:21:44, 5.48s/it] {'loss': 0.0808, 'grad_norm': 0.38537275791168213, 'learning_rate': 3.911490772065956e-05, 'epoch': 1.22} 12%|█▏ | 1222/10000 [1:56:04<13:21:44, 5.48s/it][2025-06-19 15:25:48,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:25:48,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.35 | bwd_microstep: 3374.17 | bwd_inner_microstep: 3373.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 15:25:48,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.35 | bwd: 3374.19 | bwd_inner: 3373.36 | bwd_allreduce: 0.78 | step: 7.07 12%|█▏ | 1223/10000 [1:56:09<13:24:17, 5.50s/it] {'loss': 0.0955, 'grad_norm': 0.7175774574279785, 'learning_rate': 3.911300106627163e-05, 'epoch': 1.22} 12%|█▏ | 1223/10000 [1:56:09<13:24:17, 5.50s/it][2025-06-19 15:25:54,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:25:54,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.60 | bwd_microstep: 3320.87 | bwd_inner_microstep: 3320.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 15:25:54,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.60 | bwd: 3320.89 | bwd_inner: 3320.08 | bwd_allreduce: 0.76 | step: 6.70 12%|█▏ | 1224/10000 [1:56:15<13:22:34, 5.49s/it] {'loss': 0.0562, 'grad_norm': 0.47216472029685974, 'learning_rate': 3.911109240701859e-05, 'epoch': 1.22} 12%|█▏ | 1224/10000 [1:56:15<13:22:34, 5.49s/it][2025-06-19 15:25:59,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:25:59,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.57 | bwd_microstep: 3372.02 | bwd_inner_microstep: 3371.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 15:25:59,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.57 | bwd: 3372.04 | bwd_inner: 3371.24 | bwd_allreduce: 0.76 | step: 6.66 12%|█▏ | 1225/10000 [1:56:20<13:24:50, 5.50s/it] {'loss': 0.0992, 'grad_norm': 0.9029370546340942, 'learning_rate': 3.910918174310066e-05, 'epoch': 1.23} 12%|█▏ | 1225/10000 [1:56:20<13:24:50, 5.50s/it][2025-06-19 15:26:05,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:26:05,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.41 | bwd_microstep: 3321.17 | bwd_inner_microstep: 3320.13 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.36 [2025-06-19 15:26:05,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.41 | bwd: 3321.19 | bwd_inner: 3320.13 | bwd_allreduce: 1.01 | step: 7.37 12%|█▏ | 1226/10000 [1:56:26<13:23:13, 5.49s/it] {'loss': 0.2081, 'grad_norm': 1.2544755935668945, 'learning_rate': 3.9107269074718246e-05, 'epoch': 1.23} 12%|█▏ | 1226/10000 [1:56:26<13:23:13, 5.49s/it][2025-06-19 15:26:10,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:26:10,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.17 | bwd_microstep: 3321.99 | bwd_inner_microstep: 3321.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 15:26:10,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.17 | bwd: 3322.00 | bwd_inner: 3321.19 | bwd_allreduce: 0.77 | step: 6.62 12%|█▏ | 1227/10000 [1:56:31<13:22:17, 5.49s/it] {'loss': 0.0803, 'grad_norm': 0.6371999979019165, 'learning_rate': 3.9105354402071986e-05, 'epoch': 1.23} 12%|█▏ | 1227/10000 [1:56:31<13:22:17, 5.49s/it][2025-06-19 15:26:16,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:26:16,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.21 | bwd_microstep: 3380.71 | bwd_inner_microstep: 3379.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 15:26:16,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.21 | bwd: 3380.72 | bwd_inner: 3379.91 | bwd_allreduce: 0.76 | step: 6.63 12%|█▏ | 1228/10000 [1:56:37<13:24:42, 5.50s/it] {'loss': 0.0938, 'grad_norm': 0.543780505657196, 'learning_rate': 3.9103437725362726e-05, 'epoch': 1.23} 12%|█▏ | 1228/10000 [1:56:37<13:24:42, 5.50s/it][2025-06-19 15:26:21,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:26:21,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.09 | bwd_microstep: 3318.29 | bwd_inner_microstep: 3317.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.13 [2025-06-19 15:26:21,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.09 | bwd: 3318.31 | bwd_inner: 3317.50 | bwd_allreduce: 0.77 | step: 7.13 12%|█▏ | 1229/10000 [1:56:42<13:22:57, 5.49s/it] {'loss': 0.2557, 'grad_norm': 1.0602160692214966, 'learning_rate': 3.9101519044791506e-05, 'epoch': 1.23} 12%|█▏ | 1229/10000 [1:56:42<13:22:57, 5.49s/it][2025-06-19 15:26:27,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:26:27,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.94 | bwd_microstep: 3324.93 | bwd_inner_microstep: 3324.10 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.76 [2025-06-19 15:26:27,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.94 | bwd: 3324.95 | bwd_inner: 3324.10 | bwd_allreduce: 0.80 | step: 6.76 12%|█▏ | 1230/10000 [1:56:48<13:21:48, 5.49s/it] {'loss': 0.0846, 'grad_norm': 0.4052717387676239, 'learning_rate': 3.9099598360559586e-05, 'epoch': 1.23} 12%|█▏ | 1230/10000 [1:56:48<13:21:48, 5.49s/it][2025-06-19 15:26:32,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:26:32,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.42 | bwd_microstep: 3365.19 | bwd_inner_microstep: 3364.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-19 15:26:32,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.42 | bwd: 3365.20 | bwd_inner: 3364.41 | bwd_allreduce: 0.76 | step: 6.52 12%|█▏ | 1231/10000 [1:56:53<13:23:42, 5.50s/it] {'loss': 0.0635, 'grad_norm': 0.49719908833503723, 'learning_rate': 3.9097675672868453e-05, 'epoch': 1.23} 12%|█▏ | 1231/10000 [1:56:53<13:23:42, 5.50s/it][2025-06-19 15:26:38,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:26:38,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.28 | bwd_microstep: 3331.72 | bwd_inner_microstep: 3330.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 15:26:38,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.28 | bwd: 3331.73 | bwd_inner: 3330.94 | bwd_allreduce: 0.76 | step: 6.72 12%|█▏ | 1232/10000 [1:56:59<13:22:47, 5.49s/it] {'loss': 0.1208, 'grad_norm': 1.3943736553192139, 'learning_rate': 3.909575098191977e-05, 'epoch': 1.23} 12%|█▏ | 1232/10000 [1:56:59<13:22:47, 5.49s/it][2025-06-19 15:26:43,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:26:43,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.75 | bwd_microstep: 3323.85 | bwd_inner_microstep: 3322.84 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.53 [2025-06-19 15:26:43,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.75 | bwd: 3323.86 | bwd_inner: 3322.84 | bwd_allreduce: 0.98 | step: 7.53 12%|█▏ | 1233/10000 [1:57:04<13:22:22, 5.49s/it] {'loss': 0.0804, 'grad_norm': 0.4343676269054413, 'learning_rate': 3.909382428791544e-05, 'epoch': 1.23} 12%|█▏ | 1233/10000 [1:57:04<13:22:22, 5.49s/it][2025-06-19 15:26:49,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:26:49,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.75 | bwd_microstep: 3369.05 | bwd_inner_microstep: 3368.23 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.28 [2025-06-19 15:26:49,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.76 | bwd: 3369.06 | bwd_inner: 3368.23 | bwd_allreduce: 0.78 | step: 7.28 12%|█▏ | 1234/10000 [1:57:10<13:24:50, 5.51s/it] {'loss': 0.0786, 'grad_norm': 0.464977890253067, 'learning_rate': 3.9091895591057555e-05, 'epoch': 1.23} 12%|█▏ | 1234/10000 [1:57:10<13:24:50, 5.51s/it][2025-06-19 15:26:54,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:26:54,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.14 | bwd_microstep: 3318.51 | bwd_inner_microstep: 3317.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-19 15:26:54,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.14 | bwd: 3318.52 | bwd_inner: 3317.73 | bwd_allreduce: 0.76 | step: 6.79 12%|█▏ | 1235/10000 [1:57:15<13:22:35, 5.49s/it] {'loss': 0.1419, 'grad_norm': 0.7021015882492065, 'learning_rate': 3.9089964891548433e-05, 'epoch': 1.23} 12%|█▏ | 1235/10000 [1:57:15<13:22:35, 5.49s/it][2025-06-19 15:27:00,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:27:00,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.00 | bwd_microstep: 3365.79 | bwd_inner_microstep: 3364.79 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.05 [2025-06-19 15:27:00,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.00 | bwd: 3365.81 | bwd_inner: 3364.79 | bwd_allreduce: 0.96 | step: 7.06 12%|█▏ | 1236/10000 [1:57:21<13:24:22, 5.51s/it] {'loss': 0.1383, 'grad_norm': 1.0661866664886475, 'learning_rate': 3.908803218959059e-05, 'epoch': 1.24} 12%|█▏ | 1236/10000 [1:57:21<13:24:22, 5.51s/it][2025-06-19 15:27:05,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 15:27:05,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.37 | bwd_microstep: 3369.70 | bwd_inner_microstep: 3368.71 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.53 [2025-06-19 15:27:05,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.37 | bwd: 3369.71 | bwd_inner: 3368.71 | bwd_allreduce: 0.95 | step: 7.53 12%|█▏ | 1237/10000 [1:57:26<13:25:42, 5.52s/it] {'loss': 0.1388, 'grad_norm': 1.1870977878570557, 'learning_rate': 3.9086097485386766e-05, 'epoch': 1.24} 12%|█▏ | 1237/10000 [1:57:26<13:25:42, 5.52s/it][2025-06-19 15:27:11,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:27:11,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.05 | bwd_microstep: 3322.34 | bwd_inner_microstep: 3321.36 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.68 [2025-06-19 15:27:11,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.05 | bwd: 3322.36 | bwd_inner: 3321.36 | bwd_allreduce: 0.95 | step: 7.69 12%|█▏ | 1238/10000 [1:57:32<13:23:51, 5.50s/it] {'loss': 0.1094, 'grad_norm': 0.7734361886978149, 'learning_rate': 3.9084160779139894e-05, 'epoch': 1.24} 12%|█▏ | 1238/10000 [1:57:32<13:23:51, 5.50s/it][2025-06-19 15:27:16,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:27:16,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.12 | bwd_microstep: 3320.39 | bwd_inner_microstep: 3319.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 15:27:16,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.12 | bwd: 3320.41 | bwd_inner: 3319.60 | bwd_allreduce: 0.76 | step: 6.76 12%|█▏ | 1239/10000 [1:57:37<13:22:17, 5.49s/it] {'loss': 0.1157, 'grad_norm': 0.6308117508888245, 'learning_rate': 3.9082222071053125e-05, 'epoch': 1.24} 12%|█▏ | 1239/10000 [1:57:37<13:22:17, 5.49s/it][2025-06-19 15:27:22,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:27:22,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.43 | bwd_microstep: 3368.17 | bwd_inner_microstep: 3367.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 15:27:22,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.43 | bwd: 3368.18 | bwd_inner: 3367.36 | bwd_allreduce: 0.78 | step: 7.17 12%|█▏ | 1240/10000 [1:57:43<13:23:54, 5.51s/it] {'loss': 0.1095, 'grad_norm': 0.560257077217102, 'learning_rate': 3.908028136132983e-05, 'epoch': 1.24} 12%|█▏ | 1240/10000 [1:57:43<13:23:54, 5.51s/it][2025-06-19 15:27:27,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:27:27,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.87 | bwd_microstep: 3312.08 | bwd_inner_microstep: 3311.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 15:27:27,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.87 | bwd: 3312.09 | bwd_inner: 3311.28 | bwd_allreduce: 0.77 | step: 6.68 12%|█▏ | 1241/10000 [1:57:48<13:22:00, 5.49s/it] {'loss': 0.1463, 'grad_norm': 1.6229565143585205, 'learning_rate': 3.907833865017357e-05, 'epoch': 1.24} 12%|█▏ | 1241/10000 [1:57:48<13:22:00, 5.49s/it][2025-06-19 15:27:33,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:27:33,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.26 | bwd_microstep: 3396.56 | bwd_inner_microstep: 3395.77 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.57 [2025-06-19 15:27:33,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.26 | bwd: 3396.57 | bwd_inner: 3395.77 | bwd_allreduce: 0.76 | step: 6.57 12%|█▏ | 1242/10000 [1:57:54<13:25:28, 5.52s/it] {'loss': 0.0663, 'grad_norm': 0.4981778562068939, 'learning_rate': 3.9076393937788135e-05, 'epoch': 1.24} 12%|█▏ | 1242/10000 [1:57:54<13:25:28, 5.52s/it][2025-06-19 15:27:38,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:27:38,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.97 | bwd_microstep: 3313.32 | bwd_inner_microstep: 3312.45 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.54 [2025-06-19 15:27:38,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.98 | bwd: 3313.34 | bwd_inner: 3312.46 | bwd_allreduce: 0.83 | step: 7.54 12%|█▏ | 1243/10000 [1:57:59<13:22:53, 5.50s/it] {'loss': 0.1097, 'grad_norm': 0.7768391966819763, 'learning_rate': 3.9074447224377505e-05, 'epoch': 1.24} 12%|█▏ | 1243/10000 [1:57:59<13:22:53, 5.50s/it][2025-06-19 15:27:44,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:27:44,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.89 | bwd_microstep: 3311.74 | bwd_inner_microstep: 3310.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 15:27:44,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.89 | bwd: 3311.75 | bwd_inner: 3310.95 | bwd_allreduce: 0.76 | step: 6.64 12%|█▏ | 1244/10000 [1:58:05<13:20:51, 5.49s/it] {'loss': 0.0776, 'grad_norm': 0.5062058568000793, 'learning_rate': 3.90724985101459e-05, 'epoch': 1.24} 12%|█▏ | 1244/10000 [1:58:05<13:20:51, 5.49s/it][2025-06-19 15:27:49,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:27:49,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.62 | bwd_microstep: 3310.11 | bwd_inner_microstep: 3309.29 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.39 [2025-06-19 15:27:49,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.62 | bwd: 3310.13 | bwd_inner: 3309.29 | bwd_allreduce: 0.79 | step: 7.39 12%|█▏ | 1245/10000 [1:58:10<13:19:15, 5.48s/it] {'loss': 0.1236, 'grad_norm': 0.5965850353240967, 'learning_rate': 3.907054779529771e-05, 'epoch': 1.25} 12%|█▏ | 1245/10000 [1:58:10<13:19:15, 5.48s/it][2025-06-19 15:27:55,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:27:55,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.88 | bwd_microstep: 3313.13 | bwd_inner_microstep: 3312.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 15:27:55,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.88 | bwd: 3313.14 | bwd_inner: 3312.34 | bwd_allreduce: 0.76 | step: 6.81 12%|█▏ | 1246/10000 [1:58:16<13:18:17, 5.47s/it] {'loss': 0.1439, 'grad_norm': 0.773430585861206, 'learning_rate': 3.906859508003757e-05, 'epoch': 1.25} 12%|█▏ | 1246/10000 [1:58:16<13:18:17, 5.47s/it][2025-06-19 15:28:00,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:28:00,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.81 | bwd_microstep: 3370.63 | bwd_inner_microstep: 3369.62 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.19 [2025-06-19 15:28:00,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.81 | bwd: 3370.64 | bwd_inner: 3369.62 | bwd_allreduce: 0.98 | step: 7.20 12%|█▏ | 1247/10000 [1:58:21<13:21:21, 5.49s/it] {'loss': 0.0638, 'grad_norm': 0.30459076166152954, 'learning_rate': 3.9066640364570305e-05, 'epoch': 1.25} 12%|█▏ | 1247/10000 [1:58:21<13:21:21, 5.49s/it][2025-06-19 15:28:06,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:28:06,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.84 | bwd_microstep: 3369.49 | bwd_inner_microstep: 3368.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-19 15:28:06,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.84 | bwd: 3369.51 | bwd_inner: 3368.69 | bwd_allreduce: 0.77 | step: 7.00 12%|█▏ | 1248/10000 [1:58:27<13:23:19, 5.51s/it] {'loss': 0.0851, 'grad_norm': 0.7790970802307129, 'learning_rate': 3.906468364910096e-05, 'epoch': 1.25} 12%|█▏ | 1248/10000 [1:58:27<13:23:19, 5.51s/it][2025-06-19 15:28:11,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:28:11,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.57 | bwd_microstep: 3307.21 | bwd_inner_microstep: 3306.40 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.87 [2025-06-19 15:28:11,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.57 | bwd: 3307.22 | bwd_inner: 3306.40 | bwd_allreduce: 0.78 | step: 6.87 12%|█▏ | 1249/10000 [1:58:32<13:20:56, 5.49s/it] {'loss': 0.1844, 'grad_norm': 1.7395756244659424, 'learning_rate': 3.9062724933834776e-05, 'epoch': 1.25} 12%|█▏ | 1249/10000 [1:58:32<13:20:56, 5.49s/it][2025-06-19 15:28:17,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:28:17,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.70 | bwd_microstep: 3315.25 | bwd_inner_microstep: 3314.39 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.22 [2025-06-19 15:28:17,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.70 | bwd: 3315.27 | bwd_inner: 3314.39 | bwd_allreduce: 0.83 | step: 7.22 12%|█▎ | 1250/10000 [1:58:38<13:19:15, 5.48s/it] {'loss': 0.1386, 'grad_norm': 0.5893047451972961, 'learning_rate': 3.906076421897722e-05, 'epoch': 1.25} 12%|█▎ | 1250/10000 [1:58:38<13:19:15, 5.48s/it][2025-06-19 15:28:22,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:28:22,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.51 | bwd_microstep: 3308.91 | bwd_inner_microstep: 3308.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 15:28:22,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.51 | bwd: 3308.93 | bwd_inner: 3308.11 | bwd_allreduce: 0.78 | step: 7.17 13%|█▎ | 1251/10000 [1:58:43<13:18:06, 5.47s/it] {'loss': 0.1533, 'grad_norm': 0.9006274938583374, 'learning_rate': 3.9058801504733966e-05, 'epoch': 1.25} 13%|█▎ | 1251/10000 [1:58:43<13:18:06, 5.47s/it][2025-06-19 15:28:28,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:28:28,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.74 | bwd_microstep: 3369.62 | bwd_inner_microstep: 3368.82 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.59 [2025-06-19 15:28:28,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.74 | bwd: 3369.64 | bwd_inner: 3368.82 | bwd_allreduce: 0.77 | step: 6.59 13%|█▎ | 1252/10000 [1:58:49<13:20:57, 5.49s/it] {'loss': 0.127, 'grad_norm': 0.6756653189659119, 'learning_rate': 3.9056836791310885e-05, 'epoch': 1.25} 13%|█▎ | 1252/10000 [1:58:49<13:20:57, 5.49s/it][2025-06-19 15:28:33,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:28:33,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.52 | bwd_microstep: 3311.06 | bwd_inner_microstep: 3310.02 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.81 [2025-06-19 15:28:33,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.52 | bwd: 3311.09 | bwd_inner: 3310.02 | bwd_allreduce: 1.00 | step: 7.81 13%|█▎ | 1253/10000 [1:58:54<13:19:20, 5.48s/it] {'loss': 0.0903, 'grad_norm': 0.42058828473091125, 'learning_rate': 3.905487007891407e-05, 'epoch': 1.25} 13%|█▎ | 1253/10000 [1:58:54<13:19:20, 5.48s/it][2025-06-19 15:28:39,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:28:39,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.34 | bwd_microstep: 3305.47 | bwd_inner_microstep: 3304.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 15:28:39,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.34 | bwd: 3305.48 | bwd_inner: 3304.69 | bwd_allreduce: 0.75 | step: 6.54 13%|█▎ | 1254/10000 [1:58:59<13:17:36, 5.47s/it] {'loss': 0.104, 'grad_norm': 0.5014871954917908, 'learning_rate': 3.905290136774982e-05, 'epoch': 1.25} 13%|█▎ | 1254/10000 [1:58:59<13:17:36, 5.47s/it][2025-06-19 15:28:44,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:28:44,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.27 | bwd_microstep: 3361.93 | bwd_inner_microstep: 3361.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 15:28:44,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.27 | bwd: 3361.95 | bwd_inner: 3361.14 | bwd_allreduce: 0.76 | step: 6.85 13%|█▎ | 1255/10000 [1:59:05<13:19:42, 5.49s/it] {'loss': 0.075, 'grad_norm': 0.375897079706192, 'learning_rate': 3.9050930658024644e-05, 'epoch': 1.25} 13%|█▎ | 1255/10000 [1:59:05<13:19:42, 5.49s/it][2025-06-19 15:28:50,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:28:50,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.20 | bwd_microstep: 3308.65 | bwd_inner_microstep: 3307.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 15:28:50,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.20 | bwd: 3308.67 | bwd_inner: 3307.85 | bwd_allreduce: 0.77 | step: 7.12 13%|█▎ | 1256/10000 [1:59:10<13:17:52, 5.47s/it] {'loss': 0.0885, 'grad_norm': 0.4604320228099823, 'learning_rate': 3.904895794994526e-05, 'epoch': 1.26} 13%|█▎ | 1256/10000 [1:59:10<13:17:52, 5.47s/it][2025-06-19 15:28:55,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:28:55,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.66 | bwd_microstep: 3378.25 | bwd_inner_microstep: 3377.26 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.71 [2025-06-19 15:28:55,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.66 | bwd: 3378.28 | bwd_inner: 3377.26 | bwd_allreduce: 0.97 | step: 7.71 13%|█▎ | 1257/10000 [1:59:16<13:21:18, 5.50s/it] {'loss': 0.1033, 'grad_norm': 0.5123528838157654, 'learning_rate': 3.90469832437186e-05, 'epoch': 1.26} 13%|█▎ | 1257/10000 [1:59:16<13:21:18, 5.50s/it][2025-06-19 15:29:01,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:29:01,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.17 | bwd_microstep: 3357.64 | bwd_inner_microstep: 3356.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-19 15:29:01,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.17 | bwd: 3357.66 | bwd_inner: 3356.85 | bwd_allreduce: 0.77 | step: 7.05 13%|█▎ | 1258/10000 [1:59:22<13:22:40, 5.51s/it] {'loss': 0.0991, 'grad_norm': 0.6164708137512207, 'learning_rate': 3.9045006539551794e-05, 'epoch': 1.26} 13%|█▎ | 1258/10000 [1:59:22<13:22:40, 5.51s/it][2025-06-19 15:29:06,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:29:06,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.72 | bwd_microstep: 3307.04 | bwd_inner_microstep: 3306.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-19 15:29:06,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.72 | bwd: 3307.06 | bwd_inner: 3306.24 | bwd_allreduce: 0.77 | step: 7.00 13%|█▎ | 1259/10000 [1:59:27<13:19:48, 5.49s/it] {'loss': 0.0552, 'grad_norm': 0.41556593775749207, 'learning_rate': 3.90430278376522e-05, 'epoch': 1.26} 13%|█▎ | 1259/10000 [1:59:27<13:19:48, 5.49s/it][2025-06-19 15:29:12,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:29:12,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.74 | bwd_microstep: 3314.97 | bwd_inner_microstep: 3314.11 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.90 [2025-06-19 15:29:12,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.74 | bwd: 3314.98 | bwd_inner: 3314.11 | bwd_allreduce: 0.83 | step: 6.90 13%|█▎ | 1260/10000 [1:59:32<13:18:22, 5.48s/it] {'loss': 0.1676, 'grad_norm': 0.7626948952674866, 'learning_rate': 3.904104713822736e-05, 'epoch': 1.26} 13%|█▎ | 1260/10000 [1:59:32<13:18:22, 5.48s/it][2025-06-19 15:29:17,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:29:17,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.64 | bwd_microstep: 3366.54 | bwd_inner_microstep: 3365.74 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 15:29:17,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.64 | bwd: 3366.56 | bwd_inner: 3365.74 | bwd_allreduce: 0.78 | step: 7.11 13%|█▎ | 1261/10000 [1:59:38<13:20:26, 5.50s/it] {'loss': 0.0624, 'grad_norm': 0.294053316116333, 'learning_rate': 3.903906444148504e-05, 'epoch': 1.26} 13%|█▎ | 1261/10000 [1:59:38<13:20:26, 5.50s/it][2025-06-19 15:29:23,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:29:23,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.88 | bwd_microstep: 3314.65 | bwd_inner_microstep: 3313.83 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.40 [2025-06-19 15:29:23,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.88 | bwd: 3314.67 | bwd_inner: 3313.83 | bwd_allreduce: 0.80 | step: 7.40 13%|█▎ | 1262/10000 [1:59:43<13:18:52, 5.49s/it] {'loss': 0.0512, 'grad_norm': 0.2945284843444824, 'learning_rate': 3.903707974763323e-05, 'epoch': 1.26} 13%|█▎ | 1262/10000 [1:59:43<13:18:52, 5.49s/it][2025-06-19 15:29:28,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:29:28,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.41 | bwd_microstep: 3314.02 | bwd_inner_microstep: 3313.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.68 [2025-06-19 15:29:28,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.41 | bwd: 3314.04 | bwd_inner: 3313.22 | bwd_allreduce: 0.77 | step: 6.69 13%|█▎ | 1263/10000 [1:59:49<13:17:36, 5.48s/it] {'loss': 0.0616, 'grad_norm': 0.43942350149154663, 'learning_rate': 3.903509305688011e-05, 'epoch': 1.26} 13%|█▎ | 1263/10000 [1:59:49<13:17:36, 5.48s/it][2025-06-19 15:29:34,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:29:34,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.54 | bwd_microstep: 3309.04 | bwd_inner_microstep: 3308.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 15:29:34,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.54 | bwd: 3309.06 | bwd_inner: 3308.24 | bwd_allreduce: 0.77 | step: 6.80 13%|█▎ | 1264/10000 [1:59:54<13:16:08, 5.47s/it] {'loss': 0.1102, 'grad_norm': 0.471876859664917, 'learning_rate': 3.903310436943407e-05, 'epoch': 1.26} 13%|█▎ | 1264/10000 [1:59:54<13:16:08, 5.47s/it][2025-06-19 15:29:39,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:29:39,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.59 | bwd_microstep: 3322.73 | bwd_inner_microstep: 3321.85 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.45 [2025-06-19 15:29:39,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.59 | bwd: 3322.75 | bwd_inner: 3321.85 | bwd_allreduce: 0.83 | step: 7.44 13%|█▎ | 1265/10000 [2:00:00<13:16:22, 5.47s/it] {'loss': 0.1138, 'grad_norm': 0.6418999433517456, 'learning_rate': 3.9031113685503726e-05, 'epoch': 1.27} 13%|█▎ | 1265/10000 [2:00:00<13:16:22, 5.47s/it][2025-06-19 15:29:44,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:29:44,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.70 | bwd_microstep: 3324.84 | bwd_inner_microstep: 3324.02 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.35 [2025-06-19 15:29:44,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.71 | bwd: 3324.86 | bwd_inner: 3324.02 | bwd_allreduce: 0.79 | step: 7.36 13%|█▎ | 1266/10000 [2:00:05<13:16:27, 5.47s/it] {'loss': 0.0785, 'grad_norm': 0.4262782037258148, 'learning_rate': 3.902912100529787e-05, 'epoch': 1.27} 13%|█▎ | 1266/10000 [2:00:05<13:16:27, 5.47s/it][2025-06-19 15:29:50,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:29:50,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.80 | bwd_microstep: 3361.77 | bwd_inner_microstep: 3360.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 15:29:50,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.80 | bwd: 3361.79 | bwd_inner: 3360.99 | bwd_allreduce: 0.76 | step: 6.65 13%|█▎ | 1267/10000 [2:00:11<13:19:01, 5.49s/it] {'loss': 0.1362, 'grad_norm': 0.7371950149536133, 'learning_rate': 3.9027126329025546e-05, 'epoch': 1.27} 13%|█▎ | 1267/10000 [2:00:11<13:19:01, 5.49s/it][2025-06-19 15:29:55,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:29:55,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.71 | bwd_microstep: 3308.64 | bwd_inner_microstep: 3307.83 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 15:29:55,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.71 | bwd: 3308.66 | bwd_inner: 3307.83 | bwd_allreduce: 0.78 | step: 7.25 13%|█▎ | 1268/10000 [2:00:16<13:17:05, 5.48s/it] {'loss': 0.0768, 'grad_norm': 0.430378258228302, 'learning_rate': 3.902512965689598e-05, 'epoch': 1.27} 13%|█▎ | 1268/10000 [2:00:16<13:17:05, 5.48s/it][2025-06-19 15:30:01,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:30:01,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.52 | bwd_microstep: 3318.08 | bwd_inner_microstep: 3317.12 | bwd_allreduce_microstep: 0.91 | step_microstep: 6.99 [2025-06-19 15:30:01,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.52 | bwd: 3318.09 | bwd_inner: 3317.12 | bwd_allreduce: 0.93 | step: 6.99 13%|█▎ | 1269/10000 [2:00:22<13:16:14, 5.47s/it] {'loss': 0.1391, 'grad_norm': 1.249907374382019, 'learning_rate': 3.9023130989118595e-05, 'epoch': 1.27} 13%|█▎ | 1269/10000 [2:00:22<13:16:14, 5.47s/it][2025-06-19 15:30:06,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.72 [2025-06-19 15:30:06,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.88 | bwd_microstep: 3378.14 | bwd_inner_microstep: 3377.31 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.52 [2025-06-19 15:30:06,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.88 | bwd: 3378.16 | bwd_inner: 3377.31 | bwd_allreduce: 0.80 | step: 7.53 13%|█▎ | 1270/10000 [2:00:27<13:19:27, 5.49s/it] {'loss': 0.0637, 'grad_norm': 0.4846266806125641, 'learning_rate': 3.9021130325903076e-05, 'epoch': 1.27} 13%|█▎ | 1270/10000 [2:00:27<13:19:27, 5.49s/it][2025-06-19 15:30:12,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:30:12,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.93 | bwd_microstep: 3323.08 | bwd_inner_microstep: 3322.27 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.31 [2025-06-19 15:30:12,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.93 | bwd: 3323.10 | bwd_inner: 3322.27 | bwd_allreduce: 0.78 | step: 7.31 13%|█▎ | 1271/10000 [2:00:33<13:18:20, 5.49s/it] {'loss': 0.0818, 'grad_norm': 0.6439415812492371, 'learning_rate': 3.901912766745926e-05, 'epoch': 1.27} 13%|█▎ | 1271/10000 [2:00:33<13:18:20, 5.49s/it][2025-06-19 15:30:17,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:30:17,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.59 | bwd_microstep: 3324.43 | bwd_inner_microstep: 3323.61 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.36 [2025-06-19 15:30:17,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.59 | bwd: 3324.45 | bwd_inner: 3323.61 | bwd_allreduce: 0.79 | step: 7.36 13%|█▎ | 1272/10000 [2:00:38<13:17:42, 5.48s/it] {'loss': 0.054, 'grad_norm': 0.328362375497818, 'learning_rate': 3.9017123013997225e-05, 'epoch': 1.27} 13%|█▎ | 1272/10000 [2:00:38<13:17:42, 5.48s/it][2025-06-19 15:30:23,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:30:23,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.52 | bwd_microstep: 3323.01 | bwd_inner_microstep: 3322.21 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.32 [2025-06-19 15:30:23,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.52 | bwd: 3323.03 | bwd_inner: 3322.21 | bwd_allreduce: 0.78 | step: 7.33 13%|█▎ | 1273/10000 [2:00:44<13:16:50, 5.48s/it] {'loss': 0.1788, 'grad_norm': 0.977738082408905, 'learning_rate': 3.901511636572724e-05, 'epoch': 1.27} 13%|█▎ | 1273/10000 [2:00:44<13:16:50, 5.48s/it][2025-06-19 15:30:28,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:30:28,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.92 | bwd_microstep: 3369.79 | bwd_inner_microstep: 3368.97 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.32 [2025-06-19 15:30:28,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.92 | bwd: 3369.81 | bwd_inner: 3368.97 | bwd_allreduce: 0.78 | step: 7.32 13%|█▎ | 1274/10000 [2:00:49<13:19:10, 5.50s/it] {'loss': 0.1131, 'grad_norm': 1.106341004371643, 'learning_rate': 3.90131077228598e-05, 'epoch': 1.27} 13%|█▎ | 1274/10000 [2:00:49<13:19:10, 5.50s/it][2025-06-19 15:30:34,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:30:34,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.57 | bwd_microstep: 3374.40 | bwd_inner_microstep: 3373.41 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.71 [2025-06-19 15:30:34,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.57 | bwd: 3374.42 | bwd_inner: 3373.41 | bwd_allreduce: 0.95 | step: 7.71 13%|█▎ | 1275/10000 [2:00:55<13:21:01, 5.51s/it] {'loss': 0.1275, 'grad_norm': 0.8502394556999207, 'learning_rate': 3.901109708560561e-05, 'epoch': 1.27} 13%|█▎ | 1275/10000 [2:00:55<13:21:01, 5.51s/it][2025-06-19 15:30:39,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 15:30:39,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.66 | bwd_microstep: 3369.97 | bwd_inner_microstep: 3369.06 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.11 [2025-06-19 15:30:39,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.66 | bwd: 3370.00 | bwd_inner: 3369.06 | bwd_allreduce: 0.88 | step: 7.12 13%|█▎ | 1276/10000 [2:01:00<13:22:33, 5.52s/it] {'loss': 0.1329, 'grad_norm': 0.7783505320549011, 'learning_rate': 3.900908445417556e-05, 'epoch': 1.28} 13%|█▎ | 1276/10000 [2:01:00<13:22:33, 5.52s/it][2025-06-19 15:30:45,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:30:45,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.48 | bwd_microstep: 3320.78 | bwd_inner_microstep: 3319.78 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.73 [2025-06-19 15:30:45,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.48 | bwd: 3320.80 | bwd_inner: 3319.78 | bwd_allreduce: 0.97 | step: 7.73 13%|█▎ | 1277/10000 [2:01:06<13:20:07, 5.50s/it] {'loss': 0.1189, 'grad_norm': 0.9586820006370544, 'learning_rate': 3.9007069828780786e-05, 'epoch': 1.28} 13%|█▎ | 1277/10000 [2:01:06<13:20:07, 5.50s/it][2025-06-19 15:30:50,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.88 [2025-06-19 15:30:50,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.08 | bwd_microstep: 3322.08 | bwd_inner_microstep: 3321.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 15:30:50,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.08 | bwd: 3322.09 | bwd_inner: 3321.29 | bwd_allreduce: 0.76 | step: 6.88 13%|█▎ | 1278/10000 [2:01:11<13:18:25, 5.49s/it] {'loss': 0.0716, 'grad_norm': 0.4900369942188263, 'learning_rate': 3.9005053209632586e-05, 'epoch': 1.28} 13%|█▎ | 1278/10000 [2:01:11<13:18:25, 5.49s/it][2025-06-19 15:30:56,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:30:56,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.63 | bwd_microstep: 3376.73 | bwd_inner_microstep: 3375.79 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.60 [2025-06-19 15:30:56,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.63 | bwd: 3376.74 | bwd_inner: 3375.79 | bwd_allreduce: 0.91 | step: 7.60 13%|█▎ | 1279/10000 [2:01:17<13:20:42, 5.51s/it] {'loss': 0.1558, 'grad_norm': 1.1085431575775146, 'learning_rate': 3.900303459694252e-05, 'epoch': 1.28} 13%|█▎ | 1279/10000 [2:01:17<13:20:42, 5.51s/it][2025-06-19 15:31:01,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:31:01,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.72 | bwd_microstep: 3317.46 | bwd_inner_microstep: 3316.66 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-19 15:31:01,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.72 | bwd: 3317.48 | bwd_inner: 3316.66 | bwd_allreduce: 0.77 | step: 7.22 13%|█▎ | 1280/10000 [2:01:22<13:18:46, 5.50s/it] {'loss': 0.111, 'grad_norm': 0.742621898651123, 'learning_rate': 3.900101399092232e-05, 'epoch': 1.28} 13%|█▎ | 1280/10000 [2:01:22<13:18:46, 5.50s/it][2025-06-19 15:31:07,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:31:07,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.20 | bwd_microstep: 3316.83 | bwd_inner_microstep: 3316.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 15:31:07,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.20 | bwd: 3316.84 | bwd_inner: 3316.04 | bwd_allreduce: 0.76 | step: 6.63 13%|█▎ | 1281/10000 [2:01:28<13:16:59, 5.48s/it] {'loss': 0.0822, 'grad_norm': 0.557936429977417, 'learning_rate': 3.899899139178394e-05, 'epoch': 1.28} 13%|█▎ | 1281/10000 [2:01:28<13:16:59, 5.48s/it][2025-06-19 15:31:12,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:31:12,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.72 | bwd_microstep: 3374.75 | bwd_inner_microstep: 3373.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 15:31:12,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.72 | bwd: 3374.76 | bwd_inner: 3373.96 | bwd_allreduce: 0.76 | step: 6.64 13%|█▎ | 1282/10000 [2:01:33<13:19:12, 5.50s/it] {'loss': 0.0703, 'grad_norm': 0.8689882755279541, 'learning_rate': 3.899696679973953e-05, 'epoch': 1.28} 13%|█▎ | 1282/10000 [2:01:33<13:19:12, 5.50s/it][2025-06-19 15:31:18,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:31:18,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.59 | bwd_microstep: 3326.48 | bwd_inner_microstep: 3325.66 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.10 [2025-06-19 15:31:18,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.59 | bwd: 3326.50 | bwd_inner: 3325.66 | bwd_allreduce: 0.79 | step: 7.10 13%|█▎ | 1283/10000 [2:01:39<13:17:46, 5.49s/it] {'loss': 0.0681, 'grad_norm': 0.4075831472873688, 'learning_rate': 3.899494021500148e-05, 'epoch': 1.28} 13%|█▎ | 1283/10000 [2:01:39<13:17:46, 5.49s/it][2025-06-19 15:31:23,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:31:23,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.78 | bwd_microstep: 3331.15 | bwd_inner_microstep: 3330.25 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.90 [2025-06-19 15:31:23,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.78 | bwd: 3331.16 | bwd_inner: 3330.25 | bwd_allreduce: 0.87 | step: 6.90 13%|█▎ | 1284/10000 [2:01:44<13:17:20, 5.49s/it] {'loss': 0.0803, 'grad_norm': 0.544410765171051, 'learning_rate': 3.899291163778236e-05, 'epoch': 1.28} 13%|█▎ | 1284/10000 [2:01:44<13:17:20, 5.49s/it][2025-06-19 15:31:29,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:31:29,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.08 | bwd_microstep: 3379.94 | bwd_inner_microstep: 3378.84 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.71 [2025-06-19 15:31:29,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.08 | bwd: 3379.95 | bwd_inner: 3378.84 | bwd_allreduce: 1.06 | step: 7.71 13%|█▎ | 1285/10000 [2:01:50<13:19:58, 5.51s/it] {'loss': 0.1342, 'grad_norm': 0.7424235939979553, 'learning_rate': 3.899088106829496e-05, 'epoch': 1.28} 13%|█▎ | 1285/10000 [2:01:50<13:19:58, 5.51s/it][2025-06-19 15:31:34,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.87 [2025-06-19 15:31:34,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.16 | bwd_microstep: 3328.50 | bwd_inner_microstep: 3327.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.16 [2025-06-19 15:31:34,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.16 | bwd: 3328.52 | bwd_inner: 3327.70 | bwd_allreduce: 0.77 | step: 7.17 13%|█▎ | 1286/10000 [2:01:55<13:19:05, 5.50s/it] {'loss': 0.0978, 'grad_norm': 0.7350720167160034, 'learning_rate': 3.8988848506752267e-05, 'epoch': 1.29} 13%|█▎ | 1286/10000 [2:01:55<13:19:05, 5.50s/it][2025-06-19 15:31:40,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:31:40,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.31 | bwd_microstep: 3317.90 | bwd_inner_microstep: 3317.07 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.91 [2025-06-19 15:31:40,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.31 | bwd: 3317.92 | bwd_inner: 3317.07 | bwd_allreduce: 0.79 | step: 6.91 13%|█▎ | 1287/10000 [2:02:01<13:17:20, 5.49s/it] {'loss': 0.1128, 'grad_norm': 0.9389936923980713, 'learning_rate': 3.8986813953367505e-05, 'epoch': 1.29} 13%|█▎ | 1287/10000 [2:02:01<13:17:20, 5.49s/it][2025-06-19 15:31:45,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:31:45,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.01 | bwd_microstep: 3379.00 | bwd_inner_microstep: 3378.17 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.34 [2025-06-19 15:31:45,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.01 | bwd: 3379.02 | bwd_inner: 3378.17 | bwd_allreduce: 0.80 | step: 7.34 13%|█▎ | 1288/10000 [2:02:06<13:19:54, 5.51s/it] {'loss': 0.1077, 'grad_norm': 0.5636472105979919, 'learning_rate': 3.898477740835407e-05, 'epoch': 1.29} 13%|█▎ | 1288/10000 [2:02:06<13:19:54, 5.51s/it][2025-06-19 15:31:51,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:31:51,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.01 | bwd_microstep: 3333.96 | bwd_inner_microstep: 3333.11 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.12 [2025-06-19 15:31:51,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.01 | bwd: 3333.98 | bwd_inner: 3333.11 | bwd_allreduce: 0.82 | step: 7.12 13%|█▎ | 1289/10000 [2:02:12<13:18:48, 5.50s/it] {'loss': 0.0451, 'grad_norm': 0.28125372529029846, 'learning_rate': 3.8982738871925605e-05, 'epoch': 1.29} 13%|█▎ | 1289/10000 [2:02:12<13:18:48, 5.50s/it][2025-06-19 15:31:56,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:31:56,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.81 | bwd_microstep: 3381.43 | bwd_inner_microstep: 3380.63 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-19 15:31:56,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.81 | bwd: 3381.44 | bwd_inner: 3380.63 | bwd_allreduce: 0.77 | step: 6.88 13%|█▎ | 1290/10000 [2:02:17<13:20:56, 5.52s/it] {'loss': 0.233, 'grad_norm': 1.5091174840927124, 'learning_rate': 3.898069834429593e-05, 'epoch': 1.29} 13%|█▎ | 1290/10000 [2:02:17<13:20:56, 5.52s/it][2025-06-19 15:32:02,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:32:02,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.12 | bwd_microstep: 3323.54 | bwd_inner_microstep: 3322.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 15:32:02,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.12 | bwd: 3323.55 | bwd_inner: 3322.73 | bwd_allreduce: 0.78 | step: 7.17 13%|█▎ | 1291/10000 [2:02:23<13:18:52, 5.50s/it] {'loss': 0.1274, 'grad_norm': 0.720598042011261, 'learning_rate': 3.897865582567908e-05, 'epoch': 1.29} 13%|█▎ | 1291/10000 [2:02:23<13:18:52, 5.50s/it][2025-06-19 15:32:07,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 15:32:07,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.16 | bwd_microstep: 3330.62 | bwd_inner_microstep: 3329.43 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.83 [2025-06-19 15:32:07,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.16 | bwd: 3330.64 | bwd_inner: 3329.43 | bwd_allreduce: 1.15 | step: 7.84 13%|█▎ | 1292/10000 [2:02:28<13:17:48, 5.50s/it] {'loss': 0.18, 'grad_norm': 0.8733639717102051, 'learning_rate': 3.897661131628933e-05, 'epoch': 1.29} 13%|█▎ | 1292/10000 [2:02:28<13:17:48, 5.50s/it][2025-06-19 15:32:13,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:32:13,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.67 | bwd_microstep: 3371.96 | bwd_inner_microstep: 3371.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-19 15:32:13,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.67 | bwd: 3371.97 | bwd_inner: 3371.15 | bwd_allreduce: 0.78 | step: 6.96 13%|█▎ | 1293/10000 [2:02:34<13:20:06, 5.51s/it] {'loss': 0.1121, 'grad_norm': 0.6696823239326477, 'learning_rate': 3.897456481634113e-05, 'epoch': 1.29} 13%|█▎ | 1293/10000 [2:02:34<13:20:06, 5.51s/it][2025-06-19 15:32:18,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:32:18,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.16 | bwd_microstep: 3323.61 | bwd_inner_microstep: 3322.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-19 15:32:18,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.16 | bwd: 3323.63 | bwd_inner: 3322.81 | bwd_allreduce: 0.78 | step: 6.87 13%|█▎ | 1294/10000 [2:02:39<13:18:04, 5.50s/it] {'loss': 0.0709, 'grad_norm': 0.46143820881843567, 'learning_rate': 3.8972516326049135e-05, 'epoch': 1.29} 13%|█▎ | 1294/10000 [2:02:39<13:18:04, 5.50s/it][2025-06-19 15:32:24,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:32:24,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.18 | bwd_microstep: 3372.40 | bwd_inner_microstep: 3371.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 15:32:24,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.18 | bwd: 3372.41 | bwd_inner: 3371.60 | bwd_allreduce: 0.76 | step: 6.64 13%|█▎ | 1295/10000 [2:02:45<13:19:58, 5.51s/it] {'loss': 0.0969, 'grad_norm': 0.6183587908744812, 'learning_rate': 3.8970465845628235e-05, 'epoch': 1.29} 13%|█▎ | 1295/10000 [2:02:45<13:19:58, 5.51s/it][2025-06-19 15:32:29,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:32:29,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.30 | bwd_microstep: 3323.75 | bwd_inner_microstep: 3322.91 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.10 [2025-06-19 15:32:29,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.30 | bwd: 3323.77 | bwd_inner: 3322.91 | bwd_allreduce: 0.80 | step: 7.10 13%|█▎ | 1296/10000 [2:02:50<13:18:13, 5.50s/it] {'loss': 0.1146, 'grad_norm': 1.2531260251998901, 'learning_rate': 3.8968413375293506e-05, 'epoch': 1.3} 13%|█▎ | 1296/10000 [2:02:50<13:18:13, 5.50s/it][2025-06-19 15:32:35,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.72 [2025-06-19 15:32:35,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.12 | bwd_microstep: 3398.22 | bwd_inner_microstep: 3397.42 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.04 [2025-06-19 15:32:35,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.12 | bwd: 3398.23 | bwd_inner: 3397.42 | bwd_allreduce: 0.77 | step: 7.05 13%|█▎ | 1297/10000 [2:02:56<13:21:29, 5.53s/it] {'loss': 0.0715, 'grad_norm': 0.406048059463501, 'learning_rate': 3.896635891526026e-05, 'epoch': 1.3} 13%|█▎ | 1297/10000 [2:02:56<13:21:29, 5.53s/it][2025-06-19 15:32:41,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:32:41,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.72 | bwd_microstep: 3374.53 | bwd_inner_microstep: 3373.70 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-19 15:32:41,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.72 | bwd: 3374.55 | bwd_inner: 3373.70 | bwd_allreduce: 0.79 | step: 6.72 13%|█▎ | 1298/10000 [2:03:01<13:22:28, 5.53s/it] {'loss': 0.0752, 'grad_norm': 0.6068682670593262, 'learning_rate': 3.896430246574398e-05, 'epoch': 1.3} 13%|█▎ | 1298/10000 [2:03:01<13:22:28, 5.53s/it][2025-06-19 15:32:46,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:32:46,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.57 | bwd_microstep: 3378.57 | bwd_inner_microstep: 3377.71 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.68 [2025-06-19 15:32:46,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.57 | bwd: 3378.58 | bwd_inner: 3377.71 | bwd_allreduce: 0.83 | step: 6.68 13%|█▎ | 1299/10000 [2:03:07<13:22:47, 5.54s/it] {'loss': 0.1358, 'grad_norm': 1.0822025537490845, 'learning_rate': 3.89622440269604e-05, 'epoch': 1.3} 13%|█▎ | 1299/10000 [2:03:07<13:22:47, 5.54s/it][2025-06-19 15:32:52,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:32:52,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.53 | bwd_microstep: 3324.23 | bwd_inner_microstep: 3323.42 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.15 [2025-06-19 15:32:52,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.53 | bwd: 3324.25 | bwd_inner: 3323.42 | bwd_allreduce: 0.78 | step: 7.15 13%|█▎ | 1300/10000 [2:03:12<13:20:01, 5.52s/it] {'loss': 0.134, 'grad_norm': 0.6417814493179321, 'learning_rate': 3.896018359912541e-05, 'epoch': 1.3} 13%|█▎ | 1300/10000 [2:03:12<13:20:01, 5.52s/it][2025-06-19 15:32:57,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:32:57,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.25 | bwd_microstep: 3396.85 | bwd_inner_microstep: 3395.94 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.36 [2025-06-19 15:32:57,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.25 | bwd: 3396.86 | bwd_inner: 3395.94 | bwd_allreduce: 0.88 | step: 7.37 13%|█▎ | 1301/10000 [2:03:18<13:22:29, 5.54s/it] {'loss': 0.0815, 'grad_norm': 0.39158985018730164, 'learning_rate': 3.895812118245517e-05, 'epoch': 1.3} 13%|█▎ | 1301/10000 [2:03:18<13:22:29, 5.54s/it][2025-06-19 15:33:03,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:33:03,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.14 | bwd_microstep: 3324.90 | bwd_inner_microstep: 3324.03 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.10 [2025-06-19 15:33:03,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.14 | bwd: 3324.92 | bwd_inner: 3324.03 | bwd_allreduce: 0.84 | step: 7.11 13%|█▎ | 1302/10000 [2:03:23<13:19:48, 5.52s/it] {'loss': 0.1055, 'grad_norm': 0.9681405425071716, 'learning_rate': 3.8956056777166e-05, 'epoch': 1.3} 13%|█▎ | 1302/10000 [2:03:23<13:19:48, 5.52s/it][2025-06-19 15:33:08,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:33:08,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.15 | bwd_microstep: 3378.56 | bwd_inner_microstep: 3377.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 15:33:08,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.15 | bwd: 3378.57 | bwd_inner: 3377.76 | bwd_allreduce: 0.77 | step: 6.85 13%|█▎ | 1303/10000 [2:03:29<13:21:20, 5.53s/it] {'loss': 0.1295, 'grad_norm': 0.9063793420791626, 'learning_rate': 3.895399038347446e-05, 'epoch': 1.3} 13%|█▎ | 1303/10000 [2:03:29<13:21:20, 5.53s/it][2025-06-19 15:33:14,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:33:14,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.02 | bwd_microstep: 3321.03 | bwd_inner_microstep: 3320.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 15:33:14,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.02 | bwd: 3321.04 | bwd_inner: 3320.24 | bwd_allreduce: 0.76 | step: 6.66 13%|█▎ | 1304/10000 [2:03:34<13:18:24, 5.51s/it] {'loss': 0.1123, 'grad_norm': 1.0430949926376343, 'learning_rate': 3.895192200159729e-05, 'epoch': 1.3} 13%|█▎ | 1304/10000 [2:03:34<13:18:24, 5.51s/it][2025-06-19 15:33:19,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:33:19,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.86 | bwd_microstep: 3367.54 | bwd_inner_microstep: 3366.74 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.69 [2025-06-19 15:33:19,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.86 | bwd: 3367.56 | bwd_inner: 3366.74 | bwd_allreduce: 0.77 | step: 6.69 13%|█▎ | 1305/10000 [2:03:40<13:19:27, 5.52s/it] {'loss': 0.0708, 'grad_norm': 0.5854722857475281, 'learning_rate': 3.894985163175146e-05, 'epoch': 1.3} 13%|█▎ | 1305/10000 [2:03:40<13:19:27, 5.52s/it][2025-06-19 15:33:25,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:33:25,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.24 | bwd_microstep: 3319.74 | bwd_inner_microstep: 3318.64 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.11 [2025-06-19 15:33:25,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.24 | bwd: 3319.76 | bwd_inner: 3318.64 | bwd_allreduce: 1.06 | step: 7.11 13%|█▎ | 1306/10000 [2:03:45<13:16:57, 5.50s/it] {'loss': 0.0672, 'grad_norm': 0.5503528714179993, 'learning_rate': 3.894777927415414e-05, 'epoch': 1.31} 13%|█▎ | 1306/10000 [2:03:45<13:16:57, 5.50s/it][2025-06-19 15:33:30,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:33:30,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.11 | bwd_microstep: 3323.64 | bwd_inner_microstep: 3322.50 | bwd_allreduce_microstep: 1.09 | step_microstep: 7.02 [2025-06-19 15:33:30,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.11 | bwd: 3323.66 | bwd_inner: 3322.50 | bwd_allreduce: 1.11 | step: 7.02 13%|█▎ | 1307/10000 [2:03:51<13:15:31, 5.49s/it] {'loss': 0.1261, 'grad_norm': 0.8997558951377869, 'learning_rate': 3.894570492902272e-05, 'epoch': 1.31} 13%|█▎ | 1307/10000 [2:03:51<13:15:31, 5.49s/it][2025-06-19 15:33:36,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:33:36,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.23 | bwd_microstep: 3372.82 | bwd_inner_microstep: 3371.84 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.55 [2025-06-19 15:33:36,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.23 | bwd: 3372.83 | bwd_inner: 3371.84 | bwd_allreduce: 0.95 | step: 7.55 13%|█▎ | 1308/10000 [2:03:56<13:17:53, 5.51s/it] {'loss': 0.0853, 'grad_norm': 0.7599566578865051, 'learning_rate': 3.894362859657478e-05, 'epoch': 1.31} 13%|█▎ | 1308/10000 [2:03:56<13:17:53, 5.51s/it][2025-06-19 15:33:41,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:33:41,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.65 | bwd_microstep: 3373.38 | bwd_inner_microstep: 3372.52 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.73 [2025-06-19 15:33:41,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.65 | bwd: 3373.40 | bwd_inner: 3372.52 | bwd_allreduce: 0.83 | step: 6.73 13%|█▎ | 1309/10000 [2:04:02<13:19:28, 5.52s/it] {'loss': 0.057, 'grad_norm': 0.36182069778442383, 'learning_rate': 3.8941550277028126e-05, 'epoch': 1.31} 13%|█▎ | 1309/10000 [2:04:02<13:19:28, 5.52s/it][2025-06-19 15:33:47,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:33:47,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.97 | bwd_microstep: 3327.69 | bwd_inner_microstep: 3326.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 15:33:47,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.97 | bwd: 3327.71 | bwd_inner: 3326.90 | bwd_allreduce: 0.76 | step: 6.65 13%|█▎ | 1310/10000 [2:04:08<13:17:17, 5.50s/it] {'loss': 0.0947, 'grad_norm': 0.5988538265228271, 'learning_rate': 3.893946997060075e-05, 'epoch': 1.31} 13%|█▎ | 1310/10000 [2:04:08<13:17:17, 5.50s/it][2025-06-19 15:33:52,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:33:52,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.82 | bwd_microstep: 3321.00 | bwd_inner_microstep: 3320.17 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.24 [2025-06-19 15:33:52,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.82 | bwd: 3321.02 | bwd_inner: 3320.17 | bwd_allreduce: 0.80 | step: 7.24 13%|█▎ | 1311/10000 [2:04:13<13:15:40, 5.49s/it] {'loss': 0.1057, 'grad_norm': 0.8218901753425598, 'learning_rate': 3.893738767751088e-05, 'epoch': 1.31} 13%|█▎ | 1311/10000 [2:04:13<13:15:40, 5.49s/it][2025-06-19 15:33:58,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:33:58,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.00 | bwd_microstep: 3325.06 | bwd_inner_microstep: 3324.09 | bwd_allreduce_microstep: 0.92 | step_microstep: 6.62 [2025-06-19 15:33:58,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.00 | bwd: 3325.07 | bwd_inner: 3324.09 | bwd_allreduce: 0.94 | step: 6.62 13%|█▎ | 1312/10000 [2:04:18<13:14:28, 5.49s/it] {'loss': 0.0749, 'grad_norm': 0.9363371133804321, 'learning_rate': 3.893530339797693e-05, 'epoch': 1.31} 13%|█▎ | 1312/10000 [2:04:18<13:14:28, 5.49s/it][2025-06-19 15:34:03,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:34:03,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.59 | bwd_microstep: 3325.24 | bwd_inner_microstep: 3324.38 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.94 [2025-06-19 15:34:03,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.59 | bwd: 3325.26 | bwd_inner: 3324.38 | bwd_allreduce: 0.82 | step: 6.95 13%|█▎ | 1313/10000 [2:04:24<13:13:36, 5.48s/it] {'loss': 0.1132, 'grad_norm': 0.8994297981262207, 'learning_rate': 3.8933217132217535e-05, 'epoch': 1.31} 13%|█▎ | 1313/10000 [2:04:24<13:13:36, 5.48s/it][2025-06-19 15:34:09,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.99 [2025-06-19 15:34:09,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.64 | bwd_microstep: 3327.34 | bwd_inner_microstep: 3326.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.59 [2025-06-19 15:34:09,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.64 | bwd: 3327.36 | bwd_inner: 3326.53 | bwd_allreduce: 0.78 | step: 7.59 13%|█▎ | 1314/10000 [2:04:29<13:13:32, 5.48s/it] {'loss': 0.0696, 'grad_norm': 0.47741392254829407, 'learning_rate': 3.8931128880451535e-05, 'epoch': 1.31} 13%|█▎ | 1314/10000 [2:04:29<13:13:32, 5.48s/it][2025-06-19 15:34:14,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.86 [2025-06-19 15:34:14,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.56 | bwd_microstep: 3321.81 | bwd_inner_microstep: 3321.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.85 [2025-06-19 15:34:14,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.56 | bwd: 3321.82 | bwd_inner: 3321.02 | bwd_allreduce: 0.76 | step: 6.86 13%|█▎ | 1315/10000 [2:04:35<13:12:45, 5.48s/it] {'loss': 0.0979, 'grad_norm': 0.5073701739311218, 'learning_rate': 3.8929038642897977e-05, 'epoch': 1.31} 13%|█▎ | 1315/10000 [2:04:35<13:12:45, 5.48s/it][2025-06-19 15:34:20,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:34:20,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.03 | bwd_microstep: 3405.52 | bwd_inner_microstep: 3404.74 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 15:34:20,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.03 | bwd: 3405.54 | bwd_inner: 3404.74 | bwd_allreduce: 0.76 | step: 6.65 13%|█▎ | 1316/10000 [2:04:40<13:17:13, 5.51s/it] {'loss': 0.0904, 'grad_norm': 0.6758162975311279, 'learning_rate': 3.8926946419776126e-05, 'epoch': 1.32} 13%|█▎ | 1316/10000 [2:04:40<13:17:13, 5.51s/it][2025-06-19 15:34:25,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:34:25,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.65 | bwd_microstep: 3370.13 | bwd_inner_microstep: 3369.15 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.64 [2025-06-19 15:34:25,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.65 | bwd: 3370.14 | bwd_inner: 3369.15 | bwd_allreduce: 0.94 | step: 7.64 13%|█▎ | 1317/10000 [2:04:46<13:18:40, 5.52s/it] {'loss': 0.1178, 'grad_norm': 0.8420671820640564, 'learning_rate': 3.8924852211305426e-05, 'epoch': 1.32} 13%|█▎ | 1317/10000 [2:04:46<13:18:40, 5.52s/it][2025-06-19 15:34:31,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:34:31,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.55 | bwd_microstep: 3318.50 | bwd_inner_microstep: 3317.57 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.09 [2025-06-19 15:34:31,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.55 | bwd: 3318.51 | bwd_inner: 3317.57 | bwd_allreduce: 0.89 | step: 7.09 13%|█▎ | 1318/10000 [2:04:51<13:16:08, 5.50s/it] {'loss': 0.1179, 'grad_norm': 0.7699427604675293, 'learning_rate': 3.892275601770557e-05, 'epoch': 1.32} 13%|█▎ | 1318/10000 [2:04:51<13:16:08, 5.50s/it][2025-06-19 15:34:36,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:34:36,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.89 | bwd_microstep: 3367.70 | bwd_inner_microstep: 3366.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.39 [2025-06-19 15:34:36,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.89 | bwd: 3367.72 | bwd_inner: 3366.90 | bwd_allreduce: 0.78 | step: 7.39 13%|█▎ | 1319/10000 [2:04:57<13:17:39, 5.51s/it] {'loss': 0.1017, 'grad_norm': 0.737630307674408, 'learning_rate': 3.892065783919643e-05, 'epoch': 1.32} 13%|█▎ | 1319/10000 [2:04:57<13:17:39, 5.51s/it][2025-06-19 15:34:42,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:34:42,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.62 | bwd_microstep: 3322.33 | bwd_inner_microstep: 3321.40 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.99 [2025-06-19 15:34:42,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.62 | bwd: 3322.34 | bwd_inner: 3321.40 | bwd_allreduce: 0.90 | step: 6.99 13%|█▎ | 1320/10000 [2:05:02<13:15:42, 5.50s/it] {'loss': 0.0745, 'grad_norm': 0.47654619812965393, 'learning_rate': 3.8918557675998096e-05, 'epoch': 1.32} 13%|█▎ | 1320/10000 [2:05:02<13:15:42, 5.50s/it][2025-06-19 15:34:47,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:34:47,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.56 | bwd_microstep: 3322.11 | bwd_inner_microstep: 3321.32 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 15:34:47,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.56 | bwd: 3322.13 | bwd_inner: 3321.32 | bwd_allreduce: 0.76 | step: 6.65 13%|█▎ | 1321/10000 [2:05:08<13:14:02, 5.49s/it] {'loss': 0.0524, 'grad_norm': 0.5114338994026184, 'learning_rate': 3.891645552833086e-05, 'epoch': 1.32} 13%|█▎ | 1321/10000 [2:05:08<13:14:02, 5.49s/it][2025-06-19 15:34:53,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:34:53,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.18 | bwd_microstep: 3363.34 | bwd_inner_microstep: 3362.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 15:34:53,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.18 | bwd: 3363.35 | bwd_inner: 3362.53 | bwd_allreduce: 0.77 | step: 7.01 13%|█▎ | 1322/10000 [2:05:13<13:15:30, 5.50s/it] {'loss': 0.1141, 'grad_norm': 1.085618019104004, 'learning_rate': 3.891435139641524e-05, 'epoch': 1.32} 13%|█▎ | 1322/10000 [2:05:13<13:15:30, 5.50s/it][2025-06-19 15:34:58,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 15:34:58,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.15 | bwd_microstep: 3317.07 | bwd_inner_microstep: 3316.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-19 15:34:58,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.15 | bwd: 3317.09 | bwd_inner: 3316.29 | bwd_allreduce: 0.75 | step: 6.53 13%|█▎ | 1323/10000 [2:05:19<13:13:30, 5.49s/it] {'loss': 0.0952, 'grad_norm': 0.5695464611053467, 'learning_rate': 3.891224528047194e-05, 'epoch': 1.32} 13%|█▎ | 1323/10000 [2:05:19<13:13:30, 5.49s/it][2025-06-19 15:35:04,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:35:04,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.10 | bwd_microstep: 3364.05 | bwd_inner_microstep: 3363.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 15:35:04,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.10 | bwd: 3364.06 | bwd_inner: 3363.26 | bwd_allreduce: 0.77 | step: 6.79 13%|█▎ | 1324/10000 [2:05:24<13:15:04, 5.50s/it] {'loss': 0.0847, 'grad_norm': 0.8475387692451477, 'learning_rate': 3.8910137180721886e-05, 'epoch': 1.32} 13%|█▎ | 1324/10000 [2:05:24<13:15:04, 5.50s/it][2025-06-19 15:35:09,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:35:09,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.23 | bwd_microstep: 3311.80 | bwd_inner_microstep: 3311.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-19 15:35:09,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.23 | bwd: 3311.82 | bwd_inner: 3311.02 | bwd_allreduce: 0.76 | step: 6.54 13%|█▎ | 1325/10000 [2:05:30<13:13:45, 5.49s/it] {'loss': 0.1206, 'grad_norm': 0.726182222366333, 'learning_rate': 3.8908027097386205e-05, 'epoch': 1.32} 13%|█▎ | 1325/10000 [2:05:30<13:13:45, 5.49s/it][2025-06-19 15:35:15,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:35:15,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.75 | bwd_microstep: 3375.55 | bwd_inner_microstep: 3374.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-19 15:35:15,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.75 | bwd: 3375.56 | bwd_inner: 3374.75 | bwd_allreduce: 0.76 | step: 7.05 13%|█▎ | 1326/10000 [2:05:35<13:15:53, 5.51s/it] {'loss': 0.1332, 'grad_norm': 1.33999764919281, 'learning_rate': 3.890591503068624e-05, 'epoch': 1.33} 13%|█▎ | 1326/10000 [2:05:35<13:15:53, 5.51s/it][2025-06-19 15:35:20,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:35:20,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.61 | bwd_microstep: 3311.47 | bwd_inner_microstep: 3310.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 15:35:20,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.61 | bwd: 3311.48 | bwd_inner: 3310.67 | bwd_allreduce: 0.77 | step: 6.68 13%|█▎ | 1327/10000 [2:05:41<13:13:27, 5.49s/it] {'loss': 0.1121, 'grad_norm': 0.879493772983551, 'learning_rate': 3.890380098084353e-05, 'epoch': 1.33} 13%|█▎ | 1327/10000 [2:05:41<13:13:27, 5.49s/it][2025-06-19 15:35:26,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:35:26,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.53 | bwd_microstep: 3323.90 | bwd_inner_microstep: 3323.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.57 [2025-06-19 15:35:26,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.53 | bwd: 3323.92 | bwd_inner: 3323.11 | bwd_allreduce: 0.76 | step: 6.57 13%|█▎ | 1328/10000 [2:05:46<13:12:28, 5.48s/it] {'loss': 0.1072, 'grad_norm': 0.6529446840286255, 'learning_rate': 3.890168494807983e-05, 'epoch': 1.33} 13%|█▎ | 1328/10000 [2:05:46<13:12:28, 5.48s/it][2025-06-19 15:35:31,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:35:31,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.31 | bwd_microstep: 3318.97 | bwd_inner_microstep: 3318.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 15:35:31,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.31 | bwd: 3318.98 | bwd_inner: 3318.18 | bwd_allreduce: 0.76 | step: 6.62 13%|█▎ | 1329/10000 [2:05:52<13:11:22, 5.48s/it] {'loss': 0.0479, 'grad_norm': 0.3307812213897705, 'learning_rate': 3.8899566932617105e-05, 'epoch': 1.33} 13%|█▎ | 1329/10000 [2:05:52<13:11:22, 5.48s/it][2025-06-19 15:35:37,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:35:37,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.41 | bwd_microstep: 3369.82 | bwd_inner_microstep: 3369.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-19 15:35:37,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.41 | bwd: 3369.83 | bwd_inner: 3369.03 | bwd_allreduce: 0.76 | step: 6.60 13%|█▎ | 1330/10000 [2:05:57<13:13:30, 5.49s/it] {'loss': 0.1071, 'grad_norm': 0.48522478342056274, 'learning_rate': 3.889744693467753e-05, 'epoch': 1.33} 13%|█▎ | 1330/10000 [2:05:57<13:13:30, 5.49s/it][2025-06-19 15:35:42,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:35:42,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.98 | bwd_microstep: 3321.23 | bwd_inner_microstep: 3320.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 15:35:42,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.98 | bwd: 3321.25 | bwd_inner: 3320.44 | bwd_allreduce: 0.76 | step: 6.64 13%|█▎ | 1331/10000 [2:06:03<13:11:54, 5.48s/it] {'loss': 0.106, 'grad_norm': 0.4630322754383087, 'learning_rate': 3.8895324954483473e-05, 'epoch': 1.33} 13%|█▎ | 1331/10000 [2:06:03<13:11:54, 5.48s/it][2025-06-19 15:35:48,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:35:48,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.60 | bwd_microstep: 3369.50 | bwd_inner_microstep: 3368.54 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.47 [2025-06-19 15:35:48,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.60 | bwd: 3369.52 | bwd_inner: 3368.54 | bwd_allreduce: 0.93 | step: 7.48 13%|█▎ | 1332/10000 [2:06:08<13:14:07, 5.50s/it] {'loss': 0.1658, 'grad_norm': 0.642869770526886, 'learning_rate': 3.889320099225752e-05, 'epoch': 1.33} 13%|█▎ | 1332/10000 [2:06:08<13:14:07, 5.50s/it][2025-06-19 15:35:53,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:35:53,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.27 | bwd_microstep: 3316.89 | bwd_inner_microstep: 3316.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 15:35:53,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.27 | bwd: 3316.90 | bwd_inner: 3316.11 | bwd_allreduce: 0.75 | step: 6.63 13%|█▎ | 1333/10000 [2:06:14<13:12:42, 5.49s/it] {'loss': 0.1173, 'grad_norm': 1.001333236694336, 'learning_rate': 3.8891075048222476e-05, 'epoch': 1.33} 13%|█▎ | 1333/10000 [2:06:14<13:12:42, 5.49s/it][2025-06-19 15:35:59,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:35:59,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.23 | bwd_microstep: 3358.19 | bwd_inner_microstep: 3357.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 15:35:59,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.23 | bwd: 3358.21 | bwd_inner: 3357.38 | bwd_allreduce: 0.78 | step: 6.95 13%|█▎ | 1334/10000 [2:06:19<13:14:16, 5.50s/it] {'loss': 0.036, 'grad_norm': 0.28424790501594543, 'learning_rate': 3.888894712260133e-05, 'epoch': 1.33} 13%|█▎ | 1334/10000 [2:06:19<13:14:16, 5.50s/it][2025-06-19 15:36:04,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:36:04,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.20 | bwd_microstep: 3312.95 | bwd_inner_microstep: 3312.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 15:36:04,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.20 | bwd: 3312.97 | bwd_inner: 3312.17 | bwd_allreduce: 0.75 | step: 6.69 13%|█▎ | 1335/10000 [2:06:25<13:11:56, 5.48s/it] {'loss': 0.1514, 'grad_norm': 1.2174854278564453, 'learning_rate': 3.88868172156173e-05, 'epoch': 1.33} 13%|█▎ | 1335/10000 [2:06:25<13:11:56, 5.48s/it][2025-06-19 15:36:09,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:36:09,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.44 | bwd_microstep: 3319.94 | bwd_inner_microstep: 3319.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 15:36:09,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.44 | bwd: 3319.95 | bwd_inner: 3319.14 | bwd_allreduce: 0.76 | step: 6.69 13%|█▎ | 1336/10000 [2:06:30<13:10:49, 5.48s/it] {'loss': 0.1213, 'grad_norm': 0.7968137264251709, 'learning_rate': 3.8884685327493806e-05, 'epoch': 1.34} 13%|█▎ | 1336/10000 [2:06:30<13:10:49, 5.48s/it][2025-06-19 15:36:15,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.92 [2025-06-19 15:36:15,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.83 | bwd_microstep: 3317.96 | bwd_inner_microstep: 3316.99 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.30 [2025-06-19 15:36:15,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.83 | bwd: 3317.97 | bwd_inner: 3316.99 | bwd_allreduce: 0.94 | step: 7.31 13%|█▎ | 1337/10000 [2:06:36<13:10:13, 5.47s/it] {'loss': 0.1672, 'grad_norm': 0.8188167214393616, 'learning_rate': 3.888255145845446e-05, 'epoch': 1.34} 13%|█▎ | 1337/10000 [2:06:36<13:10:13, 5.47s/it][2025-06-19 15:36:20,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:36:20,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.41 | bwd_microstep: 3320.34 | bwd_inner_microstep: 3319.53 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.83 [2025-06-19 15:36:20,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.41 | bwd: 3320.36 | bwd_inner: 3319.53 | bwd_allreduce: 0.78 | step: 6.83 13%|█▎ | 1338/10000 [2:06:41<13:09:57, 5.47s/it] {'loss': 0.0767, 'grad_norm': 0.613538920879364, 'learning_rate': 3.88804156087231e-05, 'epoch': 1.34} 13%|█▎ | 1338/10000 [2:06:41<13:09:57, 5.47s/it][2025-06-19 15:36:26,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 15:36:26,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.39 | bwd_microstep: 3312.38 | bwd_inner_microstep: 3311.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 15:36:26,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.39 | bwd: 3312.40 | bwd_inner: 3311.57 | bwd_allreduce: 0.78 | step: 7.24 13%|█▎ | 1339/10000 [2:06:47<13:09:20, 5.47s/it] {'loss': 0.1494, 'grad_norm': 1.3554195165634155, 'learning_rate': 3.887827777852378e-05, 'epoch': 1.34} 13%|█▎ | 1339/10000 [2:06:47<13:09:20, 5.47s/it][2025-06-19 15:36:31,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 15:36:31,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.57 | bwd_microstep: 3315.36 | bwd_inner_microstep: 3314.41 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.21 [2025-06-19 15:36:31,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.57 | bwd: 3315.38 | bwd_inner: 3314.41 | bwd_allreduce: 0.92 | step: 7.21 13%|█▎ | 1340/10000 [2:06:52<13:09:13, 5.47s/it] {'loss': 0.0655, 'grad_norm': 0.7118255496025085, 'learning_rate': 3.887613796808073e-05, 'epoch': 1.34} 13%|█▎ | 1340/10000 [2:06:52<13:09:13, 5.47s/it][2025-06-19 15:36:37,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.72 [2025-06-19 15:36:37,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.63 | bwd_microstep: 3320.45 | bwd_inner_microstep: 3319.19 | bwd_allreduce_microstep: 1.17 | step_microstep: 8.66 [2025-06-19 15:36:37,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.63 | bwd: 3320.48 | bwd_inner: 3319.19 | bwd_allreduce: 1.20 | step: 8.67 13%|█▎ | 1341/10000 [2:06:58<13:10:04, 5.47s/it] {'loss': 0.0845, 'grad_norm': 0.6818469166755676, 'learning_rate': 3.887399617761842e-05, 'epoch': 1.34} 13%|█▎ | 1341/10000 [2:06:58<13:10:04, 5.47s/it][2025-06-19 15:36:42,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:36:42,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2164.37 | bwd_microstep: 3372.13 | bwd_inner_microstep: 3371.27 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.67 [2025-06-19 15:36:42,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2164.37 | bwd: 3372.15 | bwd_inner: 3371.27 | bwd_allreduce: 0.82 | step: 7.68 13%|█▎ | 1342/10000 [2:07:03<13:14:49, 5.51s/it] {'loss': 0.0716, 'grad_norm': 0.5079470276832581, 'learning_rate': 3.88718524073615e-05, 'epoch': 1.34} 13%|█▎ | 1342/10000 [2:07:03<13:14:49, 5.51s/it][2025-06-19 15:36:48,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 15:36:48,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2165.79 | bwd_microstep: 3323.61 | bwd_inner_microstep: 3322.53 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.57 [2025-06-19 15:36:48,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2165.79 | bwd: 3323.64 | bwd_inner: 3322.53 | bwd_allreduce: 1.04 | step: 7.57 13%|█▎ | 1343/10000 [2:07:09<13:15:53, 5.52s/it] {'loss': 0.0531, 'grad_norm': 0.34119734168052673, 'learning_rate': 3.886970665753485e-05, 'epoch': 1.34} 13%|█▎ | 1343/10000 [2:07:09<13:15:53, 5.52s/it][2025-06-19 15:36:53,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:36:53,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.89 | bwd_microstep: 3317.29 | bwd_inner_microstep: 3316.47 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.84 [2025-06-19 15:36:53,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.89 | bwd: 3317.31 | bwd_inner: 3316.47 | bwd_allreduce: 0.79 | step: 6.84 13%|█▎ | 1344/10000 [2:07:14<13:14:18, 5.51s/it] {'loss': 0.109, 'grad_norm': 0.741229236125946, 'learning_rate': 3.886755892836356e-05, 'epoch': 1.34} 13%|█▎ | 1344/10000 [2:07:14<13:14:18, 5.51s/it][2025-06-19 15:36:59,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 15:36:59,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.57 | bwd_microstep: 3318.37 | bwd_inner_microstep: 3317.26 | bwd_allreduce_microstep: 1.03 | step_microstep: 8.56 [2025-06-19 15:36:59,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.57 | bwd: 3318.40 | bwd_inner: 3317.26 | bwd_allreduce: 1.06 | step: 8.56 13%|█▎ | 1345/10000 [2:07:20<13:13:13, 5.50s/it] {'loss': 0.108, 'grad_norm': 0.6596812009811401, 'learning_rate': 3.8865409220072895e-05, 'epoch': 1.34} 13%|█▎ | 1345/10000 [2:07:20<13:13:13, 5.50s/it][2025-06-19 15:37:04,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:37:04,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.35 | bwd_microstep: 3307.68 | bwd_inner_microstep: 3306.83 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.08 [2025-06-19 15:37:04,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.35 | bwd: 3307.70 | bwd_inner: 3306.83 | bwd_allreduce: 0.83 | step: 7.09 13%|█▎ | 1346/10000 [2:07:25<13:11:55, 5.49s/it] {'loss': 0.0559, 'grad_norm': 0.42764711380004883, 'learning_rate': 3.886325753288836e-05, 'epoch': 1.35} 13%|█▎ | 1346/10000 [2:07:25<13:11:55, 5.49s/it][2025-06-19 15:37:10,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:37:10,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.87 | bwd_microstep: 3323.59 | bwd_inner_microstep: 3322.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 15:37:10,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.87 | bwd: 3323.60 | bwd_inner: 3322.79 | bwd_allreduce: 0.77 | step: 6.76 13%|█▎ | 1347/10000 [2:07:31<13:11:01, 5.48s/it] {'loss': 0.1076, 'grad_norm': 0.7926489114761353, 'learning_rate': 3.8861103867035665e-05, 'epoch': 1.35} 13%|█▎ | 1347/10000 [2:07:31<13:11:01, 5.48s/it][2025-06-19 15:37:15,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.99 [2025-06-19 15:37:15,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.43 | bwd_microstep: 3328.29 | bwd_inner_microstep: 3327.42 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.91 [2025-06-19 15:37:15,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.43 | bwd: 3328.32 | bwd_inner: 3327.42 | bwd_allreduce: 0.84 | step: 7.90 13%|█▎ | 1348/10000 [2:07:36<13:10:43, 5.48s/it] {'loss': 0.108, 'grad_norm': 0.9781904816627502, 'learning_rate': 3.88589482227407e-05, 'epoch': 1.35} 13%|█▎ | 1348/10000 [2:07:36<13:10:43, 5.48s/it][2025-06-19 15:37:21,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 15:37:21,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.31 | bwd_microstep: 3323.84 | bwd_inner_microstep: 3322.86 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.78 [2025-06-19 15:37:21,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.31 | bwd: 3323.87 | bwd_inner: 3322.86 | bwd_allreduce: 0.94 | step: 7.78 13%|█▎ | 1349/10000 [2:07:42<13:11:12, 5.49s/it] {'loss': 0.0996, 'grad_norm': 0.6405303478240967, 'learning_rate': 3.8856790600229606e-05, 'epoch': 1.35} 13%|█▎ | 1349/10000 [2:07:42<13:11:12, 5.49s/it][2025-06-19 15:37:26,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.78 [2025-06-19 15:37:26,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.97 | bwd_microstep: 3327.87 | bwd_inner_microstep: 3326.98 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.03 [2025-06-19 15:37:26,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.97 | bwd: 3327.88 | bwd_inner: 3326.98 | bwd_allreduce: 0.86 | step: 7.05 14%|█▎ | 1350/10000 [2:07:47<13:12:13, 5.50s/it] {'loss': 0.0797, 'grad_norm': 0.8276547193527222, 'learning_rate': 3.885463099972869e-05, 'epoch': 1.35} 14%|█▎ | 1350/10000 [2:07:47<13:12:13, 5.50s/it][2025-06-19 15:37:32,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.72 [2025-06-19 15:37:32,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.06 | bwd_microstep: 3368.93 | bwd_inner_microstep: 3368.04 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.26 [2025-06-19 15:37:32,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.06 | bwd: 3368.96 | bwd_inner: 3368.04 | bwd_allreduce: 0.84 | step: 7.27 14%|█▎ | 1351/10000 [2:07:53<13:13:53, 5.51s/it] {'loss': 0.1254, 'grad_norm': 1.153240442276001, 'learning_rate': 3.885246942146449e-05, 'epoch': 1.35} 14%|█▎ | 1351/10000 [2:07:53<13:13:53, 5.51s/it][2025-06-19 15:37:37,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:37:37,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2160.67 | bwd_microstep: 3370.40 | bwd_inner_microstep: 3369.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 15:37:37,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2160.67 | bwd: 3370.42 | bwd_inner: 3369.61 | bwd_allreduce: 0.76 | step: 6.68 14%|█▎ | 1352/10000 [2:07:58<13:16:28, 5.53s/it] {'loss': 0.1611, 'grad_norm': 1.347041368484497, 'learning_rate': 3.885030586566374e-05, 'epoch': 1.35} 14%|█▎ | 1352/10000 [2:07:58<13:16:28, 5.53s/it][2025-06-19 15:37:43,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:37:43,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.74 | bwd_microstep: 3320.06 | bwd_inner_microstep: 3319.18 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.89 [2025-06-19 15:37:43,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.74 | bwd: 3320.09 | bwd_inner: 3319.18 | bwd_allreduce: 0.84 | step: 7.90 14%|█▎ | 1353/10000 [2:08:04<13:14:28, 5.51s/it] {'loss': 0.0625, 'grad_norm': 0.4247153103351593, 'learning_rate': 3.884814033255339e-05, 'epoch': 1.35} 14%|█▎ | 1353/10000 [2:08:04<13:14:28, 5.51s/it][2025-06-19 15:37:48,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:37:48,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.83 | bwd_microstep: 3324.14 | bwd_inner_microstep: 3323.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.49 [2025-06-19 15:37:48,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.83 | bwd: 3324.15 | bwd_inner: 3323.34 | bwd_allreduce: 0.76 | step: 6.49 14%|█▎ | 1354/10000 [2:08:09<13:14:23, 5.51s/it] {'loss': 0.1444, 'grad_norm': 0.7827091217041016, 'learning_rate': 3.88459728223606e-05, 'epoch': 1.35} 14%|█▎ | 1354/10000 [2:08:09<13:14:23, 5.51s/it][2025-06-19 15:37:54,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:37:54,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.41 | bwd_microstep: 3316.71 | bwd_inner_microstep: 3315.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 15:37:54,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.41 | bwd: 3316.72 | bwd_inner: 3315.93 | bwd_allreduce: 0.75 | step: 6.54 14%|█▎ | 1355/10000 [2:08:15<13:11:52, 5.50s/it] {'loss': 0.122, 'grad_norm': 0.7107463479042053, 'learning_rate': 3.884380333531273e-05, 'epoch': 1.35} 14%|█▎ | 1355/10000 [2:08:15<13:11:52, 5.50s/it][2025-06-19 15:37:59,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:37:59,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.97 | bwd_microstep: 3365.35 | bwd_inner_microstep: 3364.54 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.69 [2025-06-19 15:37:59,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.97 | bwd: 3365.37 | bwd_inner: 3364.54 | bwd_allreduce: 0.79 | step: 6.69 14%|█▎ | 1356/10000 [2:08:20<13:13:04, 5.50s/it] {'loss': 0.0441, 'grad_norm': 0.2318398505449295, 'learning_rate': 3.8841631871637346e-05, 'epoch': 1.36} 14%|█▎ | 1356/10000 [2:08:20<13:13:04, 5.50s/it][2025-06-19 15:38:05,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:38:05,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.94 | bwd_microstep: 3318.75 | bwd_inner_microstep: 3317.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.71 [2025-06-19 15:38:05,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.94 | bwd: 3318.76 | bwd_inner: 3317.94 | bwd_allreduce: 0.78 | step: 6.71 14%|█▎ | 1357/10000 [2:08:26<13:10:58, 5.49s/it] {'loss': 0.1692, 'grad_norm': 1.1284027099609375, 'learning_rate': 3.883945843156222e-05, 'epoch': 1.36} 14%|█▎ | 1357/10000 [2:08:26<13:10:58, 5.49s/it][2025-06-19 15:38:10,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:38:10,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.70 | bwd_microstep: 3319.28 | bwd_inner_microstep: 3318.48 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-19 15:38:10,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.70 | bwd: 3319.29 | bwd_inner: 3318.48 | bwd_allreduce: 0.77 | step: 6.92 14%|█▎ | 1358/10000 [2:08:31<13:09:57, 5.48s/it] {'loss': 0.0962, 'grad_norm': 0.6298588514328003, 'learning_rate': 3.883728301531535e-05, 'epoch': 1.36} 14%|█▎ | 1358/10000 [2:08:31<13:09:57, 5.48s/it][2025-06-19 15:38:16,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:38:16,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.83 | bwd_microstep: 3378.51 | bwd_inner_microstep: 3377.32 | bwd_allreduce_microstep: 1.13 | step_microstep: 7.30 [2025-06-19 15:38:16,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.83 | bwd: 3378.54 | bwd_inner: 3377.32 | bwd_allreduce: 1.16 | step: 7.30 14%|█▎ | 1359/10000 [2:08:37<13:12:43, 5.50s/it] {'loss': 0.2004, 'grad_norm': 0.8986247181892395, 'learning_rate': 3.8835105623124914e-05, 'epoch': 1.36} 14%|█▎ | 1359/10000 [2:08:37<13:12:43, 5.50s/it][2025-06-19 15:38:21,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:38:21,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.35 | bwd_microstep: 3375.98 | bwd_inner_microstep: 3375.10 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.26 [2025-06-19 15:38:21,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.35 | bwd: 3376.01 | bwd_inner: 3375.10 | bwd_allreduce: 0.84 | step: 7.27 14%|█▎ | 1360/10000 [2:08:42<13:15:23, 5.52s/it] {'loss': 0.1404, 'grad_norm': 0.691447377204895, 'learning_rate': 3.883292625521931e-05, 'epoch': 1.36} 14%|█▎ | 1360/10000 [2:08:42<13:15:23, 5.52s/it][2025-06-19 15:38:27,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:38:27,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2157.20 | bwd_microstep: 3339.42 | bwd_inner_microstep: 3338.49 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.48 [2025-06-19 15:38:27,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2157.20 | bwd: 3339.45 | bwd_inner: 3338.49 | bwd_allreduce: 0.88 | step: 8.49 14%|█▎ | 1361/10000 [2:08:48<13:16:16, 5.53s/it] {'loss': 0.112, 'grad_norm': 0.7373256683349609, 'learning_rate': 3.8830744911827154e-05, 'epoch': 1.36} 14%|█▎ | 1361/10000 [2:08:48<13:16:16, 5.53s/it][2025-06-19 15:38:33,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:38:33,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.69 | bwd_microstep: 3327.38 | bwd_inner_microstep: 3326.48 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.80 [2025-06-19 15:38:33,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.69 | bwd: 3327.41 | bwd_inner: 3326.48 | bwd_allreduce: 0.85 | step: 7.80 14%|█▎ | 1362/10000 [2:08:53<13:15:51, 5.53s/it] {'loss': 0.0986, 'grad_norm': 0.5378904342651367, 'learning_rate': 3.882856159317725e-05, 'epoch': 1.36} 14%|█▎ | 1362/10000 [2:08:53<13:15:51, 5.53s/it][2025-06-19 15:38:38,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:38:38,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2166.54 | bwd_microstep: 3378.38 | bwd_inner_microstep: 3377.51 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.06 [2025-06-19 15:38:38,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2166.54 | bwd: 3378.41 | bwd_inner: 3377.51 | bwd_allreduce: 0.83 | step: 7.07 14%|█▎ | 1363/10000 [2:08:59<13:18:31, 5.55s/it] {'loss': 0.1203, 'grad_norm': 0.6111442446708679, 'learning_rate': 3.8826376299498625e-05, 'epoch': 1.36} 14%|█▎ | 1363/10000 [2:08:59<13:18:31, 5.55s/it][2025-06-19 15:38:44,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:38:44,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2170.69 | bwd_microstep: 3381.60 | bwd_inner_microstep: 3380.54 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.03 [2025-06-19 15:38:44,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2170.69 | bwd: 3381.63 | bwd_inner: 3380.54 | bwd_allreduce: 1.02 | step: 8.04 14%|█▎ | 1364/10000 [2:09:04<13:20:34, 5.56s/it] {'loss': 0.1065, 'grad_norm': 0.47925660014152527, 'learning_rate': 3.88241890310205e-05, 'epoch': 1.36} 14%|█▎ | 1364/10000 [2:09:04<13:20:34, 5.56s/it][2025-06-19 15:38:49,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:38:49,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.32 | bwd_microstep: 3325.45 | bwd_inner_microstep: 3324.58 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.06 [2025-06-19 15:38:49,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.32 | bwd: 3325.48 | bwd_inner: 3324.58 | bwd_allreduce: 0.83 | step: 7.06 14%|█▎ | 1365/10000 [2:09:10<13:18:36, 5.55s/it] {'loss': 0.1138, 'grad_norm': 0.512391984462738, 'learning_rate': 3.8821999787972314e-05, 'epoch': 1.36} 14%|█▎ | 1365/10000 [2:09:10<13:18:36, 5.55s/it][2025-06-19 15:38:55,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:38:55,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.67 | bwd_microstep: 3326.24 | bwd_inner_microstep: 3325.37 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.05 [2025-06-19 15:38:55,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.67 | bwd: 3326.27 | bwd_inner: 3325.37 | bwd_allreduce: 0.83 | step: 7.05 14%|█▎ | 1366/10000 [2:09:16<13:16:54, 5.54s/it] {'loss': 0.0797, 'grad_norm': 0.49860695004463196, 'learning_rate': 3.88198085705837e-05, 'epoch': 1.37} 14%|█▎ | 1366/10000 [2:09:16<13:16:54, 5.54s/it][2025-06-19 15:39:00,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:39:00,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2161.88 | bwd_microstep: 3384.53 | bwd_inner_microstep: 3383.65 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.52 [2025-06-19 15:39:00,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2161.88 | bwd: 3384.56 | bwd_inner: 3383.65 | bwd_allreduce: 0.83 | step: 7.53 14%|█▎ | 1367/10000 [2:09:21<13:19:07, 5.55s/it] {'loss': 0.1339, 'grad_norm': 0.66061931848526, 'learning_rate': 3.8817615379084514e-05, 'epoch': 1.37} 14%|█▎ | 1367/10000 [2:09:21<13:19:07, 5.55s/it][2025-06-19 15:39:06,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:39:06,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.12 | bwd_microstep: 3328.12 | bwd_inner_microstep: 3327.21 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.28 [2025-06-19 15:39:06,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.12 | bwd: 3328.14 | bwd_inner: 3327.21 | bwd_allreduce: 0.89 | step: 7.29 14%|█▎ | 1368/10000 [2:09:27<13:17:16, 5.54s/it] {'loss': 0.1224, 'grad_norm': 0.5817689299583435, 'learning_rate': 3.881542021370481e-05, 'epoch': 1.37} 14%|█▎ | 1368/10000 [2:09:27<13:17:16, 5.54s/it][2025-06-19 15:39:11,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:39:11,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.24 | bwd_microstep: 3371.58 | bwd_inner_microstep: 3370.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 15:39:11,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.25 | bwd: 3371.60 | bwd_inner: 3370.78 | bwd_allreduce: 0.78 | step: 7.16 14%|█▎ | 1369/10000 [2:09:32<13:17:03, 5.54s/it] {'loss': 0.1582, 'grad_norm': 0.8440613746643066, 'learning_rate': 3.881322307467485e-05, 'epoch': 1.37} 14%|█▎ | 1369/10000 [2:09:32<13:17:03, 5.54s/it][2025-06-19 15:39:17,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:39:17,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.72 | bwd_microstep: 3367.86 | bwd_inner_microstep: 3367.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 15:39:17,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.72 | bwd: 3367.87 | bwd_inner: 3367.07 | bwd_allreduce: 0.76 | step: 6.73 14%|█▎ | 1370/10000 [2:09:38<13:16:51, 5.54s/it] {'loss': 0.1809, 'grad_norm': 0.9759716391563416, 'learning_rate': 3.8811023962225106e-05, 'epoch': 1.37} 14%|█▎ | 1370/10000 [2:09:38<13:16:51, 5.54s/it][2025-06-19 15:39:22,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:39:22,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.81 | bwd_microstep: 3326.01 | bwd_inner_microstep: 3325.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 15:39:22,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.81 | bwd: 3326.03 | bwd_inner: 3325.23 | bwd_allreduce: 0.76 | step: 6.62 14%|█▎ | 1371/10000 [2:09:43<13:13:36, 5.52s/it] {'loss': 0.0589, 'grad_norm': 0.39048317074775696, 'learning_rate': 3.880882287658625e-05, 'epoch': 1.37} 14%|█▎ | 1371/10000 [2:09:43<13:13:36, 5.52s/it][2025-06-19 15:39:28,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:39:28,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.45 | bwd_microstep: 3376.89 | bwd_inner_microstep: 3376.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.69 [2025-06-19 15:39:28,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.45 | bwd: 3376.91 | bwd_inner: 3376.08 | bwd_allreduce: 0.78 | step: 7.69 14%|█▎ | 1372/10000 [2:09:49<13:14:57, 5.53s/it] {'loss': 0.0657, 'grad_norm': 0.3266226351261139, 'learning_rate': 3.880661981798918e-05, 'epoch': 1.37} 14%|█▎ | 1372/10000 [2:09:49<13:14:57, 5.53s/it][2025-06-19 15:39:34,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:39:34,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2161.04 | bwd_microstep: 3373.64 | bwd_inner_microstep: 3372.77 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.12 [2025-06-19 15:39:34,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2161.04 | bwd: 3373.66 | bwd_inner: 3372.77 | bwd_allreduce: 0.83 | step: 7.13 14%|█▎ | 1373/10000 [2:09:54<13:16:59, 5.54s/it] {'loss': 0.1044, 'grad_norm': 0.5448059439659119, 'learning_rate': 3.8804414786664966e-05, 'epoch': 1.37} 14%|█▎ | 1373/10000 [2:09:54<13:16:59, 5.54s/it][2025-06-19 15:39:39,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 15:39:39,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.27 | bwd_microstep: 3381.35 | bwd_inner_microstep: 3380.25 | bwd_allreduce_microstep: 0.83 | step_microstep: 8.17 [2025-06-19 15:39:39,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.27 | bwd: 3381.37 | bwd_inner: 3380.25 | bwd_allreduce: 0.86 | step: 8.17 14%|█▎ | 1374/10000 [2:10:00<13:17:47, 5.55s/it] {'loss': 0.0814, 'grad_norm': 0.5302740335464478, 'learning_rate': 3.880220778284491e-05, 'epoch': 1.37} 14%|█▎ | 1374/10000 [2:10:00<13:17:47, 5.55s/it][2025-06-19 15:39:45,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.74 | optimizer_step: 2.73 [2025-06-19 15:39:45,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.72 | bwd_microstep: 3337.93 | bwd_inner_microstep: 3336.67 | bwd_allreduce_microstep: 1.09 | step_microstep: 11.02 [2025-06-19 15:39:45,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.72 | bwd: 3337.99 | bwd_inner: 3336.67 | bwd_allreduce: 1.17 | step: 11.04 14%|█▍ | 1375/10000 [2:10:05<13:16:52, 5.54s/it] {'loss': 0.16, 'grad_norm': 0.9368734359741211, 'learning_rate': 3.879999880676053e-05, 'epoch': 1.38} 14%|█▍ | 1375/10000 [2:10:05<13:16:52, 5.54s/it][2025-06-19 15:39:50,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:39:50,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.81 | bwd_microstep: 3324.60 | bwd_inner_microstep: 3323.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 15:39:50,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.81 | bwd: 3324.62 | bwd_inner: 3323.81 | bwd_allreduce: 0.76 | step: 6.72 14%|█▍ | 1376/10000 [2:10:11<13:14:55, 5.53s/it] {'loss': 0.089, 'grad_norm': 0.3774835467338562, 'learning_rate': 3.879778785864353e-05, 'epoch': 1.38} 14%|█▍ | 1376/10000 [2:10:11<13:14:55, 5.53s/it][2025-06-19 15:39:56,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:39:56,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2151.41 | bwd_microstep: 3332.58 | bwd_inner_microstep: 3331.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 15:39:56,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2151.41 | bwd: 3332.60 | bwd_inner: 3331.79 | bwd_allreduce: 0.76 | step: 6.63 14%|█▍ | 1377/10000 [2:10:16<13:14:30, 5.53s/it] {'loss': 0.0474, 'grad_norm': 0.295949786901474, 'learning_rate': 3.8795574938725816e-05, 'epoch': 1.38} 14%|█▍ | 1377/10000 [2:10:16<13:14:30, 5.53s/it][2025-06-19 15:40:01,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.77 | optimizer_step: 2.73 [2025-06-19 15:40:01,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.04 | bwd_microstep: 3327.25 | bwd_inner_microstep: 3326.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.57 [2025-06-19 15:40:01,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.04 | bwd: 3327.26 | bwd_inner: 3326.45 | bwd_allreduce: 0.77 | step: 7.58 14%|█▍ | 1378/10000 [2:10:22<13:13:13, 5.52s/it] {'loss': 0.0803, 'grad_norm': 0.344068706035614, 'learning_rate': 3.879336004723953e-05, 'epoch': 1.38} 14%|█▍ | 1378/10000 [2:10:22<13:13:13, 5.52s/it][2025-06-19 15:40:07,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:40:07,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.46 | bwd_microstep: 3318.25 | bwd_inner_microstep: 3317.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 15:40:07,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.46 | bwd: 3318.26 | bwd_inner: 3317.46 | bwd_allreduce: 0.76 | step: 6.62 14%|█▍ | 1379/10000 [2:10:27<13:10:41, 5.50s/it] {'loss': 0.0858, 'grad_norm': 0.4619540870189667, 'learning_rate': 3.8791143184417e-05, 'epoch': 1.38} 14%|█▍ | 1379/10000 [2:10:27<13:10:41, 5.50s/it][2025-06-19 15:40:12,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:40:12,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.29 | bwd_microstep: 3376.63 | bwd_inner_microstep: 3375.80 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.15 [2025-06-19 15:40:12,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.29 | bwd: 3376.64 | bwd_inner: 3375.80 | bwd_allreduce: 0.80 | step: 7.15 14%|█▍ | 1380/10000 [2:10:33<13:12:14, 5.51s/it] {'loss': 0.0879, 'grad_norm': 0.6087276935577393, 'learning_rate': 3.8788924350490764e-05, 'epoch': 1.38} 14%|█▍ | 1380/10000 [2:10:33<13:12:14, 5.51s/it][2025-06-19 15:40:18,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:40:18,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.04 | bwd_microstep: 3322.85 | bwd_inner_microstep: 3322.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 15:40:18,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.04 | bwd: 3322.86 | bwd_inner: 3322.06 | bwd_allreduce: 0.77 | step: 6.68 14%|█▍ | 1381/10000 [2:10:38<13:10:05, 5.50s/it] {'loss': 0.0937, 'grad_norm': 0.5411670804023743, 'learning_rate': 3.878670354569356e-05, 'epoch': 1.38} 14%|█▍ | 1381/10000 [2:10:38<13:10:05, 5.50s/it][2025-06-19 15:40:23,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:40:23,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.13 | bwd_microstep: 3320.37 | bwd_inner_microstep: 3319.36 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.01 [2025-06-19 15:40:23,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.13 | bwd: 3320.38 | bwd_inner: 3319.36 | bwd_allreduce: 0.98 | step: 7.01 14%|█▍ | 1382/10000 [2:10:44<13:08:26, 5.49s/it] {'loss': 0.0584, 'grad_norm': 0.32920408248901367, 'learning_rate': 3.878448077025835e-05, 'epoch': 1.38} 14%|█▍ | 1382/10000 [2:10:44<13:08:26, 5.49s/it][2025-06-19 15:40:29,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:40:29,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.68 | bwd_microstep: 3324.98 | bwd_inner_microstep: 3324.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 15:40:29,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.68 | bwd: 3325.00 | bwd_inner: 3324.19 | bwd_allreduce: 0.76 | step: 6.67 14%|█▍ | 1383/10000 [2:10:49<13:07:26, 5.48s/it] {'loss': 0.0982, 'grad_norm': 0.44556233286857605, 'learning_rate': 3.878225602441829e-05, 'epoch': 1.38} 14%|█▍ | 1383/10000 [2:10:49<13:07:26, 5.48s/it][2025-06-19 15:40:34,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 15:40:34,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.33 | bwd_microstep: 3340.39 | bwd_inner_microstep: 3339.38 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.84 [2025-06-19 15:40:34,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.33 | bwd: 3340.42 | bwd_inner: 3339.38 | bwd_allreduce: 0.97 | step: 7.84 14%|█▍ | 1384/10000 [2:10:55<13:07:57, 5.49s/it] {'loss': 0.0592, 'grad_norm': 0.3953819274902344, 'learning_rate': 3.878002930840674e-05, 'epoch': 1.38} 14%|█▍ | 1384/10000 [2:10:55<13:07:57, 5.49s/it][2025-06-19 15:40:40,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.68 | optimizer_step: 2.73 [2025-06-19 15:40:40,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2166.72 | bwd_microstep: 3395.99 | bwd_inner_microstep: 3394.61 | bwd_allreduce_microstep: 1.26 | step_microstep: 10.43 [2025-06-19 15:40:40,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2166.72 | bwd: 3396.04 | bwd_inner: 3394.61 | bwd_allreduce: 1.32 | step: 10.42 14%|█▍ | 1385/10000 [2:11:00<13:14:06, 5.53s/it] {'loss': 0.065, 'grad_norm': 0.3431144952774048, 'learning_rate': 3.877780062245727e-05, 'epoch': 1.39} 14%|█▍ | 1385/10000 [2:11:00<13:14:06, 5.53s/it][2025-06-19 15:40:45,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:40:45,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2188.91 | bwd_microstep: 3330.13 | bwd_inner_microstep: 3329.25 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.19 [2025-06-19 15:40:45,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2188.91 | bwd: 3330.15 | bwd_inner: 3329.25 | bwd_allreduce: 0.83 | step: 7.20 14%|█▍ | 1386/10000 [2:11:06<13:15:44, 5.54s/it] {'loss': 0.1379, 'grad_norm': 0.5334196090698242, 'learning_rate': 3.8775569966803674e-05, 'epoch': 1.39} 14%|█▍ | 1386/10000 [2:11:06<13:15:44, 5.54s/it][2025-06-19 15:40:51,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:40:51,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2172.84 | bwd_microstep: 3380.37 | bwd_inner_microstep: 3379.52 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.11 [2025-06-19 15:40:51,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2172.84 | bwd: 3380.39 | bwd_inner: 3379.52 | bwd_allreduce: 0.81 | step: 7.12 14%|█▍ | 1387/10000 [2:11:12<13:18:00, 5.56s/it] {'loss': 0.0807, 'grad_norm': 0.7766958475112915, 'learning_rate': 3.877333734167993e-05, 'epoch': 1.39} 14%|█▍ | 1387/10000 [2:11:12<13:18:00, 5.56s/it][2025-06-19 15:40:56,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:40:56,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2151.89 | bwd_microstep: 3320.79 | bwd_inner_microstep: 3319.66 | bwd_allreduce_microstep: 1.07 | step_microstep: 9.07 [2025-06-19 15:40:56,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2151.89 | bwd: 3320.82 | bwd_inner: 3319.66 | bwd_allreduce: 1.10 | step: 9.10 14%|█▍ | 1388/10000 [2:11:17<13:16:38, 5.55s/it] {'loss': 0.1108, 'grad_norm': 0.8114195466041565, 'learning_rate': 3.8771102747320226e-05, 'epoch': 1.39} 14%|█▍ | 1388/10000 [2:11:17<13:16:38, 5.55s/it][2025-06-19 15:41:02,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 15:41:02,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2156.71 | bwd_microstep: 3374.30 | bwd_inner_microstep: 3373.42 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.67 [2025-06-19 15:41:02,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2156.71 | bwd: 3374.32 | bwd_inner: 3373.42 | bwd_allreduce: 0.85 | step: 7.67 14%|█▍ | 1389/10000 [2:11:23<13:17:35, 5.56s/it] {'loss': 0.1714, 'grad_norm': 0.9879878759384155, 'learning_rate': 3.876886618395896e-05, 'epoch': 1.39} 14%|█▍ | 1389/10000 [2:11:23<13:17:35, 5.56s/it][2025-06-19 15:41:07,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 15:41:07,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.03 | bwd_microstep: 3323.15 | bwd_inner_microstep: 3322.24 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.35 [2025-06-19 15:41:07,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.03 | bwd: 3323.17 | bwd_inner: 3322.24 | bwd_allreduce: 0.86 | step: 7.36 14%|█▍ | 1390/10000 [2:11:28<13:14:43, 5.54s/it] {'loss': 0.1014, 'grad_norm': 0.5160186886787415, 'learning_rate': 3.8766627651830745e-05, 'epoch': 1.39} 14%|█▍ | 1390/10000 [2:11:28<13:14:43, 5.54s/it][2025-06-19 15:41:13,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.73 [2025-06-19 15:41:13,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.12 | bwd_microstep: 3327.98 | bwd_inner_microstep: 3326.84 | bwd_allreduce_microstep: 1.04 | step_microstep: 8.70 [2025-06-19 15:41:13,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.12 | bwd: 3328.03 | bwd_inner: 3326.84 | bwd_allreduce: 1.09 | step: 8.70 14%|█▍ | 1391/10000 [2:11:34<13:14:00, 5.53s/it] {'loss': 0.0737, 'grad_norm': 0.5229598879814148, 'learning_rate': 3.8764387151170386e-05, 'epoch': 1.39} 14%|█▍ | 1391/10000 [2:11:34<13:14:00, 5.53s/it][2025-06-19 15:41:18,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:41:18,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2153.05 | bwd_microstep: 3339.08 | bwd_inner_microstep: 3338.21 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.26 [2025-06-19 15:41:18,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2153.05 | bwd: 3339.11 | bwd_inner: 3338.21 | bwd_allreduce: 0.83 | step: 7.26 14%|█▍ | 1392/10000 [2:11:39<13:14:14, 5.54s/it] {'loss': 0.0884, 'grad_norm': 0.47836631536483765, 'learning_rate': 3.8762144682212905e-05, 'epoch': 1.39} 14%|█▍ | 1392/10000 [2:11:39<13:14:14, 5.54s/it][2025-06-19 15:41:24,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:41:24,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2152.96 | bwd_microstep: 3334.45 | bwd_inner_microstep: 3333.54 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.21 [2025-06-19 15:41:24,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2152.96 | bwd: 3334.48 | bwd_inner: 3333.54 | bwd_allreduce: 0.87 | step: 8.20 14%|█▍ | 1393/10000 [2:11:45<13:14:10, 5.54s/it] {'loss': 0.0689, 'grad_norm': 0.3520885407924652, 'learning_rate': 3.875990024519352e-05, 'epoch': 1.39} 14%|█▍ | 1393/10000 [2:11:45<13:14:10, 5.54s/it][2025-06-19 15:41:30,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:41:30,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.75 | bwd_microstep: 3335.30 | bwd_inner_microstep: 3334.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 15:41:30,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.75 | bwd: 3335.31 | bwd_inner: 3334.51 | bwd_allreduce: 0.76 | step: 6.62 14%|█▍ | 1394/10000 [2:11:50<13:13:40, 5.53s/it] {'loss': 0.0943, 'grad_norm': 0.9086644649505615, 'learning_rate': 3.875765384034767e-05, 'epoch': 1.39} 14%|█▍ | 1394/10000 [2:11:50<13:13:40, 5.53s/it][2025-06-19 15:41:35,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 15:41:35,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2157.14 | bwd_microstep: 3413.70 | bwd_inner_microstep: 3412.66 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.86 [2025-06-19 15:41:35,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2157.14 | bwd: 3413.73 | bwd_inner: 3412.66 | bwd_allreduce: 1.00 | step: 7.87 14%|█▍ | 1395/10000 [2:11:56<13:16:50, 5.56s/it] {'loss': 0.1083, 'grad_norm': 0.8703147768974304, 'learning_rate': 3.875540546791099e-05, 'epoch': 1.4} 14%|█▍ | 1395/10000 [2:11:56<13:16:50, 5.56s/it][2025-06-19 15:41:41,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:41:41,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2329.83 | bwd_microstep: 3337.24 | bwd_inner_microstep: 3336.42 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.38 [2025-06-19 15:41:41,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2329.83 | bwd: 3337.25 | bwd_inner: 3336.42 | bwd_allreduce: 0.79 | step: 7.38 14%|█▍ | 1396/10000 [2:12:02<13:23:34, 5.60s/it] {'loss': 0.1227, 'grad_norm': 1.751427173614502, 'learning_rate': 3.875315512811932e-05, 'epoch': 1.4} 14%|█▍ | 1396/10000 [2:12:02<13:23:34, 5.60s/it][2025-06-19 15:41:46,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 15:41:46,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.16 | bwd_microstep: 3372.61 | bwd_inner_microstep: 3371.50 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.71 [2025-06-19 15:41:46,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.16 | bwd: 3372.63 | bwd_inner: 3371.50 | bwd_allreduce: 1.06 | step: 7.72 14%|█▍ | 1397/10000 [2:12:07<13:20:58, 5.59s/it] {'loss': 0.0814, 'grad_norm': 0.6243236660957336, 'learning_rate': 3.8750902821208716e-05, 'epoch': 1.4} 14%|█▍ | 1397/10000 [2:12:07<13:20:58, 5.59s/it][2025-06-19 15:41:52,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 15:41:52,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.34 | bwd_microstep: 3383.40 | bwd_inner_microstep: 3382.32 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.99 [2025-06-19 15:41:52,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.34 | bwd: 3383.43 | bwd_inner: 3382.32 | bwd_allreduce: 1.04 | step: 7.98 14%|█▍ | 1398/10000 [2:12:13<13:19:38, 5.58s/it] {'loss': 0.0571, 'grad_norm': 0.3612533509731293, 'learning_rate': 3.8748648547415434e-05, 'epoch': 1.4} 14%|█▍ | 1398/10000 [2:12:13<13:19:38, 5.58s/it][2025-06-19 15:41:58,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:41:58,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.20 | bwd_microstep: 3378.29 | bwd_inner_microstep: 3377.47 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.20 [2025-06-19 15:41:58,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.20 | bwd: 3378.30 | bwd_inner: 3377.47 | bwd_allreduce: 0.78 | step: 7.20 14%|█▍ | 1399/10000 [2:12:18<13:18:17, 5.57s/it] {'loss': 0.0609, 'grad_norm': 0.45178812742233276, 'learning_rate': 3.8746392306975925e-05, 'epoch': 1.4} 14%|█▍ | 1399/10000 [2:12:18<13:18:17, 5.57s/it][2025-06-19 15:42:03,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:42:03,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.09 | bwd_microstep: 3339.36 | bwd_inner_microstep: 3338.39 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.33 [2025-06-19 15:42:03,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.09 | bwd: 3339.38 | bwd_inner: 3338.39 | bwd_allreduce: 0.94 | step: 7.34 14%|█▍ | 1400/10000 [2:12:24<13:15:31, 5.55s/it] {'loss': 0.1614, 'grad_norm': 0.900798499584198, 'learning_rate': 3.874413410012688e-05, 'epoch': 1.4} 14%|█▍ | 1400/10000 [2:12:24<13:15:31, 5.55s/it][2025-06-19 15:42:09,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:42:09,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.96 | bwd_microstep: 3390.21 | bwd_inner_microstep: 3389.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 15:42:09,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.96 | bwd: 3390.22 | bwd_inner: 3389.41 | bwd_allreduce: 0.77 | step: 6.69 14%|█▍ | 1401/10000 [2:12:29<13:16:52, 5.56s/it] {'loss': 0.151, 'grad_norm': 0.8069837689399719, 'learning_rate': 3.874187392710515e-05, 'epoch': 1.4} 14%|█▍ | 1401/10000 [2:12:29<13:16:52, 5.56s/it][2025-06-19 15:42:14,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:42:14,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.60 | bwd_microstep: 3380.63 | bwd_inner_microstep: 3379.83 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 15:42:14,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.60 | bwd: 3380.64 | bwd_inner: 3379.83 | bwd_allreduce: 0.77 | step: 7.14 14%|█▍ | 1402/10000 [2:12:35<13:16:53, 5.56s/it] {'loss': 0.1569, 'grad_norm': 1.4075284004211426, 'learning_rate': 3.8739611788147834e-05, 'epoch': 1.4} 14%|█▍ | 1402/10000 [2:12:35<13:16:53, 5.56s/it][2025-06-19 15:42:20,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:42:20,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.26 | bwd_microstep: 3335.59 | bwd_inner_microstep: 3334.73 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.87 [2025-06-19 15:42:20,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.26 | bwd: 3335.61 | bwd_inner: 3334.73 | bwd_allreduce: 0.81 | step: 6.87 14%|█▍ | 1403/10000 [2:12:40<13:13:19, 5.54s/it] {'loss': 0.1755, 'grad_norm': 1.2453620433807373, 'learning_rate': 3.873734768349222e-05, 'epoch': 1.4} 14%|█▍ | 1403/10000 [2:12:40<13:13:19, 5.54s/it][2025-06-19 15:42:25,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:42:25,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.49 | bwd_microstep: 3378.45 | bwd_inner_microstep: 3377.58 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.41 [2025-06-19 15:42:25,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.49 | bwd: 3378.47 | bwd_inner: 3377.58 | bwd_allreduce: 0.85 | step: 7.41 14%|█▍ | 1404/10000 [2:12:46<13:13:43, 5.54s/it] {'loss': 0.0632, 'grad_norm': 0.4850558042526245, 'learning_rate': 3.873508161337579e-05, 'epoch': 1.4} 14%|█▍ | 1404/10000 [2:12:46<13:13:43, 5.54s/it][2025-06-19 15:42:31,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:42:31,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.84 | bwd_microstep: 3374.56 | bwd_inner_microstep: 3373.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 15:42:31,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.84 | bwd: 3374.58 | bwd_inner: 3373.78 | bwd_allreduce: 0.76 | step: 6.64 14%|█▍ | 1405/10000 [2:12:52<13:13:54, 5.54s/it] {'loss': 0.1581, 'grad_norm': 1.0088598728179932, 'learning_rate': 3.8732813578036254e-05, 'epoch': 1.41} 14%|█▍ | 1405/10000 [2:12:52<13:13:54, 5.54s/it][2025-06-19 15:42:36,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 15:42:36,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.33 | bwd_microstep: 3372.02 | bwd_inner_microstep: 3371.10 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.45 [2025-06-19 15:42:36,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.33 | bwd: 3372.05 | bwd_inner: 3371.10 | bwd_allreduce: 0.87 | step: 8.45 14%|█▍ | 1406/10000 [2:12:57<13:13:44, 5.54s/it] {'loss': 0.0755, 'grad_norm': 0.48988527059555054, 'learning_rate': 3.873054357771151e-05, 'epoch': 1.41} 14%|█▍ | 1406/10000 [2:12:57<13:13:44, 5.54s/it][2025-06-19 15:42:42,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:42:42,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.22 | bwd_microstep: 3321.08 | bwd_inner_microstep: 3320.27 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.29 [2025-06-19 15:42:42,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.22 | bwd: 3321.10 | bwd_inner: 3320.27 | bwd_allreduce: 0.78 | step: 7.29 14%|█▍ | 1407/10000 [2:13:03<13:11:47, 5.53s/it] {'loss': 0.077, 'grad_norm': 0.7444928884506226, 'learning_rate': 3.872827161263968e-05, 'epoch': 1.41} 14%|█▍ | 1407/10000 [2:13:03<13:11:47, 5.53s/it][2025-06-19 15:42:47,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 15:42:47,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.57 | bwd_microstep: 3363.80 | bwd_inner_microstep: 3362.82 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.22 [2025-06-19 15:42:47,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.57 | bwd: 3363.82 | bwd_inner: 3362.82 | bwd_allreduce: 0.95 | step: 7.23 14%|█▍ | 1408/10000 [2:13:08<13:11:43, 5.53s/it] {'loss': 0.1194, 'grad_norm': 1.2234320640563965, 'learning_rate': 3.8725997683059085e-05, 'epoch': 1.41} 14%|█▍ | 1408/10000 [2:13:08<13:11:43, 5.53s/it][2025-06-19 15:42:53,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:42:53,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.04 | bwd_microstep: 3322.85 | bwd_inner_microstep: 3322.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-19 15:42:53,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.04 | bwd: 3322.86 | bwd_inner: 3322.06 | bwd_allreduce: 0.76 | step: 6.94 14%|█▍ | 1409/10000 [2:13:14<13:09:17, 5.51s/it] {'loss': 0.1218, 'grad_norm': 0.7435863614082336, 'learning_rate': 3.872372178920823e-05, 'epoch': 1.41} 14%|█▍ | 1409/10000 [2:13:14<13:09:17, 5.51s/it][2025-06-19 15:42:58,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:42:58,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.15 | bwd_microstep: 3313.01 | bwd_inner_microstep: 3312.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 15:42:58,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.15 | bwd: 3313.02 | bwd_inner: 3312.22 | bwd_allreduce: 0.76 | step: 6.66 14%|█▍ | 1410/10000 [2:13:19<13:07:32, 5.50s/it] {'loss': 0.0882, 'grad_norm': 0.7698078155517578, 'learning_rate': 3.872144393132587e-05, 'epoch': 1.41} 14%|█▍ | 1410/10000 [2:13:19<13:07:32, 5.50s/it][2025-06-19 15:43:04,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:43:04,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.54 | bwd_microstep: 3363.42 | bwd_inner_microstep: 3362.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 15:43:04,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.54 | bwd: 3363.43 | bwd_inner: 3362.60 | bwd_allreduce: 0.78 | step: 7.17 14%|█▍ | 1411/10000 [2:13:25<13:08:25, 5.51s/it] {'loss': 0.0664, 'grad_norm': 0.6235788464546204, 'learning_rate': 3.8719164109650924e-05, 'epoch': 1.41} 14%|█▍ | 1411/10000 [2:13:25<13:08:25, 5.51s/it][2025-06-19 15:43:09,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:43:09,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.07 | bwd_microstep: 3313.63 | bwd_inner_microstep: 3312.76 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.82 [2025-06-19 15:43:09,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.07 | bwd: 3313.64 | bwd_inner: 3312.76 | bwd_allreduce: 0.83 | step: 6.82 14%|█▍ | 1412/10000 [2:13:30<13:06:03, 5.49s/it] {'loss': 0.0862, 'grad_norm': 0.4199409782886505, 'learning_rate': 3.871688232442254e-05, 'epoch': 1.41} 14%|█▍ | 1412/10000 [2:13:30<13:06:03, 5.49s/it][2025-06-19 15:43:15,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:43:15,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.14 | bwd_microstep: 3306.55 | bwd_inner_microstep: 3305.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-19 15:43:15,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.14 | bwd: 3306.56 | bwd_inner: 3305.75 | bwd_allreduce: 0.76 | step: 6.91 14%|█▍ | 1413/10000 [2:13:35<13:04:16, 5.48s/it] {'loss': 0.068, 'grad_norm': 0.519716203212738, 'learning_rate': 3.8714598575880074e-05, 'epoch': 1.41} 14%|█▍ | 1413/10000 [2:13:36<13:04:16, 5.48s/it][2025-06-19 15:43:20,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:43:20,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.37 | bwd_microstep: 3317.77 | bwd_inner_microstep: 3316.96 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.25 [2025-06-19 15:43:20,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.37 | bwd: 3317.78 | bwd_inner: 3316.96 | bwd_allreduce: 0.78 | step: 7.25 14%|█▍ | 1414/10000 [2:13:41<13:03:31, 5.48s/it] {'loss': 0.1011, 'grad_norm': 0.8193398118019104, 'learning_rate': 3.8712312864263066e-05, 'epoch': 1.41} 14%|█▍ | 1414/10000 [2:13:41<13:03:31, 5.48s/it][2025-06-19 15:43:26,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:43:26,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.82 | bwd_microstep: 3321.20 | bwd_inner_microstep: 3320.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 15:43:26,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.82 | bwd: 3321.22 | bwd_inner: 3320.41 | bwd_allreduce: 0.77 | step: 6.68 14%|█▍ | 1415/10000 [2:13:46<13:02:44, 5.47s/it] {'loss': 0.1559, 'grad_norm': 1.4048030376434326, 'learning_rate': 3.871002518981129e-05, 'epoch': 1.42} 14%|█▍ | 1415/10000 [2:13:46<13:02:44, 5.47s/it][2025-06-19 15:43:31,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:43:31,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.63 | bwd_microstep: 3324.73 | bwd_inner_microstep: 3323.87 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.41 [2025-06-19 15:43:31,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.63 | bwd: 3324.74 | bwd_inner: 3323.87 | bwd_allreduce: 0.82 | step: 7.41 14%|█▍ | 1416/10000 [2:13:52<13:02:43, 5.47s/it] {'loss': 0.0729, 'grad_norm': 1.2143603563308716, 'learning_rate': 3.8707735552764714e-05, 'epoch': 1.42} 14%|█▍ | 1416/10000 [2:13:52<13:02:43, 5.47s/it][2025-06-19 15:43:37,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:43:37,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.16 | bwd_microstep: 3316.64 | bwd_inner_microstep: 3315.84 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 15:43:37,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.16 | bwd: 3316.66 | bwd_inner: 3315.84 | bwd_allreduce: 0.77 | step: 6.64 14%|█▍ | 1417/10000 [2:13:57<13:02:11, 5.47s/it] {'loss': 0.0512, 'grad_norm': 0.580467164516449, 'learning_rate': 3.8705443953363495e-05, 'epoch': 1.42} 14%|█▍ | 1417/10000 [2:13:57<13:02:11, 5.47s/it][2025-06-19 15:43:42,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:43:42,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.33 | bwd_microstep: 3319.28 | bwd_inner_microstep: 3318.35 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.54 [2025-06-19 15:43:42,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.33 | bwd: 3319.30 | bwd_inner: 3318.35 | bwd_allreduce: 0.90 | step: 7.55 14%|█▍ | 1418/10000 [2:14:03<13:01:47, 5.47s/it] {'loss': 0.0927, 'grad_norm': 1.7914880514144897, 'learning_rate': 3.870315039184803e-05, 'epoch': 1.42} 14%|█▍ | 1418/10000 [2:14:03<13:01:47, 5.47s/it][2025-06-19 15:43:47,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 15:43:47,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.81 | bwd_microstep: 3324.32 | bwd_inner_microstep: 3323.21 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.56 [2025-06-19 15:43:47,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.81 | bwd: 3324.34 | bwd_inner: 3323.21 | bwd_allreduce: 1.07 | step: 7.57 14%|█▍ | 1419/10000 [2:14:08<13:02:10, 5.47s/it] {'loss': 0.164, 'grad_norm': 0.9647075533866882, 'learning_rate': 3.870085486845888e-05, 'epoch': 1.42} 14%|█▍ | 1419/10000 [2:14:08<13:02:10, 5.47s/it][2025-06-19 15:43:53,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 15:43:53,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.82 | bwd_microstep: 3317.66 | bwd_inner_microstep: 3316.64 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.45 [2025-06-19 15:43:53,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.82 | bwd: 3317.68 | bwd_inner: 3316.64 | bwd_allreduce: 0.98 | step: 7.46 14%|█▍ | 1420/10000 [2:14:14<13:02:25, 5.47s/it] {'loss': 0.1793, 'grad_norm': 1.2395288944244385, 'learning_rate': 3.869855738343685e-05, 'epoch': 1.42} 14%|█▍ | 1420/10000 [2:14:14<13:02:25, 5.47s/it][2025-06-19 15:43:58,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 15:43:58,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.38 | bwd_microstep: 3317.50 | bwd_inner_microstep: 3316.17 | bwd_allreduce_microstep: 1.24 | step_microstep: 8.50 [2025-06-19 15:43:58,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.38 | bwd: 3317.52 | bwd_inner: 3316.17 | bwd_allreduce: 1.28 | step: 8.50 14%|█▍ | 1421/10000 [2:14:19<13:02:47, 5.47s/it] {'loss': 0.0748, 'grad_norm': 0.7001367807388306, 'learning_rate': 3.869625793702294e-05, 'epoch': 1.42} 14%|█▍ | 1421/10000 [2:14:19<13:02:47, 5.47s/it][2025-06-19 15:44:04,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:44:04,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.46 | bwd_microstep: 3371.91 | bwd_inner_microstep: 3370.94 | bwd_allreduce_microstep: 0.92 | step_microstep: 6.89 [2025-06-19 15:44:04,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.46 | bwd: 3371.93 | bwd_inner: 3370.94 | bwd_allreduce: 0.94 | step: 6.89 14%|█▍ | 1422/10000 [2:14:25<13:06:50, 5.50s/it] {'loss': 0.0805, 'grad_norm': 0.5499696731567383, 'learning_rate': 3.8693956529458346e-05, 'epoch': 1.42} 14%|█▍ | 1422/10000 [2:14:25<13:06:50, 5.50s/it][2025-06-19 15:44:10,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:44:10,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.08 | bwd_microstep: 3321.48 | bwd_inner_microstep: 3320.35 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.65 [2025-06-19 15:44:10,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.08 | bwd: 3321.50 | bwd_inner: 3320.35 | bwd_allreduce: 1.09 | step: 7.66 14%|█▍ | 1423/10000 [2:14:30<13:05:51, 5.50s/it] {'loss': 0.1447, 'grad_norm': 0.9394729733467102, 'learning_rate': 3.869165316098446e-05, 'epoch': 1.42} 14%|█▍ | 1423/10000 [2:14:30<13:05:51, 5.50s/it][2025-06-19 15:44:15,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:44:15,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.21 | bwd_microstep: 3314.61 | bwd_inner_microstep: 3313.72 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.67 [2025-06-19 15:44:15,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.21 | bwd: 3314.63 | bwd_inner: 3313.72 | bwd_allreduce: 0.86 | step: 7.67 14%|█▍ | 1424/10000 [2:14:36<13:04:37, 5.49s/it] {'loss': 0.0963, 'grad_norm': 0.7863565683364868, 'learning_rate': 3.868934783184292e-05, 'epoch': 1.42} 14%|█▍ | 1424/10000 [2:14:36<13:04:37, 5.49s/it][2025-06-19 15:44:21,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.77 | optimizer_step: 2.72 [2025-06-19 15:44:21,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.89 | bwd_microstep: 3368.93 | bwd_inner_microstep: 3367.90 | bwd_allreduce_microstep: 0.97 | step_microstep: 8.44 [2025-06-19 15:44:21,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.89 | bwd: 3368.94 | bwd_inner: 3367.90 | bwd_allreduce: 0.99 | step: 8.44 14%|█▍ | 1425/10000 [2:14:41<13:06:41, 5.50s/it] {'loss': 0.1169, 'grad_norm': 0.8334065675735474, 'learning_rate': 3.868704054227553e-05, 'epoch': 1.43} 14%|█▍ | 1425/10000 [2:14:41<13:06:41, 5.50s/it][2025-06-19 15:44:26,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:44:26,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.46 | bwd_microstep: 3317.50 | bwd_inner_microstep: 3316.68 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.74 [2025-06-19 15:44:26,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.46 | bwd: 3317.51 | bwd_inner: 3316.68 | bwd_allreduce: 0.79 | step: 6.75 14%|█▍ | 1426/10000 [2:14:47<13:05:15, 5.50s/it] {'loss': 0.1715, 'grad_norm': 0.9771835803985596, 'learning_rate': 3.8684731292524316e-05, 'epoch': 1.43} 14%|█▍ | 1426/10000 [2:14:47<13:05:15, 5.50s/it][2025-06-19 15:44:31,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:44:31,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.69 | bwd_microstep: 3317.46 | bwd_inner_microstep: 3316.64 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-19 15:44:31,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.69 | bwd: 3317.48 | bwd_inner: 3316.64 | bwd_allreduce: 0.79 | step: 7.23 14%|█▍ | 1427/10000 [2:14:52<13:03:58, 5.49s/it] {'loss': 0.0462, 'grad_norm': 0.40480148792266846, 'learning_rate': 3.868242008283151e-05, 'epoch': 1.43} 14%|█▍ | 1427/10000 [2:14:52<13:03:58, 5.49s/it][2025-06-19 15:44:37,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:44:37,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.57 | bwd_microstep: 3360.09 | bwd_inner_microstep: 3359.13 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.01 [2025-06-19 15:44:37,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.57 | bwd: 3360.10 | bwd_inner: 3359.13 | bwd_allreduce: 0.92 | step: 7.02 14%|█▍ | 1428/10000 [2:14:58<13:05:35, 5.50s/it] {'loss': 0.0687, 'grad_norm': 0.5289420485496521, 'learning_rate': 3.8680106913439536e-05, 'epoch': 1.43} 14%|█▍ | 1428/10000 [2:14:58<13:05:35, 5.50s/it][2025-06-19 15:44:43,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 15:44:43,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.02 | bwd_microstep: 3363.29 | bwd_inner_microstep: 3362.24 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.00 [2025-06-19 15:44:43,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.02 | bwd: 3363.30 | bwd_inner: 3362.24 | bwd_allreduce: 1.01 | step: 8.00 14%|█▍ | 1429/10000 [2:15:03<13:07:32, 5.51s/it] {'loss': 0.1232, 'grad_norm': 0.7354112267494202, 'learning_rate': 3.8677791784591054e-05, 'epoch': 1.43} 14%|█▍ | 1429/10000 [2:15:03<13:07:32, 5.51s/it][2025-06-19 15:44:48,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.86 | optimizer_step: 2.72 [2025-06-19 15:44:48,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.71 | bwd_microstep: 3312.78 | bwd_inner_microstep: 3311.87 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.28 [2025-06-19 15:44:48,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.71 | bwd: 3312.80 | bwd_inner: 3311.87 | bwd_allreduce: 0.88 | step: 7.29 14%|█▍ | 1430/10000 [2:15:09<13:05:24, 5.50s/it] {'loss': 0.1196, 'grad_norm': 0.8310404419898987, 'learning_rate': 3.8675474696528896e-05, 'epoch': 1.43} 14%|█▍ | 1430/10000 [2:15:09<13:05:24, 5.50s/it][2025-06-19 15:44:53,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:44:53,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.53 | bwd_microstep: 3310.16 | bwd_inner_microstep: 3309.35 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.25 [2025-06-19 15:44:53,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.53 | bwd: 3310.17 | bwd_inner: 3309.35 | bwd_allreduce: 0.78 | step: 7.25 14%|█▍ | 1431/10000 [2:15:14<13:03:48, 5.49s/it] {'loss': 0.0963, 'grad_norm': 0.6351244449615479, 'learning_rate': 3.8673155649496114e-05, 'epoch': 1.43} 14%|█▍ | 1431/10000 [2:15:14<13:03:48, 5.49s/it][2025-06-19 15:44:59,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-19 15:44:59,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.74 | bwd_microstep: 3316.00 | bwd_inner_microstep: 3315.18 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.66 [2025-06-19 15:44:59,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.74 | bwd: 3316.01 | bwd_inner: 3315.18 | bwd_allreduce: 0.79 | step: 7.66 14%|█▍ | 1432/10000 [2:15:20<13:02:53, 5.48s/it] {'loss': 0.1245, 'grad_norm': 0.6099069118499756, 'learning_rate': 3.8670834643735975e-05, 'epoch': 1.43} 14%|█▍ | 1432/10000 [2:15:20<13:02:53, 5.48s/it][2025-06-19 15:45:04,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:45:04,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.62 | bwd_microstep: 3314.19 | bwd_inner_microstep: 3313.23 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.66 [2025-06-19 15:45:04,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.62 | bwd: 3314.21 | bwd_inner: 3313.23 | bwd_allreduce: 0.93 | step: 7.67 14%|█▍ | 1433/10000 [2:15:25<13:02:10, 5.48s/it] {'loss': 0.1012, 'grad_norm': 0.7211999893188477, 'learning_rate': 3.866851167949193e-05, 'epoch': 1.43} 14%|█▍ | 1433/10000 [2:15:25<13:02:10, 5.48s/it][2025-06-19 15:45:10,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:45:10,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.30 | bwd_microstep: 3317.64 | bwd_inner_microstep: 3316.74 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.48 [2025-06-19 15:45:10,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.30 | bwd: 3317.65 | bwd_inner: 3316.74 | bwd_allreduce: 0.87 | step: 7.48 14%|█▍ | 1434/10000 [2:15:31<13:01:48, 5.48s/it] {'loss': 0.1481, 'grad_norm': 0.827780544757843, 'learning_rate': 3.8666186757007664e-05, 'epoch': 1.43} 14%|█▍ | 1434/10000 [2:15:31<13:01:48, 5.48s/it][2025-06-19 15:45:15,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 15:45:15,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.75 | bwd_microstep: 3363.37 | bwd_inner_microstep: 3362.32 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.85 [2025-06-19 15:45:15,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.75 | bwd: 3363.39 | bwd_inner: 3362.32 | bwd_allreduce: 1.02 | step: 7.86 14%|█▍ | 1435/10000 [2:15:36<13:04:31, 5.50s/it] {'loss': 0.0585, 'grad_norm': 0.7820310592651367, 'learning_rate': 3.866385987652703e-05, 'epoch': 1.44} 14%|█▍ | 1435/10000 [2:15:36<13:04:31, 5.50s/it][2025-06-19 15:45:21,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:45:21,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.27 | bwd_microstep: 3318.51 | bwd_inner_microstep: 3317.55 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.47 [2025-06-19 15:45:21,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.27 | bwd: 3318.52 | bwd_inner: 3317.55 | bwd_allreduce: 0.92 | step: 7.47 14%|█▍ | 1436/10000 [2:15:42<13:03:01, 5.49s/it] {'loss': 0.0648, 'grad_norm': 0.48930227756500244, 'learning_rate': 3.8661531038294116e-05, 'epoch': 1.44} 14%|█▍ | 1436/10000 [2:15:42<13:03:01, 5.49s/it][2025-06-19 15:45:26,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:45:26,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.35 | bwd_microstep: 3308.76 | bwd_inner_microstep: 3307.90 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.84 [2025-06-19 15:45:26,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.35 | bwd: 3308.77 | bwd_inner: 3307.90 | bwd_allreduce: 0.83 | step: 6.85 14%|█▍ | 1437/10000 [2:15:47<13:01:28, 5.48s/it] {'loss': 0.1805, 'grad_norm': 1.1740748882293701, 'learning_rate': 3.8659200242553215e-05, 'epoch': 1.44} 14%|█▍ | 1437/10000 [2:15:47<13:01:28, 5.48s/it][2025-06-19 15:45:32,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:45:32,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.79 | bwd_microstep: 3314.37 | bwd_inner_microstep: 3313.40 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.51 [2025-06-19 15:45:32,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.79 | bwd: 3314.39 | bwd_inner: 3313.40 | bwd_allreduce: 0.93 | step: 7.52 14%|█▍ | 1438/10000 [2:15:53<13:00:41, 5.47s/it] {'loss': 0.1788, 'grad_norm': 1.202748417854309, 'learning_rate': 3.8656867489548806e-05, 'epoch': 1.44} 14%|█▍ | 1438/10000 [2:15:53<13:00:41, 5.47s/it][2025-06-19 15:45:37,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:45:37,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.61 | bwd_microstep: 3360.19 | bwd_inner_microstep: 3359.13 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.36 [2025-06-19 15:45:37,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.61 | bwd: 3360.21 | bwd_inner: 3359.13 | bwd_allreduce: 1.02 | step: 7.37 14%|█▍ | 1439/10000 [2:15:58<13:03:40, 5.49s/it] {'loss': 0.0921, 'grad_norm': 0.40372952818870544, 'learning_rate': 3.8654532779525575e-05, 'epoch': 1.44} 14%|█▍ | 1439/10000 [2:15:58<13:03:40, 5.49s/it][2025-06-19 15:45:43,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:45:43,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.85 | bwd_microstep: 3363.31 | bwd_inner_microstep: 3362.25 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.99 [2025-06-19 15:45:43,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.85 | bwd: 3363.33 | bwd_inner: 3362.25 | bwd_allreduce: 1.02 | step: 7.99 14%|█▍ | 1440/10000 [2:16:04<13:05:39, 5.51s/it] {'loss': 0.1145, 'grad_norm': 0.7457690834999084, 'learning_rate': 3.865219611272845e-05, 'epoch': 1.44} 14%|█▍ | 1440/10000 [2:16:04<13:05:39, 5.51s/it][2025-06-19 15:45:48,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:45:48,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.05 | bwd_microstep: 3319.29 | bwd_inner_microstep: 3318.29 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.64 [2025-06-19 15:45:48,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.05 | bwd: 3319.31 | bwd_inner: 3318.29 | bwd_allreduce: 0.97 | step: 7.65 14%|█▍ | 1441/10000 [2:16:09<13:03:49, 5.49s/it] {'loss': 0.0738, 'grad_norm': 0.8875943422317505, 'learning_rate': 3.864985748940251e-05, 'epoch': 1.44} 14%|█▍ | 1441/10000 [2:16:09<13:03:49, 5.49s/it][2025-06-19 15:45:54,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:45:54,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.06 | bwd_microstep: 3316.50 | bwd_inner_microstep: 3315.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 15:45:54,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.06 | bwd: 3316.52 | bwd_inner: 3315.71 | bwd_allreduce: 0.76 | step: 6.81 14%|█▍ | 1442/10000 [2:16:15<13:02:47, 5.49s/it] {'loss': 0.0969, 'grad_norm': 0.8118472695350647, 'learning_rate': 3.864751690979308e-05, 'epoch': 1.44} 14%|█▍ | 1442/10000 [2:16:15<13:02:47, 5.49s/it][2025-06-19 15:45:59,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:45:59,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.22 | bwd_microstep: 3324.88 | bwd_inner_microstep: 3323.93 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.56 [2025-06-19 15:45:59,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.22 | bwd: 3324.90 | bwd_inner: 3323.93 | bwd_allreduce: 0.91 | step: 7.57 14%|█▍ | 1443/10000 [2:16:20<13:01:53, 5.48s/it] {'loss': 0.1118, 'grad_norm': 0.5847957730293274, 'learning_rate': 3.864517437414567e-05, 'epoch': 1.44} 14%|█▍ | 1443/10000 [2:16:20<13:01:53, 5.48s/it][2025-06-19 15:46:05,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:46:05,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.98 | bwd_microstep: 3322.37 | bwd_inner_microstep: 3321.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 15:46:05,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.98 | bwd: 3322.39 | bwd_inner: 3321.58 | bwd_allreduce: 0.76 | step: 6.74 14%|█▍ | 1444/10000 [2:16:26<13:01:09, 5.48s/it] {'loss': 0.0978, 'grad_norm': 0.705467700958252, 'learning_rate': 3.864282988270601e-05, 'epoch': 1.44} 14%|█▍ | 1444/10000 [2:16:26<13:01:09, 5.48s/it][2025-06-19 15:46:10,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:46:10,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.81 | bwd_microstep: 3325.04 | bwd_inner_microstep: 3324.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 15:46:10,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.81 | bwd: 3325.05 | bwd_inner: 3324.24 | bwd_allreduce: 0.77 | step: 7.07 14%|█▍ | 1445/10000 [2:16:31<13:00:35, 5.47s/it] {'loss': 0.0843, 'grad_norm': 0.5665600895881653, 'learning_rate': 3.864048343572001e-05, 'epoch': 1.45} 14%|█▍ | 1445/10000 [2:16:31<13:00:35, 5.47s/it][2025-06-19 15:46:16,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:46:16,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.87 | bwd_microstep: 3321.46 | bwd_inner_microstep: 3320.36 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.73 [2025-06-19 15:46:16,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.87 | bwd: 3321.48 | bwd_inner: 3320.36 | bwd_allreduce: 1.05 | step: 7.73 14%|█▍ | 1446/10000 [2:16:36<13:00:24, 5.47s/it] {'loss': 0.089, 'grad_norm': 0.42473334074020386, 'learning_rate': 3.863813503343382e-05, 'epoch': 1.45} 14%|█▍ | 1446/10000 [2:16:36<13:00:24, 5.47s/it][2025-06-19 15:46:21,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:46:21,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.44 | bwd_microstep: 3319.59 | bwd_inner_microstep: 3318.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 15:46:21,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.44 | bwd: 3319.60 | bwd_inner: 3318.80 | bwd_allreduce: 0.76 | step: 6.55 14%|█▍ | 1447/10000 [2:16:42<12:59:52, 5.47s/it] {'loss': 0.1227, 'grad_norm': 0.8001806139945984, 'learning_rate': 3.863578467609376e-05, 'epoch': 1.45} 14%|█▍ | 1447/10000 [2:16:42<12:59:52, 5.47s/it][2025-06-19 15:46:27,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:46:27,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.32 | bwd_microstep: 3327.23 | bwd_inner_microstep: 3326.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 15:46:27,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.32 | bwd: 3327.25 | bwd_inner: 3326.45 | bwd_allreduce: 0.75 | step: 6.53 14%|█▍ | 1448/10000 [2:16:47<12:59:49, 5.47s/it] {'loss': 0.1085, 'grad_norm': 0.9144657254219055, 'learning_rate': 3.863343236394638e-05, 'epoch': 1.45} 14%|█▍ | 1448/10000 [2:16:47<12:59:49, 5.47s/it][2025-06-19 15:46:32,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:46:32,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.15 | bwd_microstep: 3330.69 | bwd_inner_microstep: 3329.76 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.89 [2025-06-19 15:46:32,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.15 | bwd: 3330.70 | bwd_inner: 3329.76 | bwd_allreduce: 0.90 | step: 6.90 14%|█▍ | 1449/10000 [2:16:53<12:59:50, 5.47s/it] {'loss': 0.0845, 'grad_norm': 0.5133355259895325, 'learning_rate': 3.8631078097238423e-05, 'epoch': 1.45} 14%|█▍ | 1449/10000 [2:16:53<12:59:50, 5.47s/it][2025-06-19 15:46:38,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:46:38,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.42 | bwd_microstep: 3327.78 | bwd_inner_microstep: 3326.98 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 15:46:38,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.42 | bwd: 3327.79 | bwd_inner: 3326.98 | bwd_allreduce: 0.77 | step: 7.01 14%|█▍ | 1450/10000 [2:16:58<13:00:12, 5.48s/it] {'loss': 0.0616, 'grad_norm': 0.29500091075897217, 'learning_rate': 3.862872187621685e-05, 'epoch': 1.45} 14%|█▍ | 1450/10000 [2:16:58<13:00:12, 5.48s/it][2025-06-19 15:46:43,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:46:43,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.32 | bwd_microstep: 3374.29 | bwd_inner_microstep: 3373.40 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.84 [2025-06-19 15:46:43,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.32 | bwd: 3374.31 | bwd_inner: 3373.40 | bwd_allreduce: 0.86 | step: 6.84 15%|█▍ | 1451/10000 [2:17:04<13:02:55, 5.49s/it] {'loss': 0.0566, 'grad_norm': 0.3444865345954895, 'learning_rate': 3.8626363701128804e-05, 'epoch': 1.45} 15%|█▍ | 1451/10000 [2:17:04<13:02:55, 5.49s/it][2025-06-19 15:46:49,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:46:49,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.88 | bwd_microstep: 3328.96 | bwd_inner_microstep: 3327.99 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.06 [2025-06-19 15:46:49,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.88 | bwd: 3328.98 | bwd_inner: 3327.99 | bwd_allreduce: 0.94 | step: 7.07 15%|█▍ | 1452/10000 [2:17:09<13:01:52, 5.49s/it] {'loss': 0.1415, 'grad_norm': 1.0184954404830933, 'learning_rate': 3.862400357222166e-05, 'epoch': 1.45} 15%|█▍ | 1452/10000 [2:17:09<13:01:52, 5.49s/it][2025-06-19 15:46:54,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.72 [2025-06-19 15:46:54,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.21 | bwd_microstep: 3321.72 | bwd_inner_microstep: 3320.66 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.65 [2025-06-19 15:46:54,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.21 | bwd: 3321.73 | bwd_inner: 3320.66 | bwd_allreduce: 1.02 | step: 7.66 15%|█▍ | 1453/10000 [2:17:15<13:01:30, 5.49s/it] {'loss': 0.0813, 'grad_norm': 0.5985713601112366, 'learning_rate': 3.862164148974297e-05, 'epoch': 1.45} 15%|█▍ | 1453/10000 [2:17:15<13:01:30, 5.49s/it][2025-06-19 15:47:00,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 15:47:00,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.72 | bwd_microstep: 3378.61 | bwd_inner_microstep: 3377.53 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.90 [2025-06-19 15:47:00,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.72 | bwd: 3378.63 | bwd_inner: 3377.53 | bwd_allreduce: 1.04 | step: 7.92 15%|█▍ | 1454/10000 [2:17:20<13:04:41, 5.51s/it] {'loss': 0.0902, 'grad_norm': 0.5876815319061279, 'learning_rate': 3.861927745394052e-05, 'epoch': 1.45} 15%|█▍ | 1454/10000 [2:17:20<13:04:41, 5.51s/it][2025-06-19 15:47:05,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 15:47:05,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.79 | bwd_microstep: 3368.74 | bwd_inner_microstep: 3367.96 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 15:47:05,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.79 | bwd: 3368.76 | bwd_inner: 3367.96 | bwd_allreduce: 0.75 | step: 6.53 15%|█▍ | 1455/10000 [2:17:26<13:06:08, 5.52s/it] {'loss': 0.0487, 'grad_norm': 0.22751227021217346, 'learning_rate': 3.861691146506228e-05, 'epoch': 1.46} 15%|█▍ | 1455/10000 [2:17:26<13:06:08, 5.52s/it][2025-06-19 15:47:11,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:47:11,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.49 | bwd_microstep: 3325.66 | bwd_inner_microstep: 3324.67 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.43 [2025-06-19 15:47:11,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.49 | bwd: 3325.68 | bwd_inner: 3324.67 | bwd_allreduce: 0.96 | step: 7.44 15%|█▍ | 1456/10000 [2:17:31<13:04:46, 5.51s/it] {'loss': 0.1059, 'grad_norm': 0.6426359415054321, 'learning_rate': 3.861454352335643e-05, 'epoch': 1.46} 15%|█▍ | 1456/10000 [2:17:31<13:04:46, 5.51s/it][2025-06-19 15:47:16,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:47:16,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.24 | bwd_microstep: 3328.09 | bwd_inner_microstep: 3327.20 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.87 [2025-06-19 15:47:16,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.24 | bwd: 3328.11 | bwd_inner: 3327.20 | bwd_allreduce: 0.86 | step: 6.87 15%|█▍ | 1457/10000 [2:17:37<13:03:19, 5.50s/it] {'loss': 0.1162, 'grad_norm': 0.6561493277549744, 'learning_rate': 3.8612173629071355e-05, 'epoch': 1.46} 15%|█▍ | 1457/10000 [2:17:37<13:03:19, 5.50s/it][2025-06-19 15:47:22,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:47:22,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.91 | bwd_microstep: 3388.36 | bwd_inner_microstep: 3387.45 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.35 [2025-06-19 15:47:22,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.91 | bwd: 3388.38 | bwd_inner: 3387.45 | bwd_allreduce: 0.88 | step: 7.35 15%|█▍ | 1458/10000 [2:17:43<13:05:57, 5.52s/it] {'loss': 0.0721, 'grad_norm': 0.4829503297805786, 'learning_rate': 3.860980178245565e-05, 'epoch': 1.46} 15%|█▍ | 1458/10000 [2:17:43<13:05:57, 5.52s/it][2025-06-19 15:47:27,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:47:27,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.61 | bwd_microstep: 3325.53 | bwd_inner_microstep: 3324.72 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.21 [2025-06-19 15:47:27,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.61 | bwd: 3325.55 | bwd_inner: 3324.72 | bwd_allreduce: 0.78 | step: 7.21 15%|█▍ | 1459/10000 [2:17:48<13:04:15, 5.51s/it] {'loss': 0.1366, 'grad_norm': 0.8375245332717896, 'learning_rate': 3.8607427983758105e-05, 'epoch': 1.46} 15%|█▍ | 1459/10000 [2:17:48<13:04:15, 5.51s/it][2025-06-19 15:47:33,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:47:33,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.49 | bwd_microstep: 3371.84 | bwd_inner_microstep: 3371.02 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.84 [2025-06-19 15:47:33,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.49 | bwd: 3371.86 | bwd_inner: 3371.02 | bwd_allreduce: 0.79 | step: 6.84 15%|█▍ | 1460/10000 [2:17:54<13:05:27, 5.52s/it] {'loss': 0.1272, 'grad_norm': 0.7652011513710022, 'learning_rate': 3.8605052233227726e-05, 'epoch': 1.46} 15%|█▍ | 1460/10000 [2:17:54<13:05:27, 5.52s/it][2025-06-19 15:47:38,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:47:38,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.05 | bwd_microstep: 3378.42 | bwd_inner_microstep: 3377.45 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.06 [2025-06-19 15:47:38,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.05 | bwd: 3378.43 | bwd_inner: 3377.45 | bwd_allreduce: 0.94 | step: 7.07 15%|█▍ | 1461/10000 [2:17:59<13:07:11, 5.53s/it] {'loss': 0.2553, 'grad_norm': 1.4380732774734497, 'learning_rate': 3.860267453111372e-05, 'epoch': 1.46} 15%|█▍ | 1461/10000 [2:17:59<13:07:11, 5.53s/it][2025-06-19 15:47:44,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:47:44,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.17 | bwd_microstep: 3336.13 | bwd_inner_microstep: 3335.31 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.85 [2025-06-19 15:47:44,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.17 | bwd: 3336.15 | bwd_inner: 3335.31 | bwd_allreduce: 0.79 | step: 6.85 15%|█▍ | 1462/10000 [2:18:05<13:05:26, 5.52s/it] {'loss': 0.0632, 'grad_norm': 0.5867178440093994, 'learning_rate': 3.8600294877665495e-05, 'epoch': 1.46} 15%|█▍ | 1462/10000 [2:18:05<13:05:26, 5.52s/it][2025-06-19 15:47:49,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:47:49,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.41 | bwd_microstep: 3331.59 | bwd_inner_microstep: 3330.75 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.35 [2025-06-19 15:47:49,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.41 | bwd: 3331.61 | bwd_inner: 3330.75 | bwd_allreduce: 0.81 | step: 7.36 15%|█▍ | 1463/10000 [2:18:10<13:03:55, 5.51s/it] {'loss': 0.0628, 'grad_norm': 0.33991822600364685, 'learning_rate': 3.859791327313265e-05, 'epoch': 1.46} 15%|█▍ | 1463/10000 [2:18:10<13:03:55, 5.51s/it][2025-06-19 15:47:55,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:47:55,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.18 | bwd_microstep: 3332.23 | bwd_inner_microstep: 3331.42 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.73 [2025-06-19 15:47:55,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.18 | bwd: 3332.24 | bwd_inner: 3331.42 | bwd_allreduce: 0.78 | step: 6.73 15%|█▍ | 1464/10000 [2:18:16<13:03:04, 5.50s/it] {'loss': 0.0703, 'grad_norm': 0.46415019035339355, 'learning_rate': 3.859552971776503e-05, 'epoch': 1.46} 15%|█▍ | 1464/10000 [2:18:16<13:03:04, 5.50s/it][2025-06-19 15:48:00,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.74 | optimizer_step: 2.85 [2025-06-19 15:48:00,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.20 | bwd_microstep: 3332.53 | bwd_inner_microstep: 3331.43 | bwd_allreduce_microstep: 1.04 | step_microstep: 8.95 [2025-06-19 15:48:00,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.20 | bwd: 3332.54 | bwd_inner: 3331.43 | bwd_allreduce: 1.06 | step: 8.97 15%|█▍ | 1465/10000 [2:18:21<13:02:13, 5.50s/it] {'loss': 0.115, 'grad_norm': 0.5508723258972168, 'learning_rate': 3.8593144211812645e-05, 'epoch': 1.47} 15%|█▍ | 1465/10000 [2:18:21<13:02:13, 5.50s/it][2025-06-19 15:48:06,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:48:06,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.30 | bwd_microstep: 3327.30 | bwd_inner_microstep: 3326.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 15:48:06,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.30 | bwd: 3327.31 | bwd_inner: 3326.51 | bwd_allreduce: 0.77 | step: 6.62 15%|█▍ | 1466/10000 [2:18:27<13:02:00, 5.50s/it] {'loss': 0.0835, 'grad_norm': 0.3877377510070801, 'learning_rate': 3.8590756755525724e-05, 'epoch': 1.47} 15%|█▍ | 1466/10000 [2:18:27<13:02:00, 5.50s/it][2025-06-19 15:48:11,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:48:11,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.95 | bwd_microstep: 3329.70 | bwd_inner_microstep: 3328.86 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.48 [2025-06-19 15:48:11,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.95 | bwd: 3329.71 | bwd_inner: 3328.86 | bwd_allreduce: 0.81 | step: 7.49 15%|█▍ | 1467/10000 [2:18:32<13:01:07, 5.49s/it] {'loss': 0.0471, 'grad_norm': 0.5809139609336853, 'learning_rate': 3.858836734915471e-05, 'epoch': 1.47} 15%|█▍ | 1467/10000 [2:18:32<13:01:07, 5.49s/it][2025-06-19 15:48:17,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:48:17,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.64 | bwd_microstep: 3342.48 | bwd_inner_microstep: 3341.67 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.21 [2025-06-19 15:48:17,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.64 | bwd: 3342.50 | bwd_inner: 3341.67 | bwd_allreduce: 0.79 | step: 7.22 15%|█▍ | 1468/10000 [2:18:38<13:01:19, 5.49s/it] {'loss': 0.0619, 'grad_norm': 0.4428235590457916, 'learning_rate': 3.858597599295023e-05, 'epoch': 1.47} 15%|█▍ | 1468/10000 [2:18:38<13:01:19, 5.49s/it][2025-06-19 15:48:22,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:48:22,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.11 | bwd_microstep: 3338.33 | bwd_inner_microstep: 3337.33 | bwd_allreduce_microstep: 0.95 | step_microstep: 8.04 [2025-06-19 15:48:22,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.11 | bwd: 3338.34 | bwd_inner: 3337.33 | bwd_allreduce: 0.97 | step: 8.04 15%|█▍ | 1469/10000 [2:18:43<13:01:21, 5.50s/it] {'loss': 0.0949, 'grad_norm': 0.694304883480072, 'learning_rate': 3.8583582687163114e-05, 'epoch': 1.47} 15%|█▍ | 1469/10000 [2:18:43<13:01:21, 5.50s/it][2025-06-19 15:48:28,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:48:28,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.51 | bwd_microstep: 3323.63 | bwd_inner_microstep: 3322.67 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.03 [2025-06-19 15:48:28,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.51 | bwd: 3323.65 | bwd_inner: 3322.67 | bwd_allreduce: 0.92 | step: 7.03 15%|█▍ | 1470/10000 [2:18:49<13:00:31, 5.49s/it] {'loss': 0.1847, 'grad_norm': 1.1166385412216187, 'learning_rate': 3.8581187432044436e-05, 'epoch': 1.47} 15%|█▍ | 1470/10000 [2:18:49<13:00:31, 5.49s/it][2025-06-19 15:48:33,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:48:33,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.49 | bwd_microstep: 3318.56 | bwd_inner_microstep: 3317.71 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.90 [2025-06-19 15:48:33,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.49 | bwd: 3318.58 | bwd_inner: 3317.71 | bwd_allreduce: 0.82 | step: 6.90 15%|█▍ | 1471/10000 [2:18:54<12:59:23, 5.48s/it] {'loss': 0.2072, 'grad_norm': 1.2771081924438477, 'learning_rate': 3.857879022784543e-05, 'epoch': 1.47} 15%|█▍ | 1471/10000 [2:18:54<12:59:23, 5.48s/it][2025-06-19 15:48:39,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:48:39,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.48 | bwd_microstep: 3320.40 | bwd_inner_microstep: 3319.59 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-19 15:48:39,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.48 | bwd: 3320.41 | bwd_inner: 3319.59 | bwd_allreduce: 0.78 | step: 7.27 15%|█▍ | 1472/10000 [2:18:59<12:59:08, 5.48s/it] {'loss': 0.1389, 'grad_norm': 0.942720890045166, 'learning_rate': 3.857639107481756e-05, 'epoch': 1.47} 15%|█▍ | 1472/10000 [2:18:59<12:59:08, 5.48s/it][2025-06-19 15:48:44,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:48:44,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.32 | bwd_microstep: 3373.81 | bwd_inner_microstep: 3372.93 | bwd_allreduce_microstep: 0.81 | step_microstep: 8.11 [2025-06-19 15:48:44,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.32 | bwd: 3373.84 | bwd_inner: 3372.93 | bwd_allreduce: 0.84 | step: 8.12 15%|█▍ | 1473/10000 [2:19:05<13:02:21, 5.50s/it] {'loss': 0.1249, 'grad_norm': 0.6824327707290649, 'learning_rate': 3.857398997321248e-05, 'epoch': 1.47} 15%|█▍ | 1473/10000 [2:19:05<13:02:21, 5.50s/it][2025-06-19 15:48:50,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:48:50,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.04 | bwd_microstep: 3374.21 | bwd_inner_microstep: 3373.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 15:48:50,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.04 | bwd: 3374.23 | bwd_inner: 3373.43 | bwd_allreduce: 0.76 | step: 6.65 15%|█▍ | 1474/10000 [2:19:11<13:04:19, 5.52s/it] {'loss': 0.1158, 'grad_norm': 0.8713458180427551, 'learning_rate': 3.857158692328206e-05, 'epoch': 1.47} 15%|█▍ | 1474/10000 [2:19:11<13:04:19, 5.52s/it][2025-06-19 15:48:55,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:48:55,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.00 | bwd_microstep: 3327.89 | bwd_inner_microstep: 3326.93 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.53 [2025-06-19 15:48:55,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.00 | bwd: 3327.90 | bwd_inner: 3326.93 | bwd_allreduce: 0.92 | step: 7.53 15%|█▍ | 1475/10000 [2:19:16<13:02:18, 5.51s/it] {'loss': 0.0733, 'grad_norm': 0.4694289267063141, 'learning_rate': 3.856918192527836e-05, 'epoch': 1.48} 15%|█▍ | 1475/10000 [2:19:16<13:02:18, 5.51s/it][2025-06-19 15:49:01,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:49:01,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.71 | bwd_microstep: 3374.97 | bwd_inner_microstep: 3374.04 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.00 [2025-06-19 15:49:01,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.71 | bwd: 3374.99 | bwd_inner: 3374.04 | bwd_allreduce: 0.91 | step: 7.00 15%|█▍ | 1476/10000 [2:19:22<13:04:09, 5.52s/it] {'loss': 0.0605, 'grad_norm': 0.46073782444000244, 'learning_rate': 3.8566774979453654e-05, 'epoch': 1.48} 15%|█▍ | 1476/10000 [2:19:22<13:04:09, 5.52s/it][2025-06-19 15:49:06,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:49:06,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.74 | bwd_microstep: 3377.49 | bwd_inner_microstep: 3376.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-19 15:49:06,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.74 | bwd: 3377.51 | bwd_inner: 3376.69 | bwd_allreduce: 0.78 | step: 6.94 15%|█▍ | 1477/10000 [2:19:27<13:05:26, 5.53s/it] {'loss': 0.1031, 'grad_norm': 0.9558055400848389, 'learning_rate': 3.856436608606043e-05, 'epoch': 1.48} 15%|█▍ | 1477/10000 [2:19:27<13:05:26, 5.53s/it][2025-06-19 15:49:12,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 15:49:12,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.74 | bwd_microstep: 3328.31 | bwd_inner_microstep: 3327.45 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.88 [2025-06-19 15:49:12,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.74 | bwd: 3328.34 | bwd_inner: 3327.45 | bwd_allreduce: 0.83 | step: 7.89 15%|█▍ | 1478/10000 [2:19:33<13:03:37, 5.52s/it] {'loss': 0.1149, 'grad_norm': 0.9114924073219299, 'learning_rate': 3.856195524535136e-05, 'epoch': 1.48} 15%|█▍ | 1478/10000 [2:19:33<13:03:37, 5.52s/it][2025-06-19 15:49:17,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:49:17,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.95 | bwd_microstep: 3322.62 | bwd_inner_microstep: 3321.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 15:49:17,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.95 | bwd: 3322.64 | bwd_inner: 3321.83 | bwd_allreduce: 0.76 | step: 6.68 15%|█▍ | 1479/10000 [2:19:38<13:01:51, 5.51s/it] {'loss': 0.1159, 'grad_norm': 0.7750427722930908, 'learning_rate': 3.855954245757934e-05, 'epoch': 1.48} 15%|█▍ | 1479/10000 [2:19:38<13:01:51, 5.51s/it][2025-06-19 15:49:23,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:49:23,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.64 | bwd_microstep: 3317.41 | bwd_inner_microstep: 3316.31 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.17 [2025-06-19 15:49:23,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.65 | bwd: 3317.43 | bwd_inner: 3316.31 | bwd_allreduce: 1.07 | step: 7.18 15%|█▍ | 1480/10000 [2:19:44<13:00:10, 5.49s/it] {'loss': 0.1204, 'grad_norm': 1.0056096315383911, 'learning_rate': 3.855712772299745e-05, 'epoch': 1.48} 15%|█▍ | 1480/10000 [2:19:44<13:00:10, 5.49s/it][2025-06-19 15:49:28,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:49:28,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.65 | bwd_microstep: 3327.23 | bwd_inner_microstep: 3326.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 15:49:28,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.65 | bwd: 3327.25 | bwd_inner: 3326.43 | bwd_allreduce: 0.77 | step: 7.09 15%|█▍ | 1481/10000 [2:19:49<12:59:32, 5.49s/it] {'loss': 0.1078, 'grad_norm': 0.9626979231834412, 'learning_rate': 3.855471104185899e-05, 'epoch': 1.48} 15%|█▍ | 1481/10000 [2:19:49<12:59:32, 5.49s/it][2025-06-19 15:49:34,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 15:49:34,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.94 | bwd_microstep: 3369.50 | bwd_inner_microstep: 3368.52 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.43 [2025-06-19 15:49:34,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.94 | bwd: 3369.52 | bwd_inner: 3368.52 | bwd_allreduce: 0.95 | step: 7.44 15%|█▍ | 1482/10000 [2:19:55<13:02:06, 5.51s/it] {'loss': 0.1245, 'grad_norm': 0.7409498691558838, 'learning_rate': 3.8552292414417454e-05, 'epoch': 1.48} 15%|█▍ | 1482/10000 [2:19:55<13:02:06, 5.51s/it][2025-06-19 15:49:39,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:49:39,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.86 | bwd_microstep: 3370.42 | bwd_inner_microstep: 3369.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 15:49:39,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.86 | bwd: 3370.43 | bwd_inner: 3369.62 | bwd_allreduce: 0.77 | step: 6.97 15%|█▍ | 1483/10000 [2:20:00<13:03:25, 5.52s/it] {'loss': 0.1675, 'grad_norm': 1.6718806028366089, 'learning_rate': 3.854987184092655e-05, 'epoch': 1.48} 15%|█▍ | 1483/10000 [2:20:00<13:03:25, 5.52s/it][2025-06-19 15:49:45,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:49:45,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.06 | bwd_microstep: 3329.75 | bwd_inner_microstep: 3328.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.26 [2025-06-19 15:49:45,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.06 | bwd: 3329.77 | bwd_inner: 3328.94 | bwd_allreduce: 0.78 | step: 7.26 15%|█▍ | 1484/10000 [2:20:06<13:01:24, 5.51s/it] {'loss': 0.0672, 'grad_norm': 0.5557634234428406, 'learning_rate': 3.854744932164017e-05, 'epoch': 1.48} 15%|█▍ | 1484/10000 [2:20:06<13:01:24, 5.51s/it][2025-06-19 15:49:50,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:49:50,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.75 | bwd_microstep: 3375.40 | bwd_inner_microstep: 3374.52 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.04 [2025-06-19 15:49:50,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.75 | bwd: 3375.42 | bwd_inner: 3374.52 | bwd_allreduce: 0.84 | step: 7.04 15%|█▍ | 1485/10000 [2:20:11<13:03:08, 5.52s/it] {'loss': 0.1428, 'grad_norm': 1.16493558883667, 'learning_rate': 3.854502485681246e-05, 'epoch': 1.48} 15%|█▍ | 1485/10000 [2:20:11<13:03:08, 5.52s/it][2025-06-19 15:49:56,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:49:56,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.89 | bwd_microstep: 3322.68 | bwd_inner_microstep: 3321.86 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.85 [2025-06-19 15:49:56,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.89 | bwd: 3322.69 | bwd_inner: 3321.86 | bwd_allreduce: 0.78 | step: 6.86 15%|█▍ | 1486/10000 [2:20:17<13:01:09, 5.50s/it] {'loss': 0.051, 'grad_norm': 0.47072359919548035, 'learning_rate': 3.85425984466977e-05, 'epoch': 1.49} 15%|█▍ | 1486/10000 [2:20:17<13:01:09, 5.50s/it][2025-06-19 15:50:01,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:50:01,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.41 | bwd_microstep: 3324.50 | bwd_inner_microstep: 3323.70 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 15:50:01,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.41 | bwd: 3324.52 | bwd_inner: 3323.70 | bwd_allreduce: 0.77 | step: 6.95 15%|█▍ | 1487/10000 [2:20:22<12:59:35, 5.49s/it] {'loss': 0.1683, 'grad_norm': 0.8313842415809631, 'learning_rate': 3.854017009155042e-05, 'epoch': 1.49} 15%|█▍ | 1487/10000 [2:20:22<12:59:35, 5.49s/it][2025-06-19 15:50:07,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:50:07,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.91 | bwd_microstep: 3323.41 | bwd_inner_microstep: 3322.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 15:50:07,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.91 | bwd: 3323.42 | bwd_inner: 3322.60 | bwd_allreduce: 0.78 | step: 6.98 15%|█▍ | 1488/10000 [2:20:28<12:58:39, 5.49s/it] {'loss': 0.0439, 'grad_norm': 0.3375413119792938, 'learning_rate': 3.8537739791625346e-05, 'epoch': 1.49} 15%|█▍ | 1488/10000 [2:20:28<12:58:39, 5.49s/it][2025-06-19 15:50:12,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.74 [2025-06-19 15:50:12,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.09 | bwd_microstep: 3374.43 | bwd_inner_microstep: 3373.32 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.33 [2025-06-19 15:50:12,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.09 | bwd: 3374.45 | bwd_inner: 3373.32 | bwd_allreduce: 1.08 | step: 7.33 15%|█▍ | 1489/10000 [2:20:33<13:00:51, 5.50s/it] {'loss': 0.1019, 'grad_norm': 0.7099437713623047, 'learning_rate': 3.8535307547177405e-05, 'epoch': 1.49} 15%|█▍ | 1489/10000 [2:20:33<13:00:51, 5.50s/it][2025-06-19 15:50:18,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:50:18,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.27 | bwd_microstep: 3332.38 | bwd_inner_microstep: 3331.54 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.42 [2025-06-19 15:50:18,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.27 | bwd: 3332.40 | bwd_inner: 3331.54 | bwd_allreduce: 0.81 | step: 7.43 15%|█▍ | 1490/10000 [2:20:39<12:59:57, 5.50s/it] {'loss': 0.0679, 'grad_norm': 0.46282488107681274, 'learning_rate': 3.853287335846173e-05, 'epoch': 1.49} 15%|█▍ | 1490/10000 [2:20:39<12:59:57, 5.50s/it][2025-06-19 15:50:23,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:50:23,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.67 | bwd_microstep: 3331.36 | bwd_inner_microstep: 3330.36 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.64 [2025-06-19 15:50:23,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.67 | bwd: 3331.38 | bwd_inner: 3330.36 | bwd_allreduce: 0.97 | step: 7.64 15%|█▍ | 1491/10000 [2:20:44<12:59:23, 5.50s/it] {'loss': 0.1279, 'grad_norm': 0.8140884637832642, 'learning_rate': 3.853043722573366e-05, 'epoch': 1.49} 15%|█▍ | 1491/10000 [2:20:44<12:59:23, 5.50s/it][2025-06-19 15:50:29,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:50:29,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.02 | bwd_microstep: 3368.90 | bwd_inner_microstep: 3368.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 15:50:29,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.02 | bwd: 3368.92 | bwd_inner: 3368.12 | bwd_allreduce: 0.76 | step: 6.77 15%|█▍ | 1492/10000 [2:20:50<13:01:07, 5.51s/it] {'loss': 0.1225, 'grad_norm': 0.8137186765670776, 'learning_rate': 3.8527999149248715e-05, 'epoch': 1.49} 15%|█▍ | 1492/10000 [2:20:50<13:01:07, 5.51s/it][2025-06-19 15:50:34,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:50:34,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.35 | bwd_microstep: 3321.56 | bwd_inner_microstep: 3320.66 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.93 [2025-06-19 15:50:34,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.35 | bwd: 3321.57 | bwd_inner: 3320.66 | bwd_allreduce: 0.86 | step: 6.93 15%|█▍ | 1493/10000 [2:20:55<12:58:56, 5.49s/it] {'loss': 0.1044, 'grad_norm': 0.5539552569389343, 'learning_rate': 3.852555912926265e-05, 'epoch': 1.49} 15%|█▍ | 1493/10000 [2:20:55<12:58:56, 5.49s/it][2025-06-19 15:50:40,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:50:40,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.38 | bwd_microstep: 3316.25 | bwd_inner_microstep: 3315.40 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.28 [2025-06-19 15:50:40,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.38 | bwd: 3316.26 | bwd_inner: 3315.40 | bwd_allreduce: 0.82 | step: 7.28 15%|█▍ | 1494/10000 [2:21:01<12:57:32, 5.48s/it] {'loss': 0.1431, 'grad_norm': 1.1268627643585205, 'learning_rate': 3.852311716603142e-05, 'epoch': 1.49} 15%|█▍ | 1494/10000 [2:21:01<12:57:32, 5.48s/it][2025-06-19 15:50:45,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:50:45,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.64 | bwd_microstep: 3373.07 | bwd_inner_microstep: 3372.27 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 15:50:45,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.64 | bwd: 3373.08 | bwd_inner: 3372.27 | bwd_allreduce: 0.77 | step: 7.13 15%|█▍ | 1495/10000 [2:21:06<12:59:50, 5.50s/it] {'loss': 0.2357, 'grad_norm': 0.7994155287742615, 'learning_rate': 3.852067325981116e-05, 'epoch': 1.5} 15%|█▍ | 1495/10000 [2:21:06<12:59:50, 5.50s/it][2025-06-19 15:50:51,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:50:51,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.07 | bwd_microstep: 3317.12 | bwd_inner_microstep: 3316.04 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.53 [2025-06-19 15:50:51,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.07 | bwd: 3317.14 | bwd_inner: 3316.04 | bwd_allreduce: 1.03 | step: 7.53 15%|█▍ | 1496/10000 [2:21:12<12:57:59, 5.49s/it] {'loss': 0.0564, 'grad_norm': 0.5527992844581604, 'learning_rate': 3.851822741085824e-05, 'epoch': 1.5} 15%|█▍ | 1496/10000 [2:21:12<12:57:59, 5.49s/it][2025-06-19 15:50:56,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.70 | optimizer_step: 2.73 [2025-06-19 15:50:56,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.03 | bwd_microstep: 3319.88 | bwd_inner_microstep: 3319.09 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-19 15:50:56,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.03 | bwd: 3319.90 | bwd_inner: 3319.09 | bwd_allreduce: 0.76 | step: 7.06 15%|█▍ | 1497/10000 [2:21:17<12:56:40, 5.48s/it] {'loss': 0.0812, 'grad_norm': 0.8149316906929016, 'learning_rate': 3.851577961942921e-05, 'epoch': 1.5} 15%|█▍ | 1497/10000 [2:21:17<12:56:40, 5.48s/it][2025-06-19 15:51:02,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:51:02,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.95 | bwd_microstep: 3312.51 | bwd_inner_microstep: 3311.71 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-19 15:51:02,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.95 | bwd: 3312.52 | bwd_inner: 3311.71 | bwd_allreduce: 0.77 | step: 6.91 15%|█▍ | 1498/10000 [2:21:22<12:55:40, 5.47s/it] {'loss': 0.1005, 'grad_norm': 0.7287960052490234, 'learning_rate': 3.8513329885780824e-05, 'epoch': 1.5} 15%|█▍ | 1498/10000 [2:21:23<12:55:40, 5.47s/it][2025-06-19 15:51:07,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:51:07,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.28 | bwd_microstep: 3313.01 | bwd_inner_microstep: 3312.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 15:51:07,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.28 | bwd: 3313.03 | bwd_inner: 3312.22 | bwd_allreduce: 0.76 | step: 6.67 15%|█▍ | 1499/10000 [2:21:28<13:00:07, 5.51s/it] {'loss': 0.161, 'grad_norm': 0.9890271425247192, 'learning_rate': 3.851087821017006e-05, 'epoch': 1.5} 15%|█▍ | 1499/10000 [2:21:28<13:00:07, 5.51s/it][2025-06-19 15:51:13,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:51:13,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.44 | bwd_microstep: 3363.14 | bwd_inner_microstep: 3362.32 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.61 [2025-06-19 15:51:13,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.44 | bwd: 3363.16 | bwd_inner: 3362.32 | bwd_allreduce: 0.79 | step: 7.61 15%|█▌ | 1500/10000 [2:21:34<13:01:24, 5.52s/it] {'loss': 0.0722, 'grad_norm': 0.44645941257476807, 'learning_rate': 3.8508424592854085e-05, 'epoch': 1.5} 15%|█▌ | 1500/10000 [2:21:34<13:01:24, 5.52s/it][2025-06-19 15:51:18,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:51:18,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.14 | bwd_microstep: 3314.11 | bwd_inner_microstep: 3313.27 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.76 [2025-06-19 15:51:18,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.14 | bwd: 3314.13 | bwd_inner: 3313.27 | bwd_allreduce: 0.80 | step: 6.76 15%|█▌ | 1501/10000 [2:21:39<12:59:06, 5.50s/it] {'loss': 0.1273, 'grad_norm': 0.9143158197402954, 'learning_rate': 3.850596903409027e-05, 'epoch': 1.5} 15%|█▌ | 1501/10000 [2:21:39<12:59:06, 5.50s/it][2025-06-19 15:51:24,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:51:24,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.87 | bwd_microstep: 3319.42 | bwd_inner_microstep: 3318.59 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.30 [2025-06-19 15:51:24,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.87 | bwd: 3319.43 | bwd_inner: 3318.59 | bwd_allreduce: 0.79 | step: 7.30 15%|█▌ | 1502/10000 [2:21:45<12:57:14, 5.49s/it] {'loss': 0.1776, 'grad_norm': 1.1250059604644775, 'learning_rate': 3.850351153413619e-05, 'epoch': 1.5} 15%|█▌ | 1502/10000 [2:21:45<12:57:14, 5.49s/it][2025-06-19 15:51:29,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:51:29,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.17 | bwd_microstep: 3322.60 | bwd_inner_microstep: 3321.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 15:51:29,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.17 | bwd: 3322.61 | bwd_inner: 3321.81 | bwd_allreduce: 0.76 | step: 6.70 15%|█▌ | 1503/10000 [2:21:50<12:56:24, 5.48s/it] {'loss': 0.1059, 'grad_norm': 0.6622875928878784, 'learning_rate': 3.850105209324963e-05, 'epoch': 1.5} 15%|█▌ | 1503/10000 [2:21:50<12:56:24, 5.48s/it][2025-06-19 15:51:35,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:51:35,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.52 | bwd_microstep: 3324.32 | bwd_inner_microstep: 3323.50 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.41 [2025-06-19 15:51:35,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.52 | bwd: 3324.33 | bwd_inner: 3323.50 | bwd_allreduce: 0.79 | step: 7.42 15%|█▌ | 1504/10000 [2:21:55<12:55:37, 5.48s/it] {'loss': 0.1642, 'grad_norm': 1.297500491142273, 'learning_rate': 3.849859071168857e-05, 'epoch': 1.5} 15%|█▌ | 1504/10000 [2:21:55<12:55:37, 5.48s/it][2025-06-19 15:51:40,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:51:40,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.13 | bwd_microstep: 3311.95 | bwd_inner_microstep: 3311.13 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.30 [2025-06-19 15:51:40,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.13 | bwd: 3311.97 | bwd_inner: 3311.13 | bwd_allreduce: 0.79 | step: 7.30 15%|█▌ | 1505/10000 [2:22:01<12:54:40, 5.47s/it] {'loss': 0.1295, 'grad_norm': 0.7148704528808594, 'learning_rate': 3.8496127389711205e-05, 'epoch': 1.5} 15%|█▌ | 1505/10000 [2:22:01<12:54:40, 5.47s/it][2025-06-19 15:51:46,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 15:51:46,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.30 | bwd_microstep: 3311.05 | bwd_inner_microstep: 3310.03 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.38 [2025-06-19 15:51:46,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.30 | bwd: 3311.07 | bwd_inner: 3310.03 | bwd_allreduce: 1.00 | step: 7.38 15%|█▌ | 1506/10000 [2:22:06<12:54:03, 5.47s/it] {'loss': 0.1005, 'grad_norm': 0.5293793678283691, 'learning_rate': 3.849366212757591e-05, 'epoch': 1.51} 15%|█▌ | 1506/10000 [2:22:06<12:54:03, 5.47s/it][2025-06-19 15:51:51,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:51:51,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.14 | bwd_microstep: 3310.12 | bwd_inner_microstep: 3309.29 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.47 [2025-06-19 15:51:51,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.14 | bwd: 3310.14 | bwd_inner: 3309.29 | bwd_allreduce: 0.80 | step: 7.47 15%|█▌ | 1507/10000 [2:22:12<12:53:21, 5.46s/it] {'loss': 0.0656, 'grad_norm': 0.6750595569610596, 'learning_rate': 3.84911949255413e-05, 'epoch': 1.51} 15%|█▌ | 1507/10000 [2:22:12<12:53:21, 5.46s/it][2025-06-19 15:51:57,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:51:57,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.44 | bwd_microstep: 3364.83 | bwd_inner_microstep: 3363.84 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.07 [2025-06-19 15:51:57,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.44 | bwd: 3364.84 | bwd_inner: 3363.84 | bwd_allreduce: 0.95 | step: 7.08 15%|█▌ | 1508/10000 [2:22:17<12:56:14, 5.48s/it] {'loss': 0.0546, 'grad_norm': 0.306942880153656, 'learning_rate': 3.8488725783866154e-05, 'epoch': 1.51} 15%|█▌ | 1508/10000 [2:22:17<12:56:14, 5.48s/it][2025-06-19 15:52:02,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:52:02,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.29 | bwd_microstep: 3325.03 | bwd_inner_microstep: 3324.22 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 15:52:02,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.29 | bwd: 3325.04 | bwd_inner: 3324.22 | bwd_allreduce: 0.78 | step: 7.29 15%|█▌ | 1509/10000 [2:22:23<12:55:29, 5.48s/it] {'loss': 0.0798, 'grad_norm': 0.8725681304931641, 'learning_rate': 3.8486254702809486e-05, 'epoch': 1.51} 15%|█▌ | 1509/10000 [2:22:23<12:55:29, 5.48s/it][2025-06-19 15:52:08,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:52:08,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.70 | bwd_microstep: 3316.68 | bwd_inner_microstep: 3315.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 15:52:08,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.70 | bwd: 3316.70 | bwd_inner: 3315.89 | bwd_allreduce: 0.76 | step: 6.68 15%|█▌ | 1510/10000 [2:22:28<12:54:32, 5.47s/it] {'loss': 0.1361, 'grad_norm': 1.1362348794937134, 'learning_rate': 3.84837816826305e-05, 'epoch': 1.51} 15%|█▌ | 1510/10000 [2:22:28<12:54:32, 5.47s/it][2025-06-19 15:52:13,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:52:13,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.96 | bwd_microstep: 3314.47 | bwd_inner_microstep: 3313.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 15:52:13,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.96 | bwd: 3314.49 | bwd_inner: 3313.68 | bwd_allreduce: 0.76 | step: 6.68 15%|█▌ | 1511/10000 [2:22:34<12:53:35, 5.47s/it] {'loss': 0.1187, 'grad_norm': 0.7448562383651733, 'learning_rate': 3.84813067235886e-05, 'epoch': 1.51} 15%|█▌ | 1511/10000 [2:22:34<12:53:35, 5.47s/it][2025-06-19 15:52:18,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:52:18,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.58 | bwd_microstep: 3314.21 | bwd_inner_microstep: 3313.41 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 15:52:18,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.58 | bwd: 3314.22 | bwd_inner: 3313.41 | bwd_allreduce: 0.77 | step: 7.18 15%|█▌ | 1512/10000 [2:22:39<12:53:13, 5.47s/it] {'loss': 0.0707, 'grad_norm': 0.7569354772567749, 'learning_rate': 3.847882982594339e-05, 'epoch': 1.51} 15%|█▌ | 1512/10000 [2:22:39<12:53:13, 5.47s/it][2025-06-19 15:52:24,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:52:24,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.16 | bwd_microstep: 3323.00 | bwd_inner_microstep: 3321.82 | bwd_allreduce_microstep: 1.13 | step_microstep: 7.10 [2025-06-19 15:52:24,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.16 | bwd: 3323.02 | bwd_inner: 3321.82 | bwd_allreduce: 1.15 | step: 7.11 15%|█▌ | 1513/10000 [2:22:45<12:53:14, 5.47s/it] {'loss': 0.1407, 'grad_norm': 0.9017017483711243, 'learning_rate': 3.84763509899547e-05, 'epoch': 1.51} 15%|█▌ | 1513/10000 [2:22:45<12:53:14, 5.47s/it][2025-06-19 15:52:29,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:52:29,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.33 | bwd_microstep: 3314.15 | bwd_inner_microstep: 3313.27 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.71 [2025-06-19 15:52:29,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.33 | bwd: 3314.16 | bwd_inner: 3313.27 | bwd_allreduce: 0.84 | step: 7.72 15%|█▌ | 1514/10000 [2:22:50<12:53:17, 5.47s/it] {'loss': 0.1026, 'grad_norm': 0.6655123829841614, 'learning_rate': 3.8473870215882544e-05, 'epoch': 1.51} 15%|█▌ | 1514/10000 [2:22:50<12:53:17, 5.47s/it][2025-06-19 15:52:35,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:52:35,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.77 | bwd_microstep: 3312.58 | bwd_inner_microstep: 3311.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-19 15:52:35,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.77 | bwd: 3312.59 | bwd_inner: 3311.79 | bwd_allreduce: 0.76 | step: 6.81 15%|█▌ | 1515/10000 [2:22:56<12:52:24, 5.46s/it] {'loss': 0.0932, 'grad_norm': 0.5901391506195068, 'learning_rate': 3.847138750398714e-05, 'epoch': 1.52} 15%|█▌ | 1515/10000 [2:22:56<12:52:24, 5.46s/it][2025-06-19 15:52:40,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:52:40,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.72 | bwd_microstep: 3364.06 | bwd_inner_microstep: 3363.28 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 15:52:40,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.72 | bwd: 3364.08 | bwd_inner: 3363.28 | bwd_allreduce: 0.75 | step: 6.64 15%|█▌ | 1516/10000 [2:23:01<12:55:03, 5.48s/it] {'loss': 0.0649, 'grad_norm': 0.5393555760383606, 'learning_rate': 3.846890285452892e-05, 'epoch': 1.52} 15%|█▌ | 1516/10000 [2:23:01<12:55:03, 5.48s/it][2025-06-19 15:52:46,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:52:46,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.61 | bwd_microstep: 3313.76 | bwd_inner_microstep: 3312.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 15:52:46,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.61 | bwd: 3313.77 | bwd_inner: 3312.95 | bwd_allreduce: 0.78 | step: 7.18 15%|█▌ | 1517/10000 [2:23:07<12:53:48, 5.47s/it] {'loss': 0.1814, 'grad_norm': 0.9905543923377991, 'learning_rate': 3.8466416267768506e-05, 'epoch': 1.52} 15%|█▌ | 1517/10000 [2:23:07<12:53:48, 5.47s/it][2025-06-19 15:52:51,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 15:52:51,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.23 | bwd_microstep: 3309.37 | bwd_inner_microstep: 3308.32 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.44 [2025-06-19 15:52:51,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.23 | bwd: 3309.40 | bwd_inner: 3308.32 | bwd_allreduce: 1.01 | step: 7.43 15%|█▌ | 1518/10000 [2:23:12<12:53:39, 5.47s/it] {'loss': 0.0436, 'grad_norm': 0.3403603732585907, 'learning_rate': 3.8463927743966734e-05, 'epoch': 1.52} 15%|█▌ | 1518/10000 [2:23:12<12:53:39, 5.47s/it][2025-06-19 15:52:57,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:52:57,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.82 | bwd_microstep: 3317.51 | bwd_inner_microstep: 3316.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 15:52:57,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.82 | bwd: 3317.53 | bwd_inner: 3316.73 | bwd_allreduce: 0.75 | step: 6.57 15%|█▌ | 1519/10000 [2:23:18<12:53:52, 5.47s/it] {'loss': 0.1144, 'grad_norm': 0.7285279631614685, 'learning_rate': 3.846143728338462e-05, 'epoch': 1.52} 15%|█▌ | 1519/10000 [2:23:18<12:53:52, 5.47s/it][2025-06-19 15:53:02,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:53:02,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.86 | bwd_microstep: 3314.59 | bwd_inner_microstep: 3313.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 15:53:02,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.86 | bwd: 3314.60 | bwd_inner: 3313.81 | bwd_allreduce: 0.76 | step: 6.76 15%|█▌ | 1520/10000 [2:23:23<12:52:54, 5.47s/it] {'loss': 0.1449, 'grad_norm': 0.816053569316864, 'learning_rate': 3.845894488628344e-05, 'epoch': 1.52} 15%|█▌ | 1520/10000 [2:23:23<12:52:54, 5.47s/it][2025-06-19 15:53:08,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:53:08,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.54 | bwd_microstep: 3320.24 | bwd_inner_microstep: 3319.44 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 15:53:08,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.54 | bwd: 3320.26 | bwd_inner: 3319.44 | bwd_allreduce: 0.78 | step: 7.19 15%|█▌ | 1521/10000 [2:23:28<12:52:35, 5.47s/it] {'loss': 0.0921, 'grad_norm': 0.47707945108413696, 'learning_rate': 3.84564505529246e-05, 'epoch': 1.52} 15%|█▌ | 1521/10000 [2:23:28<12:52:35, 5.47s/it][2025-06-19 15:53:13,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:53:13,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.59 | bwd_microstep: 3372.43 | bwd_inner_microstep: 3371.63 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 15:53:13,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.59 | bwd: 3372.44 | bwd_inner: 3371.63 | bwd_allreduce: 0.77 | step: 7.18 15%|█▌ | 1522/10000 [2:23:34<12:55:27, 5.49s/it] {'loss': 0.0784, 'grad_norm': 0.5216678380966187, 'learning_rate': 3.845395428356975e-05, 'epoch': 1.52} 15%|█▌ | 1522/10000 [2:23:34<12:55:27, 5.49s/it][2025-06-19 15:53:19,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:53:19,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.82 | bwd_microstep: 3316.02 | bwd_inner_microstep: 3315.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 15:53:19,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.82 | bwd: 3316.03 | bwd_inner: 3315.24 | bwd_allreduce: 0.76 | step: 6.57 15%|█▌ | 1523/10000 [2:23:39<12:53:43, 5.48s/it] {'loss': 0.0769, 'grad_norm': 0.5760040879249573, 'learning_rate': 3.845145607848075e-05, 'epoch': 1.52} 15%|█▌ | 1523/10000 [2:23:39<12:53:43, 5.48s/it][2025-06-19 15:53:24,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:53:24,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.06 | bwd_microstep: 3311.88 | bwd_inner_microstep: 3310.96 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.15 [2025-06-19 15:53:24,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.05 | bwd: 3311.90 | bwd_inner: 3310.96 | bwd_allreduce: 0.88 | step: 7.16 15%|█▌ | 1524/10000 [2:23:45<12:52:34, 5.47s/it] {'loss': 0.0836, 'grad_norm': 0.6344616413116455, 'learning_rate': 3.844895593791964e-05, 'epoch': 1.52} 15%|█▌ | 1524/10000 [2:23:45<12:52:34, 5.47s/it][2025-06-19 15:53:30,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 15:53:30,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.05 | bwd_microstep: 3316.01 | bwd_inner_microstep: 3315.15 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.61 [2025-06-19 15:53:30,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.05 | bwd: 3316.03 | bwd_inner: 3315.15 | bwd_allreduce: 0.83 | step: 7.61 15%|█▌ | 1525/10000 [2:23:50<12:52:41, 5.47s/it] {'loss': 0.0451, 'grad_norm': 0.31984061002731323, 'learning_rate': 3.844645386214868e-05, 'epoch': 1.52} 15%|█▌ | 1525/10000 [2:23:50<12:52:41, 5.47s/it][2025-06-19 15:53:35,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 15:53:35,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.10 | bwd_microstep: 3365.04 | bwd_inner_microstep: 3363.92 | bwd_allreduce_microstep: 1.05 | step_microstep: 8.18 [2025-06-19 15:53:35,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.10 | bwd: 3365.06 | bwd_inner: 3363.92 | bwd_allreduce: 1.07 | step: 8.18 15%|█▌ | 1526/10000 [2:23:56<12:55:39, 5.49s/it] {'loss': 0.0979, 'grad_norm': 0.6812931299209595, 'learning_rate': 3.844394985143032e-05, 'epoch': 1.53} 15%|█▌ | 1526/10000 [2:23:56<12:55:39, 5.49s/it][2025-06-19 15:53:41,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:53:41,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.88 | bwd_microstep: 3314.40 | bwd_inner_microstep: 3313.53 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.97 [2025-06-19 15:53:41,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.88 | bwd: 3314.42 | bwd_inner: 3313.53 | bwd_allreduce: 0.82 | step: 7.98 15%|█▌ | 1527/10000 [2:24:01<12:54:46, 5.49s/it] {'loss': 0.086, 'grad_norm': 0.48636728525161743, 'learning_rate': 3.844144390602722e-05, 'epoch': 1.53} 15%|█▌ | 1527/10000 [2:24:01<12:54:46, 5.49s/it][2025-06-19 15:53:46,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:53:46,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.56 | bwd_microstep: 3307.21 | bwd_inner_microstep: 3306.41 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 15:53:46,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.56 | bwd: 3307.23 | bwd_inner: 3306.41 | bwd_allreduce: 0.77 | step: 7.08 15%|█▌ | 1528/10000 [2:24:07<12:54:00, 5.48s/it] {'loss': 0.0909, 'grad_norm': 0.8675611615180969, 'learning_rate': 3.843893602620225e-05, 'epoch': 1.53} 15%|█▌ | 1528/10000 [2:24:07<12:54:00, 5.48s/it][2025-06-19 15:53:52,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:53:52,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.53 | bwd_microstep: 3315.15 | bwd_inner_microstep: 3314.19 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.69 [2025-06-19 15:53:52,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.53 | bwd: 3315.17 | bwd_inner: 3314.19 | bwd_allreduce: 0.93 | step: 7.69 15%|█▌ | 1529/10000 [2:24:12<12:52:48, 5.47s/it] {'loss': 0.1052, 'grad_norm': 0.6417496800422668, 'learning_rate': 3.843642621221847e-05, 'epoch': 1.53} 15%|█▌ | 1529/10000 [2:24:12<12:52:48, 5.47s/it][2025-06-19 15:53:57,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:53:57,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.40 | bwd_microstep: 3361.14 | bwd_inner_microstep: 3360.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 15:53:57,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.40 | bwd: 3361.15 | bwd_inner: 3360.34 | bwd_allreduce: 0.76 | step: 6.69 15%|█▌ | 1530/10000 [2:24:18<12:54:51, 5.49s/it] {'loss': 0.0552, 'grad_norm': 0.3711996376514435, 'learning_rate': 3.8433914464339135e-05, 'epoch': 1.53} 15%|█▌ | 1530/10000 [2:24:18<12:54:51, 5.49s/it][2025-06-19 15:54:02,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:54:02,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.47 | bwd_microstep: 3313.08 | bwd_inner_microstep: 3312.26 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-19 15:54:02,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.47 | bwd: 3313.09 | bwd_inner: 3312.26 | bwd_allreduce: 0.78 | step: 7.15 15%|█▌ | 1531/10000 [2:24:23<12:53:02, 5.48s/it] {'loss': 0.1375, 'grad_norm': 0.7255916595458984, 'learning_rate': 3.8431400782827734e-05, 'epoch': 1.53} 15%|█▌ | 1531/10000 [2:24:23<12:53:02, 5.48s/it][2025-06-19 15:54:08,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:54:08,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.43 | bwd_microstep: 3363.84 | bwd_inner_microstep: 3363.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 15:54:08,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.43 | bwd: 3363.86 | bwd_inner: 3363.06 | bwd_allreduce: 0.76 | step: 6.76 15%|█▌ | 1532/10000 [2:24:29<12:55:07, 5.49s/it] {'loss': 0.0794, 'grad_norm': 0.5060424208641052, 'learning_rate': 3.842888516794793e-05, 'epoch': 1.53} 15%|█▌ | 1532/10000 [2:24:29<12:55:07, 5.49s/it][2025-06-19 15:54:13,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:54:13,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2095.22 | bwd_microstep: 3302.24 | bwd_inner_microstep: 3301.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 15:54:13,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2095.22 | bwd: 3302.25 | bwd_inner: 3301.45 | bwd_allreduce: 0.76 | step: 6.70 15%|█▌ | 1533/10000 [2:24:34<12:53:05, 5.48s/it] {'loss': 0.0921, 'grad_norm': 0.5960351228713989, 'learning_rate': 3.842636761996361e-05, 'epoch': 1.53} 15%|█▌ | 1533/10000 [2:24:34<12:53:05, 5.48s/it][2025-06-19 15:54:19,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:54:19,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.12 | bwd_microstep: 3354.57 | bwd_inner_microstep: 3353.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 15:54:19,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.12 | bwd: 3354.59 | bwd_inner: 3353.78 | bwd_allreduce: 0.77 | step: 6.98 15%|█▌ | 1534/10000 [2:24:40<12:55:15, 5.49s/it] {'loss': 0.1599, 'grad_norm': 0.9122631549835205, 'learning_rate': 3.8423848139138844e-05, 'epoch': 1.53} 15%|█▌ | 1534/10000 [2:24:40<12:55:15, 5.49s/it][2025-06-19 15:54:25,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:54:25,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.72 | bwd_microstep: 3365.60 | bwd_inner_microstep: 3364.75 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.77 [2025-06-19 15:54:25,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.72 | bwd: 3365.61 | bwd_inner: 3364.74 | bwd_allreduce: 0.82 | step: 6.77 15%|█▌ | 1535/10000 [2:24:45<12:56:41, 5.51s/it] {'loss': 0.1335, 'grad_norm': 0.48464348912239075, 'learning_rate': 3.842132672573791e-05, 'epoch': 1.54} 15%|█▌ | 1535/10000 [2:24:45<12:56:41, 5.51s/it][2025-06-19 15:54:30,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:54:30,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.16 | bwd_microstep: 3314.63 | bwd_inner_microstep: 3313.85 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 15:54:30,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.16 | bwd: 3314.65 | bwd_inner: 3313.85 | bwd_allreduce: 0.76 | step: 6.63 15%|█▌ | 1536/10000 [2:24:51<12:54:36, 5.49s/it] {'loss': 0.0661, 'grad_norm': 0.3069077730178833, 'learning_rate': 3.84188033800253e-05, 'epoch': 1.54} 15%|█▌ | 1536/10000 [2:24:51<12:54:36, 5.49s/it][2025-06-19 15:54:35,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:54:35,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2095.82 | bwd_microstep: 3307.90 | bwd_inner_microstep: 3307.07 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.74 [2025-06-19 15:54:35,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2095.82 | bwd: 3307.92 | bwd_inner: 3307.07 | bwd_allreduce: 0.80 | step: 6.74 15%|█▌ | 1537/10000 [2:24:56<12:52:40, 5.48s/it] {'loss': 0.0759, 'grad_norm': 0.545762300491333, 'learning_rate': 3.84162781022657e-05, 'epoch': 1.54} 15%|█▌ | 1537/10000 [2:24:56<12:52:40, 5.48s/it][2025-06-19 15:54:41,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 15:54:41,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.33 | bwd_microstep: 3323.20 | bwd_inner_microstep: 3322.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.51 [2025-06-19 15:54:41,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.33 | bwd: 3323.21 | bwd_inner: 3322.42 | bwd_allreduce: 0.75 | step: 6.51 15%|█▌ | 1538/10000 [2:25:02<12:52:01, 5.47s/it] {'loss': 0.0875, 'grad_norm': 0.7194203734397888, 'learning_rate': 3.841375089272401e-05, 'epoch': 1.54} 15%|█▌ | 1538/10000 [2:25:02<12:52:01, 5.47s/it][2025-06-19 15:54:46,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 15:54:46,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.98 | bwd_microstep: 3369.13 | bwd_inner_microstep: 3367.96 | bwd_allreduce_microstep: 1.11 | step_microstep: 8.25 [2025-06-19 15:54:46,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.98 | bwd: 3369.15 | bwd_inner: 3367.96 | bwd_allreduce: 1.14 | step: 8.24 15%|█▌ | 1539/10000 [2:25:07<12:54:50, 5.49s/it] {'loss': 0.1022, 'grad_norm': 0.6372009515762329, 'learning_rate': 3.84112217516653e-05, 'epoch': 1.54} 15%|█▌ | 1539/10000 [2:25:07<12:54:50, 5.49s/it][2025-06-19 15:54:52,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:54:52,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.02 | bwd_microstep: 3320.91 | bwd_inner_microstep: 3320.03 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.02 [2025-06-19 15:54:52,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.02 | bwd: 3320.92 | bwd_inner: 3320.03 | bwd_allreduce: 0.84 | step: 7.02 15%|█▌ | 1540/10000 [2:25:13<12:53:52, 5.49s/it] {'loss': 0.0965, 'grad_norm': 0.9023423790931702, 'learning_rate': 3.8408690679354884e-05, 'epoch': 1.54} 15%|█▌ | 1540/10000 [2:25:13<12:53:52, 5.49s/it][2025-06-19 15:54:57,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:54:57,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.02 | bwd_microstep: 3370.73 | bwd_inner_microstep: 3369.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.29 [2025-06-19 15:54:57,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.02 | bwd: 3370.75 | bwd_inner: 3369.92 | bwd_allreduce: 0.78 | step: 7.30 15%|█▌ | 1541/10000 [2:25:18<12:56:28, 5.51s/it] {'loss': 0.1574, 'grad_norm': 1.0623257160186768, 'learning_rate': 3.840615767605825e-05, 'epoch': 1.54} 15%|█▌ | 1541/10000 [2:25:18<12:56:28, 5.51s/it][2025-06-19 15:55:03,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:55:03,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.90 | bwd_microstep: 3315.29 | bwd_inner_microstep: 3314.48 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 15:55:03,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.90 | bwd: 3315.30 | bwd_inner: 3314.48 | bwd_allreduce: 0.77 | step: 6.57 15%|█▌ | 1542/10000 [2:25:24<12:54:46, 5.50s/it] {'loss': 0.0702, 'grad_norm': 0.6809524297714233, 'learning_rate': 3.8403622742041106e-05, 'epoch': 1.54} 15%|█▌ | 1542/10000 [2:25:24<12:54:46, 5.50s/it][2025-06-19 15:55:08,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:55:08,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.11 | bwd_microstep: 3370.96 | bwd_inner_microstep: 3370.13 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.80 [2025-06-19 15:55:08,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.11 | bwd: 3370.98 | bwd_inner: 3370.13 | bwd_allreduce: 0.79 | step: 6.80 15%|█▌ | 1543/10000 [2:25:29<12:56:25, 5.51s/it] {'loss': 0.0434, 'grad_norm': 0.5402117371559143, 'learning_rate': 3.840108587756934e-05, 'epoch': 1.54} 15%|█▌ | 1543/10000 [2:25:29<12:56:25, 5.51s/it][2025-06-19 15:55:14,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:55:14,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.55 | bwd_microstep: 3322.07 | bwd_inner_microstep: 3321.26 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.10 [2025-06-19 15:55:14,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.55 | bwd: 3322.09 | bwd_inner: 3321.26 | bwd_allreduce: 0.79 | step: 7.11 15%|█▌ | 1544/10000 [2:25:35<12:54:38, 5.50s/it] {'loss': 0.054, 'grad_norm': 0.4217831790447235, 'learning_rate': 3.839854708290908e-05, 'epoch': 1.54} 15%|█▌ | 1544/10000 [2:25:35<12:54:38, 5.50s/it][2025-06-19 15:55:19,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:55:19,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.99 | bwd_microstep: 3370.54 | bwd_inner_microstep: 3369.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 15:55:19,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.99 | bwd: 3370.55 | bwd_inner: 3369.75 | bwd_allreduce: 0.77 | step: 6.77 15%|█▌ | 1545/10000 [2:25:40<12:56:28, 5.51s/it] {'loss': 0.0423, 'grad_norm': 0.36003604531288147, 'learning_rate': 3.8396006358326615e-05, 'epoch': 1.54} 15%|█▌ | 1545/10000 [2:25:40<12:56:28, 5.51s/it][2025-06-19 15:55:25,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:55:25,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.21 | bwd_microstep: 3372.68 | bwd_inner_microstep: 3371.82 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.80 [2025-06-19 15:55:25,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.21 | bwd: 3372.70 | bwd_inner: 3371.82 | bwd_allreduce: 0.82 | step: 6.80 15%|█▌ | 1546/10000 [2:25:46<12:57:54, 5.52s/it] {'loss': 0.1306, 'grad_norm': 1.0349900722503662, 'learning_rate': 3.839346370408846e-05, 'epoch': 1.55} 15%|█▌ | 1546/10000 [2:25:46<12:57:54, 5.52s/it][2025-06-19 15:55:30,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:55:30,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.77 | bwd_microstep: 3329.55 | bwd_inner_microstep: 3328.53 | bwd_allreduce_microstep: 0.97 | step_microstep: 8.02 [2025-06-19 15:55:30,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.77 | bwd: 3329.56 | bwd_inner: 3328.53 | bwd_allreduce: 0.99 | step: 8.04 15%|█▌ | 1547/10000 [2:25:51<12:56:04, 5.51s/it] {'loss': 0.0728, 'grad_norm': 0.5643752217292786, 'learning_rate': 3.839091912046133e-05, 'epoch': 1.55} 15%|█▌ | 1547/10000 [2:25:51<12:56:04, 5.51s/it][2025-06-19 15:55:36,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:55:36,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.53 | bwd_microstep: 3329.63 | bwd_inner_microstep: 3328.80 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.93 [2025-06-19 15:55:36,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.53 | bwd: 3329.65 | bwd_inner: 3328.80 | bwd_allreduce: 0.80 | step: 6.94 15%|█▌ | 1548/10000 [2:25:57<12:55:08, 5.50s/it] {'loss': 0.094, 'grad_norm': 0.6395059823989868, 'learning_rate': 3.838837260771214e-05, 'epoch': 1.55} 15%|█▌ | 1548/10000 [2:25:57<12:55:08, 5.50s/it][2025-06-19 15:55:41,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:55:41,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.79 | bwd_microstep: 3326.25 | bwd_inner_microstep: 3325.47 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 15:55:41,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.79 | bwd: 3326.27 | bwd_inner: 3325.47 | bwd_allreduce: 0.75 | step: 6.61 15%|█▌ | 1549/10000 [2:26:02<12:53:52, 5.49s/it] {'loss': 0.0814, 'grad_norm': 0.6555333137512207, 'learning_rate': 3.8385824166108006e-05, 'epoch': 1.55} 15%|█▌ | 1549/10000 [2:26:02<12:53:52, 5.49s/it][2025-06-19 15:55:47,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:55:47,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.22 | bwd_microstep: 3316.57 | bwd_inner_microstep: 3315.74 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.26 [2025-06-19 15:55:47,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.22 | bwd: 3316.58 | bwd_inner: 3315.74 | bwd_allreduce: 0.80 | step: 7.26 16%|█▌ | 1550/10000 [2:26:08<12:52:43, 5.49s/it] {'loss': 0.1026, 'grad_norm': 0.917736291885376, 'learning_rate': 3.8383273795916245e-05, 'epoch': 1.55} 16%|█▌ | 1550/10000 [2:26:08<12:52:43, 5.49s/it][2025-06-19 15:55:52,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 15:55:52,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.81 | bwd_microstep: 3378.26 | bwd_inner_microstep: 3377.25 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.84 [2025-06-19 15:55:52,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.81 | bwd: 3378.29 | bwd_inner: 3377.25 | bwd_allreduce: 0.97 | step: 7.84 16%|█▌ | 1551/10000 [2:26:13<12:55:46, 5.51s/it] {'loss': 0.0892, 'grad_norm': 0.6609808802604675, 'learning_rate': 3.8380721497404385e-05, 'epoch': 1.55} 16%|█▌ | 1551/10000 [2:26:13<12:55:46, 5.51s/it][2025-06-19 15:55:58,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:55:58,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.39 | bwd_microstep: 3382.54 | bwd_inner_microstep: 3381.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 15:55:58,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.39 | bwd: 3382.55 | bwd_inner: 3381.75 | bwd_allreduce: 0.77 | step: 6.78 16%|█▌ | 1552/10000 [2:26:19<12:58:02, 5.53s/it] {'loss': 0.1169, 'grad_norm': 0.987816333770752, 'learning_rate': 3.8378167270840155e-05, 'epoch': 1.55} 16%|█▌ | 1552/10000 [2:26:19<12:58:02, 5.53s/it][2025-06-19 15:56:04,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 15:56:04,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.09 | bwd_microstep: 3319.43 | bwd_inner_microstep: 3318.27 | bwd_allreduce_microstep: 1.09 | step_microstep: 8.13 [2025-06-19 15:56:04,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.09 | bwd: 3319.45 | bwd_inner: 3318.27 | bwd_allreduce: 1.12 | step: 8.13 16%|█▌ | 1553/10000 [2:26:24<12:56:04, 5.51s/it] {'loss': 0.1009, 'grad_norm': 0.9248983263969421, 'learning_rate': 3.837561111649146e-05, 'epoch': 1.55} 16%|█▌ | 1553/10000 [2:26:24<12:56:04, 5.51s/it][2025-06-19 15:56:09,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 15:56:09,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.90 | bwd_microstep: 3371.80 | bwd_inner_microstep: 3371.02 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.61 [2025-06-19 15:56:09,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.90 | bwd: 3371.82 | bwd_inner: 3371.03 | bwd_allreduce: 0.75 | step: 6.62 16%|█▌ | 1554/10000 [2:26:30<12:57:17, 5.52s/it] {'loss': 0.0527, 'grad_norm': 0.4474641680717468, 'learning_rate': 3.8373053034626454e-05, 'epoch': 1.55} 16%|█▌ | 1554/10000 [2:26:30<12:57:17, 5.52s/it][2025-06-19 15:56:15,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:56:15,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.34 | bwd_microstep: 3373.58 | bwd_inner_microstep: 3372.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 15:56:15,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.34 | bwd: 3373.59 | bwd_inner: 3372.77 | bwd_allreduce: 0.78 | step: 7.17 16%|█▌ | 1555/10000 [2:26:35<12:57:58, 5.53s/it] {'loss': 0.0693, 'grad_norm': 0.8962613344192505, 'learning_rate': 3.8370493025513455e-05, 'epoch': 1.56} 16%|█▌ | 1555/10000 [2:26:35<12:57:58, 5.53s/it][2025-06-19 15:56:20,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:56:20,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.66 | bwd_microstep: 3317.37 | bwd_inner_microstep: 3316.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 15:56:20,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.66 | bwd: 3317.39 | bwd_inner: 3316.57 | bwd_allreduce: 0.77 | step: 6.75 16%|█▌ | 1556/10000 [2:26:41<12:55:07, 5.51s/it] {'loss': 0.056, 'grad_norm': 0.8418442606925964, 'learning_rate': 3.836793108942099e-05, 'epoch': 1.56} 16%|█▌ | 1556/10000 [2:26:41<12:55:07, 5.51s/it][2025-06-19 15:56:26,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:56:26,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.35 | bwd_microstep: 3376.45 | bwd_inner_microstep: 3375.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 15:56:26,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.35 | bwd: 3376.46 | bwd_inner: 3375.65 | bwd_allreduce: 0.77 | step: 6.69 16%|█▌ | 1557/10000 [2:26:46<12:56:33, 5.52s/it] {'loss': 0.1092, 'grad_norm': 1.1124258041381836, 'learning_rate': 3.836536722661781e-05, 'epoch': 1.56} 16%|█▌ | 1557/10000 [2:26:46<12:56:33, 5.52s/it][2025-06-19 15:56:31,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:56:31,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.42 | bwd_microstep: 3335.55 | bwd_inner_microstep: 3334.75 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.40 [2025-06-19 15:56:31,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.42 | bwd: 3335.57 | bwd_inner: 3334.75 | bwd_allreduce: 0.78 | step: 7.40 16%|█▌ | 1558/10000 [2:26:52<12:55:00, 5.51s/it] {'loss': 0.0926, 'grad_norm': 0.9039790034294128, 'learning_rate': 3.836280143737284e-05, 'epoch': 1.56} 16%|█▌ | 1558/10000 [2:26:52<12:55:00, 5.51s/it][2025-06-19 15:56:37,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:56:37,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.99 | bwd_microstep: 3322.98 | bwd_inner_microstep: 3322.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 15:56:37,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.99 | bwd: 3322.99 | bwd_inner: 3322.20 | bwd_allreduce: 0.75 | step: 6.60 16%|█▌ | 1559/10000 [2:26:57<12:53:17, 5.50s/it] {'loss': 0.108, 'grad_norm': 1.8343311548233032, 'learning_rate': 3.836023372195523e-05, 'epoch': 1.56} 16%|█▌ | 1559/10000 [2:26:57<12:53:17, 5.50s/it][2025-06-19 15:56:42,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:56:42,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.85 | bwd_microstep: 3323.26 | bwd_inner_microstep: 3322.45 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 15:56:42,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.85 | bwd: 3323.27 | bwd_inner: 3322.45 | bwd_allreduce: 0.78 | step: 7.25 16%|█▌ | 1560/10000 [2:27:03<12:52:00, 5.49s/it] {'loss': 0.1199, 'grad_norm': 1.040116786956787, 'learning_rate': 3.8357664080634306e-05, 'epoch': 1.56} 16%|█▌ | 1560/10000 [2:27:03<12:52:00, 5.49s/it][2025-06-19 15:56:48,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:56:48,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.76 | bwd_microstep: 3380.15 | bwd_inner_microstep: 3379.24 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.50 [2025-06-19 15:56:48,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.76 | bwd: 3380.16 | bwd_inner: 3379.24 | bwd_allreduce: 0.88 | step: 7.50 16%|█▌ | 1561/10000 [2:27:08<12:54:55, 5.51s/it] {'loss': 0.1234, 'grad_norm': 1.1735632419586182, 'learning_rate': 3.835509251367963e-05, 'epoch': 1.56} 16%|█▌ | 1561/10000 [2:27:08<12:54:55, 5.51s/it][2025-06-19 15:56:53,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:56:53,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.95 | bwd_microstep: 3374.33 | bwd_inner_microstep: 3373.52 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 15:56:53,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.95 | bwd: 3374.34 | bwd_inner: 3373.52 | bwd_allreduce: 0.77 | step: 7.13 16%|█▌ | 1562/10000 [2:27:14<12:56:44, 5.52s/it] {'loss': 0.0711, 'grad_norm': 0.5076600909233093, 'learning_rate': 3.8352519021360925e-05, 'epoch': 1.56} 16%|█▌ | 1562/10000 [2:27:14<12:56:44, 5.52s/it][2025-06-19 15:56:59,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 15:56:59,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.34 | bwd_microstep: 3315.84 | bwd_inner_microstep: 3315.05 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 15:56:59,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.34 | bwd: 3315.85 | bwd_inner: 3315.05 | bwd_allreduce: 0.76 | step: 6.65 16%|█▌ | 1563/10000 [2:27:19<12:54:12, 5.51s/it] {'loss': 0.1129, 'grad_norm': 0.8403694033622742, 'learning_rate': 3.834994360394817e-05, 'epoch': 1.56} 16%|█▌ | 1563/10000 [2:27:19<12:54:12, 5.51s/it][2025-06-19 15:57:04,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:57:04,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.22 | bwd_microstep: 3322.98 | bwd_inner_microstep: 3322.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 15:57:04,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.22 | bwd: 3322.99 | bwd_inner: 3322.20 | bwd_allreduce: 0.75 | step: 6.57 16%|█▌ | 1564/10000 [2:27:25<12:52:22, 5.49s/it] {'loss': 0.1108, 'grad_norm': 0.7518206238746643, 'learning_rate': 3.8347366261711475e-05, 'epoch': 1.56} 16%|█▌ | 1564/10000 [2:27:25<12:52:22, 5.49s/it][2025-06-19 15:57:10,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:57:10,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.00 | bwd_microstep: 3323.02 | bwd_inner_microstep: 3322.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.13 [2025-06-19 15:57:10,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.00 | bwd: 3323.04 | bwd_inner: 3322.22 | bwd_allreduce: 0.77 | step: 7.13 16%|█▌ | 1565/10000 [2:27:30<12:51:23, 5.49s/it] {'loss': 0.1573, 'grad_norm': 1.2268086671829224, 'learning_rate': 3.834478699492122e-05, 'epoch': 1.56} 16%|█▌ | 1565/10000 [2:27:30<12:51:23, 5.49s/it][2025-06-19 15:57:15,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:57:15,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.55 | bwd_microstep: 3371.82 | bwd_inner_microstep: 3371.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 15:57:15,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.55 | bwd: 3371.83 | bwd_inner: 3371.04 | bwd_allreduce: 0.75 | step: 6.54 16%|█▌ | 1566/10000 [2:27:36<12:53:18, 5.50s/it] {'loss': 0.1617, 'grad_norm': 1.042487621307373, 'learning_rate': 3.8342205803847956e-05, 'epoch': 1.57} 16%|█▌ | 1566/10000 [2:27:36<12:53:18, 5.50s/it][2025-06-19 15:57:21,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:57:21,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.31 | bwd_microstep: 3384.65 | bwd_inner_microstep: 3383.59 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.27 [2025-06-19 15:57:21,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.31 | bwd: 3384.67 | bwd_inner: 3383.59 | bwd_allreduce: 1.03 | step: 7.27 16%|█▌ | 1567/10000 [2:27:41<12:55:33, 5.52s/it] {'loss': 0.0868, 'grad_norm': 1.0468907356262207, 'learning_rate': 3.833962268876243e-05, 'epoch': 1.57} 16%|█▌ | 1567/10000 [2:27:41<12:55:33, 5.52s/it][2025-06-19 15:57:26,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:57:26,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.96 | bwd_microstep: 3375.32 | bwd_inner_microstep: 3374.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 15:57:26,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.96 | bwd: 3375.33 | bwd_inner: 3374.51 | bwd_allreduce: 0.78 | step: 7.06 16%|█▌ | 1568/10000 [2:27:47<12:56:41, 5.53s/it] {'loss': 0.0787, 'grad_norm': 0.9684480428695679, 'learning_rate': 3.8337037649935594e-05, 'epoch': 1.57} 16%|█▌ | 1568/10000 [2:27:47<12:56:41, 5.53s/it][2025-06-19 15:57:32,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 15:57:32,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.14 | bwd_microstep: 3370.28 | bwd_inner_microstep: 3369.47 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.74 [2025-06-19 15:57:32,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.15 | bwd: 3370.30 | bwd_inner: 3369.47 | bwd_allreduce: 0.79 | step: 6.75 16%|█▌ | 1569/10000 [2:27:53<12:57:14, 5.53s/it] {'loss': 0.0455, 'grad_norm': 0.4122813045978546, 'learning_rate': 3.833445068763862e-05, 'epoch': 1.57} 16%|█▌ | 1569/10000 [2:27:53<12:57:14, 5.53s/it][2025-06-19 15:57:37,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.88 [2025-06-19 15:57:37,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.44 | bwd_microstep: 3329.60 | bwd_inner_microstep: 3328.79 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.83 [2025-06-19 15:57:37,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.44 | bwd: 3329.61 | bwd_inner: 3328.79 | bwd_allreduce: 0.78 | step: 6.83 16%|█▌ | 1570/10000 [2:27:58<12:55:01, 5.52s/it] {'loss': 0.0871, 'grad_norm': 0.49296891689300537, 'learning_rate': 3.833186180214286e-05, 'epoch': 1.57} 16%|█▌ | 1570/10000 [2:27:58<12:55:01, 5.52s/it][2025-06-19 15:57:43,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:57:43,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.77 | bwd_microstep: 3330.78 | bwd_inner_microstep: 3330.00 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 15:57:43,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.77 | bwd: 3330.79 | bwd_inner: 3330.00 | bwd_allreduce: 0.75 | step: 6.62 16%|█▌ | 1571/10000 [2:28:04<12:53:11, 5.50s/it] {'loss': 0.0958, 'grad_norm': 1.0317144393920898, 'learning_rate': 3.8329270993719874e-05, 'epoch': 1.57} 16%|█▌ | 1571/10000 [2:28:04<12:53:11, 5.50s/it][2025-06-19 15:57:48,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:57:48,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.50 | bwd_microstep: 3329.17 | bwd_inner_microstep: 3328.36 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.19 [2025-06-19 15:57:48,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.50 | bwd: 3329.19 | bwd_inner: 3328.36 | bwd_allreduce: 0.79 | step: 7.20 16%|█▌ | 1572/10000 [2:28:09<12:52:03, 5.50s/it] {'loss': 0.0522, 'grad_norm': 0.5935760140419006, 'learning_rate': 3.832667826264144e-05, 'epoch': 1.57} 16%|█▌ | 1572/10000 [2:28:09<12:52:03, 5.50s/it][2025-06-19 15:57:54,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:57:54,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.00 | bwd_microstep: 3332.77 | bwd_inner_microstep: 3331.95 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.32 [2025-06-19 15:57:54,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.00 | bwd: 3332.78 | bwd_inner: 3331.95 | bwd_allreduce: 0.79 | step: 7.32 16%|█▌ | 1573/10000 [2:28:14<12:51:43, 5.49s/it] {'loss': 0.1091, 'grad_norm': 1.0092599391937256, 'learning_rate': 3.8324083609179504e-05, 'epoch': 1.57} 16%|█▌ | 1573/10000 [2:28:14<12:51:43, 5.49s/it][2025-06-19 15:57:59,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:57:59,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.17 | bwd_microstep: 3327.73 | bwd_inner_microstep: 3326.93 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 15:57:59,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.17 | bwd: 3327.74 | bwd_inner: 3326.93 | bwd_allreduce: 0.77 | step: 7.18 16%|█▌ | 1574/10000 [2:28:20<12:51:26, 5.49s/it] {'loss': 0.0853, 'grad_norm': 0.8258552551269531, 'learning_rate': 3.832148703360624e-05, 'epoch': 1.57} 16%|█▌ | 1574/10000 [2:28:20<12:51:26, 5.49s/it][2025-06-19 15:58:05,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 15:58:05,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.45 | bwd_microstep: 3328.81 | bwd_inner_microstep: 3327.85 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.63 [2025-06-19 15:58:05,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.45 | bwd: 3328.82 | bwd_inner: 3327.85 | bwd_allreduce: 0.93 | step: 7.63 16%|█▌ | 1575/10000 [2:28:25<12:50:51, 5.49s/it] {'loss': 0.0953, 'grad_norm': 1.1219655275344849, 'learning_rate': 3.8318888536194025e-05, 'epoch': 1.57} 16%|█▌ | 1575/10000 [2:28:25<12:50:51, 5.49s/it][2025-06-19 15:58:10,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:58:10,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.18 | bwd_microstep: 3323.34 | bwd_inner_microstep: 3322.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-19 15:58:10,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.18 | bwd: 3323.36 | bwd_inner: 3322.56 | bwd_allreduce: 0.75 | step: 6.52 16%|█▌ | 1576/10000 [2:28:31<12:50:13, 5.49s/it] {'loss': 0.139, 'grad_norm': 0.8141900300979614, 'learning_rate': 3.831628811721542e-05, 'epoch': 1.58} 16%|█▌ | 1576/10000 [2:28:31<12:50:13, 5.49s/it][2025-06-19 15:58:16,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:58:16,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.81 | bwd_microstep: 3371.63 | bwd_inner_microstep: 3370.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.10 [2025-06-19 15:58:16,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.81 | bwd: 3371.64 | bwd_inner: 3370.83 | bwd_allreduce: 0.77 | step: 7.10 16%|█▌ | 1577/10000 [2:28:36<12:52:26, 5.50s/it] {'loss': 0.1154, 'grad_norm': 0.8715096116065979, 'learning_rate': 3.83136857769432e-05, 'epoch': 1.58} 16%|█▌ | 1577/10000 [2:28:36<12:52:26, 5.50s/it][2025-06-19 15:58:21,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:58:21,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.45 | bwd_microstep: 3327.37 | bwd_inner_microstep: 3326.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 15:58:21,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.45 | bwd: 3327.38 | bwd_inner: 3326.57 | bwd_allreduce: 0.76 | step: 6.70 16%|█▌ | 1578/10000 [2:28:42<12:51:27, 5.50s/it] {'loss': 0.0635, 'grad_norm': 0.4298282563686371, 'learning_rate': 3.8311081515650336e-05, 'epoch': 1.58} 16%|█▌ | 1578/10000 [2:28:42<12:51:27, 5.50s/it][2025-06-19 15:58:27,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:58:27,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.00 | bwd_microstep: 3317.23 | bwd_inner_microstep: 3316.29 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.26 [2025-06-19 15:58:27,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.00 | bwd: 3317.25 | bwd_inner: 3316.29 | bwd_allreduce: 0.91 | step: 7.27 16%|█▌ | 1579/10000 [2:28:47<12:49:49, 5.49s/it] {'loss': 0.0504, 'grad_norm': 0.42180952429771423, 'learning_rate': 3.830847533361001e-05, 'epoch': 1.58} 16%|█▌ | 1579/10000 [2:28:47<12:49:49, 5.49s/it][2025-06-19 15:58:32,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:58:32,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.31 | bwd_microstep: 3375.66 | bwd_inner_microstep: 3374.84 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.93 [2025-06-19 15:58:32,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.31 | bwd: 3375.67 | bwd_inner: 3374.84 | bwd_allreduce: 0.79 | step: 6.94 16%|█▌ | 1580/10000 [2:28:53<12:52:05, 5.50s/it] {'loss': 0.0899, 'grad_norm': 0.7625680565834045, 'learning_rate': 3.83058672310956e-05, 'epoch': 1.58} 16%|█▌ | 1580/10000 [2:28:53<12:52:05, 5.50s/it][2025-06-19 15:58:38,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 15:58:38,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.30 | bwd_microstep: 3374.68 | bwd_inner_microstep: 3373.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 15:58:38,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.30 | bwd: 3374.69 | bwd_inner: 3373.89 | bwd_allreduce: 0.76 | step: 6.63 16%|█▌ | 1581/10000 [2:28:58<12:53:38, 5.51s/it] {'loss': 0.1385, 'grad_norm': 1.2056218385696411, 'learning_rate': 3.830325720838066e-05, 'epoch': 1.58} 16%|█▌ | 1581/10000 [2:28:58<12:53:38, 5.51s/it][2025-06-19 15:58:43,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.89 [2025-06-19 15:58:43,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.20 | bwd_microstep: 3324.21 | bwd_inner_microstep: 3323.40 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 15:58:43,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.20 | bwd: 3324.22 | bwd_inner: 3323.40 | bwd_allreduce: 0.78 | step: 7.10 16%|█▌ | 1582/10000 [2:29:04<12:51:24, 5.50s/it] {'loss': 0.0634, 'grad_norm': 0.49106448888778687, 'learning_rate': 3.8300645265738996e-05, 'epoch': 1.58} 16%|█▌ | 1582/10000 [2:29:04<12:51:24, 5.50s/it][2025-06-19 15:58:49,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 15:58:49,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.52 | bwd_microstep: 3315.83 | bwd_inner_microstep: 3314.76 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.22 [2025-06-19 15:58:49,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.52 | bwd: 3315.85 | bwd_inner: 3314.76 | bwd_allreduce: 1.04 | step: 7.22 16%|█▌ | 1583/10000 [2:29:09<12:49:37, 5.49s/it] {'loss': 0.1269, 'grad_norm': 1.312484860420227, 'learning_rate': 3.829803140344458e-05, 'epoch': 1.58} 16%|█▌ | 1583/10000 [2:29:09<12:49:37, 5.49s/it][2025-06-19 15:58:54,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 15:58:54,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.76 | bwd_microstep: 3324.79 | bwd_inner_microstep: 3323.97 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-19 15:58:54,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.76 | bwd: 3324.80 | bwd_inner: 3323.97 | bwd_allreduce: 0.79 | step: 7.23 16%|█▌ | 1584/10000 [2:29:15<12:49:22, 5.49s/it] {'loss': 0.114, 'grad_norm': 0.6115177869796753, 'learning_rate': 3.8295415621771596e-05, 'epoch': 1.58} 16%|█▌ | 1584/10000 [2:29:15<12:49:22, 5.49s/it][2025-06-19 15:59:00,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 15:59:00,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.99 | bwd_microstep: 3366.00 | bwd_inner_microstep: 3364.97 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.08 [2025-06-19 15:59:00,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.99 | bwd: 3366.02 | bwd_inner: 3364.97 | bwd_allreduce: 1.00 | step: 7.08 16%|█▌ | 1585/10000 [2:29:20<12:51:24, 5.50s/it] {'loss': 0.1289, 'grad_norm': 1.2822344303131104, 'learning_rate': 3.8292797920994426e-05, 'epoch': 1.58} 16%|█▌ | 1585/10000 [2:29:20<12:51:24, 5.50s/it][2025-06-19 15:59:05,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:59:05,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.33 | bwd_microstep: 3321.24 | bwd_inner_microstep: 3320.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 15:59:05,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.33 | bwd: 3321.25 | bwd_inner: 3320.44 | bwd_allreduce: 0.77 | step: 6.62 16%|█▌ | 1586/10000 [2:29:26<12:49:51, 5.49s/it] {'loss': 0.0685, 'grad_norm': 0.5087626576423645, 'learning_rate': 3.8290178301387646e-05, 'epoch': 1.59} 16%|█▌ | 1586/10000 [2:29:26<12:49:51, 5.49s/it][2025-06-19 15:59:11,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:59:11,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.73 | bwd_microstep: 3370.83 | bwd_inner_microstep: 3370.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.09 [2025-06-19 15:59:11,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.73 | bwd: 3370.84 | bwd_inner: 3370.04 | bwd_allreduce: 0.76 | step: 7.10 16%|█▌ | 1587/10000 [2:29:31<12:51:42, 5.50s/it] {'loss': 0.1046, 'grad_norm': 0.5480300188064575, 'learning_rate': 3.828755676322605e-05, 'epoch': 1.59} 16%|█▌ | 1587/10000 [2:29:31<12:51:42, 5.50s/it][2025-06-19 15:59:16,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:59:16,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.20 | bwd_microstep: 3330.26 | bwd_inner_microstep: 3329.45 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.13 [2025-06-19 15:59:16,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.20 | bwd: 3330.28 | bwd_inner: 3329.45 | bwd_allreduce: 0.79 | step: 7.14 16%|█▌ | 1588/10000 [2:29:37<12:50:41, 5.50s/it] {'loss': 0.1, 'grad_norm': 0.8024443984031677, 'learning_rate': 3.8284933306784634e-05, 'epoch': 1.59} 16%|█▌ | 1588/10000 [2:29:37<12:50:41, 5.50s/it][2025-06-19 15:59:22,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:59:22,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.22 | bwd_microstep: 3332.12 | bwd_inner_microstep: 3331.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 15:59:22,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.22 | bwd: 3332.13 | bwd_inner: 3331.33 | bwd_allreduce: 0.76 | step: 6.57 16%|█▌ | 1589/10000 [2:29:42<12:50:02, 5.49s/it] {'loss': 0.1459, 'grad_norm': 1.3692011833190918, 'learning_rate': 3.8282307932338574e-05, 'epoch': 1.59} 16%|█▌ | 1589/10000 [2:29:42<12:50:02, 5.49s/it][2025-06-19 15:59:27,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:59:27,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.25 | bwd_microstep: 3329.93 | bwd_inner_microstep: 3329.06 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.48 [2025-06-19 15:59:27,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.25 | bwd: 3329.95 | bwd_inner: 3329.06 | bwd_allreduce: 0.83 | step: 7.49 16%|█▌ | 1590/10000 [2:29:48<12:49:36, 5.49s/it] {'loss': 0.1022, 'grad_norm': 0.7022486329078674, 'learning_rate': 3.827968064016326e-05, 'epoch': 1.59} 16%|█▌ | 1590/10000 [2:29:48<12:49:36, 5.49s/it][2025-06-19 15:59:33,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:59:33,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.48 | bwd_microstep: 3372.16 | bwd_inner_microstep: 3371.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 15:59:33,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.48 | bwd: 3372.17 | bwd_inner: 3371.38 | bwd_allreduce: 0.75 | step: 6.64 16%|█▌ | 1591/10000 [2:29:53<12:51:39, 5.51s/it] {'loss': 0.0497, 'grad_norm': 0.3578729033470154, 'learning_rate': 3.827705143053428e-05, 'epoch': 1.59} 16%|█▌ | 1591/10000 [2:29:53<12:51:39, 5.51s/it][2025-06-19 15:59:38,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 15:59:38,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.79 | bwd_microstep: 3317.23 | bwd_inner_microstep: 3316.28 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.05 [2025-06-19 15:59:38,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.79 | bwd: 3317.24 | bwd_inner: 3316.28 | bwd_allreduce: 0.92 | step: 7.06 16%|█▌ | 1592/10000 [2:29:59<12:49:27, 5.49s/it] {'loss': 0.0942, 'grad_norm': 0.8146766424179077, 'learning_rate': 3.827442030372744e-05, 'epoch': 1.59} 16%|█▌ | 1592/10000 [2:29:59<12:49:27, 5.49s/it][2025-06-19 15:59:44,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:59:44,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.12 | bwd_microstep: 3321.05 | bwd_inner_microstep: 3320.19 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.87 [2025-06-19 15:59:44,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.12 | bwd: 3321.06 | bwd_inner: 3320.19 | bwd_allreduce: 0.83 | step: 6.87 16%|█▌ | 1593/10000 [2:30:04<12:48:48, 5.49s/it] {'loss': 0.0834, 'grad_norm': 0.7261694669723511, 'learning_rate': 3.8271787260018725e-05, 'epoch': 1.59} 16%|█▌ | 1593/10000 [2:30:04<12:48:48, 5.49s/it][2025-06-19 15:59:49,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 15:59:49,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.37 | bwd_microstep: 3332.15 | bwd_inner_microstep: 3331.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 15:59:49,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.37 | bwd: 3332.16 | bwd_inner: 3331.35 | bwd_allreduce: 0.76 | step: 6.65 16%|█▌ | 1594/10000 [2:30:10<12:48:42, 5.49s/it] {'loss': 0.1267, 'grad_norm': 0.7593182325363159, 'learning_rate': 3.826915229968433e-05, 'epoch': 1.59} 16%|█▌ | 1594/10000 [2:30:10<12:48:42, 5.49s/it][2025-06-19 15:59:55,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 15:59:55,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.20 | bwd_microstep: 3324.10 | bwd_inner_microstep: 3323.30 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.34 [2025-06-19 15:59:55,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.20 | bwd: 3324.11 | bwd_inner: 3323.30 | bwd_allreduce: 0.77 | step: 7.34 16%|█▌ | 1595/10000 [2:30:15<12:47:48, 5.48s/it] {'loss': 0.0777, 'grad_norm': 0.7257760167121887, 'learning_rate': 3.8266515423000635e-05, 'epoch': 1.59} 16%|█▌ | 1595/10000 [2:30:15<12:47:48, 5.48s/it][2025-06-19 16:00:00,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:00:00,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.22 | bwd_microstep: 3325.31 | bwd_inner_microstep: 3324.26 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.26 [2025-06-19 16:00:00,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.22 | bwd: 3325.33 | bwd_inner: 3324.26 | bwd_allreduce: 1.02 | step: 7.26 16%|█▌ | 1596/10000 [2:30:21<12:47:47, 5.48s/it] {'loss': 0.0863, 'grad_norm': 1.4398233890533447, 'learning_rate': 3.826387663024426e-05, 'epoch': 1.6} 16%|█▌ | 1596/10000 [2:30:21<12:47:47, 5.48s/it][2025-06-19 16:00:06,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:00:06,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.04 | bwd_microstep: 3374.68 | bwd_inner_microstep: 3373.88 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.65 [2025-06-19 16:00:06,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.04 | bwd: 3374.69 | bwd_inner: 3373.88 | bwd_allreduce: 0.77 | step: 6.65 16%|█▌ | 1597/10000 [2:30:26<12:50:15, 5.50s/it] {'loss': 0.1085, 'grad_norm': 0.998599112033844, 'learning_rate': 3.826123592169199e-05, 'epoch': 1.6} 16%|█▌ | 1597/10000 [2:30:26<12:50:15, 5.50s/it][2025-06-19 16:00:11,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:00:11,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.27 | bwd_microstep: 3311.89 | bwd_inner_microstep: 3311.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.63 [2025-06-19 16:00:11,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.27 | bwd: 3311.90 | bwd_inner: 3311.08 | bwd_allreduce: 0.78 | step: 6.64 16%|█▌ | 1598/10000 [2:30:32<12:48:15, 5.49s/it] {'loss': 0.1143, 'grad_norm': 1.3785936832427979, 'learning_rate': 3.825859329762082e-05, 'epoch': 1.6} 16%|█▌ | 1598/10000 [2:30:32<12:48:15, 5.49s/it][2025-06-19 16:00:17,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:00:17,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.36 | bwd_microstep: 3362.90 | bwd_inner_microstep: 3362.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 16:00:17,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.36 | bwd: 3362.91 | bwd_inner: 3362.10 | bwd_allreduce: 0.77 | step: 6.71 16%|█▌ | 1599/10000 [2:30:37<12:49:47, 5.50s/it] {'loss': 0.09, 'grad_norm': 0.8160290122032166, 'learning_rate': 3.8255948758307966e-05, 'epoch': 1.6} 16%|█▌ | 1599/10000 [2:30:37<12:49:47, 5.50s/it][2025-06-19 16:00:22,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 16:00:22,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.69 | bwd_microstep: 3322.36 | bwd_inner_microstep: 3321.42 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.32 [2025-06-19 16:00:22,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.69 | bwd: 3322.38 | bwd_inner: 3321.42 | bwd_allreduce: 0.91 | step: 7.32 16%|█▌ | 1600/10000 [2:30:43<12:48:31, 5.49s/it] {'loss': 0.0823, 'grad_norm': 0.9850350618362427, 'learning_rate': 3.825330230403081e-05, 'epoch': 1.6} 16%|█▌ | 1600/10000 [2:30:43<12:48:31, 5.49s/it][2025-06-19 16:00:28,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:00:28,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.15 | bwd_microstep: 3371.62 | bwd_inner_microstep: 3370.83 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 16:00:28,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.15 | bwd: 3371.63 | bwd_inner: 3370.83 | bwd_allreduce: 0.76 | step: 6.65 16%|█▌ | 1601/10000 [2:30:48<12:50:28, 5.50s/it] {'loss': 0.095, 'grad_norm': 0.6299021244049072, 'learning_rate': 3.825065393506695e-05, 'epoch': 1.6} 16%|█▌ | 1601/10000 [2:30:48<12:50:28, 5.50s/it][2025-06-19 16:00:33,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:00:33,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.24 | bwd_microstep: 3365.53 | bwd_inner_microstep: 3364.72 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-19 16:00:33,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.24 | bwd: 3365.55 | bwd_inner: 3364.72 | bwd_allreduce: 0.77 | step: 6.77 16%|█▌ | 1602/10000 [2:30:54<12:51:21, 5.51s/it] {'loss': 0.1256, 'grad_norm': 1.902785301208496, 'learning_rate': 3.8248003651694206e-05, 'epoch': 1.6} 16%|█▌ | 1602/10000 [2:30:54<12:51:21, 5.51s/it][2025-06-19 16:00:39,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:00:39,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.28 | bwd_microstep: 3323.81 | bwd_inner_microstep: 3323.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 16:00:39,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.28 | bwd: 3323.82 | bwd_inner: 3323.01 | bwd_allreduce: 0.77 | step: 6.93 16%|█▌ | 1603/10000 [2:30:59<12:49:37, 5.50s/it] {'loss': 0.0812, 'grad_norm': 0.8365567922592163, 'learning_rate': 3.824535145419057e-05, 'epoch': 1.6} 16%|█▌ | 1603/10000 [2:30:59<12:49:37, 5.50s/it][2025-06-19 16:00:44,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:00:44,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.94 | bwd_microstep: 3318.84 | bwd_inner_microstep: 3317.84 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.14 [2025-06-19 16:00:44,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.94 | bwd: 3318.86 | bwd_inner: 3317.84 | bwd_allreduce: 0.96 | step: 7.15 16%|█▌ | 1604/10000 [2:31:05<12:47:53, 5.49s/it] {'loss': 0.0805, 'grad_norm': 0.6709797978401184, 'learning_rate': 3.824269734283424e-05, 'epoch': 1.6} 16%|█▌ | 1604/10000 [2:31:05<12:47:53, 5.49s/it][2025-06-19 16:00:49,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:00:49,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.62 | bwd_microstep: 3313.96 | bwd_inner_microstep: 3313.11 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.12 [2025-06-19 16:00:49,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.62 | bwd: 3313.97 | bwd_inner: 3313.11 | bwd_allreduce: 0.81 | step: 7.13 16%|█▌ | 1605/10000 [2:31:10<12:47:09, 5.48s/it] {'loss': 0.204, 'grad_norm': 2.2287709712982178, 'learning_rate': 3.824004131790363e-05, 'epoch': 1.6} 16%|█▌ | 1605/10000 [2:31:10<12:47:09, 5.48s/it][2025-06-19 16:00:55,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.71 | optimizer_step: 2.90 [2025-06-19 16:00:55,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.11 | bwd_microstep: 3391.53 | bwd_inner_microstep: 3390.58 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.31 [2025-06-19 16:00:55,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.11 | bwd: 3391.54 | bwd_inner: 3390.58 | bwd_allreduce: 0.91 | step: 7.31 16%|█▌ | 1606/10000 [2:31:16<12:50:52, 5.51s/it] {'loss': 0.0444, 'grad_norm': 0.343448668718338, 'learning_rate': 3.8237383379677336e-05, 'epoch': 1.61} 16%|█▌ | 1606/10000 [2:31:16<12:50:52, 5.51s/it][2025-06-19 16:01:01,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:01:01,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.53 | bwd_microstep: 3377.42 | bwd_inner_microstep: 3376.51 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.04 [2025-06-19 16:01:01,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.53 | bwd: 3377.44 | bwd_inner: 3376.51 | bwd_allreduce: 0.88 | step: 7.05 16%|█▌ | 1607/10000 [2:31:21<12:52:27, 5.52s/it] {'loss': 0.1325, 'grad_norm': 1.1296223402023315, 'learning_rate': 3.8234723528434174e-05, 'epoch': 1.61} 16%|█▌ | 1607/10000 [2:31:21<12:52:27, 5.52s/it][2025-06-19 16:01:06,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:01:06,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.20 | bwd_microstep: 3362.94 | bwd_inner_microstep: 3361.97 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.33 [2025-06-19 16:01:06,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.20 | bwd: 3362.96 | bwd_inner: 3361.97 | bwd_allreduce: 0.94 | step: 7.33 16%|█▌ | 1608/10000 [2:31:27<12:52:57, 5.53s/it] {'loss': 0.0678, 'grad_norm': 0.3816368579864502, 'learning_rate': 3.823206176445314e-05, 'epoch': 1.61} 16%|█▌ | 1608/10000 [2:31:27<12:52:57, 5.53s/it][2025-06-19 16:01:12,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:01:12,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.86 | bwd_microstep: 3313.30 | bwd_inner_microstep: 3312.52 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 16:01:12,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.86 | bwd: 3313.31 | bwd_inner: 3312.51 | bwd_allreduce: 0.75 | step: 6.67 16%|█▌ | 1609/10000 [2:31:32<12:49:55, 5.51s/it] {'loss': 0.0844, 'grad_norm': 0.7422218322753906, 'learning_rate': 3.822939808801345e-05, 'epoch': 1.61} 16%|█▌ | 1609/10000 [2:31:32<12:49:55, 5.51s/it][2025-06-19 16:01:17,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:01:17,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.12 | bwd_microstep: 3367.88 | bwd_inner_microstep: 3367.07 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-19 16:01:17,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.12 | bwd: 3367.89 | bwd_inner: 3367.07 | bwd_allreduce: 0.78 | step: 7.27 16%|█▌ | 1610/10000 [2:31:38<12:51:01, 5.51s/it] {'loss': 0.0794, 'grad_norm': 0.5006935596466064, 'learning_rate': 3.82267324993945e-05, 'epoch': 1.61} 16%|█▌ | 1610/10000 [2:31:38<12:51:01, 5.51s/it][2025-06-19 16:01:23,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:01:23,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.33 | bwd_microstep: 3316.72 | bwd_inner_microstep: 3315.87 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.41 [2025-06-19 16:01:23,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.33 | bwd: 3316.73 | bwd_inner: 3315.87 | bwd_allreduce: 0.81 | step: 7.42 16%|█▌ | 1611/10000 [2:31:43<12:48:46, 5.50s/it] {'loss': 0.0845, 'grad_norm': 0.8844860792160034, 'learning_rate': 3.82240649988759e-05, 'epoch': 1.61} 16%|█▌ | 1611/10000 [2:31:43<12:48:46, 5.50s/it][2025-06-19 16:01:28,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.75 [2025-06-19 16:01:28,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.20 | bwd_microstep: 3362.38 | bwd_inner_microstep: 3361.27 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.17 [2025-06-19 16:01:28,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.20 | bwd: 3362.40 | bwd_inner: 3361.27 | bwd_allreduce: 1.07 | step: 7.18 16%|█▌ | 1612/10000 [2:31:49<12:49:44, 5.51s/it] {'loss': 0.0645, 'grad_norm': 0.7271406054496765, 'learning_rate': 3.8221395586737465e-05, 'epoch': 1.61} 16%|█▌ | 1612/10000 [2:31:49<12:49:44, 5.51s/it][2025-06-19 16:01:34,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 16:01:34,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.60 | bwd_microstep: 3321.08 | bwd_inner_microstep: 3320.13 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.20 [2025-06-19 16:01:34,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.60 | bwd: 3321.09 | bwd_inner: 3320.13 | bwd_allreduce: 0.91 | step: 7.20 16%|█▌ | 1613/10000 [2:31:54<12:48:03, 5.49s/it] {'loss': 0.1319, 'grad_norm': 0.8369677662849426, 'learning_rate': 3.821872426325921e-05, 'epoch': 1.61} 16%|█▌ | 1613/10000 [2:31:54<12:48:03, 5.49s/it][2025-06-19 16:01:39,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:01:39,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.49 | bwd_microstep: 3325.78 | bwd_inner_microstep: 3324.96 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.26 [2025-06-19 16:01:39,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.49 | bwd: 3325.79 | bwd_inner: 3324.96 | bwd_allreduce: 0.78 | step: 7.26 16%|█▌ | 1614/10000 [2:32:00<12:47:00, 5.49s/it] {'loss': 0.1497, 'grad_norm': 0.8914321660995483, 'learning_rate': 3.821605102872132e-05, 'epoch': 1.61} 16%|█▌ | 1614/10000 [2:32:00<12:47:00, 5.49s/it][2025-06-19 16:01:44,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:01:44,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.04 | bwd_microstep: 3315.04 | bwd_inner_microstep: 3314.22 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.75 [2025-06-19 16:01:44,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.04 | bwd: 3315.06 | bwd_inner: 3314.22 | bwd_allreduce: 0.79 | step: 6.75 16%|█▌ | 1615/10000 [2:32:05<12:45:26, 5.48s/it] {'loss': 0.1147, 'grad_norm': 2.4238393306732178, 'learning_rate': 3.821337588340424e-05, 'epoch': 1.61} 16%|█▌ | 1615/10000 [2:32:05<12:45:26, 5.48s/it][2025-06-19 16:01:50,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:01:50,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.30 | bwd_microstep: 3304.27 | bwd_inner_microstep: 3303.49 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-19 16:01:50,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.30 | bwd: 3304.29 | bwd_inner: 3303.49 | bwd_allreduce: 0.76 | step: 6.53 16%|█▌ | 1616/10000 [2:32:11<12:43:50, 5.47s/it] {'loss': 0.0502, 'grad_norm': 0.5456822514533997, 'learning_rate': 3.8210698827588545e-05, 'epoch': 1.62} 16%|█▌ | 1616/10000 [2:32:11<12:43:50, 5.47s/it][2025-06-19 16:01:55,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:01:55,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.12 | bwd_microstep: 3317.24 | bwd_inner_microstep: 3316.34 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.86 [2025-06-19 16:01:55,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.12 | bwd: 3317.26 | bwd_inner: 3316.34 | bwd_allreduce: 0.87 | step: 6.86 16%|█▌ | 1617/10000 [2:32:16<12:43:09, 5.46s/it] {'loss': 0.0637, 'grad_norm': 0.4625401794910431, 'learning_rate': 3.8208019861555066e-05, 'epoch': 1.62} 16%|█▌ | 1617/10000 [2:32:16<12:43:09, 5.46s/it][2025-06-19 16:02:01,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:02:01,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.56 | bwd_microstep: 3309.68 | bwd_inner_microstep: 3308.86 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.23 [2025-06-19 16:02:01,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.56 | bwd: 3309.70 | bwd_inner: 3308.86 | bwd_allreduce: 0.79 | step: 7.23 16%|█▌ | 1618/10000 [2:32:22<12:42:43, 5.46s/it] {'loss': 0.15, 'grad_norm': 1.0583492517471313, 'learning_rate': 3.820533898558481e-05, 'epoch': 1.62} 16%|█▌ | 1618/10000 [2:32:22<12:42:43, 5.46s/it][2025-06-19 16:02:06,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:02:06,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.07 | bwd_microstep: 3322.69 | bwd_inner_microstep: 3321.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 16:02:06,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.07 | bwd: 3322.70 | bwd_inner: 3321.90 | bwd_allreduce: 0.77 | step: 6.79 16%|█▌ | 1619/10000 [2:32:27<12:43:19, 5.46s/it] {'loss': 0.0683, 'grad_norm': 0.5715667009353638, 'learning_rate': 3.820265619995899e-05, 'epoch': 1.62} 16%|█▌ | 1619/10000 [2:32:27<12:43:19, 5.46s/it][2025-06-19 16:02:12,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:02:12,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.82 | bwd_microstep: 3315.98 | bwd_inner_microstep: 3315.00 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.44 [2025-06-19 16:02:12,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.82 | bwd: 3316.00 | bwd_inner: 3315.00 | bwd_allreduce: 0.95 | step: 7.44 16%|█▌ | 1620/10000 [2:32:33<12:42:58, 5.46s/it] {'loss': 0.1423, 'grad_norm': 0.9156417846679688, 'learning_rate': 3.819997150495901e-05, 'epoch': 1.62} 16%|█▌ | 1620/10000 [2:32:33<12:42:58, 5.46s/it][2025-06-19 16:02:17,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:02:17,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.76 | bwd_microstep: 3314.36 | bwd_inner_microstep: 3313.40 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.06 [2025-06-19 16:02:17,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.76 | bwd: 3314.37 | bwd_inner: 3313.40 | bwd_allreduce: 0.93 | step: 7.06 16%|█▌ | 1621/10000 [2:32:38<12:42:27, 5.46s/it] {'loss': 0.0541, 'grad_norm': 0.34874242544174194, 'learning_rate': 3.8197284900866496e-05, 'epoch': 1.62} 16%|█▌ | 1621/10000 [2:32:38<12:42:27, 5.46s/it][2025-06-19 16:02:23,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:02:23,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.07 | bwd_microstep: 3320.67 | bwd_inner_microstep: 3319.70 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.11 [2025-06-19 16:02:23,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.07 | bwd: 3320.69 | bwd_inner: 3319.70 | bwd_allreduce: 0.95 | step: 7.11 16%|█▌ | 1622/10000 [2:32:43<12:42:38, 5.46s/it] {'loss': 0.0841, 'grad_norm': 0.7162452936172485, 'learning_rate': 3.8194596387963246e-05, 'epoch': 1.62} 16%|█▌ | 1622/10000 [2:32:43<12:42:38, 5.46s/it][2025-06-19 16:02:28,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:02:28,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.80 | bwd_microstep: 3316.82 | bwd_inner_microstep: 3316.00 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-19 16:02:28,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.80 | bwd: 3316.83 | bwd_inner: 3316.00 | bwd_allreduce: 0.78 | step: 7.12 16%|█▌ | 1623/10000 [2:32:49<12:42:25, 5.46s/it] {'loss': 0.2274, 'grad_norm': 1.0607284307479858, 'learning_rate': 3.819190596653128e-05, 'epoch': 1.62} 16%|█▌ | 1623/10000 [2:32:49<12:42:25, 5.46s/it][2025-06-19 16:02:34,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:02:34,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.73 | bwd_microstep: 3321.56 | bwd_inner_microstep: 3320.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-19 16:02:34,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.73 | bwd: 3321.57 | bwd_inner: 3320.76 | bwd_allreduce: 0.77 | step: 6.86 16%|█▌ | 1624/10000 [2:32:54<12:42:41, 5.46s/it] {'loss': 0.0788, 'grad_norm': 0.48329323530197144, 'learning_rate': 3.8189213636852806e-05, 'epoch': 1.62} 16%|█▌ | 1624/10000 [2:32:54<12:42:41, 5.46s/it][2025-06-19 16:02:39,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:02:39,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.81 | bwd_microstep: 3367.91 | bwd_inner_microstep: 3367.12 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 16:02:39,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.81 | bwd: 3367.93 | bwd_inner: 3367.12 | bwd_allreduce: 0.77 | step: 7.07 16%|█▋ | 1625/10000 [2:33:00<12:45:33, 5.48s/it] {'loss': 0.1298, 'grad_norm': 0.9885272979736328, 'learning_rate': 3.818651939921025e-05, 'epoch': 1.62} 16%|█▋ | 1625/10000 [2:33:00<12:45:33, 5.48s/it][2025-06-19 16:02:45,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:02:45,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.55 | bwd_microstep: 3316.05 | bwd_inner_microstep: 3315.22 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.76 [2025-06-19 16:02:45,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.55 | bwd: 3316.06 | bwd_inner: 3315.22 | bwd_allreduce: 0.80 | step: 6.76 16%|█▋ | 1626/10000 [2:33:05<12:44:15, 5.48s/it] {'loss': 0.1621, 'grad_norm': 1.1463500261306763, 'learning_rate': 3.818382325388621e-05, 'epoch': 1.63} 16%|█▋ | 1626/10000 [2:33:05<12:44:15, 5.48s/it][2025-06-19 16:02:50,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:02:50,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.02 | bwd_microstep: 3319.57 | bwd_inner_microstep: 3318.77 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.20 [2025-06-19 16:02:50,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.02 | bwd: 3319.59 | bwd_inner: 3318.77 | bwd_allreduce: 0.77 | step: 7.20 16%|█▋ | 1627/10000 [2:33:11<12:44:06, 5.48s/it] {'loss': 0.1088, 'grad_norm': 0.7976771593093872, 'learning_rate': 3.8181125201163496e-05, 'epoch': 1.63} 16%|█▋ | 1627/10000 [2:33:11<12:44:06, 5.48s/it][2025-06-19 16:02:56,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:02:56,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.12 | bwd_microstep: 3321.55 | bwd_inner_microstep: 3320.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.40 [2025-06-19 16:02:56,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.12 | bwd: 3321.56 | bwd_inner: 3320.73 | bwd_allreduce: 0.78 | step: 7.40 16%|█▋ | 1628/10000 [2:33:16<12:43:43, 5.47s/it] {'loss': 0.0882, 'grad_norm': 0.6872782111167908, 'learning_rate': 3.817842524132514e-05, 'epoch': 1.63} 16%|█▋ | 1628/10000 [2:33:16<12:43:43, 5.47s/it][2025-06-19 16:03:01,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:03:01,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.67 | bwd_microstep: 3326.00 | bwd_inner_microstep: 3325.19 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-19 16:03:01,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.67 | bwd: 3326.02 | bwd_inner: 3325.19 | bwd_allreduce: 0.78 | step: 7.31 16%|█▋ | 1629/10000 [2:33:22<12:43:37, 5.47s/it] {'loss': 0.0863, 'grad_norm': 0.644129753112793, 'learning_rate': 3.817572337465434e-05, 'epoch': 1.63} 16%|█▋ | 1629/10000 [2:33:22<12:43:37, 5.47s/it][2025-06-19 16:03:07,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:03:07,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.64 | bwd_microstep: 3370.38 | bwd_inner_microstep: 3369.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 16:03:07,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.64 | bwd: 3370.39 | bwd_inner: 3369.59 | bwd_allreduce: 0.76 | step: 6.63 16%|█▋ | 1630/10000 [2:33:27<12:46:29, 5.49s/it] {'loss': 0.0761, 'grad_norm': 0.5011614561080933, 'learning_rate': 3.817301960143452e-05, 'epoch': 1.63} 16%|█▋ | 1630/10000 [2:33:27<12:46:29, 5.49s/it][2025-06-19 16:03:12,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:03:12,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.97 | bwd_microstep: 3368.61 | bwd_inner_microstep: 3367.78 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.75 [2025-06-19 16:03:12,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.97 | bwd: 3368.63 | bwd_inner: 3367.78 | bwd_allreduce: 0.79 | step: 6.75 16%|█▋ | 1631/10000 [2:33:33<12:48:16, 5.51s/it] {'loss': 0.168, 'grad_norm': 0.7405306100845337, 'learning_rate': 3.817031392194928e-05, 'epoch': 1.63} 16%|█▋ | 1631/10000 [2:33:33<12:48:16, 5.51s/it][2025-06-19 16:03:18,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:03:18,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.64 | bwd_microstep: 3363.09 | bwd_inner_microstep: 3362.04 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.77 [2025-06-19 16:03:18,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.64 | bwd: 3363.11 | bwd_inner: 3362.04 | bwd_allreduce: 1.01 | step: 7.77 16%|█▋ | 1632/10000 [2:33:38<12:49:47, 5.52s/it] {'loss': 0.08, 'grad_norm': 0.9474936127662659, 'learning_rate': 3.816760633648245e-05, 'epoch': 1.63} 16%|█▋ | 1632/10000 [2:33:38<12:49:47, 5.52s/it][2025-06-19 16:03:23,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:03:23,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.07 | bwd_microstep: 3312.73 | bwd_inner_microstep: 3311.74 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.55 [2025-06-19 16:03:23,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.08 | bwd: 3312.75 | bwd_inner: 3311.74 | bwd_allreduce: 0.96 | step: 7.56 16%|█▋ | 1633/10000 [2:33:44<12:47:16, 5.50s/it] {'loss': 0.0825, 'grad_norm': 0.7175910472869873, 'learning_rate': 3.816489684531802e-05, 'epoch': 1.63} 16%|█▋ | 1633/10000 [2:33:44<12:47:16, 5.50s/it][2025-06-19 16:03:29,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.89 [2025-06-19 16:03:29,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.75 | bwd_microstep: 3322.67 | bwd_inner_microstep: 3321.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-19 16:03:29,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.75 | bwd: 3322.68 | bwd_inner: 3321.88 | bwd_allreduce: 0.76 | step: 6.79 16%|█▋ | 1634/10000 [2:33:49<12:46:10, 5.49s/it] {'loss': 0.107, 'grad_norm': 0.596250593662262, 'learning_rate': 3.8162185448740225e-05, 'epoch': 1.63} 16%|█▋ | 1634/10000 [2:33:49<12:46:10, 5.49s/it][2025-06-19 16:03:34,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:03:34,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.76 | bwd_microstep: 3333.39 | bwd_inner_microstep: 3332.55 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.42 [2025-06-19 16:03:34,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.76 | bwd: 3333.41 | bwd_inner: 3332.55 | bwd_allreduce: 0.81 | step: 7.43 16%|█▋ | 1635/10000 [2:33:55<12:45:17, 5.49s/it] {'loss': 0.069, 'grad_norm': 0.6593199372291565, 'learning_rate': 3.815947214703347e-05, 'epoch': 1.64} 16%|█▋ | 1635/10000 [2:33:55<12:45:17, 5.49s/it][2025-06-19 16:03:40,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:03:40,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.20 | bwd_microstep: 3323.19 | bwd_inner_microstep: 3322.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 16:03:40,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.20 | bwd: 3323.20 | bwd_inner: 3322.41 | bwd_allreduce: 0.75 | step: 6.64 16%|█▋ | 1636/10000 [2:34:00<12:44:41, 5.49s/it] {'loss': 0.097, 'grad_norm': 0.6925155520439148, 'learning_rate': 3.8156756940482365e-05, 'epoch': 1.64} 16%|█▋ | 1636/10000 [2:34:00<12:44:41, 5.49s/it][2025-06-19 16:03:45,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:03:45,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.16 | bwd_microstep: 3318.04 | bwd_inner_microstep: 3317.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.22 [2025-06-19 16:03:45,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.16 | bwd: 3318.05 | bwd_inner: 3317.24 | bwd_allreduce: 0.77 | step: 7.22 16%|█▋ | 1637/10000 [2:34:06<12:43:46, 5.48s/it] {'loss': 0.07, 'grad_norm': 0.48758822679519653, 'learning_rate': 3.815403982937173e-05, 'epoch': 1.64} 16%|█▋ | 1637/10000 [2:34:06<12:43:46, 5.48s/it][2025-06-19 16:03:51,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:03:51,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.38 | bwd_microstep: 3377.02 | bwd_inner_microstep: 3376.08 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.52 [2025-06-19 16:03:51,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.38 | bwd: 3377.04 | bwd_inner: 3376.08 | bwd_allreduce: 0.91 | step: 7.53 16%|█▋ | 1638/10000 [2:34:11<12:46:25, 5.50s/it] {'loss': 0.119, 'grad_norm': 0.6010791659355164, 'learning_rate': 3.815132081398657e-05, 'epoch': 1.64} 16%|█▋ | 1638/10000 [2:34:11<12:46:25, 5.50s/it][2025-06-19 16:03:56,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:03:56,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.88 | bwd_microstep: 3381.16 | bwd_inner_microstep: 3380.29 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.99 [2025-06-19 16:03:56,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.88 | bwd: 3381.18 | bwd_inner: 3380.29 | bwd_allreduce: 0.83 | step: 7.00 16%|█▋ | 1639/10000 [2:34:17<12:49:07, 5.52s/it] {'loss': 0.2214, 'grad_norm': 1.701865792274475, 'learning_rate': 3.81485998946121e-05, 'epoch': 1.64} 16%|█▋ | 1639/10000 [2:34:17<12:49:07, 5.52s/it][2025-06-19 16:04:02,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:04:02,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.41 | bwd_microstep: 3324.04 | bwd_inner_microstep: 3323.23 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 16:04:02,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.41 | bwd: 3324.06 | bwd_inner: 3323.23 | bwd_allreduce: 0.78 | step: 7.13 16%|█▋ | 1640/10000 [2:34:22<12:47:28, 5.51s/it] {'loss': 0.101, 'grad_norm': 0.507463276386261, 'learning_rate': 3.814587707153373e-05, 'epoch': 1.64} 16%|█▋ | 1640/10000 [2:34:22<12:47:28, 5.51s/it][2025-06-19 16:04:07,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:04:07,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.86 | bwd_microstep: 3317.53 | bwd_inner_microstep: 3316.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 16:04:07,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.86 | bwd: 3317.55 | bwd_inner: 3316.74 | bwd_allreduce: 0.77 | step: 6.82 16%|█▋ | 1641/10000 [2:34:28<12:45:52, 5.50s/it] {'loss': 0.1999, 'grad_norm': 1.156554937362671, 'learning_rate': 3.814315234503707e-05, 'epoch': 1.64} 16%|█▋ | 1641/10000 [2:34:28<12:45:52, 5.50s/it][2025-06-19 16:04:13,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:04:13,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.55 | bwd_microstep: 3334.71 | bwd_inner_microstep: 3333.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 16:04:13,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.55 | bwd: 3334.73 | bwd_inner: 3333.89 | bwd_allreduce: 0.79 | step: 7.21 16%|█▋ | 1642/10000 [2:34:33<12:45:36, 5.50s/it] {'loss': 0.1387, 'grad_norm': 0.5003355145454407, 'learning_rate': 3.814042571540794e-05, 'epoch': 1.64} 16%|█▋ | 1642/10000 [2:34:33<12:45:36, 5.50s/it][2025-06-19 16:04:18,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:04:18,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.37 | bwd_microstep: 3328.78 | bwd_inner_microstep: 3327.99 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-19 16:04:18,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.37 | bwd: 3328.80 | bwd_inner: 3327.99 | bwd_allreduce: 0.77 | step: 7.08 16%|█▋ | 1643/10000 [2:34:39<12:44:47, 5.49s/it] {'loss': 0.2297, 'grad_norm': 1.1473137140274048, 'learning_rate': 3.813769718293234e-05, 'epoch': 1.64} 16%|█▋ | 1643/10000 [2:34:39<12:44:47, 5.49s/it][2025-06-19 16:04:24,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:04:24,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.97 | bwd_microstep: 3375.23 | bwd_inner_microstep: 3374.38 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.81 [2025-06-19 16:04:24,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.97 | bwd: 3375.24 | bwd_inner: 3374.38 | bwd_allreduce: 0.82 | step: 6.81 16%|█▋ | 1644/10000 [2:34:44<12:46:55, 5.51s/it] {'loss': 0.0852, 'grad_norm': 0.4007410407066345, 'learning_rate': 3.813496674789649e-05, 'epoch': 1.64} 16%|█▋ | 1644/10000 [2:34:44<12:46:55, 5.51s/it][2025-06-19 16:04:29,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:04:29,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.22 | bwd_microstep: 3326.44 | bwd_inner_microstep: 3325.64 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 16:04:29,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.22 | bwd: 3326.46 | bwd_inner: 3325.64 | bwd_allreduce: 0.77 | step: 7.12 16%|█▋ | 1645/10000 [2:34:50<12:45:44, 5.50s/it] {'loss': 0.1597, 'grad_norm': 0.7830600142478943, 'learning_rate': 3.813223441058679e-05, 'epoch': 1.65} 16%|█▋ | 1645/10000 [2:34:50<12:45:44, 5.50s/it][2025-06-19 16:04:35,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:04:35,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.54 | bwd_microstep: 3330.66 | bwd_inner_microstep: 3329.73 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.69 [2025-06-19 16:04:35,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.54 | bwd: 3330.67 | bwd_inner: 3329.73 | bwd_allreduce: 0.89 | step: 6.69 16%|█▋ | 1646/10000 [2:34:55<12:44:33, 5.49s/it] {'loss': 0.1095, 'grad_norm': 0.41231393814086914, 'learning_rate': 3.812950017128986e-05, 'epoch': 1.65} 16%|█▋ | 1646/10000 [2:34:55<12:44:33, 5.49s/it][2025-06-19 16:04:40,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:04:40,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.93 | bwd_microstep: 3380.65 | bwd_inner_microstep: 3379.79 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.39 [2025-06-19 16:04:40,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.93 | bwd: 3380.66 | bwd_inner: 3379.79 | bwd_allreduce: 0.82 | step: 7.39 16%|█▋ | 1647/10000 [2:35:01<12:47:28, 5.51s/it] {'loss': 0.0959, 'grad_norm': 0.5547117590904236, 'learning_rate': 3.81267640302925e-05, 'epoch': 1.65} 16%|█▋ | 1647/10000 [2:35:01<12:47:28, 5.51s/it][2025-06-19 16:04:46,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:04:46,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.69 | bwd_microstep: 3325.91 | bwd_inner_microstep: 3324.91 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.12 [2025-06-19 16:04:46,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.69 | bwd: 3325.93 | bwd_inner: 3324.91 | bwd_allreduce: 0.97 | step: 7.12 16%|█▋ | 1648/10000 [2:35:06<12:46:04, 5.50s/it] {'loss': 0.0785, 'grad_norm': 0.4569953382015228, 'learning_rate': 3.812402598788172e-05, 'epoch': 1.65} 16%|█▋ | 1648/10000 [2:35:06<12:46:04, 5.50s/it][2025-06-19 16:04:51,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:04:51,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.91 | bwd_microstep: 3333.09 | bwd_inner_microstep: 3332.15 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.88 [2025-06-19 16:04:51,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.91 | bwd: 3333.11 | bwd_inner: 3332.15 | bwd_allreduce: 0.90 | step: 7.88 16%|█▋ | 1649/10000 [2:35:12<12:45:18, 5.50s/it] {'loss': 0.1016, 'grad_norm': 0.5435368418693542, 'learning_rate': 3.812128604434473e-05, 'epoch': 1.65} 16%|█▋ | 1649/10000 [2:35:12<12:45:18, 5.50s/it][2025-06-19 16:04:57,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:04:57,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.60 | bwd_microstep: 3375.07 | bwd_inner_microstep: 3374.13 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.55 [2025-06-19 16:04:57,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.60 | bwd: 3375.08 | bwd_inner: 3374.13 | bwd_allreduce: 0.91 | step: 6.55 16%|█▋ | 1650/10000 [2:35:17<12:47:24, 5.51s/it] {'loss': 0.0615, 'grad_norm': 0.4457736015319824, 'learning_rate': 3.811854419996894e-05, 'epoch': 1.65} 16%|█▋ | 1650/10000 [2:35:17<12:47:24, 5.51s/it][2025-06-19 16:05:02,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:05:02,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.63 | bwd_microstep: 3340.89 | bwd_inner_microstep: 3340.02 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.55 [2025-06-19 16:05:02,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.63 | bwd: 3340.91 | bwd_inner: 3340.02 | bwd_allreduce: 0.82 | step: 7.55 17%|█▋ | 1651/10000 [2:35:23<12:46:29, 5.51s/it] {'loss': 0.081, 'grad_norm': 0.33735719323158264, 'learning_rate': 3.8115800455041954e-05, 'epoch': 1.65} 17%|█▋ | 1651/10000 [2:35:23<12:46:29, 5.51s/it][2025-06-19 16:05:08,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:05:08,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.21 | bwd_microstep: 3372.34 | bwd_inner_microstep: 3371.53 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.25 [2025-06-19 16:05:08,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.21 | bwd: 3372.36 | bwd_inner: 3371.53 | bwd_allreduce: 0.79 | step: 7.26 17%|█▋ | 1652/10000 [2:35:28<12:48:05, 5.52s/it] {'loss': 0.1147, 'grad_norm': 0.46020498871803284, 'learning_rate': 3.8113054809851576e-05, 'epoch': 1.65} 17%|█▋ | 1652/10000 [2:35:28<12:48:05, 5.52s/it][2025-06-19 16:05:13,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:05:13,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.53 | bwd_microstep: 3330.71 | bwd_inner_microstep: 3329.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 16:05:13,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.53 | bwd: 3330.72 | bwd_inner: 3329.93 | bwd_allreduce: 0.76 | step: 6.67 17%|█▋ | 1653/10000 [2:35:34<12:46:20, 5.51s/it] {'loss': 0.0676, 'grad_norm': 0.4695010185241699, 'learning_rate': 3.8110307264685815e-05, 'epoch': 1.65} 17%|█▋ | 1653/10000 [2:35:34<12:46:20, 5.51s/it][2025-06-19 16:05:19,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:05:19,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.71 | bwd_microstep: 3374.72 | bwd_inner_microstep: 3373.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.13 [2025-06-19 16:05:19,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.71 | bwd: 3374.74 | bwd_inner: 3373.93 | bwd_allreduce: 0.77 | step: 7.13 17%|█▋ | 1654/10000 [2:35:39<12:47:47, 5.52s/it] {'loss': 0.0978, 'grad_norm': 0.3811401426792145, 'learning_rate': 3.810755781983288e-05, 'epoch': 1.65} 17%|█▋ | 1654/10000 [2:35:39<12:47:47, 5.52s/it][2025-06-19 16:05:24,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:05:24,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.63 | bwd_microstep: 3327.27 | bwd_inner_microstep: 3326.46 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.11 [2025-06-19 16:05:24,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.63 | bwd: 3327.29 | bwd_inner: 3326.46 | bwd_allreduce: 0.78 | step: 7.11 17%|█▋ | 1655/10000 [2:35:45<12:45:49, 5.51s/it] {'loss': 0.1509, 'grad_norm': 0.895278811454773, 'learning_rate': 3.810480647558116e-05, 'epoch': 1.66} 17%|█▋ | 1655/10000 [2:35:45<12:45:49, 5.51s/it][2025-06-19 16:05:30,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:05:30,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.05 | bwd_microstep: 3342.78 | bwd_inner_microstep: 3341.71 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.39 [2025-06-19 16:05:30,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.05 | bwd: 3342.80 | bwd_inner: 3341.71 | bwd_allreduce: 1.04 | step: 7.39 17%|█▋ | 1656/10000 [2:35:50<12:45:24, 5.50s/it] {'loss': 0.1417, 'grad_norm': 0.8274675011634827, 'learning_rate': 3.8102053232219274e-05, 'epoch': 1.66} 17%|█▋ | 1656/10000 [2:35:50<12:45:24, 5.50s/it][2025-06-19 16:05:35,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:05:35,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.21 | bwd_microstep: 3320.94 | bwd_inner_microstep: 3320.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 16:05:35,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.21 | bwd: 3320.96 | bwd_inner: 3320.14 | bwd_allreduce: 0.78 | step: 7.06 17%|█▋ | 1657/10000 [2:35:56<12:43:56, 5.49s/it] {'loss': 0.0708, 'grad_norm': 0.3817285895347595, 'learning_rate': 3.8099298090036015e-05, 'epoch': 1.66} 17%|█▋ | 1657/10000 [2:35:56<12:43:56, 5.49s/it][2025-06-19 16:05:41,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:05:41,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.04 | bwd_microstep: 3320.30 | bwd_inner_microstep: 3319.50 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 16:05:41,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.04 | bwd: 3320.31 | bwd_inner: 3319.50 | bwd_allreduce: 0.77 | step: 6.80 17%|█▋ | 1658/10000 [2:36:01<12:42:38, 5.49s/it] {'loss': 0.0962, 'grad_norm': 0.5472739338874817, 'learning_rate': 3.809654104932039e-05, 'epoch': 1.66} 17%|█▋ | 1658/10000 [2:36:01<12:42:38, 5.49s/it][2025-06-19 16:05:46,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:05:46,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.63 | bwd_microstep: 3327.33 | bwd_inner_microstep: 3326.52 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 16:05:46,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.63 | bwd: 3327.34 | bwd_inner: 3326.52 | bwd_allreduce: 0.77 | step: 6.95 17%|█▋ | 1659/10000 [2:36:07<12:42:18, 5.48s/it] {'loss': 0.1678, 'grad_norm': 1.012858271598816, 'learning_rate': 3.80937821103616e-05, 'epoch': 1.66} 17%|█▋ | 1659/10000 [2:36:07<12:42:18, 5.48s/it][2025-06-19 16:05:52,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:05:52,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.61 | bwd_microstep: 3374.15 | bwd_inner_microstep: 3373.04 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.77 [2025-06-19 16:05:52,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.62 | bwd: 3374.17 | bwd_inner: 3373.04 | bwd_allreduce: 1.08 | step: 7.77 17%|█▋ | 1660/10000 [2:36:12<12:45:05, 5.50s/it] {'loss': 0.0959, 'grad_norm': 0.5555728673934937, 'learning_rate': 3.809102127344904e-05, 'epoch': 1.66} 17%|█▋ | 1660/10000 [2:36:12<12:45:05, 5.50s/it][2025-06-19 16:05:57,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:05:57,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.62 | bwd_microstep: 3377.18 | bwd_inner_microstep: 3376.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 16:05:57,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.62 | bwd: 3377.20 | bwd_inner: 3376.37 | bwd_allreduce: 0.78 | step: 7.07 17%|█▋ | 1661/10000 [2:36:18<12:47:02, 5.52s/it] {'loss': 0.0649, 'grad_norm': 0.26497215032577515, 'learning_rate': 3.808825853887231e-05, 'epoch': 1.66} 17%|█▋ | 1661/10000 [2:36:18<12:47:02, 5.52s/it][2025-06-19 16:06:03,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 16:06:03,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2163.32 | bwd_microstep: 3374.21 | bwd_inner_microstep: 3372.87 | bwd_allreduce_microstep: 1.25 | step_microstep: 8.42 [2025-06-19 16:06:03,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2163.32 | bwd: 3374.24 | bwd_inner: 3372.87 | bwd_allreduce: 1.30 | step: 8.42 17%|█▋ | 1662/10000 [2:36:24<12:49:32, 5.54s/it] {'loss': 0.0743, 'grad_norm': 0.33059459924697876, 'learning_rate': 3.808549390692121e-05, 'epoch': 1.66} 17%|█▋ | 1662/10000 [2:36:24<12:49:32, 5.54s/it][2025-06-19 16:06:08,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:06:08,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.23 | bwd_microstep: 3314.20 | bwd_inner_microstep: 3313.34 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.18 [2025-06-19 16:06:08,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.23 | bwd: 3314.22 | bwd_inner: 3313.34 | bwd_allreduce: 0.82 | step: 7.18 17%|█▋ | 1663/10000 [2:36:29<12:46:48, 5.52s/it] {'loss': 0.1938, 'grad_norm': 1.5834052562713623, 'learning_rate': 3.808272737788574e-05, 'epoch': 1.66} 17%|█▋ | 1663/10000 [2:36:29<12:46:48, 5.52s/it][2025-06-19 16:06:14,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:06:14,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.62 | bwd_microstep: 3377.65 | bwd_inner_microstep: 3376.83 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.99 [2025-06-19 16:06:14,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.62 | bwd: 3377.67 | bwd_inner: 3376.83 | bwd_allreduce: 0.79 | step: 6.99 17%|█▋ | 1664/10000 [2:36:35<12:48:01, 5.53s/it] {'loss': 0.0767, 'grad_norm': 0.4299964904785156, 'learning_rate': 3.807995895205609e-05, 'epoch': 1.66} 17%|█▋ | 1664/10000 [2:36:35<12:48:01, 5.53s/it][2025-06-19 16:06:19,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 16:06:19,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.35 | bwd_microstep: 3322.38 | bwd_inner_microstep: 3321.34 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.87 [2025-06-19 16:06:19,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.35 | bwd: 3322.40 | bwd_inner: 3321.34 | bwd_allreduce: 1.00 | step: 7.88 17%|█▋ | 1665/10000 [2:36:40<12:45:52, 5.51s/it] {'loss': 0.0709, 'grad_norm': 0.5207922458648682, 'learning_rate': 3.8077188629722656e-05, 'epoch': 1.67} 17%|█▋ | 1665/10000 [2:36:40<12:45:52, 5.51s/it][2025-06-19 16:06:25,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:06:25,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.86 | bwd_microstep: 3367.18 | bwd_inner_microstep: 3366.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.50 [2025-06-19 16:06:25,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.86 | bwd: 3367.20 | bwd_inner: 3366.37 | bwd_allreduce: 0.78 | step: 7.51 17%|█▋ | 1666/10000 [2:36:46<12:46:52, 5.52s/it] {'loss': 0.0933, 'grad_norm': 1.1364774703979492, 'learning_rate': 3.807441641117604e-05, 'epoch': 1.67} 17%|█▋ | 1666/10000 [2:36:46<12:46:52, 5.52s/it][2025-06-19 16:06:30,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:06:30,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.95 | bwd_microstep: 3321.24 | bwd_inner_microstep: 3320.42 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-19 16:06:30,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.95 | bwd: 3321.25 | bwd_inner: 3320.42 | bwd_allreduce: 0.79 | step: 7.31 17%|█▋ | 1667/10000 [2:36:51<12:44:27, 5.50s/it] {'loss': 0.0983, 'grad_norm': 0.6633018255233765, 'learning_rate': 3.807164229670701e-05, 'epoch': 1.67} 17%|█▋ | 1667/10000 [2:36:51<12:44:27, 5.50s/it][2025-06-19 16:06:36,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:06:36,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.12 | bwd_microstep: 3407.89 | bwd_inner_microstep: 3407.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.46 [2025-06-19 16:06:36,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.12 | bwd: 3407.90 | bwd_inner: 3407.08 | bwd_allreduce: 0.78 | step: 7.46 17%|█▋ | 1668/10000 [2:36:57<12:47:59, 5.53s/it] {'loss': 0.0738, 'grad_norm': 0.42810091376304626, 'learning_rate': 3.80688662866066e-05, 'epoch': 1.67} 17%|█▋ | 1668/10000 [2:36:57<12:47:59, 5.53s/it][2025-06-19 16:06:41,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:06:41,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.32 | bwd_microstep: 3324.37 | bwd_inner_microstep: 3323.56 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-19 16:06:41,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.32 | bwd: 3324.39 | bwd_inner: 3323.56 | bwd_allreduce: 0.78 | step: 7.27 17%|█▋ | 1669/10000 [2:37:02<12:45:20, 5.51s/it] {'loss': 0.1106, 'grad_norm': 0.8850563168525696, 'learning_rate': 3.806608838116596e-05, 'epoch': 1.67} 17%|█▋ | 1669/10000 [2:37:02<12:45:20, 5.51s/it][2025-06-19 16:06:47,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:06:47,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.40 | bwd_microstep: 3376.17 | bwd_inner_microstep: 3375.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 16:06:47,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.40 | bwd: 3376.18 | bwd_inner: 3375.38 | bwd_allreduce: 0.76 | step: 6.68 17%|█▋ | 1670/10000 [2:37:08<12:46:45, 5.52s/it] {'loss': 0.0853, 'grad_norm': 0.46600016951560974, 'learning_rate': 3.806330858067651e-05, 'epoch': 1.67} 17%|█▋ | 1670/10000 [2:37:08<12:46:45, 5.52s/it][2025-06-19 16:06:52,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:06:52,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.69 | bwd_microstep: 3400.62 | bwd_inner_microstep: 3399.80 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.23 [2025-06-19 16:06:52,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.69 | bwd: 3400.63 | bwd_inner: 3399.80 | bwd_allreduce: 0.79 | step: 7.24 17%|█▋ | 1671/10000 [2:37:13<12:49:00, 5.54s/it] {'loss': 0.0932, 'grad_norm': 0.5884866714477539, 'learning_rate': 3.8060526885429815e-05, 'epoch': 1.67} 17%|█▋ | 1671/10000 [2:37:13<12:49:00, 5.54s/it][2025-06-19 16:06:58,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:06:58,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.11 | bwd_microstep: 3374.64 | bwd_inner_microstep: 3373.68 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.06 [2025-06-19 16:06:58,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.11 | bwd: 3374.65 | bwd_inner: 3373.68 | bwd_allreduce: 0.93 | step: 7.06 17%|█▋ | 1672/10000 [2:37:19<12:49:10, 5.54s/it] {'loss': 0.0856, 'grad_norm': 0.4817354083061218, 'learning_rate': 3.805774329571767e-05, 'epoch': 1.67} 17%|█▋ | 1672/10000 [2:37:19<12:49:10, 5.54s/it][2025-06-19 16:07:03,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:07:03,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.30 | bwd_microstep: 3321.83 | bwd_inner_microstep: 3321.02 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.13 [2025-06-19 16:07:03,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.30 | bwd: 3321.85 | bwd_inner: 3321.02 | bwd_allreduce: 0.78 | step: 7.13 17%|█▋ | 1673/10000 [2:37:24<12:45:59, 5.52s/it] {'loss': 0.0683, 'grad_norm': 0.3415270149707794, 'learning_rate': 3.805495781183207e-05, 'epoch': 1.67} 17%|█▋ | 1673/10000 [2:37:24<12:45:59, 5.52s/it][2025-06-19 16:07:09,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:07:09,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.06 | bwd_microstep: 3318.46 | bwd_inner_microstep: 3317.68 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 16:07:09,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.06 | bwd: 3318.48 | bwd_inner: 3317.68 | bwd_allreduce: 0.76 | step: 6.59 17%|█▋ | 1674/10000 [2:37:30<12:43:15, 5.50s/it] {'loss': 0.0834, 'grad_norm': 0.6361677646636963, 'learning_rate': 3.8052170434065184e-05, 'epoch': 1.67} 17%|█▋ | 1674/10000 [2:37:30<12:43:15, 5.50s/it][2025-06-19 16:07:14,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:07:14,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.54 | bwd_microstep: 3334.09 | bwd_inner_microstep: 3333.30 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 16:07:14,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.54 | bwd: 3334.10 | bwd_inner: 3333.30 | bwd_allreduce: 0.76 | step: 6.79 17%|█▋ | 1675/10000 [2:37:35<12:42:02, 5.49s/it] {'loss': 0.0552, 'grad_norm': 0.3485189974308014, 'learning_rate': 3.80493811627094e-05, 'epoch': 1.68} 17%|█▋ | 1675/10000 [2:37:35<12:42:02, 5.49s/it][2025-06-19 16:07:20,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:07:20,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.04 | bwd_microstep: 3323.88 | bwd_inner_microstep: 3323.05 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.19 [2025-06-19 16:07:20,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.04 | bwd: 3323.89 | bwd_inner: 3323.05 | bwd_allreduce: 0.80 | step: 7.19 17%|█▋ | 1676/10000 [2:37:41<12:40:52, 5.48s/it] {'loss': 0.0952, 'grad_norm': 0.45332464575767517, 'learning_rate': 3.804658999805731e-05, 'epoch': 1.68} 17%|█▋ | 1676/10000 [2:37:41<12:40:52, 5.48s/it][2025-06-19 16:07:25,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:07:25,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.66 | bwd_microstep: 3406.36 | bwd_inner_microstep: 3405.54 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.89 [2025-06-19 16:07:25,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.66 | bwd: 3406.38 | bwd_inner: 3405.54 | bwd_allreduce: 0.79 | step: 6.90 17%|█▋ | 1677/10000 [2:37:46<12:45:21, 5.52s/it] {'loss': 0.069, 'grad_norm': 0.501589298248291, 'learning_rate': 3.804379694040168e-05, 'epoch': 1.68} 17%|█▋ | 1677/10000 [2:37:46<12:45:21, 5.52s/it][2025-06-19 16:07:31,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:07:31,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.72 | bwd_microstep: 3321.51 | bwd_inner_microstep: 3320.58 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.09 [2025-06-19 16:07:31,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.72 | bwd: 3321.53 | bwd_inner: 3320.58 | bwd_allreduce: 0.91 | step: 7.09 17%|█▋ | 1678/10000 [2:37:52<12:43:34, 5.51s/it] {'loss': 0.083, 'grad_norm': 0.48599550127983093, 'learning_rate': 3.8041001990035494e-05, 'epoch': 1.68} 17%|█▋ | 1678/10000 [2:37:52<12:43:34, 5.51s/it][2025-06-19 16:07:36,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:07:36,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.25 | bwd_microstep: 3318.07 | bwd_inner_microstep: 3317.26 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.80 [2025-06-19 16:07:36,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.25 | bwd: 3318.08 | bwd_inner: 3317.26 | bwd_allreduce: 0.79 | step: 6.80 17%|█▋ | 1679/10000 [2:37:57<12:42:03, 5.49s/it] {'loss': 0.1165, 'grad_norm': 0.8998072147369385, 'learning_rate': 3.803820514725193e-05, 'epoch': 1.68} 17%|█▋ | 1679/10000 [2:37:57<12:42:03, 5.49s/it][2025-06-19 16:07:42,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:07:42,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.53 | bwd_microstep: 3324.47 | bwd_inner_microstep: 3323.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 16:07:42,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.53 | bwd: 3324.48 | bwd_inner: 3323.68 | bwd_allreduce: 0.77 | step: 6.82 17%|█▋ | 1680/10000 [2:38:03<12:40:44, 5.49s/it] {'loss': 0.0974, 'grad_norm': 0.6985829472541809, 'learning_rate': 3.8035406412344375e-05, 'epoch': 1.68} 17%|█▋ | 1680/10000 [2:38:03<12:40:44, 5.49s/it][2025-06-19 16:07:47,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:07:47,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.58 | bwd_microstep: 3315.98 | bwd_inner_microstep: 3315.18 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-19 16:07:47,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.58 | bwd: 3315.99 | bwd_inner: 3315.18 | bwd_allreduce: 0.77 | step: 6.86 17%|█▋ | 1681/10000 [2:38:08<12:39:20, 5.48s/it] {'loss': 0.1588, 'grad_norm': 0.9509894251823425, 'learning_rate': 3.803260578560638e-05, 'epoch': 1.68} 17%|█▋ | 1681/10000 [2:38:08<12:39:20, 5.48s/it][2025-06-19 16:07:53,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:07:53,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.19 | bwd_microstep: 3322.92 | bwd_inner_microstep: 3322.08 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.91 [2025-06-19 16:07:53,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.19 | bwd: 3322.94 | bwd_inner: 3322.08 | bwd_allreduce: 0.81 | step: 6.92 17%|█▋ | 1682/10000 [2:38:14<12:39:13, 5.48s/it] {'loss': 0.1379, 'grad_norm': 0.6531897187232971, 'learning_rate': 3.802980326733174e-05, 'epoch': 1.68} 17%|█▋ | 1682/10000 [2:38:14<12:39:13, 5.48s/it][2025-06-19 16:07:58,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 16:07:58,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.67 | bwd_microstep: 3320.92 | bwd_inner_microstep: 3319.85 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.37 [2025-06-19 16:07:58,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.67 | bwd: 3320.94 | bwd_inner: 3319.85 | bwd_allreduce: 1.03 | step: 7.38 17%|█▋ | 1683/10000 [2:38:19<12:38:59, 5.48s/it] {'loss': 0.1909, 'grad_norm': 1.1044431924819946, 'learning_rate': 3.8026998857814404e-05, 'epoch': 1.68} 17%|█▋ | 1683/10000 [2:38:19<12:38:59, 5.48s/it][2025-06-19 16:08:04,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:08:04,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.75 | bwd_microstep: 3401.80 | bwd_inner_microstep: 3400.99 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.75 [2025-06-19 16:08:04,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.76 | bwd: 3401.82 | bwd_inner: 3400.99 | bwd_allreduce: 0.79 | step: 6.75 17%|█▋ | 1684/10000 [2:38:25<12:43:19, 5.51s/it] {'loss': 0.1067, 'grad_norm': 0.9264492392539978, 'learning_rate': 3.8024192557348566e-05, 'epoch': 1.68} 17%|█▋ | 1684/10000 [2:38:25<12:43:19, 5.51s/it][2025-06-19 16:08:09,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:08:09,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.30 | bwd_microstep: 3314.75 | bwd_inner_microstep: 3313.88 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.99 [2025-06-19 16:08:09,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.30 | bwd: 3314.76 | bwd_inner: 3313.88 | bwd_allreduce: 0.84 | step: 7.00 17%|█▋ | 1685/10000 [2:38:30<12:41:04, 5.49s/it] {'loss': 0.0834, 'grad_norm': 0.81728595495224, 'learning_rate': 3.802138436622857e-05, 'epoch': 1.69} 17%|█▋ | 1685/10000 [2:38:30<12:41:04, 5.49s/it][2025-06-19 16:08:15,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:08:15,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.59 | bwd_microstep: 3366.06 | bwd_inner_microstep: 3365.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-19 16:08:15,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.59 | bwd: 3366.07 | bwd_inner: 3365.08 | bwd_allreduce: 0.76 | step: 6.83 17%|█▋ | 1686/10000 [2:38:36<12:42:49, 5.51s/it] {'loss': 0.0931, 'grad_norm': 0.5032832026481628, 'learning_rate': 3.8018574284749e-05, 'epoch': 1.69} 17%|█▋ | 1686/10000 [2:38:36<12:42:49, 5.51s/it][2025-06-19 16:08:20,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:08:20,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.56 | bwd_microstep: 3318.58 | bwd_inner_microstep: 3317.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.84 [2025-06-19 16:08:20,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.56 | bwd: 3318.60 | bwd_inner: 3317.80 | bwd_allreduce: 0.75 | step: 6.84 17%|█▋ | 1687/10000 [2:38:41<12:41:00, 5.49s/it] {'loss': 0.1258, 'grad_norm': 0.8178762793540955, 'learning_rate': 3.8015762313204614e-05, 'epoch': 1.69} 17%|█▋ | 1687/10000 [2:38:41<12:41:00, 5.49s/it][2025-06-19 16:08:26,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:08:26,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.08 | bwd_microstep: 3364.29 | bwd_inner_microstep: 3363.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 16:08:26,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.08 | bwd: 3364.31 | bwd_inner: 3363.51 | bwd_allreduce: 0.75 | step: 6.56 17%|█▋ | 1688/10000 [2:38:47<12:42:33, 5.50s/it] {'loss': 0.0769, 'grad_norm': 0.4595077633857727, 'learning_rate': 3.801294845189038e-05, 'epoch': 1.69} 17%|█▋ | 1688/10000 [2:38:47<12:42:33, 5.50s/it][2025-06-19 16:08:31,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:08:31,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.74 | bwd_microstep: 3370.83 | bwd_inner_microstep: 3370.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 16:08:31,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.74 | bwd: 3370.84 | bwd_inner: 3370.04 | bwd_allreduce: 0.76 | step: 6.61 17%|█▋ | 1689/10000 [2:38:52<12:43:41, 5.51s/it] {'loss': 0.0637, 'grad_norm': 0.48639115691185, 'learning_rate': 3.801013270110145e-05, 'epoch': 1.69} 17%|█▋ | 1689/10000 [2:38:52<12:43:41, 5.51s/it][2025-06-19 16:08:37,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:08:37,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.87 | bwd_microstep: 3398.05 | bwd_inner_microstep: 3397.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 16:08:37,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.87 | bwd: 3398.06 | bwd_inner: 3397.27 | bwd_allreduce: 0.76 | step: 6.56 17%|█▋ | 1690/10000 [2:38:58<12:46:13, 5.53s/it] {'loss': 0.0742, 'grad_norm': 0.5455163717269897, 'learning_rate': 3.800731506113318e-05, 'epoch': 1.69} 17%|█▋ | 1690/10000 [2:38:58<12:46:13, 5.53s/it][2025-06-19 16:08:43,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:08:43,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.69 | bwd_microstep: 3394.03 | bwd_inner_microstep: 3393.25 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 16:08:43,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.69 | bwd: 3394.04 | bwd_inner: 3393.25 | bwd_allreduce: 0.75 | step: 6.55 17%|█▋ | 1691/10000 [2:39:03<12:47:32, 5.54s/it] {'loss': 0.0663, 'grad_norm': 0.36387404799461365, 'learning_rate': 3.800449553228114e-05, 'epoch': 1.69} 17%|█▋ | 1691/10000 [2:39:03<12:47:32, 5.54s/it][2025-06-19 16:08:48,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:08:48,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.29 | bwd_microstep: 3369.62 | bwd_inner_microstep: 3368.67 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.07 [2025-06-19 16:08:48,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.29 | bwd: 3369.64 | bwd_inner: 3368.67 | bwd_allreduce: 0.93 | step: 7.08 17%|█▋ | 1692/10000 [2:39:09<12:47:24, 5.54s/it] {'loss': 0.1017, 'grad_norm': 0.7093840837478638, 'learning_rate': 3.800167411484108e-05, 'epoch': 1.69} 17%|█▋ | 1692/10000 [2:39:09<12:47:24, 5.54s/it][2025-06-19 16:08:54,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:08:54,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.10 | bwd_microstep: 3399.90 | bwd_inner_microstep: 3399.08 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.18 [2025-06-19 16:08:54,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.10 | bwd: 3399.92 | bwd_inner: 3399.08 | bwd_allreduce: 0.79 | step: 7.19 17%|█▋ | 1693/10000 [2:39:14<12:48:54, 5.55s/it] {'loss': 0.053, 'grad_norm': 0.39162829518318176, 'learning_rate': 3.799885080910896e-05, 'epoch': 1.69} 17%|█▋ | 1693/10000 [2:39:14<12:48:54, 5.55s/it][2025-06-19 16:08:59,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:08:59,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.86 | bwd_microstep: 3314.94 | bwd_inner_microstep: 3314.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 16:08:59,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.86 | bwd: 3314.95 | bwd_inner: 3314.16 | bwd_allreduce: 0.76 | step: 6.56 17%|█▋ | 1694/10000 [2:39:20<12:44:36, 5.52s/it] {'loss': 0.1361, 'grad_norm': 0.9834218621253967, 'learning_rate': 3.799602561538092e-05, 'epoch': 1.69} 17%|█▋ | 1694/10000 [2:39:20<12:44:36, 5.52s/it][2025-06-19 16:09:05,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:09:05,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.47 | bwd_microstep: 3307.93 | bwd_inner_microstep: 3307.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.55 [2025-06-19 16:09:05,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.47 | bwd: 3307.95 | bwd_inner: 3307.14 | bwd_allreduce: 0.76 | step: 6.56 17%|█▋ | 1695/10000 [2:39:25<12:41:14, 5.50s/it] {'loss': 0.1155, 'grad_norm': 0.8781118392944336, 'learning_rate': 3.7993198533953316e-05, 'epoch': 1.69} 17%|█▋ | 1695/10000 [2:39:25<12:41:14, 5.50s/it][2025-06-19 16:09:10,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:09:10,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.77 | bwd_microstep: 3374.57 | bwd_inner_microstep: 3373.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 16:09:10,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.77 | bwd: 3374.58 | bwd_inner: 3373.79 | bwd_allreduce: 0.75 | step: 6.57 17%|█▋ | 1696/10000 [2:39:31<12:43:14, 5.51s/it] {'loss': 0.0734, 'grad_norm': 0.544341504573822, 'learning_rate': 3.7990369565122686e-05, 'epoch': 1.7} 17%|█▋ | 1696/10000 [2:39:31<12:43:14, 5.51s/it][2025-06-19 16:09:16,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:09:16,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.69 | bwd_microstep: 3320.68 | bwd_inner_microstep: 3319.76 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.15 [2025-06-19 16:09:16,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.69 | bwd: 3320.70 | bwd_inner: 3319.76 | bwd_allreduce: 0.88 | step: 7.15 17%|█▋ | 1697/10000 [2:39:36<12:41:14, 5.50s/it] {'loss': 0.1976, 'grad_norm': 1.537965178489685, 'learning_rate': 3.7987538709185795e-05, 'epoch': 1.7} 17%|█▋ | 1697/10000 [2:39:36<12:41:14, 5.50s/it][2025-06-19 16:09:21,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:09:21,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.08 | bwd_microstep: 3365.37 | bwd_inner_microstep: 3364.55 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.24 [2025-06-19 16:09:21,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.08 | bwd: 3365.39 | bwd_inner: 3364.55 | bwd_allreduce: 0.79 | step: 7.25 17%|█▋ | 1698/10000 [2:39:42<12:42:41, 5.51s/it] {'loss': 0.0971, 'grad_norm': 0.8215379118919373, 'learning_rate': 3.798470596643957e-05, 'epoch': 1.7} 17%|█▋ | 1698/10000 [2:39:42<12:42:41, 5.51s/it][2025-06-19 16:09:27,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:09:27,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.22 | bwd_microstep: 3315.84 | bwd_inner_microstep: 3314.87 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.44 [2025-06-19 16:09:27,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.22 | bwd: 3315.86 | bwd_inner: 3314.87 | bwd_allreduce: 0.93 | step: 7.44 17%|█▋ | 1699/10000 [2:39:47<12:40:42, 5.50s/it] {'loss': 0.1079, 'grad_norm': 0.7907950282096863, 'learning_rate': 3.798187133718117e-05, 'epoch': 1.7} 17%|█▋ | 1699/10000 [2:39:47<12:40:42, 5.50s/it][2025-06-19 16:09:32,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:09:32,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.55 | bwd_microstep: 3311.51 | bwd_inner_microstep: 3310.69 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-19 16:09:32,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.55 | bwd: 3311.52 | bwd_inner: 3310.69 | bwd_allreduce: 0.79 | step: 7.13 17%|█▋ | 1700/10000 [2:39:53<12:39:00, 5.49s/it] {'loss': 0.0521, 'grad_norm': 0.276897132396698, 'learning_rate': 3.797903482170791e-05, 'epoch': 1.7} 17%|█▋ | 1700/10000 [2:39:53<12:39:00, 5.49s/it][2025-06-19 16:09:38,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.84 [2025-06-19 16:09:38,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.36 | bwd_microstep: 3368.18 | bwd_inner_microstep: 3367.19 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.71 [2025-06-19 16:09:38,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.37 | bwd: 3368.20 | bwd_inner: 3367.19 | bwd_allreduce: 0.95 | step: 7.71 17%|█▋ | 1701/10000 [2:39:58<12:40:52, 5.50s/it] {'loss': 0.1168, 'grad_norm': 0.72447270154953, 'learning_rate': 3.7976196420317345e-05, 'epoch': 1.7} 17%|█▋ | 1701/10000 [2:39:58<12:40:52, 5.50s/it][2025-06-19 16:09:43,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:09:43,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.36 | bwd_microstep: 3315.15 | bwd_inner_microstep: 3314.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-19 16:09:43,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.36 | bwd: 3315.17 | bwd_inner: 3314.36 | bwd_allreduce: 0.77 | step: 7.00 17%|█▋ | 1702/10000 [2:40:04<12:38:53, 5.49s/it] {'loss': 0.1707, 'grad_norm': 0.9671508073806763, 'learning_rate': 3.797335613330721e-05, 'epoch': 1.7} 17%|█▋ | 1702/10000 [2:40:04<12:38:53, 5.49s/it][2025-06-19 16:09:49,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:09:49,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.46 | bwd_microstep: 3374.33 | bwd_inner_microstep: 3373.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 16:09:49,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.46 | bwd: 3374.34 | bwd_inner: 3373.54 | bwd_allreduce: 0.76 | step: 6.63 17%|█▋ | 1703/10000 [2:40:09<12:41:01, 5.50s/it] {'loss': 0.0869, 'grad_norm': 0.6070665717124939, 'learning_rate': 3.797051396097543e-05, 'epoch': 1.7} 17%|█▋ | 1703/10000 [2:40:09<12:41:01, 5.50s/it][2025-06-19 16:09:54,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:09:54,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.67 | bwd_microstep: 3368.43 | bwd_inner_microstep: 3367.59 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.92 [2025-06-19 16:09:54,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.67 | bwd: 3368.45 | bwd_inner: 3367.59 | bwd_allreduce: 0.81 | step: 6.92 17%|█▋ | 1704/10000 [2:40:15<12:42:22, 5.51s/it] {'loss': 0.112, 'grad_norm': 0.8911504149436951, 'learning_rate': 3.7967669903620136e-05, 'epoch': 1.7} 17%|█▋ | 1704/10000 [2:40:15<12:42:22, 5.51s/it][2025-06-19 16:10:00,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 16:10:00,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.83 | bwd_microstep: 3367.39 | bwd_inner_microstep: 3366.40 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.76 [2025-06-19 16:10:00,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.83 | bwd: 3367.41 | bwd_inner: 3366.40 | bwd_allreduce: 0.96 | step: 7.76 17%|█▋ | 1705/10000 [2:40:20<12:43:14, 5.52s/it] {'loss': 0.0739, 'grad_norm': 0.7049466967582703, 'learning_rate': 3.796482396153966e-05, 'epoch': 1.71} 17%|█▋ | 1705/10000 [2:40:20<12:43:14, 5.52s/it][2025-06-19 16:10:05,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:10:05,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.21 | bwd_microstep: 3315.63 | bwd_inner_microstep: 3314.71 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.04 [2025-06-19 16:10:05,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.21 | bwd: 3315.64 | bwd_inner: 3314.71 | bwd_allreduce: 0.89 | step: 7.04 17%|█▋ | 1706/10000 [2:40:26<12:40:45, 5.50s/it] {'loss': 0.1313, 'grad_norm': 0.7787736058235168, 'learning_rate': 3.796197613503253e-05, 'epoch': 1.71} 17%|█▋ | 1706/10000 [2:40:26<12:40:45, 5.50s/it][2025-06-19 16:10:11,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:10:11,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.12 | bwd_microstep: 3313.58 | bwd_inner_microstep: 3312.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 16:10:11,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.12 | bwd: 3313.59 | bwd_inner: 3312.80 | bwd_allreduce: 0.75 | step: 6.65 17%|█▋ | 1707/10000 [2:40:31<12:38:38, 5.49s/it] {'loss': 0.1038, 'grad_norm': 0.7345353364944458, 'learning_rate': 3.7959126424397474e-05, 'epoch': 1.71} 17%|█▋ | 1707/10000 [2:40:31<12:38:38, 5.49s/it][2025-06-19 16:10:16,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:10:16,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.55 | bwd_microstep: 3314.40 | bwd_inner_microstep: 3313.38 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.06 [2025-06-19 16:10:16,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.55 | bwd: 3314.41 | bwd_inner: 3313.38 | bwd_allreduce: 0.99 | step: 7.06 17%|█▋ | 1708/10000 [2:40:37<12:37:10, 5.48s/it] {'loss': 0.1012, 'grad_norm': 0.7714665532112122, 'learning_rate': 3.79562748299334e-05, 'epoch': 1.71} 17%|█▋ | 1708/10000 [2:40:37<12:37:10, 5.48s/it][2025-06-19 16:10:21,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 16:10:21,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.95 | bwd_microstep: 3310.42 | bwd_inner_microstep: 3309.28 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.89 [2025-06-19 16:10:21,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.95 | bwd: 3310.44 | bwd_inner: 3309.28 | bwd_allreduce: 1.10 | step: 7.90 17%|█▋ | 1709/10000 [2:40:42<12:36:34, 5.48s/it] {'loss': 0.0412, 'grad_norm': 0.3584963083267212, 'learning_rate': 3.795342135193943e-05, 'epoch': 1.71} 17%|█▋ | 1709/10000 [2:40:42<12:36:34, 5.48s/it][2025-06-19 16:10:27,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:10:27,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.19 | bwd_microstep: 3374.69 | bwd_inner_microstep: 3373.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-19 16:10:27,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.19 | bwd: 3374.70 | bwd_inner: 3373.89 | bwd_allreduce: 0.77 | step: 7.00 17%|█▋ | 1710/10000 [2:40:48<12:39:49, 5.50s/it] {'loss': 0.0749, 'grad_norm': 0.5591940879821777, 'learning_rate': 3.7950565990714895e-05, 'epoch': 1.71} 17%|█▋ | 1710/10000 [2:40:48<12:39:49, 5.50s/it][2025-06-19 16:10:33,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:10:33,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.64 | bwd_microstep: 3399.52 | bwd_inner_microstep: 3398.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.17 [2025-06-19 16:10:33,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.64 | bwd: 3399.54 | bwd_inner: 3398.73 | bwd_allreduce: 0.77 | step: 7.18 17%|█▋ | 1711/10000 [2:40:53<12:42:46, 5.52s/it] {'loss': 0.1578, 'grad_norm': 0.9306183457374573, 'learning_rate': 3.794770874655929e-05, 'epoch': 1.71} 17%|█▋ | 1711/10000 [2:40:53<12:42:46, 5.52s/it][2025-06-19 16:10:38,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:10:38,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.50 | bwd_microstep: 3322.85 | bwd_inner_microstep: 3322.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 16:10:38,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.50 | bwd: 3322.86 | bwd_inner: 3322.06 | bwd_allreduce: 0.76 | step: 6.68 17%|█▋ | 1712/10000 [2:40:59<12:40:13, 5.50s/it] {'loss': 0.0847, 'grad_norm': 0.5159459114074707, 'learning_rate': 3.794484961977234e-05, 'epoch': 1.71} 17%|█▋ | 1712/10000 [2:40:59<12:40:13, 5.50s/it][2025-06-19 16:10:44,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:10:44,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.28 | bwd_microstep: 3319.55 | bwd_inner_microstep: 3318.65 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.69 [2025-06-19 16:10:44,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.28 | bwd: 3319.57 | bwd_inner: 3318.65 | bwd_allreduce: 0.86 | step: 7.69 17%|█▋ | 1713/10000 [2:41:04<12:38:38, 5.49s/it] {'loss': 0.2581, 'grad_norm': 1.6878262758255005, 'learning_rate': 3.794198861065395e-05, 'epoch': 1.71} 17%|█▋ | 1713/10000 [2:41:04<12:38:38, 5.49s/it][2025-06-19 16:10:49,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 16:10:49,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.49 | bwd_microstep: 3329.83 | bwd_inner_microstep: 3328.98 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.05 [2025-06-19 16:10:49,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.49 | bwd: 3329.85 | bwd_inner: 3328.98 | bwd_allreduce: 0.82 | step: 7.06 17%|█▋ | 1714/10000 [2:41:10<12:38:42, 5.49s/it] {'loss': 0.0674, 'grad_norm': 0.3910602331161499, 'learning_rate': 3.793912571950422e-05, 'epoch': 1.71} 17%|█▋ | 1714/10000 [2:41:10<12:38:42, 5.49s/it][2025-06-19 16:10:54,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:10:54,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.73 | bwd_microstep: 3321.57 | bwd_inner_microstep: 3320.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 16:10:54,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.73 | bwd: 3321.59 | bwd_inner: 3320.77 | bwd_allreduce: 0.77 | step: 7.06 17%|█▋ | 1715/10000 [2:41:15<12:37:31, 5.49s/it] {'loss': 0.0567, 'grad_norm': 0.3844646215438843, 'learning_rate': 3.7936260946623465e-05, 'epoch': 1.71} 17%|█▋ | 1715/10000 [2:41:15<12:37:31, 5.49s/it][2025-06-19 16:11:00,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 16:11:00,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.26 | bwd_microstep: 3313.06 | bwd_inner_microstep: 3312.13 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.08 [2025-06-19 16:11:00,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.26 | bwd: 3313.08 | bwd_inner: 3312.13 | bwd_allreduce: 0.90 | step: 7.09 17%|█▋ | 1716/10000 [2:41:21<12:36:30, 5.48s/it] {'loss': 0.0729, 'grad_norm': 0.46862703561782837, 'learning_rate': 3.7933394292312183e-05, 'epoch': 1.72} 17%|█▋ | 1716/10000 [2:41:21<12:36:30, 5.48s/it][2025-06-19 16:11:05,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:11:05,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.35 | bwd_microstep: 3312.12 | bwd_inner_microstep: 3311.21 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.02 [2025-06-19 16:11:05,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.35 | bwd: 3312.13 | bwd_inner: 3311.21 | bwd_allreduce: 0.88 | step: 7.02 17%|█▋ | 1717/10000 [2:41:26<12:35:38, 5.47s/it] {'loss': 0.1026, 'grad_norm': 1.1077810525894165, 'learning_rate': 3.793052575687107e-05, 'epoch': 1.72} 17%|█▋ | 1717/10000 [2:41:26<12:35:38, 5.47s/it][2025-06-19 16:11:11,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:11:11,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.95 | bwd_microstep: 3364.58 | bwd_inner_microstep: 3363.74 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.28 [2025-06-19 16:11:11,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.96 | bwd: 3364.60 | bwd_inner: 3363.74 | bwd_allreduce: 0.81 | step: 7.29 17%|█▋ | 1718/10000 [2:41:32<12:38:18, 5.49s/it] {'loss': 0.1518, 'grad_norm': 1.008781909942627, 'learning_rate': 3.7927655340601025e-05, 'epoch': 1.72} 17%|█▋ | 1718/10000 [2:41:32<12:38:18, 5.49s/it][2025-06-19 16:11:17,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:11:17,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.68 | bwd_microstep: 3371.68 | bwd_inner_microstep: 3370.71 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.26 [2025-06-19 16:11:17,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.68 | bwd: 3371.70 | bwd_inner: 3370.71 | bwd_allreduce: 0.94 | step: 7.27 17%|█▋ | 1719/10000 [2:41:37<12:40:10, 5.51s/it] {'loss': 0.229, 'grad_norm': 1.1843761205673218, 'learning_rate': 3.792478304380313e-05, 'epoch': 1.72} 17%|█▋ | 1719/10000 [2:41:37<12:40:10, 5.51s/it][2025-06-19 16:11:22,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:11:22,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.64 | bwd_microstep: 3331.62 | bwd_inner_microstep: 3330.79 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.87 [2025-06-19 16:11:22,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.64 | bwd: 3331.64 | bwd_inner: 3330.79 | bwd_allreduce: 0.80 | step: 6.87 17%|█▋ | 1720/10000 [2:41:43<12:39:05, 5.50s/it] {'loss': 0.1146, 'grad_norm': 0.9866184592247009, 'learning_rate': 3.79219088667787e-05, 'epoch': 1.72} 17%|█▋ | 1720/10000 [2:41:43<12:39:05, 5.50s/it][2025-06-19 16:11:27,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:11:27,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.19 | bwd_microstep: 3316.25 | bwd_inner_microstep: 3315.44 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.35 [2025-06-19 16:11:27,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.19 | bwd: 3316.27 | bwd_inner: 3315.44 | bwd_allreduce: 0.78 | step: 7.36 17%|█▋ | 1721/10000 [2:41:48<12:37:42, 5.49s/it] {'loss': 0.1092, 'grad_norm': 0.9338295459747314, 'learning_rate': 3.7919032809829195e-05, 'epoch': 1.72} 17%|█▋ | 1721/10000 [2:41:48<12:37:42, 5.49s/it][2025-06-19 16:11:33,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:11:33,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.37 | bwd_microstep: 3317.03 | bwd_inner_microstep: 3316.21 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-19 16:11:33,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.37 | bwd: 3317.04 | bwd_inner: 3316.21 | bwd_allreduce: 0.79 | step: 7.13 17%|█▋ | 1722/10000 [2:41:54<12:36:55, 5.49s/it] {'loss': 0.1112, 'grad_norm': 0.6818576455116272, 'learning_rate': 3.791615487325632e-05, 'epoch': 1.72} 17%|█▋ | 1722/10000 [2:41:54<12:36:55, 5.49s/it][2025-06-19 16:11:38,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:11:38,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.40 | bwd_microstep: 3316.00 | bwd_inner_microstep: 3315.08 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.07 [2025-06-19 16:11:38,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.40 | bwd: 3316.02 | bwd_inner: 3315.08 | bwd_allreduce: 0.89 | step: 7.07 17%|█▋ | 1723/10000 [2:41:59<12:35:32, 5.48s/it] {'loss': 0.1248, 'grad_norm': 0.9217079281806946, 'learning_rate': 3.7913275057361945e-05, 'epoch': 1.72} 17%|█▋ | 1723/10000 [2:41:59<12:35:32, 5.48s/it][2025-06-19 16:11:44,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:11:44,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.97 | bwd_microstep: 3394.46 | bwd_inner_microstep: 3393.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.32 [2025-06-19 16:11:44,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.97 | bwd: 3394.48 | bwd_inner: 3393.65 | bwd_allreduce: 0.78 | step: 7.33 17%|█▋ | 1724/10000 [2:42:05<12:39:32, 5.51s/it] {'loss': 0.0538, 'grad_norm': 0.4075261950492859, 'learning_rate': 3.791039336244816e-05, 'epoch': 1.72} 17%|█▋ | 1724/10000 [2:42:05<12:39:32, 5.51s/it][2025-06-19 16:11:49,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:11:49,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.30 | bwd_microstep: 3365.12 | bwd_inner_microstep: 3364.35 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.64 [2025-06-19 16:11:49,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.30 | bwd: 3365.14 | bwd_inner: 3364.35 | bwd_allreduce: 0.75 | step: 6.65 17%|█▋ | 1725/10000 [2:42:10<12:40:23, 5.51s/it] {'loss': 0.1978, 'grad_norm': 1.2313286066055298, 'learning_rate': 3.7907509788817234e-05, 'epoch': 1.73} 17%|█▋ | 1725/10000 [2:42:10<12:40:23, 5.51s/it][2025-06-19 16:11:55,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:11:55,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.22 | bwd_microstep: 3366.13 | bwd_inner_microstep: 3365.31 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-19 16:11:55,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.21 | bwd: 3366.14 | bwd_inner: 3365.31 | bwd_allreduce: 0.78 | step: 7.24 17%|█▋ | 1726/10000 [2:42:16<12:40:56, 5.52s/it] {'loss': 0.1518, 'grad_norm': 1.4881571531295776, 'learning_rate': 3.790462433677164e-05, 'epoch': 1.73} 17%|█▋ | 1726/10000 [2:42:16<12:40:56, 5.52s/it][2025-06-19 16:12:00,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 16:12:00,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.63 | bwd_microstep: 3321.92 | bwd_inner_microstep: 3320.85 | bwd_allreduce_microstep: 1.01 | step_microstep: 8.20 [2025-06-19 16:12:00,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.63 | bwd: 3321.95 | bwd_inner: 3320.85 | bwd_allreduce: 1.04 | step: 8.21 17%|█▋ | 1727/10000 [2:42:21<12:38:47, 5.50s/it] {'loss': 0.1233, 'grad_norm': 0.7403527498245239, 'learning_rate': 3.790173700661405e-05, 'epoch': 1.73} 17%|█▋ | 1727/10000 [2:42:21<12:38:47, 5.50s/it][2025-06-19 16:12:06,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:12:06,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.08 | bwd_microstep: 3318.68 | bwd_inner_microstep: 3317.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 16:12:06,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.08 | bwd: 3318.69 | bwd_inner: 3317.90 | bwd_allreduce: 0.75 | step: 6.58 17%|█▋ | 1728/10000 [2:42:27<12:37:06, 5.49s/it] {'loss': 0.0897, 'grad_norm': 1.0576015710830688, 'learning_rate': 3.789884779864734e-05, 'epoch': 1.73} 17%|█▋ | 1728/10000 [2:42:27<12:37:06, 5.49s/it][2025-06-19 16:12:11,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:12:11,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.08 | bwd_microstep: 3319.29 | bwd_inner_microstep: 3318.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-19 16:12:11,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.08 | bwd: 3319.30 | bwd_inner: 3318.49 | bwd_allreduce: 0.76 | step: 6.60 17%|█▋ | 1729/10000 [2:42:32<12:35:52, 5.48s/it] {'loss': 0.1338, 'grad_norm': 1.453324794769287, 'learning_rate': 3.7895956713174566e-05, 'epoch': 1.73} 17%|█▋ | 1729/10000 [2:42:32<12:35:52, 5.48s/it][2025-06-19 16:12:17,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:12:17,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.89 | bwd_microstep: 3324.50 | bwd_inner_microstep: 3323.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 16:12:17,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.89 | bwd: 3324.52 | bwd_inner: 3323.69 | bwd_allreduce: 0.78 | step: 7.11 17%|█▋ | 1730/10000 [2:42:38<12:35:33, 5.48s/it] {'loss': 0.0683, 'grad_norm': 0.4852989614009857, 'learning_rate': 3.789306375049899e-05, 'epoch': 1.73} 17%|█▋ | 1730/10000 [2:42:38<12:35:33, 5.48s/it][2025-06-19 16:12:22,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:12:22,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.92 | bwd_microstep: 3320.67 | bwd_inner_microstep: 3319.73 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.06 [2025-06-19 16:12:22,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.92 | bwd: 3320.69 | bwd_inner: 3319.73 | bwd_allreduce: 0.91 | step: 7.06 17%|█▋ | 1731/10000 [2:42:43<12:34:43, 5.48s/it] {'loss': 0.0839, 'grad_norm': 0.44868552684783936, 'learning_rate': 3.7890168910924064e-05, 'epoch': 1.73} 17%|█▋ | 1731/10000 [2:42:43<12:34:43, 5.48s/it][2025-06-19 16:12:28,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:12:28,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.00 | bwd_microstep: 3327.69 | bwd_inner_microstep: 3326.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 16:12:28,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.00 | bwd: 3327.70 | bwd_inner: 3326.89 | bwd_allreduce: 0.77 | step: 7.07 17%|█▋ | 1732/10000 [2:42:49<12:34:36, 5.48s/it] {'loss': 0.0957, 'grad_norm': 0.6612005829811096, 'learning_rate': 3.7887272194753456e-05, 'epoch': 1.73} 17%|█▋ | 1732/10000 [2:42:49<12:34:36, 5.48s/it][2025-06-19 16:12:33,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:12:33,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.08 | bwd_microstep: 3374.80 | bwd_inner_microstep: 3374.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 16:12:33,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.08 | bwd: 3374.82 | bwd_inner: 3374.01 | bwd_allreduce: 0.77 | step: 6.81 17%|█▋ | 1733/10000 [2:42:54<12:37:07, 5.50s/it] {'loss': 0.1298, 'grad_norm': 1.0136332511901855, 'learning_rate': 3.788437360229101e-05, 'epoch': 1.73} 17%|█▋ | 1733/10000 [2:42:54<12:37:07, 5.50s/it][2025-06-19 16:12:39,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:12:39,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.53 | bwd_microstep: 3379.96 | bwd_inner_microstep: 3378.97 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.86 [2025-06-19 16:12:39,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.53 | bwd: 3379.98 | bwd_inner: 3378.97 | bwd_allreduce: 0.96 | step: 7.86 17%|█▋ | 1734/10000 [2:43:00<12:39:31, 5.51s/it] {'loss': 0.057, 'grad_norm': 0.29999223351478577, 'learning_rate': 3.788147313384077e-05, 'epoch': 1.73} 17%|█▋ | 1734/10000 [2:43:00<12:39:31, 5.51s/it][h264 @ 0xb6c6100] Reference 5 >= 5 [h264 @ 0xb6c6100] error while decoding MB 15 42, bytestream 9292 [h264 @ 0xcc49c40] left block unavailable for requested intra mode [h264 @ 0xcc49c40] error while decoding MB 0 25, bytestream 45493 [h264 @ 0xcc3ad00] Reference 5 >= 5 [h264 @ 0xcc3ad00] error while decoding MB 15 42, bytestream 9292 [h264 @ 0xcc3ad00] left block unavailable for requested intra mode [h264 @ 0xcc3ad00] error while decoding MB 0 25, bytestream 45493 [2025-06-19 16:12:44,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:12:44,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.09 | bwd_microstep: 3323.20 | bwd_inner_microstep: 3322.38 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.45 [2025-06-19 16:12:44,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.09 | bwd: 3323.22 | bwd_inner: 3322.38 | bwd_allreduce: 0.79 | step: 7.45 17%|█▋ | 1735/10000 [2:43:05<12:37:43, 5.50s/it] {'loss': 0.0647, 'grad_norm': 0.3648589253425598, 'learning_rate': 3.7878570789707e-05, 'epoch': 1.73} 17%|█▋ | 1735/10000 [2:43:05<12:37:43, 5.50s/it][2025-06-19 16:12:50,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:12:50,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.70 | bwd_microstep: 3329.00 | bwd_inner_microstep: 3328.05 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.20 [2025-06-19 16:12:50,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.70 | bwd: 3329.01 | bwd_inner: 3328.05 | bwd_allreduce: 0.92 | step: 7.20 17%|█▋ | 1736/10000 [2:43:11<12:36:47, 5.49s/it] {'loss': 0.0978, 'grad_norm': 0.7415056824684143, 'learning_rate': 3.7875666570194134e-05, 'epoch': 1.74} 17%|█▋ | 1736/10000 [2:43:11<12:36:47, 5.49s/it][2025-06-19 16:12:55,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:12:55,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.85 | bwd_microstep: 3373.84 | bwd_inner_microstep: 3373.02 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.30 [2025-06-19 16:12:55,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.85 | bwd: 3373.85 | bwd_inner: 3373.02 | bwd_allreduce: 0.79 | step: 7.30 17%|█▋ | 1737/10000 [2:43:16<12:39:05, 5.51s/it] {'loss': 0.0857, 'grad_norm': 0.43648383021354675, 'learning_rate': 3.7872760475606796e-05, 'epoch': 1.74} 17%|█▋ | 1737/10000 [2:43:16<12:39:05, 5.51s/it][2025-06-19 16:13:01,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:13:01,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.79 | bwd_microstep: 3386.12 | bwd_inner_microstep: 3385.23 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.49 [2025-06-19 16:13:01,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.79 | bwd: 3386.13 | bwd_inner: 3385.23 | bwd_allreduce: 0.86 | step: 7.50 17%|█▋ | 1738/10000 [2:43:22<12:41:15, 5.53s/it] {'loss': 0.0797, 'grad_norm': 0.5669593214988708, 'learning_rate': 3.7869852506249844e-05, 'epoch': 1.74} 17%|█▋ | 1738/10000 [2:43:22<12:41:15, 5.53s/it][2025-06-19 16:13:06,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:13:06,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.64 | bwd_microstep: 3329.55 | bwd_inner_microstep: 3328.71 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.79 [2025-06-19 16:13:06,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.64 | bwd: 3329.57 | bwd_inner: 3328.71 | bwd_allreduce: 0.82 | step: 6.79 17%|█▋ | 1739/10000 [2:43:27<12:39:00, 5.51s/it] {'loss': 0.0711, 'grad_norm': 0.4930022060871124, 'learning_rate': 3.78669426624283e-05, 'epoch': 1.74} 17%|█▋ | 1739/10000 [2:43:27<12:39:00, 5.51s/it][2025-06-19 16:13:12,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:13:12,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.01 | bwd_microstep: 3316.54 | bwd_inner_microstep: 3315.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 16:13:12,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.01 | bwd: 3316.55 | bwd_inner: 3315.75 | bwd_allreduce: 0.76 | step: 6.62 17%|█▋ | 1740/10000 [2:43:33<12:36:42, 5.50s/it] {'loss': 0.103, 'grad_norm': 0.5339618921279907, 'learning_rate': 3.7864030944447385e-05, 'epoch': 1.74} 17%|█▋ | 1740/10000 [2:43:33<12:36:42, 5.50s/it][2025-06-19 16:13:17,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:13:17,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.37 | bwd_microstep: 3385.16 | bwd_inner_microstep: 3384.32 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.16 [2025-06-19 16:13:17,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.37 | bwd: 3385.17 | bwd_inner: 3384.32 | bwd_allreduce: 0.81 | step: 7.16 17%|█▋ | 1741/10000 [2:43:38<12:39:09, 5.52s/it] {'loss': 0.0554, 'grad_norm': 0.23462097346782684, 'learning_rate': 3.786111735261254e-05, 'epoch': 1.74} 17%|█▋ | 1741/10000 [2:43:38<12:39:09, 5.52s/it][2025-06-19 16:13:23,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:13:23,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.37 | bwd_microstep: 3407.18 | bwd_inner_microstep: 3406.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 16:13:23,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.37 | bwd: 3407.19 | bwd_inner: 3406.38 | bwd_allreduce: 0.77 | step: 6.65 17%|█▋ | 1742/10000 [2:43:44<12:42:18, 5.54s/it] {'loss': 0.0884, 'grad_norm': 0.5033847093582153, 'learning_rate': 3.785820188722938e-05, 'epoch': 1.74} 17%|█▋ | 1742/10000 [2:43:44<12:42:18, 5.54s/it][2025-06-19 16:13:29,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:13:29,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.19 | bwd_microstep: 3327.01 | bwd_inner_microstep: 3326.05 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.31 [2025-06-19 16:13:29,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.19 | bwd: 3327.02 | bwd_inner: 3326.05 | bwd_allreduce: 0.92 | step: 7.31 17%|█▋ | 1743/10000 [2:43:49<12:39:30, 5.52s/it] {'loss': 0.0623, 'grad_norm': 0.25618085265159607, 'learning_rate': 3.7855284548603724e-05, 'epoch': 1.74} 17%|█▋ | 1743/10000 [2:43:49<12:39:30, 5.52s/it][2025-06-19 16:13:34,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:13:34,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.28 | bwd_microstep: 3327.69 | bwd_inner_microstep: 3326.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.27 [2025-06-19 16:13:34,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.28 | bwd: 3327.70 | bwd_inner: 3326.90 | bwd_allreduce: 0.76 | step: 7.29 17%|█▋ | 1744/10000 [2:43:55<12:37:58, 5.51s/it] {'loss': 0.1669, 'grad_norm': 0.7829123735427856, 'learning_rate': 3.785236533704159e-05, 'epoch': 1.74} 17%|█▋ | 1744/10000 [2:43:55<12:37:58, 5.51s/it][2025-06-19 16:13:40,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:13:40,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.99 | bwd_microstep: 3331.43 | bwd_inner_microstep: 3330.43 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.43 [2025-06-19 16:13:40,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.00 | bwd: 3331.45 | bwd_inner: 3330.43 | bwd_allreduce: 0.97 | step: 7.44 17%|█▋ | 1745/10000 [2:44:00<12:36:44, 5.50s/it] {'loss': 0.1345, 'grad_norm': 0.6796208024024963, 'learning_rate': 3.784944425284918e-05, 'epoch': 1.75} 17%|█▋ | 1745/10000 [2:44:00<12:36:44, 5.50s/it][2025-06-19 16:13:45,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:13:45,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.14 | bwd_microstep: 3383.24 | bwd_inner_microstep: 3382.28 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.15 [2025-06-19 16:13:45,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.14 | bwd: 3383.25 | bwd_inner: 3382.28 | bwd_allreduce: 0.93 | step: 7.15 17%|█▋ | 1746/10000 [2:44:06<12:39:18, 5.52s/it] {'loss': 0.07, 'grad_norm': 0.2866515517234802, 'learning_rate': 3.784652129633291e-05, 'epoch': 1.75} 17%|█▋ | 1746/10000 [2:44:06<12:39:18, 5.52s/it][2025-06-19 16:13:51,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:13:51,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.52 | bwd_microstep: 3378.68 | bwd_inner_microstep: 3377.74 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.21 [2025-06-19 16:13:51,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.52 | bwd: 3378.69 | bwd_inner: 3377.74 | bwd_allreduce: 0.91 | step: 7.21 17%|█▋ | 1747/10000 [2:44:11<12:40:30, 5.53s/it] {'loss': 0.144, 'grad_norm': 0.5701783299446106, 'learning_rate': 3.78435964677994e-05, 'epoch': 1.75} 17%|█▋ | 1747/10000 [2:44:11<12:40:30, 5.53s/it][2025-06-19 16:13:56,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 16:13:56,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.13 | bwd_microstep: 3342.54 | bwd_inner_microstep: 3341.42 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.99 [2025-06-19 16:13:56,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.13 | bwd: 3342.56 | bwd_inner: 3341.42 | bwd_allreduce: 1.08 | step: 7.99 17%|█▋ | 1748/10000 [2:44:17<12:39:22, 5.52s/it] {'loss': 0.0554, 'grad_norm': 0.3026469051837921, 'learning_rate': 3.784066976755542e-05, 'epoch': 1.75} 17%|█▋ | 1748/10000 [2:44:17<12:39:22, 5.52s/it][2025-06-19 16:14:02,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:14:02,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.43 | bwd_microstep: 3376.33 | bwd_inner_microstep: 3375.41 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.56 [2025-06-19 16:14:02,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.43 | bwd: 3376.34 | bwd_inner: 3375.41 | bwd_allreduce: 0.89 | step: 7.57 17%|█▋ | 1749/10000 [2:44:23<12:41:11, 5.54s/it] {'loss': 0.1001, 'grad_norm': 0.5013868808746338, 'learning_rate': 3.783774119590799e-05, 'epoch': 1.75} 17%|█▋ | 1749/10000 [2:44:23<12:41:11, 5.54s/it][2025-06-19 16:14:07,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:14:07,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.51 | bwd_microstep: 3330.94 | bwd_inner_microstep: 3329.99 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.07 [2025-06-19 16:14:07,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.51 | bwd: 3330.95 | bwd_inner: 3329.99 | bwd_allreduce: 0.92 | step: 7.08 18%|█▊ | 1750/10000 [2:44:28<12:39:23, 5.52s/it] {'loss': 0.077, 'grad_norm': 0.8280020356178284, 'learning_rate': 3.783481075316429e-05, 'epoch': 1.75} 18%|█▊ | 1750/10000 [2:44:28<12:39:23, 5.52s/it][2025-06-19 16:14:13,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:14:13,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.85 | bwd_microstep: 3380.45 | bwd_inner_microstep: 3379.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 16:14:13,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.85 | bwd: 3380.47 | bwd_inner: 3379.66 | bwd_allreduce: 0.76 | step: 6.63 18%|█▊ | 1751/10000 [2:44:34<12:40:31, 5.53s/it] {'loss': 0.0526, 'grad_norm': 0.2650567591190338, 'learning_rate': 3.783187843963172e-05, 'epoch': 1.75} 18%|█▊ | 1751/10000 [2:44:34<12:40:31, 5.53s/it][2025-06-19 16:14:18,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:14:18,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.21 | bwd_microstep: 3381.27 | bwd_inner_microstep: 3380.45 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.13 [2025-06-19 16:14:18,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.21 | bwd: 3381.28 | bwd_inner: 3380.45 | bwd_allreduce: 0.78 | step: 7.13 18%|█▊ | 1752/10000 [2:44:39<12:41:45, 5.54s/it] {'loss': 0.0711, 'grad_norm': 0.5598441362380981, 'learning_rate': 3.782894425561786e-05, 'epoch': 1.75} 18%|█▊ | 1752/10000 [2:44:39<12:41:45, 5.54s/it][2025-06-19 16:14:24,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:14:24,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.27 | bwd_microstep: 3319.25 | bwd_inner_microstep: 3318.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 16:14:24,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.27 | bwd: 3319.27 | bwd_inner: 3318.47 | bwd_allreduce: 0.76 | step: 6.65 18%|█▊ | 1753/10000 [2:44:45<12:38:43, 5.52s/it] {'loss': 0.0697, 'grad_norm': 0.33541712164878845, 'learning_rate': 3.7826008201430495e-05, 'epoch': 1.75} 18%|█▊ | 1753/10000 [2:44:45<12:38:43, 5.52s/it][2025-06-19 16:14:29,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 3.02 [2025-06-19 16:14:29,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.26 | bwd_microstep: 3334.11 | bwd_inner_microstep: 3333.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.99 [2025-06-19 16:14:29,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.26 | bwd: 3334.12 | bwd_inner: 3333.33 | bwd_allreduce: 0.75 | step: 7.00 18%|█▊ | 1754/10000 [2:44:50<12:37:19, 5.51s/it] {'loss': 0.0926, 'grad_norm': 0.9523545503616333, 'learning_rate': 3.78230702773776e-05, 'epoch': 1.75} 18%|█▊ | 1754/10000 [2:44:50<12:37:19, 5.51s/it][2025-06-19 16:14:35,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:14:35,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.21 | bwd_microstep: 3322.41 | bwd_inner_microstep: 3321.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.56 [2025-06-19 16:14:35,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.21 | bwd: 3322.42 | bwd_inner: 3321.62 | bwd_allreduce: 0.76 | step: 6.57 18%|█▊ | 1755/10000 [2:44:56<12:35:22, 5.50s/it] {'loss': 0.1079, 'grad_norm': 0.7798405885696411, 'learning_rate': 3.782013048376736e-05, 'epoch': 1.75} 18%|█▊ | 1755/10000 [2:44:56<12:35:22, 5.50s/it][2025-06-19 16:14:40,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:14:40,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.49 | bwd_microstep: 3380.04 | bwd_inner_microstep: 3379.20 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.86 [2025-06-19 16:14:40,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.49 | bwd: 3380.05 | bwd_inner: 3379.20 | bwd_allreduce: 0.80 | step: 6.86 18%|█▊ | 1756/10000 [2:45:01<12:37:34, 5.51s/it] {'loss': 0.1029, 'grad_norm': 0.7141259908676147, 'learning_rate': 3.7817188820908135e-05, 'epoch': 1.76} 18%|█▊ | 1756/10000 [2:45:01<12:37:34, 5.51s/it][2025-06-19 16:14:46,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:14:46,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.85 | bwd_microstep: 3373.35 | bwd_inner_microstep: 3372.54 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-19 16:14:46,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.85 | bwd: 3373.37 | bwd_inner: 3372.54 | bwd_allreduce: 0.78 | step: 6.85 18%|█▊ | 1757/10000 [2:45:07<12:39:12, 5.53s/it] {'loss': 0.0504, 'grad_norm': 0.2429877072572708, 'learning_rate': 3.781424528910849e-05, 'epoch': 1.76} 18%|█▊ | 1757/10000 [2:45:07<12:39:12, 5.53s/it][2025-06-19 16:14:51,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:14:51,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.65 | bwd_microstep: 3321.27 | bwd_inner_microstep: 3320.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-19 16:14:51,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.65 | bwd: 3321.28 | bwd_inner: 3320.47 | bwd_allreduce: 0.76 | step: 6.91 18%|█▊ | 1758/10000 [2:45:12<12:36:59, 5.51s/it] {'loss': 0.1073, 'grad_norm': 0.6468959450721741, 'learning_rate': 3.781129988867719e-05, 'epoch': 1.76} 18%|█▊ | 1758/10000 [2:45:12<12:36:59, 5.51s/it][2025-06-19 16:14:57,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:14:57,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.93 | bwd_microstep: 3334.73 | bwd_inner_microstep: 3333.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 16:14:57,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.93 | bwd: 3334.75 | bwd_inner: 3333.95 | bwd_allreduce: 0.76 | step: 6.56 18%|█▊ | 1759/10000 [2:45:18<12:35:30, 5.50s/it] {'loss': 0.2189, 'grad_norm': 1.5633796453475952, 'learning_rate': 3.780835261992321e-05, 'epoch': 1.76} 18%|█▊ | 1759/10000 [2:45:18<12:35:30, 5.50s/it][2025-06-19 16:15:02,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:15:02,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.72 | bwd_microstep: 3331.32 | bwd_inner_microstep: 3330.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 16:15:02,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.72 | bwd: 3331.33 | bwd_inner: 3330.52 | bwd_allreduce: 0.77 | step: 6.91 18%|█▊ | 1760/10000 [2:45:23<12:35:01, 5.50s/it] {'loss': 0.2015, 'grad_norm': 0.7589941620826721, 'learning_rate': 3.780540348315569e-05, 'epoch': 1.76} 18%|█▊ | 1760/10000 [2:45:23<12:35:01, 5.50s/it][2025-06-19 16:15:08,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:15:08,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.47 | bwd_microstep: 3332.24 | bwd_inner_microstep: 3331.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 16:15:08,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.47 | bwd: 3332.26 | bwd_inner: 3331.46 | bwd_allreduce: 0.75 | step: 6.59 18%|█▊ | 1761/10000 [2:45:29<12:34:50, 5.50s/it] {'loss': 0.0832, 'grad_norm': 0.4875127375125885, 'learning_rate': 3.780245247868397e-05, 'epoch': 1.76} 18%|█▊ | 1761/10000 [2:45:29<12:34:50, 5.50s/it][2025-06-19 16:15:13,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:15:13,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.87 | bwd_microstep: 3380.20 | bwd_inner_microstep: 3379.40 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.79 [2025-06-19 16:15:13,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.87 | bwd: 3380.21 | bwd_inner: 3379.40 | bwd_allreduce: 0.77 | step: 6.79 18%|█▊ | 1762/10000 [2:45:34<12:36:49, 5.51s/it] {'loss': 0.0794, 'grad_norm': 0.49891021847724915, 'learning_rate': 3.7799499606817615e-05, 'epoch': 1.76} 18%|█▊ | 1762/10000 [2:45:34<12:36:49, 5.51s/it][2025-06-19 16:15:19,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:15:19,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.94 | bwd_microstep: 3324.05 | bwd_inner_microstep: 3323.26 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.87 [2025-06-19 16:15:19,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.94 | bwd: 3324.07 | bwd_inner: 3323.26 | bwd_allreduce: 0.77 | step: 6.87 18%|█▊ | 1763/10000 [2:45:40<12:35:00, 5.50s/it] {'loss': 0.1385, 'grad_norm': 0.9414947628974915, 'learning_rate': 3.779654486786636e-05, 'epoch': 1.76} 18%|█▊ | 1763/10000 [2:45:40<12:35:00, 5.50s/it][2025-06-19 16:15:24,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:15:24,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.74 | bwd_microstep: 3320.86 | bwd_inner_microstep: 3319.99 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.69 [2025-06-19 16:15:24,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.74 | bwd: 3320.88 | bwd_inner: 3319.99 | bwd_allreduce: 0.83 | step: 7.70 18%|█▊ | 1764/10000 [2:45:45<12:34:26, 5.50s/it] {'loss': 0.1266, 'grad_norm': 0.8713100552558899, 'learning_rate': 3.779358826214015e-05, 'epoch': 1.76} 18%|█▊ | 1764/10000 [2:45:45<12:34:26, 5.50s/it][2025-06-19 16:15:30,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 16:15:30,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2158.35 | bwd_microstep: 3373.78 | bwd_inner_microstep: 3372.71 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.30 [2025-06-19 16:15:30,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2158.35 | bwd: 3373.80 | bwd_inner: 3372.71 | bwd_allreduce: 1.04 | step: 7.30 18%|█▊ | 1765/10000 [2:45:51<12:37:34, 5.52s/it] {'loss': 0.081, 'grad_norm': 0.47259366512298584, 'learning_rate': 3.77906297899491e-05, 'epoch': 1.77} 18%|█▊ | 1765/10000 [2:45:51<12:37:34, 5.52s/it][2025-06-19 16:15:35,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:15:35,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.59 | bwd_microstep: 3375.00 | bwd_inner_microstep: 3374.13 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.03 [2025-06-19 16:15:35,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.59 | bwd: 3375.02 | bwd_inner: 3374.13 | bwd_allreduce: 0.84 | step: 7.03 18%|█▊ | 1766/10000 [2:45:56<12:39:27, 5.53s/it] {'loss': 0.0952, 'grad_norm': 0.49973052740097046, 'learning_rate': 3.7787669451603564e-05, 'epoch': 1.77} 18%|█▊ | 1766/10000 [2:45:56<12:39:27, 5.53s/it][2025-06-19 16:15:41,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 16:15:41,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.02 | bwd_microstep: 3319.47 | bwd_inner_microstep: 3318.46 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.35 [2025-06-19 16:15:41,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.02 | bwd: 3319.49 | bwd_inner: 3318.46 | bwd_allreduce: 0.98 | step: 7.35 18%|█▊ | 1767/10000 [2:46:02<12:36:50, 5.52s/it] {'loss': 0.1334, 'grad_norm': 0.8897166848182678, 'learning_rate': 3.7784707247414065e-05, 'epoch': 1.77} 18%|█▊ | 1767/10000 [2:46:02<12:36:50, 5.52s/it][2025-06-19 16:15:46,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:15:46,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.46 | bwd_microstep: 3382.86 | bwd_inner_microstep: 3381.95 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.60 [2025-06-19 16:15:46,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.46 | bwd: 3382.87 | bwd_inner: 3381.95 | bwd_allreduce: 0.88 | step: 7.60 18%|█▊ | 1768/10000 [2:46:07<12:38:30, 5.53s/it] {'loss': 0.059, 'grad_norm': 0.44967415928840637, 'learning_rate': 3.7781743177691306e-05, 'epoch': 1.77} 18%|█▊ | 1768/10000 [2:46:07<12:38:30, 5.53s/it][2025-06-19 16:15:52,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:15:52,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.49 | bwd_microstep: 3326.37 | bwd_inner_microstep: 3325.56 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.25 [2025-06-19 16:15:52,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.49 | bwd: 3326.38 | bwd_inner: 3325.56 | bwd_allreduce: 0.78 | step: 7.25 18%|█▊ | 1769/10000 [2:46:13<12:36:19, 5.51s/it] {'loss': 0.1364, 'grad_norm': 1.1090840101242065, 'learning_rate': 3.7778777242746214e-05, 'epoch': 1.77} 18%|█▊ | 1769/10000 [2:46:13<12:36:19, 5.51s/it][2025-06-19 16:15:57,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:15:57,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.52 | bwd_microstep: 3331.06 | bwd_inner_microstep: 3330.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 16:15:57,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.52 | bwd: 3331.07 | bwd_inner: 3330.27 | bwd_allreduce: 0.76 | step: 6.63 18%|█▊ | 1770/10000 [2:46:18<12:34:41, 5.50s/it] {'loss': 0.1047, 'grad_norm': 0.7797141671180725, 'learning_rate': 3.777580944288991e-05, 'epoch': 1.77} 18%|█▊ | 1770/10000 [2:46:18<12:34:41, 5.50s/it][2025-06-19 16:16:03,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:16:03,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.52 | bwd_microstep: 3315.27 | bwd_inner_microstep: 3314.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 16:16:03,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.52 | bwd: 3315.29 | bwd_inner: 3314.47 | bwd_allreduce: 0.77 | step: 6.74 18%|█▊ | 1771/10000 [2:46:24<12:32:41, 5.49s/it] {'loss': 0.0686, 'grad_norm': 0.5564988255500793, 'learning_rate': 3.777283977843369e-05, 'epoch': 1.77} 18%|█▊ | 1771/10000 [2:46:24<12:32:41, 5.49s/it][2025-06-19 16:16:08,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:16:08,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.52 | bwd_microstep: 3368.41 | bwd_inner_microstep: 3367.59 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.28 [2025-06-19 16:16:08,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.52 | bwd: 3368.43 | bwd_inner: 3367.59 | bwd_allreduce: 0.79 | step: 7.28 18%|█▊ | 1772/10000 [2:46:29<12:34:45, 5.50s/it] {'loss': 0.0618, 'grad_norm': 0.41804561018943787, 'learning_rate': 3.776986824968907e-05, 'epoch': 1.77} 18%|█▊ | 1772/10000 [2:46:29<12:34:45, 5.50s/it][2025-06-19 16:16:14,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:16:14,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.21 | bwd_microstep: 3310.90 | bwd_inner_microstep: 3309.96 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.97 [2025-06-19 16:16:14,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.21 | bwd: 3310.91 | bwd_inner: 3309.96 | bwd_allreduce: 0.91 | step: 6.98 18%|█▊ | 1773/10000 [2:46:35<12:32:48, 5.49s/it] {'loss': 0.114, 'grad_norm': 1.128479242324829, 'learning_rate': 3.776689485696774e-05, 'epoch': 1.77} 18%|█▊ | 1773/10000 [2:46:35<12:32:48, 5.49s/it][2025-06-19 16:16:19,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 16:16:19,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.84 | bwd_microstep: 3359.38 | bwd_inner_microstep: 3358.55 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.27 [2025-06-19 16:16:19,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.84 | bwd: 3359.39 | bwd_inner: 3358.55 | bwd_allreduce: 0.80 | step: 7.28 18%|█▊ | 1774/10000 [2:46:40<12:34:14, 5.50s/it] {'loss': 0.128, 'grad_norm': 0.6284888386726379, 'learning_rate': 3.77639196005816e-05, 'epoch': 1.77} 18%|█▊ | 1774/10000 [2:46:40<12:34:14, 5.50s/it][2025-06-19 16:16:25,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:16:25,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.75 | bwd_microstep: 3312.51 | bwd_inner_microstep: 3311.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 16:16:25,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.75 | bwd: 3312.53 | bwd_inner: 3311.72 | bwd_allreduce: 0.76 | step: 6.69 18%|█▊ | 1775/10000 [2:46:46<12:32:38, 5.49s/it] {'loss': 0.1173, 'grad_norm': 0.6931294202804565, 'learning_rate': 3.776094248084273e-05, 'epoch': 1.77} 18%|█▊ | 1775/10000 [2:46:46<12:32:38, 5.49s/it][2025-06-19 16:16:30,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 16:16:30,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.65 | bwd_microstep: 3329.22 | bwd_inner_microstep: 3328.30 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.38 [2025-06-19 16:16:30,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.65 | bwd: 3329.23 | bwd_inner: 3328.30 | bwd_allreduce: 0.88 | step: 7.38 18%|█▊ | 1776/10000 [2:46:51<12:31:57, 5.49s/it] {'loss': 0.1703, 'grad_norm': 0.9368542432785034, 'learning_rate': 3.7757963498063436e-05, 'epoch': 1.78} 18%|█▊ | 1776/10000 [2:46:51<12:31:57, 5.49s/it][2025-06-19 16:16:36,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 16:16:36,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.40 | bwd_microstep: 3320.81 | bwd_inner_microstep: 3319.80 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.28 [2025-06-19 16:16:36,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.40 | bwd: 3320.83 | bwd_inner: 3319.81 | bwd_allreduce: 0.97 | step: 7.28 18%|█▊ | 1777/10000 [2:46:57<12:31:08, 5.48s/it] {'loss': 0.0891, 'grad_norm': 0.45875313878059387, 'learning_rate': 3.7754982652556186e-05, 'epoch': 1.78} 18%|█▊ | 1777/10000 [2:46:57<12:31:08, 5.48s/it][2025-06-19 16:16:41,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:16:41,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.48 | bwd_microstep: 3374.76 | bwd_inner_microstep: 3373.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 16:16:41,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.48 | bwd: 3374.77 | bwd_inner: 3373.95 | bwd_allreduce: 0.78 | step: 7.23 18%|█▊ | 1778/10000 [2:47:02<12:33:39, 5.50s/it] {'loss': 0.0812, 'grad_norm': 0.5911890864372253, 'learning_rate': 3.775199994463365e-05, 'epoch': 1.78} 18%|█▊ | 1778/10000 [2:47:02<12:33:39, 5.50s/it][2025-06-19 16:16:47,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:16:47,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.59 | bwd_microstep: 3378.85 | bwd_inner_microstep: 3378.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 16:16:47,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.59 | bwd: 3378.86 | bwd_inner: 3378.07 | bwd_allreduce: 0.75 | step: 6.61 18%|█▊ | 1779/10000 [2:47:08<12:35:35, 5.51s/it] {'loss': 0.1592, 'grad_norm': 0.7834187150001526, 'learning_rate': 3.774901537460872e-05, 'epoch': 1.78} 18%|█▊ | 1779/10000 [2:47:08<12:35:35, 5.51s/it][2025-06-19 16:16:52,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 16:16:52,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.63 | bwd_microstep: 3323.29 | bwd_inner_microstep: 3322.10 | bwd_allreduce_microstep: 1.11 | step_microstep: 8.22 [2025-06-19 16:16:52,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.63 | bwd: 3323.31 | bwd_inner: 3322.10 | bwd_allreduce: 1.14 | step: 8.22 18%|█▊ | 1780/10000 [2:47:13<12:33:43, 5.50s/it] {'loss': 0.0928, 'grad_norm': 0.5640586018562317, 'learning_rate': 3.774602894279445e-05, 'epoch': 1.78} 18%|█▊ | 1780/10000 [2:47:13<12:33:43, 5.50s/it][2025-06-19 16:16:58,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 16:16:58,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.19 | bwd_microstep: 3315.45 | bwd_inner_microstep: 3314.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 16:16:58,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.19 | bwd: 3315.46 | bwd_inner: 3314.66 | bwd_allreduce: 0.76 | step: 6.63 18%|█▊ | 1781/10000 [2:47:19<12:32:06, 5.49s/it] {'loss': 0.0891, 'grad_norm': 0.6878653764724731, 'learning_rate': 3.774304064950411e-05, 'epoch': 1.78} 18%|█▊ | 1781/10000 [2:47:19<12:32:06, 5.49s/it][2025-06-19 16:17:03,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:17:03,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.90 | bwd_microstep: 3368.26 | bwd_inner_microstep: 3367.44 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 16:17:03,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.90 | bwd: 3368.28 | bwd_inner: 3367.44 | bwd_allreduce: 0.78 | step: 7.10 18%|█▊ | 1782/10000 [2:47:24<12:33:51, 5.50s/it] {'loss': 0.1014, 'grad_norm': 0.6254920959472656, 'learning_rate': 3.774005049505114e-05, 'epoch': 1.78} 18%|█▊ | 1782/10000 [2:47:24<12:33:51, 5.50s/it][2025-06-19 16:17:09,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:17:09,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.71 | bwd_microstep: 3313.59 | bwd_inner_microstep: 3312.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.61 [2025-06-19 16:17:09,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.71 | bwd: 3313.60 | bwd_inner: 3312.79 | bwd_allreduce: 0.76 | step: 6.61 18%|█▊ | 1783/10000 [2:47:30<12:31:52, 5.49s/it] {'loss': 0.0778, 'grad_norm': 0.8234845995903015, 'learning_rate': 3.7737058479749216e-05, 'epoch': 1.78} 18%|█▊ | 1783/10000 [2:47:30<12:31:52, 5.49s/it][2025-06-19 16:17:14,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:17:14,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.36 | bwd_microstep: 3315.83 | bwd_inner_microstep: 3315.02 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 16:17:14,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.36 | bwd: 3315.84 | bwd_inner: 3315.02 | bwd_allreduce: 0.78 | step: 6.77 18%|█▊ | 1784/10000 [2:47:35<12:30:41, 5.48s/it] {'loss': 0.0964, 'grad_norm': 0.8176344633102417, 'learning_rate': 3.773406460391218e-05, 'epoch': 1.78} 18%|█▊ | 1784/10000 [2:47:35<12:30:41, 5.48s/it][2025-06-19 16:17:20,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:17:20,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.04 | bwd_microstep: 3364.02 | bwd_inner_microstep: 3363.20 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-19 16:17:20,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.04 | bwd: 3364.03 | bwd_inner: 3363.20 | bwd_allreduce: 0.79 | step: 7.31 18%|█▊ | 1785/10000 [2:47:41<12:33:15, 5.50s/it] {'loss': 0.1526, 'grad_norm': 0.8780355453491211, 'learning_rate': 3.773106886785407e-05, 'epoch': 1.79} 18%|█▊ | 1785/10000 [2:47:41<12:33:15, 5.50s/it][2025-06-19 16:17:25,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:17:25,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.28 | bwd_microstep: 3325.11 | bwd_inner_microstep: 3324.28 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.22 [2025-06-19 16:17:25,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.28 | bwd: 3325.12 | bwd_inner: 3324.28 | bwd_allreduce: 0.79 | step: 7.22 18%|█▊ | 1786/10000 [2:47:46<12:31:47, 5.49s/it] {'loss': 0.1112, 'grad_norm': 0.6864232420921326, 'learning_rate': 3.772807127188913e-05, 'epoch': 1.79} 18%|█▊ | 1786/10000 [2:47:46<12:31:47, 5.49s/it][2025-06-19 16:17:31,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:17:31,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.76 | bwd_microstep: 3314.79 | bwd_inner_microstep: 3313.83 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.05 [2025-06-19 16:17:31,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.77 | bwd: 3314.81 | bwd_inner: 3313.83 | bwd_allreduce: 0.93 | step: 7.05 18%|█▊ | 1787/10000 [2:47:52<12:30:04, 5.48s/it] {'loss': 0.1721, 'grad_norm': 1.2295653820037842, 'learning_rate': 3.7725071816331794e-05, 'epoch': 1.79} 18%|█▊ | 1787/10000 [2:47:52<12:30:04, 5.48s/it][2025-06-19 16:17:36,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:17:36,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.43 | bwd_microstep: 3316.07 | bwd_inner_microstep: 3315.20 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.99 [2025-06-19 16:17:36,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.44 | bwd: 3316.09 | bwd_inner: 3315.20 | bwd_allreduce: 0.85 | step: 6.99 18%|█▊ | 1788/10000 [2:47:57<12:29:32, 5.48s/it] {'loss': 0.1012, 'grad_norm': 0.6994677186012268, 'learning_rate': 3.7722070501496685e-05, 'epoch': 1.79} 18%|█▊ | 1788/10000 [2:47:57<12:29:32, 5.48s/it][2025-06-19 16:17:42,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:17:42,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.92 | bwd_microstep: 3361.79 | bwd_inner_microstep: 3360.96 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.49 [2025-06-19 16:17:42,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.92 | bwd: 3361.81 | bwd_inner: 3360.96 | bwd_allreduce: 0.81 | step: 7.49 18%|█▊ | 1789/10000 [2:48:03<12:31:41, 5.49s/it] {'loss': 0.0815, 'grad_norm': 0.31538236141204834, 'learning_rate': 3.7719067327698635e-05, 'epoch': 1.79} 18%|█▊ | 1789/10000 [2:48:03<12:31:41, 5.49s/it][2025-06-19 16:17:47,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:17:47,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.11 | bwd_microstep: 3365.66 | bwd_inner_microstep: 3364.86 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.71 [2025-06-19 16:17:47,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.11 | bwd: 3365.68 | bwd_inner: 3364.86 | bwd_allreduce: 0.77 | step: 6.71 18%|█▊ | 1790/10000 [2:48:08<12:33:37, 5.51s/it] {'loss': 0.0815, 'grad_norm': 0.6561642289161682, 'learning_rate': 3.771606229525265e-05, 'epoch': 1.79} 18%|█▊ | 1790/10000 [2:48:08<12:33:37, 5.51s/it][2025-06-19 16:17:53,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:17:53,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.40 | bwd_microstep: 3310.72 | bwd_inner_microstep: 3309.90 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.21 [2025-06-19 16:17:53,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.40 | bwd: 3310.74 | bwd_inner: 3309.90 | bwd_allreduce: 0.79 | step: 7.21 18%|█▊ | 1791/10000 [2:48:14<12:31:18, 5.49s/it] {'loss': 0.0931, 'grad_norm': 0.5309993028640747, 'learning_rate': 3.771305540447397e-05, 'epoch': 1.79} 18%|█▊ | 1791/10000 [2:48:14<12:31:18, 5.49s/it][2025-06-19 16:17:58,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:17:58,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.61 | bwd_microstep: 3311.28 | bwd_inner_microstep: 3310.33 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.93 [2025-06-19 16:17:58,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.61 | bwd: 3311.30 | bwd_inner: 3310.33 | bwd_allreduce: 0.92 | step: 6.94 18%|█▊ | 1792/10000 [2:48:19<12:29:55, 5.48s/it] {'loss': 0.0983, 'grad_norm': 0.5654146075248718, 'learning_rate': 3.771004665567797e-05, 'epoch': 1.79} 18%|█▊ | 1792/10000 [2:48:19<12:29:55, 5.48s/it][2025-06-19 16:18:04,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:18:04,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.32 | bwd_microstep: 3361.94 | bwd_inner_microstep: 3361.11 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.74 [2025-06-19 16:18:04,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.32 | bwd: 3361.95 | bwd_inner: 3361.11 | bwd_allreduce: 0.79 | step: 6.75 18%|█▊ | 1793/10000 [2:48:25<12:31:44, 5.50s/it] {'loss': 0.0654, 'grad_norm': 0.38315027952194214, 'learning_rate': 3.770703604918027e-05, 'epoch': 1.79} 18%|█▊ | 1793/10000 [2:48:25<12:31:44, 5.50s/it][2025-06-19 16:18:09,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:18:09,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.35 | bwd_microstep: 3365.30 | bwd_inner_microstep: 3364.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 16:18:09,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.35 | bwd: 3365.31 | bwd_inner: 3364.50 | bwd_allreduce: 0.76 | step: 6.65 18%|█▊ | 1794/10000 [2:48:30<12:33:15, 5.51s/it] {'loss': 0.1716, 'grad_norm': 1.0507360696792603, 'learning_rate': 3.770402358529668e-05, 'epoch': 1.79} 18%|█▊ | 1794/10000 [2:48:30<12:33:15, 5.51s/it][2025-06-19 16:18:15,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.92 [2025-06-19 16:18:15,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.16 | bwd_microstep: 3317.00 | bwd_inner_microstep: 3316.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.29 [2025-06-19 16:18:15,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.16 | bwd: 3317.02 | bwd_inner: 3316.20 | bwd_allreduce: 0.77 | step: 7.30 18%|█▊ | 1795/10000 [2:48:36<12:31:15, 5.49s/it] {'loss': 0.0701, 'grad_norm': 0.44860631227493286, 'learning_rate': 3.770100926434318e-05, 'epoch': 1.79} 18%|█▊ | 1795/10000 [2:48:36<12:31:15, 5.49s/it][2025-06-19 16:18:20,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:18:20,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.48 | bwd_microstep: 3327.11 | bwd_inner_microstep: 3326.28 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.68 [2025-06-19 16:18:20,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.48 | bwd: 3327.12 | bwd_inner: 3326.28 | bwd_allreduce: 0.80 | step: 6.69 18%|█▊ | 1796/10000 [2:48:41<12:30:12, 5.49s/it] {'loss': 0.1003, 'grad_norm': 0.640450119972229, 'learning_rate': 3.7697993086635964e-05, 'epoch': 1.8} 18%|█▊ | 1796/10000 [2:48:41<12:30:12, 5.49s/it][2025-06-19 16:18:26,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:18:26,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.50 | bwd_microstep: 3358.18 | bwd_inner_microstep: 3357.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 16:18:26,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.50 | bwd: 3358.20 | bwd_inner: 3357.38 | bwd_allreduce: 0.78 | step: 7.10 18%|█▊ | 1797/10000 [2:48:47<12:31:46, 5.50s/it] {'loss': 0.0855, 'grad_norm': 0.6290597915649414, 'learning_rate': 3.769497505249141e-05, 'epoch': 1.8} 18%|█▊ | 1797/10000 [2:48:47<12:31:46, 5.50s/it][2025-06-19 16:18:31,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:18:31,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.48 | bwd_microstep: 3365.24 | bwd_inner_microstep: 3364.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 16:18:31,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.48 | bwd: 3365.25 | bwd_inner: 3364.46 | bwd_allreduce: 0.76 | step: 6.62 18%|█▊ | 1798/10000 [2:48:52<12:33:13, 5.51s/it] {'loss': 0.0921, 'grad_norm': 0.6313319206237793, 'learning_rate': 3.7691955162226096e-05, 'epoch': 1.8} 18%|█▊ | 1798/10000 [2:48:52<12:33:13, 5.51s/it][2025-06-19 16:18:37,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.68 | optimizer_step: 2.73 [2025-06-19 16:18:37,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.88 | bwd_microstep: 3356.09 | bwd_inner_microstep: 3355.22 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.43 [2025-06-19 16:18:37,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.88 | bwd: 3356.11 | bwd_inner: 3355.22 | bwd_allreduce: 0.84 | step: 7.43 18%|█▊ | 1799/10000 [2:48:58<12:33:38, 5.51s/it] {'loss': 0.051, 'grad_norm': 0.26014068722724915, 'learning_rate': 3.7688933416156795e-05, 'epoch': 1.8} 18%|█▊ | 1799/10000 [2:48:58<12:33:38, 5.51s/it][2025-06-19 16:18:42,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:18:42,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.29 | bwd_microstep: 3308.90 | bwd_inner_microstep: 3308.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 16:18:42,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.29 | bwd: 3308.91 | bwd_inner: 3308.12 | bwd_allreduce: 0.75 | step: 6.58 18%|█▊ | 1800/10000 [2:49:03<12:30:47, 5.49s/it] {'loss': 0.0516, 'grad_norm': 0.35314181447029114, 'learning_rate': 3.768590981460047e-05, 'epoch': 1.8} 18%|█▊ | 1800/10000 [2:49:03<12:30:47, 5.49s/it][2025-06-19 16:18:48,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:18:48,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.20 | bwd_microstep: 3314.49 | bwd_inner_microstep: 3313.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 16:18:48,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.20 | bwd: 3314.50 | bwd_inner: 3313.70 | bwd_allreduce: 0.76 | step: 6.63 18%|█▊ | 1801/10000 [2:49:09<12:29:19, 5.48s/it] {'loss': 0.1063, 'grad_norm': 0.8634037971496582, 'learning_rate': 3.76828843578743e-05, 'epoch': 1.8} 18%|█▊ | 1801/10000 [2:49:09<12:29:19, 5.48s/it][2025-06-19 16:18:53,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:18:53,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.19 | bwd_microstep: 3368.40 | bwd_inner_microstep: 3367.53 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.21 [2025-06-19 16:18:53,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.19 | bwd: 3368.42 | bwd_inner: 3367.53 | bwd_allreduce: 0.83 | step: 7.21 18%|█▊ | 1802/10000 [2:49:14<12:31:38, 5.50s/it] {'loss': 0.1089, 'grad_norm': 0.7221509218215942, 'learning_rate': 3.767985704629562e-05, 'epoch': 1.8} 18%|█▊ | 1802/10000 [2:49:14<12:31:38, 5.50s/it][2025-06-19 16:18:59,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 16:18:59,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.20 | bwd_microstep: 3370.01 | bwd_inner_microstep: 3369.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 16:18:59,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.20 | bwd: 3370.02 | bwd_inner: 3369.23 | bwd_allreduce: 0.76 | step: 6.63 18%|█▊ | 1803/10000 [2:49:20<12:32:58, 5.51s/it] {'loss': 0.0708, 'grad_norm': 0.33242955803871155, 'learning_rate': 3.7676827880182e-05, 'epoch': 1.8} 18%|█▊ | 1803/10000 [2:49:20<12:32:58, 5.51s/it][2025-06-19 16:19:04,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 16:19:04,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.84 | bwd_microstep: 3370.53 | bwd_inner_microstep: 3369.74 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 16:19:04,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.84 | bwd: 3370.54 | bwd_inner: 3369.74 | bwd_allreduce: 0.76 | step: 6.68 18%|█▊ | 1804/10000 [2:49:25<12:34:03, 5.52s/it] {'loss': 0.1195, 'grad_norm': 0.5631278157234192, 'learning_rate': 3.767379685985117e-05, 'epoch': 1.8} 18%|█▊ | 1804/10000 [2:49:25<12:34:03, 5.52s/it][2025-06-19 16:19:10,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:19:10,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.40 | bwd_microstep: 3316.45 | bwd_inner_microstep: 3315.67 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 16:19:10,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.40 | bwd: 3316.47 | bwd_inner: 3315.67 | bwd_allreduce: 0.75 | step: 6.54 18%|█▊ | 1805/10000 [2:49:31<12:31:19, 5.50s/it] {'loss': 0.1247, 'grad_norm': 0.7455305457115173, 'learning_rate': 3.767076398562108e-05, 'epoch': 1.81} 18%|█▊ | 1805/10000 [2:49:31<12:31:19, 5.50s/it][2025-06-19 16:19:15,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:19:15,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.20 | bwd_microstep: 3316.75 | bwd_inner_microstep: 3315.97 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 16:19:15,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.20 | bwd: 3316.77 | bwd_inner: 3315.97 | bwd_allreduce: 0.75 | step: 6.58 18%|█▊ | 1806/10000 [2:49:36<12:29:22, 5.49s/it] {'loss': 0.0869, 'grad_norm': 0.6661253571510315, 'learning_rate': 3.766772925780986e-05, 'epoch': 1.81} 18%|█▊ | 1806/10000 [2:49:36<12:29:22, 5.49s/it][2025-06-19 16:19:21,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:19:21,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.93 | bwd_microstep: 3330.26 | bwd_inner_microstep: 3329.25 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.52 [2025-06-19 16:19:21,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.93 | bwd: 3330.28 | bwd_inner: 3329.25 | bwd_allreduce: 0.97 | step: 7.52 18%|█▊ | 1807/10000 [2:49:42<12:28:43, 5.48s/it] {'loss': 0.1368, 'grad_norm': 0.5687354803085327, 'learning_rate': 3.766469267673584e-05, 'epoch': 1.81} 18%|█▊ | 1807/10000 [2:49:42<12:28:43, 5.48s/it][2025-06-19 16:19:26,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:19:26,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.79 | bwd_microstep: 3363.09 | bwd_inner_microstep: 3362.30 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 16:19:26,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.79 | bwd: 3363.10 | bwd_inner: 3362.30 | bwd_allreduce: 0.76 | step: 6.65 18%|█▊ | 1808/10000 [2:49:47<12:30:43, 5.50s/it] {'loss': 0.0952, 'grad_norm': 0.3725615441799164, 'learning_rate': 3.766165424271754e-05, 'epoch': 1.81} 18%|█▊ | 1808/10000 [2:49:47<12:30:43, 5.50s/it][2025-06-19 16:19:32,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:19:32,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.00 | bwd_microstep: 3304.45 | bwd_inner_microstep: 3303.67 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 16:19:32,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.00 | bwd: 3304.47 | bwd_inner: 3303.67 | bwd_allreduce: 0.75 | step: 6.55 18%|█▊ | 1809/10000 [2:49:52<12:28:38, 5.48s/it] {'loss': 0.0321, 'grad_norm': 0.24728074669837952, 'learning_rate': 3.7658613956073675e-05, 'epoch': 1.81} 18%|█▊ | 1809/10000 [2:49:53<12:28:38, 5.48s/it][2025-06-19 16:19:37,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:19:37,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.95 | bwd_microstep: 3325.58 | bwd_inner_microstep: 3324.76 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.65 [2025-06-19 16:19:37,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.95 | bwd: 3325.59 | bwd_inner: 3324.76 | bwd_allreduce: 0.79 | step: 6.65 18%|█▊ | 1810/10000 [2:49:58<12:28:22, 5.48s/it] {'loss': 0.0607, 'grad_norm': 0.4412047564983368, 'learning_rate': 3.765557181712317e-05, 'epoch': 1.81} 18%|█▊ | 1810/10000 [2:49:58<12:28:22, 5.48s/it][2025-06-19 16:19:43,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.79 [2025-06-19 16:19:43,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.77 | bwd_microstep: 3362.89 | bwd_inner_microstep: 3362.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 16:19:43,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.77 | bwd: 3362.90 | bwd_inner: 3362.11 | bwd_allreduce: 0.75 | step: 6.71 18%|█▊ | 1811/10000 [2:50:04<12:30:04, 5.50s/it] {'loss': 0.0628, 'grad_norm': 0.5383286476135254, 'learning_rate': 3.765252782618512e-05, 'epoch': 1.81} 18%|█▊ | 1811/10000 [2:50:04<12:30:04, 5.50s/it][2025-06-19 16:19:48,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:19:48,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.17 | bwd_microstep: 3378.22 | bwd_inner_microstep: 3377.31 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.91 [2025-06-19 16:19:48,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.17 | bwd: 3378.24 | bwd_inner: 3377.31 | bwd_allreduce: 0.88 | step: 6.92 18%|█▊ | 1812/10000 [2:50:09<12:32:23, 5.51s/it] {'loss': 0.0754, 'grad_norm': 0.4984656572341919, 'learning_rate': 3.764948198357883e-05, 'epoch': 1.81} 18%|█▊ | 1812/10000 [2:50:09<12:32:23, 5.51s/it][2025-06-19 16:19:54,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:19:54,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.81 | bwd_microstep: 3320.47 | bwd_inner_microstep: 3319.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 16:19:54,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.81 | bwd: 3320.49 | bwd_inner: 3319.68 | bwd_allreduce: 0.76 | step: 6.99 18%|█▊ | 1813/10000 [2:50:15<12:30:37, 5.50s/it] {'loss': 0.0662, 'grad_norm': 0.43751898407936096, 'learning_rate': 3.76464342896238e-05, 'epoch': 1.81} 18%|█▊ | 1813/10000 [2:50:15<12:30:37, 5.50s/it][2025-06-19 16:19:59,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:19:59,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.14 | bwd_microstep: 3309.92 | bwd_inner_microstep: 3309.01 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.84 [2025-06-19 16:19:59,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.14 | bwd: 3309.93 | bwd_inner: 3309.01 | bwd_allreduce: 0.88 | step: 6.85 18%|█▊ | 1814/10000 [2:50:20<12:28:16, 5.48s/it] {'loss': 0.0776, 'grad_norm': 0.584882915019989, 'learning_rate': 3.7643384744639704e-05, 'epoch': 1.81} 18%|█▊ | 1814/10000 [2:50:20<12:28:16, 5.48s/it][2025-06-19 16:20:05,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:20:05,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.40 | bwd_microstep: 3318.97 | bwd_inner_microstep: 3318.03 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.00 [2025-06-19 16:20:05,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.40 | bwd: 3318.98 | bwd_inner: 3318.03 | bwd_allreduce: 0.90 | step: 7.00 18%|█▊ | 1815/10000 [2:50:25<12:27:52, 5.48s/it] {'loss': 0.0443, 'grad_norm': 0.28125521540641785, 'learning_rate': 3.7640333348946434e-05, 'epoch': 1.81} 18%|█▊ | 1815/10000 [2:50:25<12:27:52, 5.48s/it][2025-06-19 16:20:10,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 16:20:10,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.57 | bwd_microstep: 3308.87 | bwd_inner_microstep: 3308.09 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.71 [2025-06-19 16:20:10,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.57 | bwd: 3308.88 | bwd_inner: 3308.09 | bwd_allreduce: 0.75 | step: 6.71 18%|█▊ | 1816/10000 [2:50:31<12:26:38, 5.47s/it] {'loss': 0.0813, 'grad_norm': 0.7293397188186646, 'learning_rate': 3.763728010286407e-05, 'epoch': 1.82} 18%|█▊ | 1816/10000 [2:50:31<12:26:38, 5.47s/it][2025-06-19 16:20:16,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:20:16,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.30 | bwd_microstep: 3360.01 | bwd_inner_microstep: 3359.23 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.68 [2025-06-19 16:20:16,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.30 | bwd: 3360.02 | bwd_inner: 3359.23 | bwd_allreduce: 0.75 | step: 6.68 18%|█▊ | 1817/10000 [2:50:36<12:28:23, 5.49s/it] {'loss': 0.0827, 'grad_norm': 0.5009747743606567, 'learning_rate': 3.763422500671288e-05, 'epoch': 1.82} 18%|█▊ | 1817/10000 [2:50:36<12:28:23, 5.49s/it][2025-06-19 16:20:21,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 16:20:21,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.69 | bwd_microstep: 3364.17 | bwd_inner_microstep: 3363.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.50 [2025-06-19 16:20:21,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.69 | bwd: 3364.19 | bwd_inner: 3363.40 | bwd_allreduce: 0.75 | step: 6.51 18%|█▊ | 1818/10000 [2:50:42<12:30:18, 5.50s/it] {'loss': 0.0444, 'grad_norm': 0.287382572889328, 'learning_rate': 3.7631168060813326e-05, 'epoch': 1.82} 18%|█▊ | 1818/10000 [2:50:42<12:30:18, 5.50s/it][2025-06-19 16:20:27,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:20:27,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.68 | bwd_microstep: 3366.47 | bwd_inner_microstep: 3365.64 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.68 [2025-06-19 16:20:27,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.68 | bwd: 3366.49 | bwd_inner: 3365.64 | bwd_allreduce: 0.79 | step: 6.69 18%|█▊ | 1819/10000 [2:50:47<12:31:30, 5.51s/it] {'loss': 0.1494, 'grad_norm': 1.5400446653366089, 'learning_rate': 3.7628109265486076e-05, 'epoch': 1.82} 18%|█▊ | 1819/10000 [2:50:47<12:31:30, 5.51s/it][2025-06-19 16:20:32,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:20:32,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.45 | bwd_microstep: 3317.94 | bwd_inner_microstep: 3317.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.70 [2025-06-19 16:20:32,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.45 | bwd: 3317.96 | bwd_inner: 3317.13 | bwd_allreduce: 0.78 | step: 6.70 18%|█▊ | 1820/10000 [2:50:53<12:29:27, 5.50s/it] {'loss': 0.112, 'grad_norm': 0.9426852464675903, 'learning_rate': 3.762504862105198e-05, 'epoch': 1.82} 18%|█▊ | 1820/10000 [2:50:53<12:29:27, 5.50s/it][2025-06-19 16:20:38,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:20:38,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.59 | bwd_microstep: 3324.63 | bwd_inner_microstep: 3323.60 | bwd_allreduce_microstep: 0.98 | step_microstep: 6.94 [2025-06-19 16:20:38,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.59 | bwd: 3324.64 | bwd_inner: 3323.60 | bwd_allreduce: 1.00 | step: 6.94 18%|█▊ | 1821/10000 [2:50:58<12:28:14, 5.49s/it] {'loss': 0.0902, 'grad_norm': 0.6624624133110046, 'learning_rate': 3.762198612783208e-05, 'epoch': 1.82} 18%|█▊ | 1821/10000 [2:50:58<12:28:14, 5.49s/it][2025-06-19 16:20:43,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:20:43,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.29 | bwd_microstep: 3316.10 | bwd_inner_microstep: 3315.25 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.16 [2025-06-19 16:20:43,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.29 | bwd: 3316.11 | bwd_inner: 3315.25 | bwd_allreduce: 0.82 | step: 7.16 18%|█▊ | 1822/10000 [2:51:04<12:27:09, 5.48s/it] {'loss': 0.1532, 'grad_norm': 0.9476511478424072, 'learning_rate': 3.761892178614762e-05, 'epoch': 1.82} 18%|█▊ | 1822/10000 [2:51:04<12:27:09, 5.48s/it][2025-06-19 16:20:49,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:20:49,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.54 | bwd_microstep: 3306.44 | bwd_inner_microstep: 3305.62 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-19 16:20:49,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.54 | bwd: 3306.46 | bwd_inner: 3305.63 | bwd_allreduce: 0.78 | step: 6.75 18%|█▊ | 1823/10000 [2:51:09<12:25:42, 5.47s/it] {'loss': 0.1092, 'grad_norm': 0.6985800862312317, 'learning_rate': 3.761585559632004e-05, 'epoch': 1.82} 18%|█▊ | 1823/10000 [2:51:09<12:25:42, 5.47s/it][2025-06-19 16:20:54,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:20:54,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.18 | bwd_microstep: 3319.64 | bwd_inner_microstep: 3318.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 16:20:54,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.18 | bwd: 3319.66 | bwd_inner: 3318.86 | bwd_allreduce: 0.75 | step: 6.61 18%|█▊ | 1824/10000 [2:51:15<12:25:26, 5.47s/it] {'loss': 0.0907, 'grad_norm': 0.6547413468360901, 'learning_rate': 3.7612787558670966e-05, 'epoch': 1.82} 18%|█▊ | 1824/10000 [2:51:15<12:25:26, 5.47s/it][2025-06-19 16:20:59,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 16:20:59,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.03 | bwd_microstep: 3310.48 | bwd_inner_microstep: 3309.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 16:20:59,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.03 | bwd: 3310.69 | bwd_inner: 3309.68 | bwd_allreduce: 0.77 | step: 6.62 18%|█▊ | 1825/10000 [2:51:20<12:24:39, 5.47s/it] {'loss': 0.1449, 'grad_norm': 1.0798207521438599, 'learning_rate': 3.760971767352222e-05, 'epoch': 1.82} 18%|█▊ | 1825/10000 [2:51:20<12:24:39, 5.47s/it][2025-06-19 16:21:05,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 16:21:05,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.09 | bwd_microstep: 3312.58 | bwd_inner_microstep: 3311.62 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.28 [2025-06-19 16:21:05,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.09 | bwd: 3312.59 | bwd_inner: 3311.62 | bwd_allreduce: 0.92 | step: 7.28 18%|█▊ | 1826/10000 [2:51:26<12:24:15, 5.46s/it] {'loss': 0.0697, 'grad_norm': 0.4262545704841614, 'learning_rate': 3.7606645941195814e-05, 'epoch': 1.83} 18%|█▊ | 1826/10000 [2:51:26<12:24:15, 5.46s/it][2025-06-19 16:21:10,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:21:10,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.33 | bwd_microstep: 3317.53 | bwd_inner_microstep: 3316.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 16:21:10,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.33 | bwd: 3317.54 | bwd_inner: 3316.73 | bwd_allreduce: 0.76 | step: 6.66 18%|█▊ | 1827/10000 [2:51:31<12:24:03, 5.46s/it] {'loss': 0.0321, 'grad_norm': 0.22006778419017792, 'learning_rate': 3.760357236201397e-05, 'epoch': 1.83} 18%|█▊ | 1827/10000 [2:51:31<12:24:03, 5.46s/it][2025-06-19 16:21:16,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:21:16,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.85 | bwd_microstep: 3311.44 | bwd_inner_microstep: 3310.58 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.56 [2025-06-19 16:21:16,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.86 | bwd: 3311.46 | bwd_inner: 3310.58 | bwd_allreduce: 0.82 | step: 7.56 18%|█▊ | 1828/10000 [2:51:37<12:23:35, 5.46s/it] {'loss': 0.0655, 'grad_norm': 0.6895474791526794, 'learning_rate': 3.760049693629909e-05, 'epoch': 1.83} 18%|█▊ | 1828/10000 [2:51:37<12:23:35, 5.46s/it][2025-06-19 16:21:21,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:21:21,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.35 | bwd_microstep: 3372.76 | bwd_inner_microstep: 3371.90 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.87 [2025-06-19 16:21:21,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.35 | bwd: 3372.78 | bwd_inner: 3371.90 | bwd_allreduce: 0.84 | step: 6.88 18%|█▊ | 1829/10000 [2:51:42<12:27:16, 5.49s/it] {'loss': 0.258, 'grad_norm': 1.405765414237976, 'learning_rate': 3.759741966437376e-05, 'epoch': 1.83} 18%|█▊ | 1829/10000 [2:51:42<12:27:16, 5.49s/it][2025-06-19 16:21:27,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:21:27,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.24 | bwd_microstep: 3312.17 | bwd_inner_microstep: 3311.22 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.90 [2025-06-19 16:21:27,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.24 | bwd: 3312.19 | bwd_inner: 3311.22 | bwd_allreduce: 0.92 | step: 6.90 18%|█▊ | 1830/10000 [2:51:48<12:26:02, 5.48s/it] {'loss': 0.0873, 'grad_norm': 0.7710313200950623, 'learning_rate': 3.759434054656078e-05, 'epoch': 1.83} 18%|█▊ | 1830/10000 [2:51:48<12:26:02, 5.48s/it][2025-06-19 16:21:32,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:21:32,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.98 | bwd_microstep: 3319.81 | bwd_inner_microstep: 3318.87 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.01 [2025-06-19 16:21:32,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.98 | bwd: 3319.82 | bwd_inner: 3318.87 | bwd_allreduce: 0.90 | step: 7.01 18%|█▊ | 1831/10000 [2:51:53<12:25:37, 5.48s/it] {'loss': 0.0915, 'grad_norm': 0.7893099188804626, 'learning_rate': 3.759125958318314e-05, 'epoch': 1.83} 18%|█▊ | 1831/10000 [2:51:53<12:25:37, 5.48s/it][2025-06-19 16:21:38,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:21:38,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.49 | bwd_microstep: 3323.85 | bwd_inner_microstep: 3322.87 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.51 [2025-06-19 16:21:38,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.49 | bwd: 3323.87 | bwd_inner: 3322.87 | bwd_allreduce: 0.95 | step: 7.52 18%|█▊ | 1832/10000 [2:51:59<12:25:27, 5.48s/it] {'loss': 0.0807, 'grad_norm': 0.4903467297554016, 'learning_rate': 3.7588176774564e-05, 'epoch': 1.83} 18%|█▊ | 1832/10000 [2:51:59<12:25:27, 5.48s/it][2025-06-19 16:21:43,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:21:43,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.76 | bwd_microstep: 3370.34 | bwd_inner_microstep: 3369.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 16:21:43,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.76 | bwd: 3370.36 | bwd_inner: 3369.55 | bwd_allreduce: 0.77 | step: 6.95 18%|█▊ | 1833/10000 [2:52:04<12:28:07, 5.50s/it] {'loss': 0.1473, 'grad_norm': 1.0546108484268188, 'learning_rate': 3.758509212102676e-05, 'epoch': 1.83} 18%|█▊ | 1833/10000 [2:52:04<12:28:07, 5.50s/it][2025-06-19 16:21:49,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:21:49,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.64 | bwd_microstep: 3368.86 | bwd_inner_microstep: 3368.07 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 16:21:49,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.64 | bwd: 3368.88 | bwd_inner: 3368.07 | bwd_allreduce: 0.76 | step: 6.73 18%|█▊ | 1834/10000 [2:52:10<12:29:44, 5.51s/it] {'loss': 0.0937, 'grad_norm': 0.5129223465919495, 'learning_rate': 3.7582005622894965e-05, 'epoch': 1.83} 18%|█▊ | 1834/10000 [2:52:10<12:29:44, 5.51s/it][2025-06-19 16:21:54,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:21:54,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.11 | bwd_microstep: 3312.47 | bwd_inner_microstep: 3311.49 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.03 [2025-06-19 16:21:54,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.11 | bwd: 3312.49 | bwd_inner: 3311.49 | bwd_allreduce: 0.95 | step: 7.03 18%|█▊ | 1835/10000 [2:52:15<12:27:47, 5.50s/it] {'loss': 0.1049, 'grad_norm': 1.4166728258132935, 'learning_rate': 3.7578917280492386e-05, 'epoch': 1.83} 18%|█▊ | 1835/10000 [2:52:15<12:27:47, 5.50s/it][2025-06-19 16:22:00,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 16:22:00,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.63 | bwd_microstep: 3315.68 | bwd_inner_microstep: 3314.52 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.55 [2025-06-19 16:22:00,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.63 | bwd: 3315.70 | bwd_inner: 3314.52 | bwd_allreduce: 1.12 | step: 7.56 18%|█▊ | 1836/10000 [2:52:21<12:26:23, 5.49s/it] {'loss': 0.1487, 'grad_norm': 0.729154109954834, 'learning_rate': 3.757582709414297e-05, 'epoch': 1.84} 18%|█▊ | 1836/10000 [2:52:21<12:26:23, 5.49s/it][2025-06-19 16:22:05,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 16:22:05,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.19 | bwd_microstep: 3326.35 | bwd_inner_microstep: 3325.27 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.81 [2025-06-19 16:22:05,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.19 | bwd: 3326.36 | bwd_inner: 3325.27 | bwd_allreduce: 1.05 | step: 7.81 18%|█▊ | 1837/10000 [2:52:26<12:25:54, 5.48s/it] {'loss': 0.0956, 'grad_norm': 0.6662809252738953, 'learning_rate': 3.757273506417085e-05, 'epoch': 1.84} 18%|█▊ | 1837/10000 [2:52:26<12:25:54, 5.48s/it][2025-06-19 16:22:11,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:22:11,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.73 | bwd_microstep: 3370.83 | bwd_inner_microstep: 3370.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 16:22:11,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.73 | bwd: 3370.85 | bwd_inner: 3370.04 | bwd_allreduce: 0.77 | step: 6.79 18%|█▊ | 1838/10000 [2:52:32<12:28:50, 5.50s/it] {'loss': 0.0955, 'grad_norm': 1.0867880582809448, 'learning_rate': 3.7569641190900395e-05, 'epoch': 1.84} 18%|█▊ | 1838/10000 [2:52:32<12:28:50, 5.50s/it][2025-06-19 16:22:16,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 16:22:16,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.83 | bwd_microstep: 3318.29 | bwd_inner_microstep: 3317.39 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.91 [2025-06-19 16:22:16,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.83 | bwd: 3318.30 | bwd_inner: 3317.39 | bwd_allreduce: 0.88 | step: 6.91 18%|█▊ | 1839/10000 [2:52:37<12:27:23, 5.49s/it] {'loss': 0.1408, 'grad_norm': 1.0159680843353271, 'learning_rate': 3.756654547465612e-05, 'epoch': 1.84} 18%|█▊ | 1839/10000 [2:52:37<12:27:23, 5.49s/it][2025-06-19 16:22:22,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:22:22,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.04 | bwd_microstep: 3321.49 | bwd_inner_microstep: 3320.68 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-19 16:22:22,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.04 | bwd: 3321.50 | bwd_inner: 3320.68 | bwd_allreduce: 0.78 | step: 7.22 18%|█▊ | 1840/10000 [2:52:43<12:26:14, 5.49s/it] {'loss': 0.1438, 'grad_norm': 0.7682169079780579, 'learning_rate': 3.756344791576275e-05, 'epoch': 1.84} 18%|█▊ | 1840/10000 [2:52:43<12:26:14, 5.49s/it][2025-06-19 16:22:27,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:22:27,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.62 | bwd_microstep: 3316.64 | bwd_inner_microstep: 3315.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 16:22:27,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.62 | bwd: 3316.66 | bwd_inner: 3315.86 | bwd_allreduce: 0.76 | step: 6.61 18%|█▊ | 1841/10000 [2:52:48<12:24:59, 5.48s/it] {'loss': 0.1277, 'grad_norm': 0.7883320450782776, 'learning_rate': 3.7560348514545205e-05, 'epoch': 1.84} 18%|█▊ | 1841/10000 [2:52:48<12:24:59, 5.48s/it][2025-06-19 16:22:33,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:22:33,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.89 | bwd_microstep: 3317.32 | bwd_inner_microstep: 3316.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 16:22:33,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.89 | bwd: 3317.33 | bwd_inner: 3316.53 | bwd_allreduce: 0.76 | step: 6.63 18%|█▊ | 1842/10000 [2:52:53<12:23:57, 5.47s/it] {'loss': 0.069, 'grad_norm': 0.4746003746986389, 'learning_rate': 3.755724727132861e-05, 'epoch': 1.84} 18%|█▊ | 1842/10000 [2:52:53<12:23:57, 5.47s/it][2025-06-19 16:22:38,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:22:38,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.66 | bwd_microstep: 3367.02 | bwd_inner_microstep: 3366.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 16:22:38,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.66 | bwd: 3367.04 | bwd_inner: 3366.23 | bwd_allreduce: 0.76 | step: 6.64 18%|█▊ | 1843/10000 [2:52:59<12:26:20, 5.49s/it] {'loss': 0.0985, 'grad_norm': 0.8922785520553589, 'learning_rate': 3.7554144186438254e-05, 'epoch': 1.84} 18%|█▊ | 1843/10000 [2:52:59<12:26:20, 5.49s/it][2025-06-19 16:22:44,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 16:22:44,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.91 | bwd_microstep: 3324.93 | bwd_inner_microstep: 3323.80 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.28 [2025-06-19 16:22:44,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.91 | bwd: 3324.96 | bwd_inner: 3323.80 | bwd_allreduce: 1.09 | step: 8.29 18%|█▊ | 1844/10000 [2:53:04<12:25:44, 5.49s/it] {'loss': 0.1159, 'grad_norm': 0.6965133547782898, 'learning_rate': 3.7551039260199645e-05, 'epoch': 1.84} 18%|█▊ | 1844/10000 [2:53:05<12:25:44, 5.49s/it][2025-06-19 16:22:49,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.75 [2025-06-19 16:22:49,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.78 | bwd_microstep: 3366.89 | bwd_inner_microstep: 3365.96 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.98 [2025-06-19 16:22:49,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.78 | bwd: 3366.91 | bwd_inner: 3365.96 | bwd_allreduce: 0.90 | step: 6.98 18%|█▊ | 1845/10000 [2:53:10<12:28:04, 5.50s/it] {'loss': 0.1385, 'grad_norm': 0.6779653429985046, 'learning_rate': 3.7547932492938475e-05, 'epoch': 1.84} 18%|█▊ | 1845/10000 [2:53:10<12:28:04, 5.50s/it][2025-06-19 16:22:55,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.81 [2025-06-19 16:22:55,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.31 | bwd_microstep: 3369.76 | bwd_inner_microstep: 3368.97 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 16:22:55,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.31 | bwd: 3369.77 | bwd_inner: 3368.97 | bwd_allreduce: 0.76 | step: 6.67 18%|█▊ | 1846/10000 [2:53:16<12:29:36, 5.52s/it] {'loss': 0.2002, 'grad_norm': 0.9034785032272339, 'learning_rate': 3.754482388498063e-05, 'epoch': 1.85} 18%|█▊ | 1846/10000 [2:53:16<12:29:36, 5.52s/it][2025-06-19 16:23:00,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:23:00,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.30 | bwd_microstep: 3318.09 | bwd_inner_microstep: 3317.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 16:23:00,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.30 | bwd: 3318.10 | bwd_inner: 3317.30 | bwd_allreduce: 0.75 | step: 6.55 18%|█▊ | 1847/10000 [2:53:21<12:27:03, 5.50s/it] {'loss': 0.1828, 'grad_norm': 0.9164761900901794, 'learning_rate': 3.754171343665219e-05, 'epoch': 1.85} 18%|█▊ | 1847/10000 [2:53:21<12:27:03, 5.50s/it][2025-06-19 16:23:06,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:23:06,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.90 | bwd_microstep: 3323.49 | bwd_inner_microstep: 3322.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 16:23:06,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.90 | bwd: 3323.50 | bwd_inner: 3322.70 | bwd_allreduce: 0.75 | step: 6.62 18%|█▊ | 1848/10000 [2:53:27<12:25:41, 5.49s/it] {'loss': 0.0696, 'grad_norm': 1.2572842836380005, 'learning_rate': 3.753860114827942e-05, 'epoch': 1.85} 18%|█▊ | 1848/10000 [2:53:27<12:25:41, 5.49s/it][2025-06-19 16:23:11,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:23:11,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.87 | bwd_microstep: 3373.35 | bwd_inner_microstep: 3372.52 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.03 [2025-06-19 16:23:11,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.87 | bwd: 3373.37 | bwd_inner: 3372.52 | bwd_allreduce: 0.80 | step: 7.03 18%|█▊ | 1849/10000 [2:53:32<12:27:52, 5.51s/it] {'loss': 0.0786, 'grad_norm': 0.47361427545547485, 'learning_rate': 3.753548702018879e-05, 'epoch': 1.85} 18%|█▊ | 1849/10000 [2:53:32<12:27:52, 5.51s/it][2025-06-19 16:23:17,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:23:17,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.51 | bwd_microstep: 3314.13 | bwd_inner_microstep: 3313.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 16:23:17,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.51 | bwd: 3314.14 | bwd_inner: 3313.33 | bwd_allreduce: 0.77 | step: 6.95 18%|█▊ | 1850/10000 [2:53:38<12:26:02, 5.49s/it] {'loss': 0.1342, 'grad_norm': 0.7824318408966064, 'learning_rate': 3.753237105270696e-05, 'epoch': 1.85} 18%|█▊ | 1850/10000 [2:53:38<12:26:02, 5.49s/it][2025-06-19 16:23:22,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:23:22,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.60 | bwd_microstep: 3327.76 | bwd_inner_microstep: 3326.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 16:23:22,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.60 | bwd: 3327.77 | bwd_inner: 3326.96 | bwd_allreduce: 0.77 | step: 6.65 19%|█▊ | 1851/10000 [2:53:43<12:25:27, 5.49s/it] {'loss': 0.1166, 'grad_norm': 0.9330167770385742, 'learning_rate': 3.752925324616077e-05, 'epoch': 1.85} 19%|█▊ | 1851/10000 [2:53:43<12:25:27, 5.49s/it][2025-06-19 16:23:28,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 16:23:28,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.59 | bwd_microstep: 3317.50 | bwd_inner_microstep: 3316.60 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.56 [2025-06-19 16:23:28,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.59 | bwd: 3317.52 | bwd_inner: 3316.60 | bwd_allreduce: 0.87 | step: 7.56 19%|█▊ | 1852/10000 [2:53:48<12:24:16, 5.48s/it] {'loss': 0.1118, 'grad_norm': 0.8415570855140686, 'learning_rate': 3.7526133600877275e-05, 'epoch': 1.85} 19%|█▊ | 1852/10000 [2:53:48<12:24:16, 5.48s/it][2025-06-19 16:23:33,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:23:33,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.99 | bwd_microstep: 3371.28 | bwd_inner_microstep: 3370.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 16:23:33,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.99 | bwd: 3371.29 | bwd_inner: 3370.50 | bwd_allreduce: 0.75 | step: 6.66 19%|█▊ | 1853/10000 [2:53:54<12:27:25, 5.50s/it] {'loss': 0.1484, 'grad_norm': 1.202713131904602, 'learning_rate': 3.75230121171837e-05, 'epoch': 1.85} 19%|█▊ | 1853/10000 [2:53:54<12:27:25, 5.50s/it][2025-06-19 16:23:39,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:23:39,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.93 | bwd_microstep: 3315.97 | bwd_inner_microstep: 3314.93 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.61 [2025-06-19 16:23:39,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.93 | bwd: 3315.99 | bwd_inner: 3314.93 | bwd_allreduce: 1.01 | step: 7.62 19%|█▊ | 1854/10000 [2:54:00<12:26:26, 5.50s/it] {'loss': 0.0521, 'grad_norm': 0.6506326794624329, 'learning_rate': 3.7519888795407496e-05, 'epoch': 1.85} 19%|█▊ | 1854/10000 [2:54:00<12:26:26, 5.50s/it][2025-06-19 16:23:44,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:23:44,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.26 | bwd_microstep: 3327.45 | bwd_inner_microstep: 3326.61 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.27 [2025-06-19 16:23:44,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.26 | bwd: 3327.47 | bwd_inner: 3326.61 | bwd_allreduce: 0.80 | step: 7.27 19%|█▊ | 1855/10000 [2:54:05<12:25:37, 5.49s/it] {'loss': 0.1056, 'grad_norm': 1.148486852645874, 'learning_rate': 3.751676363587625e-05, 'epoch': 1.85} 19%|█▊ | 1855/10000 [2:54:05<12:25:37, 5.49s/it][2025-06-19 16:23:50,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:23:50,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.91 | bwd_microstep: 3365.66 | bwd_inner_microstep: 3364.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.92 [2025-06-19 16:23:50,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.91 | bwd: 3365.68 | bwd_inner: 3364.87 | bwd_allreduce: 0.76 | step: 6.92 19%|█▊ | 1856/10000 [2:54:11<12:27:25, 5.51s/it] {'loss': 0.1159, 'grad_norm': 0.5265842080116272, 'learning_rate': 3.751363663891782e-05, 'epoch': 1.86} 19%|█▊ | 1856/10000 [2:54:11<12:27:25, 5.51s/it][2025-06-19 16:23:55,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 16:23:55,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.48 | bwd_microstep: 3339.16 | bwd_inner_microstep: 3338.21 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.61 [2025-06-19 16:23:55,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.48 | bwd: 3339.18 | bwd_inner: 3338.21 | bwd_allreduce: 0.93 | step: 7.61 19%|█▊ | 1857/10000 [2:54:16<12:26:42, 5.50s/it] {'loss': 0.1548, 'grad_norm': 0.8571770787239075, 'learning_rate': 3.7510507804860165e-05, 'epoch': 1.86} 19%|█▊ | 1857/10000 [2:54:16<12:26:42, 5.50s/it][2025-06-19 16:24:01,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:24:01,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.76 | bwd_microstep: 3375.91 | bwd_inner_microstep: 3374.93 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.08 [2025-06-19 16:24:01,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.76 | bwd: 3375.93 | bwd_inner: 3374.93 | bwd_allreduce: 0.95 | step: 7.09 19%|█▊ | 1858/10000 [2:54:22<12:28:51, 5.52s/it] {'loss': 0.0835, 'grad_norm': 0.8168399333953857, 'learning_rate': 3.750737713403152e-05, 'epoch': 1.86} 19%|█▊ | 1858/10000 [2:54:22<12:28:51, 5.52s/it][2025-06-19 16:24:06,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:24:06,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.25 | bwd_microstep: 3318.38 | bwd_inner_microstep: 3317.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 16:24:06,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.25 | bwd: 3318.39 | bwd_inner: 3317.60 | bwd_allreduce: 0.76 | step: 6.69 19%|█▊ | 1859/10000 [2:54:27<12:26:35, 5.50s/it] {'loss': 0.1081, 'grad_norm': 0.8101257085800171, 'learning_rate': 3.7504244626760274e-05, 'epoch': 1.86} 19%|█▊ | 1859/10000 [2:54:27<12:26:35, 5.50s/it][2025-06-19 16:24:12,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 16:24:12,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.35 | bwd_microstep: 3378.51 | bwd_inner_microstep: 3377.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 16:24:12,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.35 | bwd: 3378.53 | bwd_inner: 3377.73 | bwd_allreduce: 0.75 | step: 6.57 19%|█▊ | 1860/10000 [2:54:33<12:28:28, 5.52s/it] {'loss': 0.0883, 'grad_norm': 0.521431565284729, 'learning_rate': 3.7501110283375e-05, 'epoch': 1.86} 19%|█▊ | 1860/10000 [2:54:33<12:28:28, 5.52s/it][2025-06-19 16:24:17,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:24:17,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.83 | bwd_microstep: 3331.89 | bwd_inner_microstep: 3331.09 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 16:24:17,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.83 | bwd: 3331.90 | bwd_inner: 3331.09 | bwd_allreduce: 0.77 | step: 7.07 19%|█▊ | 1861/10000 [2:54:38<12:27:30, 5.51s/it] {'loss': 0.0995, 'grad_norm': 0.7250348329544067, 'learning_rate': 3.749797410420448e-05, 'epoch': 1.86} 19%|█▊ | 1861/10000 [2:54:38<12:27:30, 5.51s/it][2025-06-19 16:24:23,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:24:23,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.36 | bwd_microstep: 3340.04 | bwd_inner_microstep: 3338.96 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.38 [2025-06-19 16:24:23,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.36 | bwd: 3340.06 | bwd_inner: 3338.96 | bwd_allreduce: 1.05 | step: 7.38 19%|█▊ | 1862/10000 [2:54:44<12:27:02, 5.51s/it] {'loss': 0.0767, 'grad_norm': 0.6419216394424438, 'learning_rate': 3.7494836089577696e-05, 'epoch': 1.86} 19%|█▊ | 1862/10000 [2:54:44<12:27:02, 5.51s/it][2025-06-19 16:24:28,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:24:28,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.48 | bwd_microstep: 3322.63 | bwd_inner_microstep: 3321.79 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.80 [2025-06-19 16:24:28,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.48 | bwd: 3322.65 | bwd_inner: 3321.79 | bwd_allreduce: 0.81 | step: 6.80 19%|█▊ | 1863/10000 [2:54:49<12:25:43, 5.50s/it] {'loss': 0.1179, 'grad_norm': 0.7225522994995117, 'learning_rate': 3.7491696239823794e-05, 'epoch': 1.86} 19%|█▊ | 1863/10000 [2:54:49<12:25:43, 5.50s/it][2025-06-19 16:24:34,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:24:34,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.55 | bwd_microstep: 3375.80 | bwd_inner_microstep: 3374.75 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.42 [2025-06-19 16:24:34,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.55 | bwd: 3375.82 | bwd_inner: 3374.75 | bwd_allreduce: 1.01 | step: 7.42 19%|█▊ | 1864/10000 [2:54:55<12:27:28, 5.51s/it] {'loss': 0.1035, 'grad_norm': 0.5415908098220825, 'learning_rate': 3.748855455527214e-05, 'epoch': 1.86} 19%|█▊ | 1864/10000 [2:54:55<12:27:28, 5.51s/it][2025-06-19 16:24:39,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 16:24:39,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.04 | bwd_microstep: 3331.07 | bwd_inner_microstep: 3330.19 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.92 [2025-06-19 16:24:39,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.04 | bwd: 3331.08 | bwd_inner: 3330.19 | bwd_allreduce: 0.84 | step: 6.93 19%|█▊ | 1865/10000 [2:55:00<12:26:27, 5.51s/it] {'loss': 0.0612, 'grad_norm': 0.37905991077423096, 'learning_rate': 3.748541103625228e-05, 'epoch': 1.86} 19%|█▊ | 1865/10000 [2:55:00<12:26:27, 5.51s/it][2025-06-19 16:24:45,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:24:45,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.62 | bwd_microstep: 3337.33 | bwd_inner_microstep: 3336.53 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.12 [2025-06-19 16:24:45,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.62 | bwd: 3337.34 | bwd_inner: 3336.53 | bwd_allreduce: 0.77 | step: 7.13 19%|█▊ | 1866/10000 [2:55:06<12:26:12, 5.50s/it] {'loss': 0.0639, 'grad_norm': 0.41431742906570435, 'learning_rate': 3.748226568309396e-05, 'epoch': 1.87} 19%|█▊ | 1866/10000 [2:55:06<12:26:12, 5.50s/it][2025-06-19 16:24:50,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:24:50,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.11 | bwd_microstep: 3328.87 | bwd_inner_microstep: 3328.05 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.73 [2025-06-19 16:24:50,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.11 | bwd: 3328.88 | bwd_inner: 3328.05 | bwd_allreduce: 0.78 | step: 6.73 19%|█▊ | 1867/10000 [2:55:11<12:24:59, 5.50s/it] {'loss': 0.0613, 'grad_norm': 0.4232182204723358, 'learning_rate': 3.747911849612711e-05, 'epoch': 1.87} 19%|█▊ | 1867/10000 [2:55:11<12:24:59, 5.50s/it][2025-06-19 16:24:56,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:24:56,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.51 | bwd_microstep: 3410.70 | bwd_inner_microstep: 3409.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.35 [2025-06-19 16:24:56,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.51 | bwd: 3410.71 | bwd_inner: 3409.90 | bwd_allreduce: 0.77 | step: 7.36 19%|█▊ | 1868/10000 [2:55:17<12:28:44, 5.52s/it] {'loss': 0.1321, 'grad_norm': 0.8377297520637512, 'learning_rate': 3.747596947568184e-05, 'epoch': 1.87} 19%|█▊ | 1868/10000 [2:55:17<12:28:44, 5.52s/it][2025-06-19 16:25:01,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:25:01,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.04 | bwd_microstep: 3331.50 | bwd_inner_microstep: 3330.68 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.86 [2025-06-19 16:25:01,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.04 | bwd: 3331.52 | bwd_inner: 3330.68 | bwd_allreduce: 0.79 | step: 6.86 19%|█▊ | 1869/10000 [2:55:22<12:27:01, 5.51s/it] {'loss': 0.094, 'grad_norm': 0.4522547721862793, 'learning_rate': 3.747281862208849e-05, 'epoch': 1.87} 19%|█▊ | 1869/10000 [2:55:22<12:27:01, 5.51s/it][2025-06-19 16:25:07,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:25:07,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.19 | bwd_microstep: 3335.52 | bwd_inner_microstep: 3334.72 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.70 [2025-06-19 16:25:07,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.19 | bwd: 3335.54 | bwd_inner: 3334.72 | bwd_allreduce: 0.78 | step: 6.71 19%|█▊ | 1870/10000 [2:55:28<12:25:58, 5.51s/it] {'loss': 0.1003, 'grad_norm': 0.5171376466751099, 'learning_rate': 3.746966593567755e-05, 'epoch': 1.87} 19%|█▊ | 1870/10000 [2:55:28<12:25:58, 5.51s/it][2025-06-19 16:25:12,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:25:12,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.45 | bwd_microstep: 3340.46 | bwd_inner_microstep: 3339.54 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.01 [2025-06-19 16:25:12,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.45 | bwd: 3340.47 | bwd_inner: 3339.54 | bwd_allreduce: 0.89 | step: 7.02 19%|█▊ | 1871/10000 [2:55:33<12:25:42, 5.50s/it] {'loss': 0.1114, 'grad_norm': 0.729476273059845, 'learning_rate': 3.7466511416779744e-05, 'epoch': 1.87} 19%|█▊ | 1871/10000 [2:55:33<12:25:42, 5.50s/it][2025-06-19 16:25:18,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:25:18,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.49 | bwd_microstep: 3411.32 | bwd_inner_microstep: 3410.52 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-19 16:25:18,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.49 | bwd: 3411.34 | bwd_inner: 3410.52 | bwd_allreduce: 0.77 | step: 7.03 19%|█▊ | 1872/10000 [2:55:39<12:29:30, 5.53s/it] {'loss': 0.0911, 'grad_norm': 0.41314756870269775, 'learning_rate': 3.746335506572595e-05, 'epoch': 1.87} 19%|█▊ | 1872/10000 [2:55:39<12:29:30, 5.53s/it][2025-06-19 16:25:23,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 16:25:23,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.71 | bwd_microstep: 3322.90 | bwd_inner_microstep: 3321.96 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.28 [2025-06-19 16:25:23,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.71 | bwd: 3322.91 | bwd_inner: 3321.96 | bwd_allreduce: 0.91 | step: 7.28 19%|█▊ | 1873/10000 [2:55:44<12:26:45, 5.51s/it] {'loss': 0.1326, 'grad_norm': 0.4993039071559906, 'learning_rate': 3.746019688284726e-05, 'epoch': 1.87} 19%|█▊ | 1873/10000 [2:55:44<12:26:45, 5.51s/it][2025-06-19 16:25:29,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 16:25:29,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.54 | bwd_microstep: 3405.43 | bwd_inner_microstep: 3404.58 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.92 [2025-06-19 16:25:29,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.54 | bwd: 3405.45 | bwd_inner: 3404.58 | bwd_allreduce: 0.82 | step: 6.92 19%|█▊ | 1874/10000 [2:55:50<12:29:34, 5.53s/it] {'loss': 0.139, 'grad_norm': 0.7472573518753052, 'learning_rate': 3.745703686847495e-05, 'epoch': 1.87} 19%|█▊ | 1874/10000 [2:55:50<12:29:34, 5.53s/it][2025-06-19 16:25:34,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:25:34,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.51 | bwd_microstep: 3338.24 | bwd_inner_microstep: 3337.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.32 [2025-06-19 16:25:34,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.51 | bwd: 3338.26 | bwd_inner: 3337.43 | bwd_allreduce: 0.78 | step: 7.32 19%|█▉ | 1875/10000 [2:55:55<12:27:32, 5.52s/it] {'loss': 0.1274, 'grad_norm': 0.5436829924583435, 'learning_rate': 3.7453875022940494e-05, 'epoch': 1.88} 19%|█▉ | 1875/10000 [2:55:55<12:27:32, 5.52s/it][2025-06-19 16:25:40,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:25:40,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.23 | bwd_microstep: 3371.82 | bwd_inner_microstep: 3371.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 16:25:40,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.23 | bwd: 3371.84 | bwd_inner: 3371.03 | bwd_allreduce: 0.76 | step: 6.63 19%|█▉ | 1876/10000 [2:56:01<12:28:15, 5.53s/it] {'loss': 0.129, 'grad_norm': 0.6550260186195374, 'learning_rate': 3.745071134657556e-05, 'epoch': 1.88} 19%|█▉ | 1876/10000 [2:56:01<12:28:15, 5.53s/it][2025-06-19 16:25:45,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:25:45,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.16 | bwd_microstep: 3328.37 | bwd_inner_microstep: 3327.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-19 16:25:45,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.16 | bwd: 3328.39 | bwd_inner: 3327.57 | bwd_allreduce: 0.78 | step: 7.27 19%|█▉ | 1877/10000 [2:56:06<12:26:03, 5.51s/it] {'loss': 0.1779, 'grad_norm': 1.234065294265747, 'learning_rate': 3.7447545839711994e-05, 'epoch': 1.88} 19%|█▉ | 1877/10000 [2:56:06<12:26:03, 5.51s/it][2025-06-19 16:25:51,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:25:51,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.45 | bwd_microstep: 3325.62 | bwd_inner_microstep: 3324.68 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.79 [2025-06-19 16:25:51,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.45 | bwd: 3325.63 | bwd_inner: 3324.68 | bwd_allreduce: 0.91 | step: 6.80 19%|█▉ | 1878/10000 [2:56:12<12:24:16, 5.50s/it] {'loss': 0.1398, 'grad_norm': 0.6019582152366638, 'learning_rate': 3.744437850268184e-05, 'epoch': 1.88} 19%|█▉ | 1878/10000 [2:56:12<12:24:16, 5.50s/it][2025-06-19 16:25:56,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:25:56,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.06 | bwd_microstep: 3325.84 | bwd_inner_microstep: 3324.91 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.22 [2025-06-19 16:25:56,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.06 | bwd: 3325.85 | bwd_inner: 3324.92 | bwd_allreduce: 0.89 | step: 7.22 19%|█▉ | 1879/10000 [2:56:17<12:23:02, 5.49s/it] {'loss': 0.1375, 'grad_norm': 0.678249180316925, 'learning_rate': 3.7441209335817347e-05, 'epoch': 1.88} 19%|█▉ | 1879/10000 [2:56:17<12:23:02, 5.49s/it][2025-06-19 16:26:02,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:26:02,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.60 | bwd_microstep: 3369.63 | bwd_inner_microstep: 3368.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 16:26:02,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.60 | bwd: 3369.64 | bwd_inner: 3368.84 | bwd_allreduce: 0.76 | step: 6.66 19%|█▉ | 1880/10000 [2:56:23<12:25:07, 5.51s/it] {'loss': 0.078, 'grad_norm': 0.43725964426994324, 'learning_rate': 3.743803833945094e-05, 'epoch': 1.88} 19%|█▉ | 1880/10000 [2:56:23<12:25:07, 5.51s/it][2025-06-19 16:26:07,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:26:07,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.07 | bwd_microstep: 3327.05 | bwd_inner_microstep: 3326.22 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.14 [2025-06-19 16:26:07,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.07 | bwd: 3327.06 | bwd_inner: 3326.22 | bwd_allreduce: 0.80 | step: 7.14 19%|█▉ | 1881/10000 [2:56:28<12:23:49, 5.50s/it] {'loss': 0.1023, 'grad_norm': 0.6786683201789856, 'learning_rate': 3.743486551391525e-05, 'epoch': 1.88} 19%|█▉ | 1881/10000 [2:56:28<12:23:49, 5.50s/it][2025-06-19 16:26:13,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:26:13,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.38 | bwd_microstep: 3382.24 | bwd_inner_microstep: 3381.42 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 16:26:13,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.38 | bwd: 3382.25 | bwd_inner: 3381.42 | bwd_allreduce: 0.79 | step: 7.29 19%|█▉ | 1882/10000 [2:56:34<12:26:13, 5.52s/it] {'loss': 0.0909, 'grad_norm': 0.572394609451294, 'learning_rate': 3.743169085954308e-05, 'epoch': 1.88} 19%|█▉ | 1882/10000 [2:56:34<12:26:13, 5.52s/it][2025-06-19 16:26:19,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:26:19,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.81 | bwd_microstep: 3376.57 | bwd_inner_microstep: 3375.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-19 16:26:19,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.81 | bwd: 3376.59 | bwd_inner: 3375.77 | bwd_allreduce: 0.77 | step: 6.81 19%|█▉ | 1883/10000 [2:56:39<12:27:33, 5.53s/it] {'loss': 0.1192, 'grad_norm': 0.5722706913948059, 'learning_rate': 3.742851437666744e-05, 'epoch': 1.88} 19%|█▉ | 1883/10000 [2:56:39<12:27:33, 5.53s/it][2025-06-19 16:26:24,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:26:24,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.71 | bwd_microstep: 3373.59 | bwd_inner_microstep: 3372.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.32 [2025-06-19 16:26:24,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.71 | bwd: 3373.61 | bwd_inner: 3372.78 | bwd_allreduce: 0.78 | step: 7.32 19%|█▉ | 1884/10000 [2:56:45<12:28:03, 5.53s/it] {'loss': 0.1059, 'grad_norm': 0.549124002456665, 'learning_rate': 3.7425336065621534e-05, 'epoch': 1.88} 19%|█▉ | 1884/10000 [2:56:45<12:28:03, 5.53s/it][2025-06-19 16:26:30,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.76 [2025-06-19 16:26:30,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.42 | bwd_microstep: 3375.71 | bwd_inner_microstep: 3374.85 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.89 [2025-06-19 16:26:30,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.42 | bwd: 3375.73 | bwd_inner: 3374.85 | bwd_allreduce: 0.83 | step: 6.89 19%|█▉ | 1885/10000 [2:56:50<12:29:04, 5.54s/it] {'loss': 0.0967, 'grad_norm': 0.5197634100914001, 'learning_rate': 3.7422155926738754e-05, 'epoch': 1.89} 19%|█▉ | 1885/10000 [2:56:50<12:29:04, 5.54s/it][2025-06-19 16:26:35,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:26:35,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.51 | bwd_microstep: 3370.55 | bwd_inner_microstep: 3369.73 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 16:26:35,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.51 | bwd: 3370.56 | bwd_inner: 3369.73 | bwd_allreduce: 0.79 | step: 7.30 19%|█▉ | 1886/10000 [2:56:56<12:29:16, 5.54s/it] {'loss': 0.1353, 'grad_norm': 0.8990390300750732, 'learning_rate': 3.741897396035266e-05, 'epoch': 1.89} 19%|█▉ | 1886/10000 [2:56:56<12:29:16, 5.54s/it][2025-06-19 16:26:41,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:26:41,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.04 | bwd_microstep: 3316.09 | bwd_inner_microstep: 3315.26 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.24 [2025-06-19 16:26:41,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.04 | bwd: 3316.11 | bwd_inner: 3315.26 | bwd_allreduce: 0.80 | step: 7.24 19%|█▉ | 1887/10000 [2:57:01<12:26:02, 5.52s/it] {'loss': 0.0922, 'grad_norm': 0.5709570050239563, 'learning_rate': 3.741579016679706e-05, 'epoch': 1.89} 19%|█▉ | 1887/10000 [2:57:01<12:26:02, 5.52s/it][2025-06-19 16:26:46,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:26:46,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.08 | bwd_microstep: 3333.92 | bwd_inner_microstep: 3332.94 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.53 [2025-06-19 16:26:46,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.08 | bwd: 3333.94 | bwd_inner: 3332.94 | bwd_allreduce: 0.95 | step: 7.53 19%|█▉ | 1888/10000 [2:57:07<12:24:37, 5.51s/it] {'loss': 0.0768, 'grad_norm': 0.7017905116081238, 'learning_rate': 3.741260454640588e-05, 'epoch': 1.89} 19%|█▉ | 1888/10000 [2:57:07<12:24:37, 5.51s/it][2025-06-19 16:26:52,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.75 | optimizer_step: 2.72 [2025-06-19 16:26:52,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.93 | bwd_microstep: 3368.30 | bwd_inner_microstep: 3367.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.06 [2025-06-19 16:26:52,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.93 | bwd: 3368.31 | bwd_inner: 3367.51 | bwd_allreduce: 0.76 | step: 7.06 19%|█▉ | 1889/10000 [2:57:12<12:26:01, 5.52s/it] {'loss': 0.1557, 'grad_norm': 0.6811864972114563, 'learning_rate': 3.7409417099513315e-05, 'epoch': 1.89} 19%|█▉ | 1889/10000 [2:57:12<12:26:01, 5.52s/it][2025-06-19 16:26:57,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:26:57,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.56 | bwd_microstep: 3368.09 | bwd_inner_microstep: 3367.15 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.63 [2025-06-19 16:26:57,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.56 | bwd: 3368.11 | bwd_inner: 3367.15 | bwd_allreduce: 0.91 | step: 7.63 19%|█▉ | 1890/10000 [2:57:18<12:26:46, 5.52s/it] {'loss': 0.0953, 'grad_norm': 0.633519172668457, 'learning_rate': 3.7406227826453694e-05, 'epoch': 1.89} 19%|█▉ | 1890/10000 [2:57:18<12:26:46, 5.52s/it][2025-06-19 16:27:03,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:27:03,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.71 | bwd_microstep: 3370.88 | bwd_inner_microstep: 3370.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 16:27:03,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.71 | bwd: 3370.90 | bwd_inner: 3370.10 | bwd_allreduce: 0.76 | step: 6.77 19%|█▉ | 1891/10000 [2:57:24<12:27:27, 5.53s/it] {'loss': 0.159, 'grad_norm': 0.6206760406494141, 'learning_rate': 3.740303672756155e-05, 'epoch': 1.89} 19%|█▉ | 1891/10000 [2:57:24<12:27:27, 5.53s/it][2025-06-19 16:27:08,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:27:08,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.30 | bwd_microstep: 3369.09 | bwd_inner_microstep: 3368.27 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.21 [2025-06-19 16:27:08,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.30 | bwd: 3369.10 | bwd_inner: 3368.27 | bwd_allreduce: 0.79 | step: 7.22 19%|█▉ | 1892/10000 [2:57:29<12:27:42, 5.53s/it] {'loss': 0.0978, 'grad_norm': 0.5206812620162964, 'learning_rate': 3.739984380317163e-05, 'epoch': 1.89} 19%|█▉ | 1892/10000 [2:57:29<12:27:42, 5.53s/it][2025-06-19 16:27:14,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:27:14,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.96 | bwd_microstep: 3320.79 | bwd_inner_microstep: 3319.93 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.92 [2025-06-19 16:27:14,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.96 | bwd: 3320.81 | bwd_inner: 3319.93 | bwd_allreduce: 0.82 | step: 6.93 19%|█▉ | 1893/10000 [2:57:35<12:24:57, 5.51s/it] {'loss': 0.0674, 'grad_norm': 0.40028268098831177, 'learning_rate': 3.739664905361885e-05, 'epoch': 1.89} 19%|█▉ | 1893/10000 [2:57:35<12:24:57, 5.51s/it][2025-06-19 16:27:19,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:27:19,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.67 | bwd_microstep: 3335.27 | bwd_inner_microstep: 3334.41 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.87 [2025-06-19 16:27:19,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.67 | bwd: 3335.28 | bwd_inner: 3334.41 | bwd_allreduce: 0.83 | step: 6.88 19%|█▉ | 1894/10000 [2:57:40<12:23:58, 5.51s/it] {'loss': 0.0839, 'grad_norm': 0.6786006689071655, 'learning_rate': 3.739345247923832e-05, 'epoch': 1.89} 19%|█▉ | 1894/10000 [2:57:40<12:23:58, 5.51s/it][2025-06-19 16:27:25,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:27:25,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.62 | bwd_microstep: 3317.24 | bwd_inner_microstep: 3316.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 16:27:25,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.62 | bwd: 3317.26 | bwd_inner: 3316.45 | bwd_allreduce: 0.76 | step: 6.72 19%|█▉ | 1895/10000 [2:57:46<12:21:58, 5.49s/it] {'loss': 0.1028, 'grad_norm': 0.754790186882019, 'learning_rate': 3.7390254080365355e-05, 'epoch': 1.9} 19%|█▉ | 1895/10000 [2:57:46<12:21:58, 5.49s/it][2025-06-19 16:27:30,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:27:30,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.17 | bwd_microstep: 3316.28 | bwd_inner_microstep: 3315.48 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 16:27:30,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.17 | bwd: 3316.29 | bwd_inner: 3315.48 | bwd_allreduce: 0.77 | step: 6.94 19%|█▉ | 1896/10000 [2:57:51<12:21:03, 5.49s/it] {'loss': 0.0795, 'grad_norm': 0.440253883600235, 'learning_rate': 3.738705385733545e-05, 'epoch': 1.9} 19%|█▉ | 1896/10000 [2:57:51<12:21:03, 5.49s/it][2025-06-19 16:27:36,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:27:36,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.70 | bwd_microstep: 3320.45 | bwd_inner_microstep: 3319.56 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.20 [2025-06-19 16:27:36,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.70 | bwd: 3320.46 | bwd_inner: 3319.56 | bwd_allreduce: 0.86 | step: 7.20 19%|█▉ | 1897/10000 [2:57:56<12:20:09, 5.48s/it] {'loss': 0.0835, 'grad_norm': 0.4383835792541504, 'learning_rate': 3.738385181048428e-05, 'epoch': 1.9} 19%|█▉ | 1897/10000 [2:57:56<12:20:09, 5.48s/it][2025-06-19 16:27:41,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:27:41,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.70 | bwd_microstep: 3332.71 | bwd_inner_microstep: 3331.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 16:27:41,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.70 | bwd: 3332.72 | bwd_inner: 3331.90 | bwd_allreduce: 0.78 | step: 7.10 19%|█▉ | 1898/10000 [2:58:02<12:20:33, 5.48s/it] {'loss': 0.0841, 'grad_norm': 0.6400002837181091, 'learning_rate': 3.738064794014775e-05, 'epoch': 1.9} 19%|█▉ | 1898/10000 [2:58:02<12:20:33, 5.48s/it][2025-06-19 16:27:47,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:27:47,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.96 | bwd_microstep: 3306.82 | bwd_inner_microstep: 3306.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 16:27:47,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.96 | bwd: 3306.83 | bwd_inner: 3306.03 | bwd_allreduce: 0.76 | step: 6.72 19%|█▉ | 1899/10000 [2:58:07<12:19:04, 5.47s/it] {'loss': 0.1383, 'grad_norm': 0.7561997771263123, 'learning_rate': 3.7377442246661904e-05, 'epoch': 1.9} 19%|█▉ | 1899/10000 [2:58:07<12:19:04, 5.47s/it][2025-06-19 16:27:52,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:27:52,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.02 | bwd_microstep: 3315.40 | bwd_inner_microstep: 3314.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 16:27:52,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.02 | bwd: 3315.41 | bwd_inner: 3314.61 | bwd_allreduce: 0.76 | step: 6.69 19%|█▉ | 1900/10000 [2:58:13<12:18:13, 5.47s/it] {'loss': 0.1382, 'grad_norm': 0.6437743306159973, 'learning_rate': 3.737423473036303e-05, 'epoch': 1.9} 19%|█▉ | 1900/10000 [2:58:13<12:18:13, 5.47s/it][2025-06-19 16:27:58,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:27:58,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.74 | bwd_microstep: 3364.60 | bwd_inner_microstep: 3363.49 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.28 [2025-06-19 16:27:58,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.74 | bwd: 3364.63 | bwd_inner: 3363.49 | bwd_allreduce: 1.07 | step: 7.27 19%|█▉ | 1901/10000 [2:58:18<12:20:35, 5.49s/it] {'loss': 0.1399, 'grad_norm': 0.7925355434417725, 'learning_rate': 3.737102539158755e-05, 'epoch': 1.9} 19%|█▉ | 1901/10000 [2:58:18<12:20:35, 5.49s/it][2025-06-19 16:28:03,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:28:03,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.71 | bwd_microstep: 3368.94 | bwd_inner_microstep: 3368.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 16:28:03,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.71 | bwd: 3368.95 | bwd_inner: 3368.13 | bwd_allreduce: 0.78 | step: 7.14 19%|█▉ | 1902/10000 [2:58:24<12:22:57, 5.50s/it] {'loss': 0.1344, 'grad_norm': 0.6384372711181641, 'learning_rate': 3.736781423067215e-05, 'epoch': 1.9} 19%|█▉ | 1902/10000 [2:58:24<12:22:57, 5.50s/it][2025-06-19 16:28:09,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:28:09,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.39 | bwd_microstep: 3309.59 | bwd_inner_microstep: 3308.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 16:28:09,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.39 | bwd: 3309.61 | bwd_inner: 3308.81 | bwd_allreduce: 0.76 | step: 6.67 19%|█▉ | 1903/10000 [2:58:29<12:20:51, 5.49s/it] {'loss': 0.0894, 'grad_norm': 0.49972933530807495, 'learning_rate': 3.736460124795363e-05, 'epoch': 1.9} 19%|█▉ | 1903/10000 [2:58:29<12:20:51, 5.49s/it][2025-06-19 16:28:14,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:28:14,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.81 | bwd_microstep: 3366.59 | bwd_inner_microstep: 3365.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 16:28:14,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.81 | bwd: 3366.61 | bwd_inner: 3365.81 | bwd_allreduce: 0.76 | step: 6.60 19%|█▉ | 1904/10000 [2:58:35<12:22:19, 5.50s/it] {'loss': 0.1421, 'grad_norm': 0.8446633219718933, 'learning_rate': 3.736138644376904e-05, 'epoch': 1.9} 19%|█▉ | 1904/10000 [2:58:35<12:22:19, 5.50s/it][2025-06-19 16:28:20,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:28:20,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.39 | bwd_microstep: 3327.90 | bwd_inner_microstep: 3326.86 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.11 [2025-06-19 16:28:20,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.39 | bwd: 3327.92 | bwd_inner: 3326.86 | bwd_allreduce: 1.01 | step: 7.12 19%|█▉ | 1905/10000 [2:58:40<12:21:03, 5.49s/it] {'loss': 0.1033, 'grad_norm': 1.000962734222412, 'learning_rate': 3.735816981845558e-05, 'epoch': 1.91} 19%|█▉ | 1905/10000 [2:58:40<12:21:03, 5.49s/it][2025-06-19 16:28:25,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:28:25,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.96 | bwd_microstep: 3369.25 | bwd_inner_microstep: 3368.44 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.73 [2025-06-19 16:28:25,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.96 | bwd: 3369.26 | bwd_inner: 3368.44 | bwd_allreduce: 0.78 | step: 6.73 19%|█▉ | 1906/10000 [2:58:46<12:22:47, 5.51s/it] {'loss': 0.1352, 'grad_norm': 0.8241459131240845, 'learning_rate': 3.735495137235067e-05, 'epoch': 1.91} 19%|█▉ | 1906/10000 [2:58:46<12:22:47, 5.51s/it][2025-06-19 16:28:31,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:28:31,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.00 | bwd_microstep: 3377.89 | bwd_inner_microstep: 3377.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-19 16:28:31,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.00 | bwd: 3377.90 | bwd_inner: 3377.08 | bwd_allreduce: 0.78 | step: 6.87 19%|█▉ | 1907/10000 [2:58:51<12:24:19, 5.52s/it] {'loss': 0.0914, 'grad_norm': 0.4726261794567108, 'learning_rate': 3.735173110579191e-05, 'epoch': 1.91} 19%|█▉ | 1907/10000 [2:58:51<12:24:19, 5.52s/it][2025-06-19 16:28:36,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:28:36,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.49 | bwd_microstep: 3367.08 | bwd_inner_microstep: 3366.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 16:28:36,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.49 | bwd: 3367.09 | bwd_inner: 3366.28 | bwd_allreduce: 0.77 | step: 6.96 19%|█▉ | 1908/10000 [2:58:57<12:24:50, 5.52s/it] {'loss': 0.0995, 'grad_norm': 0.6272489428520203, 'learning_rate': 3.734850901911709e-05, 'epoch': 1.91} 19%|█▉ | 1908/10000 [2:58:57<12:24:50, 5.52s/it][2025-06-19 16:28:42,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:28:42,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.74 | bwd_microstep: 3318.73 | bwd_inner_microstep: 3317.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 16:28:42,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.74 | bwd: 3318.75 | bwd_inner: 3317.94 | bwd_allreduce: 0.76 | step: 6.67 19%|█▉ | 1909/10000 [2:59:02<12:22:17, 5.50s/it] {'loss': 0.1142, 'grad_norm': 0.6464602947235107, 'learning_rate': 3.7345285112664186e-05, 'epoch': 1.91} 19%|█▉ | 1909/10000 [2:59:02<12:22:17, 5.50s/it][2025-06-19 16:28:47,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:28:47,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.50 | bwd_microstep: 3315.94 | bwd_inner_microstep: 3315.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 16:28:47,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.50 | bwd: 3315.95 | bwd_inner: 3315.13 | bwd_allreduce: 0.77 | step: 6.99 19%|█▉ | 1910/10000 [2:59:08<12:20:16, 5.49s/it] {'loss': 0.0921, 'grad_norm': 0.5360292792320251, 'learning_rate': 3.734205938677138e-05, 'epoch': 1.91} 19%|█▉ | 1910/10000 [2:59:08<12:20:16, 5.49s/it][2025-06-19 16:28:53,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:28:53,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.32 | bwd_microstep: 3357.41 | bwd_inner_microstep: 3356.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-19 16:28:53,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.32 | bwd: 3357.42 | bwd_inner: 3356.60 | bwd_allreduce: 0.78 | step: 6.98 19%|█▉ | 1911/10000 [2:59:13<12:21:34, 5.50s/it] {'loss': 0.0908, 'grad_norm': 0.4872797429561615, 'learning_rate': 3.7338831841777034e-05, 'epoch': 1.91} 19%|█▉ | 1911/10000 [2:59:13<12:21:34, 5.50s/it][2025-06-19 16:28:58,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:28:58,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.26 | bwd_microstep: 3314.17 | bwd_inner_microstep: 3312.95 | bwd_allreduce_microstep: 1.16 | step_microstep: 7.27 [2025-06-19 16:28:58,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.26 | bwd: 3314.18 | bwd_inner: 3312.95 | bwd_allreduce: 1.18 | step: 7.27 19%|█▉ | 1912/10000 [2:59:19<12:19:26, 5.49s/it] {'loss': 0.0899, 'grad_norm': 0.5174448490142822, 'learning_rate': 3.733560247801969e-05, 'epoch': 1.91} 19%|█▉ | 1912/10000 [2:59:19<12:19:26, 5.49s/it][2025-06-19 16:29:04,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 16:29:04,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.26 | bwd_microstep: 3375.38 | bwd_inner_microstep: 3374.29 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.49 [2025-06-19 16:29:04,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.26 | bwd: 3375.41 | bwd_inner: 3374.29 | bwd_allreduce: 1.04 | step: 7.49 19%|█▉ | 1913/10000 [2:59:24<12:21:55, 5.50s/it] {'loss': 0.1148, 'grad_norm': 0.7592204213142395, 'learning_rate': 3.733237129583812e-05, 'epoch': 1.91} 19%|█▉ | 1913/10000 [2:59:24<12:21:55, 5.50s/it][2025-06-19 16:29:09,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:29:09,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.86 | bwd_microstep: 3404.05 | bwd_inner_microstep: 3403.17 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.17 [2025-06-19 16:29:09,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.87 | bwd: 3404.07 | bwd_inner: 3403.17 | bwd_allreduce: 0.85 | step: 7.18 19%|█▉ | 1914/10000 [2:59:30<12:25:02, 5.53s/it] {'loss': 0.0713, 'grad_norm': 0.3392445147037506, 'learning_rate': 3.732913829557123e-05, 'epoch': 1.91} 19%|█▉ | 1914/10000 [2:59:30<12:25:02, 5.53s/it][2025-06-19 16:29:15,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:29:15,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.34 | bwd_microstep: 3328.12 | bwd_inner_microstep: 3327.30 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.03 [2025-06-19 16:29:15,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.34 | bwd: 3328.13 | bwd_inner: 3327.30 | bwd_allreduce: 0.79 | step: 7.04 19%|█▉ | 1915/10000 [2:59:36<12:22:52, 5.51s/it] {'loss': 0.1137, 'grad_norm': 0.818449854850769, 'learning_rate': 3.7325903477558154e-05, 'epoch': 1.92} 19%|█▉ | 1915/10000 [2:59:36<12:22:52, 5.51s/it][2025-06-19 16:29:20,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:29:20,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.69 | bwd_microstep: 3368.54 | bwd_inner_microstep: 3367.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 16:29:20,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.69 | bwd: 3368.55 | bwd_inner: 3367.74 | bwd_allreduce: 0.76 | step: 6.65 19%|█▉ | 1916/10000 [2:59:41<12:23:24, 5.52s/it] {'loss': 0.1184, 'grad_norm': 0.7835032939910889, 'learning_rate': 3.732266684213822e-05, 'epoch': 1.92} 19%|█▉ | 1916/10000 [2:59:41<12:23:24, 5.52s/it][2025-06-19 16:29:26,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:29:26,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.37 | bwd_microstep: 3372.01 | bwd_inner_microstep: 3371.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 16:29:26,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.37 | bwd: 3372.03 | bwd_inner: 3371.22 | bwd_allreduce: 0.76 | step: 6.76 19%|█▉ | 1917/10000 [2:59:47<12:24:02, 5.52s/it] {'loss': 0.1256, 'grad_norm': 0.8892363905906677, 'learning_rate': 3.731942838965094e-05, 'epoch': 1.92} 19%|█▉ | 1917/10000 [2:59:47<12:24:02, 5.52s/it][2025-06-19 16:29:31,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:29:31,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.52 | bwd_microstep: 3312.47 | bwd_inner_microstep: 3311.68 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 16:29:31,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.52 | bwd: 3312.49 | bwd_inner: 3311.69 | bwd_allreduce: 0.76 | step: 6.67 19%|█▉ | 1918/10000 [2:59:52<12:21:04, 5.50s/it] {'loss': 0.1142, 'grad_norm': 0.4907287657260895, 'learning_rate': 3.7316188120436e-05, 'epoch': 1.92} 19%|█▉ | 1918/10000 [2:59:52<12:21:04, 5.50s/it][2025-06-19 16:29:37,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:29:37,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.37 | bwd_microstep: 3321.87 | bwd_inner_microstep: 3320.89 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.33 [2025-06-19 16:29:37,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.37 | bwd: 3321.88 | bwd_inner: 3320.89 | bwd_allreduce: 0.95 | step: 7.33 19%|█▉ | 1919/10000 [2:59:57<12:19:09, 5.49s/it] {'loss': 0.1079, 'grad_norm': 0.7902462482452393, 'learning_rate': 3.731294603483329e-05, 'epoch': 1.92} 19%|█▉ | 1919/10000 [2:59:58<12:19:09, 5.49s/it][2025-06-19 16:29:42,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:29:42,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.63 | bwd_microstep: 3322.66 | bwd_inner_microstep: 3321.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 16:29:42,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.63 | bwd: 3322.67 | bwd_inner: 3321.86 | bwd_allreduce: 0.77 | step: 6.84 19%|█▉ | 1920/10000 [3:00:03<12:18:17, 5.48s/it] {'loss': 0.1752, 'grad_norm': 0.8712618947029114, 'learning_rate': 3.73097021331829e-05, 'epoch': 1.92} 19%|█▉ | 1920/10000 [3:00:03<12:18:17, 5.48s/it][2025-06-19 16:29:48,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:29:48,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.71 | bwd_microstep: 3315.25 | bwd_inner_microstep: 3314.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 16:29:48,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.71 | bwd: 3315.26 | bwd_inner: 3314.45 | bwd_allreduce: 0.76 | step: 6.89 19%|█▉ | 1921/10000 [3:00:08<12:17:15, 5.48s/it] {'loss': 0.0856, 'grad_norm': 0.38507986068725586, 'learning_rate': 3.730645641582508e-05, 'epoch': 1.92} 19%|█▉ | 1921/10000 [3:00:08<12:17:15, 5.48s/it][2025-06-19 16:29:53,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:29:53,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.89 | bwd_microstep: 3394.69 | bwd_inner_microstep: 3393.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 16:29:53,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.89 | bwd: 3394.70 | bwd_inner: 3393.90 | bwd_allreduce: 0.76 | step: 6.64 19%|█▉ | 1922/10000 [3:00:14<12:20:51, 5.50s/it] {'loss': 0.0706, 'grad_norm': 0.6776022911071777, 'learning_rate': 3.730320888310031e-05, 'epoch': 1.92} 19%|█▉ | 1922/10000 [3:00:14<12:20:51, 5.50s/it][2025-06-19 16:29:59,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:29:59,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.12 | bwd_microstep: 3328.05 | bwd_inner_microstep: 3327.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 16:29:59,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.12 | bwd: 3328.07 | bwd_inner: 3327.27 | bwd_allreduce: 0.76 | step: 6.65 19%|█▉ | 1923/10000 [3:00:19<12:19:21, 5.49s/it] {'loss': 0.0814, 'grad_norm': 0.4248637855052948, 'learning_rate': 3.729995953534924e-05, 'epoch': 1.92} 19%|█▉ | 1923/10000 [3:00:19<12:19:21, 5.49s/it][2025-06-19 16:30:04,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:30:04,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.96 | bwd_microstep: 3369.50 | bwd_inner_microstep: 3368.70 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-19 16:30:04,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.96 | bwd: 3369.52 | bwd_inner: 3368.70 | bwd_allreduce: 0.77 | step: 7.22 19%|█▉ | 1924/10000 [3:00:25<12:21:15, 5.51s/it] {'loss': 0.1207, 'grad_norm': 0.8427236080169678, 'learning_rate': 3.72967083729127e-05, 'epoch': 1.92} 19%|█▉ | 1924/10000 [3:00:25<12:21:15, 5.51s/it][2025-06-19 16:30:10,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:30:10,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.26 | bwd_microstep: 3364.21 | bwd_inner_microstep: 3363.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 16:30:10,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.26 | bwd: 3364.23 | bwd_inner: 3363.43 | bwd_allreduce: 0.76 | step: 6.62 19%|█▉ | 1925/10000 [3:00:31<12:22:10, 5.51s/it] {'loss': 0.1203, 'grad_norm': 0.5274036526679993, 'learning_rate': 3.729345539613173e-05, 'epoch': 1.93} 19%|█▉ | 1925/10000 [3:00:31<12:22:10, 5.51s/it][2025-06-19 16:30:15,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:30:15,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.41 | bwd_microstep: 3310.98 | bwd_inner_microstep: 3310.08 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.41 [2025-06-19 16:30:15,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.41 | bwd: 3311.00 | bwd_inner: 3310.08 | bwd_allreduce: 0.87 | step: 7.41 19%|█▉ | 1926/10000 [3:00:36<12:19:33, 5.50s/it] {'loss': 0.1925, 'grad_norm': 1.6113826036453247, 'learning_rate': 3.729020060534755e-05, 'epoch': 1.93} 19%|█▉ | 1926/10000 [3:00:36<12:19:33, 5.50s/it][2025-06-19 16:30:21,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:30:21,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.14 | bwd_microstep: 3365.12 | bwd_inner_microstep: 3364.12 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.46 [2025-06-19 16:30:21,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.14 | bwd: 3365.14 | bwd_inner: 3364.12 | bwd_allreduce: 0.97 | step: 7.47 19%|█▉ | 1927/10000 [3:00:42<12:21:01, 5.51s/it] {'loss': 0.1086, 'grad_norm': 0.6883846521377563, 'learning_rate': 3.728694400090157e-05, 'epoch': 1.93} 19%|█▉ | 1927/10000 [3:00:42<12:21:01, 5.51s/it][2025-06-19 16:30:26,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:30:26,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.99 | bwd_microstep: 3371.05 | bwd_inner_microstep: 3370.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 16:30:26,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.99 | bwd: 3371.07 | bwd_inner: 3370.26 | bwd_allreduce: 0.76 | step: 6.69 19%|█▉ | 1928/10000 [3:00:47<12:22:27, 5.52s/it] {'loss': 0.1373, 'grad_norm': 1.0344579219818115, 'learning_rate': 3.72836855831354e-05, 'epoch': 1.93} 19%|█▉ | 1928/10000 [3:00:47<12:22:27, 5.52s/it][2025-06-19 16:30:32,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:30:32,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.51 | bwd_microstep: 3314.73 | bwd_inner_microstep: 3313.92 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.25 [2025-06-19 16:30:32,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.51 | bwd: 3314.75 | bwd_inner: 3313.92 | bwd_allreduce: 0.78 | step: 7.25 19%|█▉ | 1929/10000 [3:00:53<12:19:39, 5.50s/it] {'loss': 0.1002, 'grad_norm': 0.6208576560020447, 'learning_rate': 3.728042535239083e-05, 'epoch': 1.93} 19%|█▉ | 1929/10000 [3:00:53<12:19:39, 5.50s/it][2025-06-19 16:30:37,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:30:37,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.22 | bwd_microstep: 3368.50 | bwd_inner_microstep: 3367.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 16:30:37,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.22 | bwd: 3368.51 | bwd_inner: 3367.71 | bwd_allreduce: 0.76 | step: 6.63 19%|█▉ | 1930/10000 [3:00:58<12:20:46, 5.51s/it] {'loss': 0.0696, 'grad_norm': 0.31800517439842224, 'learning_rate': 3.727716330900984e-05, 'epoch': 1.93} 19%|█▉ | 1930/10000 [3:00:58<12:20:46, 5.51s/it][2025-06-19 16:30:43,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:30:43,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.71 | bwd_microstep: 3358.68 | bwd_inner_microstep: 3357.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 16:30:43,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.71 | bwd: 3358.70 | bwd_inner: 3357.89 | bwd_allreduce: 0.76 | step: 6.64 19%|█▉ | 1931/10000 [3:01:04<12:21:04, 5.51s/it] {'loss': 0.1063, 'grad_norm': 0.6642578840255737, 'learning_rate': 3.72738994533346e-05, 'epoch': 1.93} 19%|█▉ | 1931/10000 [3:01:04<12:21:04, 5.51s/it][2025-06-19 16:30:48,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 16:30:48,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.72 | bwd_microstep: 3308.17 | bwd_inner_microstep: 3307.05 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.40 [2025-06-19 16:30:48,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.72 | bwd: 3308.19 | bwd_inner: 3307.05 | bwd_allreduce: 1.08 | step: 7.40 19%|█▉ | 1932/10000 [3:01:09<12:18:36, 5.49s/it] {'loss': 0.1321, 'grad_norm': 0.4590264856815338, 'learning_rate': 3.727063378570748e-05, 'epoch': 1.93} 19%|█▉ | 1932/10000 [3:01:09<12:18:36, 5.49s/it][2025-06-19 16:30:54,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 16:30:54,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.36 | bwd_microstep: 3305.36 | bwd_inner_microstep: 3304.58 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.56 [2025-06-19 16:30:54,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.36 | bwd: 3305.37 | bwd_inner: 3304.58 | bwd_allreduce: 0.75 | step: 6.57 19%|█▉ | 1933/10000 [3:01:14<12:16:36, 5.48s/it] {'loss': 0.0908, 'grad_norm': 0.5031557679176331, 'learning_rate': 3.726736630647104e-05, 'epoch': 1.93} 19%|█▉ | 1933/10000 [3:01:14<12:16:36, 5.48s/it][2025-06-19 16:30:59,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 16:30:59,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.86 | bwd_microstep: 3310.46 | bwd_inner_microstep: 3309.44 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.79 [2025-06-19 16:30:59,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.86 | bwd: 3310.48 | bwd_inner: 3309.44 | bwd_allreduce: 0.99 | step: 7.79 19%|█▉ | 1934/10000 [3:01:20<12:15:45, 5.47s/it] {'loss': 0.105, 'grad_norm': 0.8666167855262756, 'learning_rate': 3.7264097015968e-05, 'epoch': 1.93} 19%|█▉ | 1934/10000 [3:01:20<12:15:45, 5.47s/it][2025-06-19 16:31:05,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:31:05,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.80 | bwd_microstep: 3311.73 | bwd_inner_microstep: 3310.79 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.06 [2025-06-19 16:31:05,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.80 | bwd: 3311.75 | bwd_inner: 3310.79 | bwd_allreduce: 0.92 | step: 7.06 19%|█▉ | 1935/10000 [3:01:25<12:15:01, 5.47s/it] {'loss': 0.0841, 'grad_norm': 0.6625946164131165, 'learning_rate': 3.726082591454132e-05, 'epoch': 1.94} 19%|█▉ | 1935/10000 [3:01:25<12:15:01, 5.47s/it][2025-06-19 16:31:10,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:31:10,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.29 | bwd_microstep: 3354.18 | bwd_inner_microstep: 3353.07 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.28 [2025-06-19 16:31:10,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.29 | bwd: 3354.20 | bwd_inner: 3353.07 | bwd_allreduce: 1.07 | step: 7.27 19%|█▉ | 1936/10000 [3:01:31<12:17:12, 5.49s/it] {'loss': 0.1044, 'grad_norm': 0.5361285209655762, 'learning_rate': 3.7257553002534104e-05, 'epoch': 1.94} 19%|█▉ | 1936/10000 [3:01:31<12:17:12, 5.49s/it][2025-06-19 16:31:16,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:31:16,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.26 | bwd_microstep: 3309.60 | bwd_inner_microstep: 3308.80 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 16:31:16,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.26 | bwd: 3309.61 | bwd_inner: 3308.80 | bwd_allreduce: 0.77 | step: 6.69 19%|█▉ | 1937/10000 [3:01:36<12:15:48, 5.48s/it] {'loss': 0.1015, 'grad_norm': 0.7102497816085815, 'learning_rate': 3.725427828028968e-05, 'epoch': 1.94} 19%|█▉ | 1937/10000 [3:01:36<12:15:48, 5.48s/it][2025-06-19 16:31:21,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:31:21,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.61 | bwd_microstep: 3378.09 | bwd_inner_microstep: 3377.26 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.23 [2025-06-19 16:31:21,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.61 | bwd: 3378.10 | bwd_inner: 3377.26 | bwd_allreduce: 0.80 | step: 7.23 19%|█▉ | 1938/10000 [3:01:42<12:18:34, 5.50s/it] {'loss': 0.1068, 'grad_norm': 0.5034220218658447, 'learning_rate': 3.725100174815154e-05, 'epoch': 1.94} 19%|█▉ | 1938/10000 [3:01:42<12:18:34, 5.50s/it][2025-06-19 16:31:27,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:31:27,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.49 | bwd_microstep: 3320.24 | bwd_inner_microstep: 3319.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 16:31:27,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.49 | bwd: 3320.25 | bwd_inner: 3319.45 | bwd_allreduce: 0.75 | step: 6.68 19%|█▉ | 1939/10000 [3:01:47<12:17:00, 5.49s/it] {'loss': 0.1246, 'grad_norm': 0.5582828521728516, 'learning_rate': 3.724772340646338e-05, 'epoch': 1.94} 19%|█▉ | 1939/10000 [3:01:47<12:17:00, 5.49s/it][2025-06-19 16:31:32,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:31:32,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.41 | bwd_microstep: 3320.65 | bwd_inner_microstep: 3319.67 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.23 [2025-06-19 16:31:32,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.41 | bwd: 3320.67 | bwd_inner: 3319.67 | bwd_allreduce: 0.94 | step: 7.23 19%|█▉ | 1940/10000 [3:01:53<12:15:53, 5.48s/it] {'loss': 0.0805, 'grad_norm': 0.3866191804409027, 'learning_rate': 3.724444325556908e-05, 'epoch': 1.94} 19%|█▉ | 1940/10000 [3:01:53<12:15:53, 5.48s/it][2025-06-19 16:31:38,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:31:38,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.41 | bwd_microstep: 3373.97 | bwd_inner_microstep: 3373.15 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.04 [2025-06-19 16:31:38,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.41 | bwd: 3373.98 | bwd_inner: 3373.15 | bwd_allreduce: 0.78 | step: 7.04 19%|█▉ | 1941/10000 [3:01:58<12:18:30, 5.50s/it] {'loss': 0.072, 'grad_norm': 0.27579399943351746, 'learning_rate': 3.724116129581273e-05, 'epoch': 1.94} 19%|█▉ | 1941/10000 [3:01:58<12:18:30, 5.50s/it][2025-06-19 16:31:43,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:31:43,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.98 | bwd_microstep: 3323.16 | bwd_inner_microstep: 3322.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 16:31:43,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.98 | bwd: 3323.17 | bwd_inner: 3322.36 | bwd_allreduce: 0.77 | step: 6.80 19%|█▉ | 1942/10000 [3:02:04<12:17:15, 5.49s/it] {'loss': 0.1531, 'grad_norm': 0.9349495768547058, 'learning_rate': 3.7237877527538566e-05, 'epoch': 1.94} 19%|█▉ | 1942/10000 [3:02:04<12:17:15, 5.49s/it][2025-06-19 16:31:49,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:31:49,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.61 | bwd_microstep: 3322.81 | bwd_inner_microstep: 3321.99 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.02 [2025-06-19 16:31:49,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.61 | bwd: 3322.83 | bwd_inner: 3321.99 | bwd_allreduce: 0.79 | step: 7.03 19%|█▉ | 1943/10000 [3:02:09<12:16:13, 5.48s/it] {'loss': 0.26, 'grad_norm': 1.3901740312576294, 'learning_rate': 3.723459195109106e-05, 'epoch': 1.94} 19%|█▉ | 1943/10000 [3:02:09<12:16:13, 5.48s/it][2025-06-19 16:31:54,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:31:54,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.65 | bwd_microstep: 3318.37 | bwd_inner_microstep: 3317.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 16:31:54,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.65 | bwd: 3318.39 | bwd_inner: 3317.57 | bwd_allreduce: 0.78 | step: 7.00 19%|█▉ | 1944/10000 [3:02:15<12:15:06, 5.48s/it] {'loss': 0.076, 'grad_norm': 0.42458799481391907, 'learning_rate': 3.723130456681484e-05, 'epoch': 1.94} 19%|█▉ | 1944/10000 [3:02:15<12:15:06, 5.48s/it][2025-06-19 16:32:00,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:32:00,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.31 | bwd_microstep: 3373.64 | bwd_inner_microstep: 3372.83 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 16:32:00,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.31 | bwd: 3373.66 | bwd_inner: 3372.83 | bwd_allreduce: 0.78 | step: 7.18 19%|█▉ | 1945/10000 [3:02:20<12:17:46, 5.50s/it] {'loss': 0.0724, 'grad_norm': 0.5801758766174316, 'learning_rate': 3.722801537505475e-05, 'epoch': 1.94} 19%|█▉ | 1945/10000 [3:02:20<12:17:46, 5.50s/it][2025-06-19 16:32:05,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:32:05,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.27 | bwd_microstep: 3320.10 | bwd_inner_microstep: 3319.31 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 16:32:05,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.27 | bwd: 3320.11 | bwd_inner: 3319.31 | bwd_allreduce: 0.76 | step: 6.73 19%|█▉ | 1946/10000 [3:02:26<12:16:21, 5.49s/it] {'loss': 0.0531, 'grad_norm': 0.39890944957733154, 'learning_rate': 3.72247243761558e-05, 'epoch': 1.95} 19%|█▉ | 1946/10000 [3:02:26<12:16:21, 5.49s/it][2025-06-19 16:32:10,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:32:10,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.94 | bwd_microstep: 3317.56 | bwd_inner_microstep: 3316.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 16:32:10,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.94 | bwd: 3317.58 | bwd_inner: 3316.78 | bwd_allreduce: 0.75 | step: 6.54 19%|█▉ | 1947/10000 [3:02:31<12:15:04, 5.48s/it] {'loss': 0.0988, 'grad_norm': 0.4416445791721344, 'learning_rate': 3.7221431570463206e-05, 'epoch': 1.95} 19%|█▉ | 1947/10000 [3:02:31<12:15:04, 5.48s/it][2025-06-19 16:32:16,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.73 [2025-06-19 16:32:16,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.53 | bwd_microstep: 3320.51 | bwd_inner_microstep: 3319.67 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.50 [2025-06-19 16:32:16,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.53 | bwd: 3320.53 | bwd_inner: 3319.67 | bwd_allreduce: 0.80 | step: 7.50 19%|█▉ | 1948/10000 [3:02:37<12:14:24, 5.47s/it] {'loss': 0.1042, 'grad_norm': 0.7660871148109436, 'learning_rate': 3.7218136958322364e-05, 'epoch': 1.95} 19%|█▉ | 1948/10000 [3:02:37<12:14:24, 5.47s/it][2025-06-19 16:32:21,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:32:21,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.60 | bwd_microstep: 3321.80 | bwd_inner_microstep: 3321.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 16:32:21,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.60 | bwd: 3321.81 | bwd_inner: 3321.00 | bwd_allreduce: 0.77 | step: 6.81 19%|█▉ | 1949/10000 [3:02:42<12:14:07, 5.47s/it] {'loss': 0.1079, 'grad_norm': 0.4746681749820709, 'learning_rate': 3.721484054007888e-05, 'epoch': 1.95} 19%|█▉ | 1949/10000 [3:02:42<12:14:07, 5.47s/it][2025-06-19 16:32:27,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:32:27,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.56 | bwd_microstep: 3319.84 | bwd_inner_microstep: 3319.02 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-19 16:32:27,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.56 | bwd: 3319.85 | bwd_inner: 3319.02 | bwd_allreduce: 0.78 | step: 7.13 20%|█▉ | 1950/10000 [3:02:48<12:13:56, 5.47s/it] {'loss': 0.0529, 'grad_norm': 0.2358231395483017, 'learning_rate': 3.7211542316078506e-05, 'epoch': 1.95} 20%|█▉ | 1950/10000 [3:02:48<12:13:56, 5.47s/it][2025-06-19 16:32:32,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:32:32,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.48 | bwd_microstep: 3382.95 | bwd_inner_microstep: 3382.07 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.21 [2025-06-19 16:32:32,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.48 | bwd: 3382.96 | bwd_inner: 3382.07 | bwd_allreduce: 0.84 | step: 7.22 20%|█▉ | 1951/10000 [3:02:53<12:17:30, 5.50s/it] {'loss': 0.0825, 'grad_norm': 0.3250937759876251, 'learning_rate': 3.720824228666723e-05, 'epoch': 1.95} 20%|█▉ | 1951/10000 [3:02:53<12:17:30, 5.50s/it][2025-06-19 16:32:38,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:32:38,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.86 | bwd_microstep: 3399.25 | bwd_inner_microstep: 3398.45 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 16:32:38,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.86 | bwd: 3399.27 | bwd_inner: 3398.45 | bwd_allreduce: 0.78 | step: 7.10 20%|█▉ | 1952/10000 [3:02:59<12:21:01, 5.52s/it] {'loss': 0.077, 'grad_norm': 0.3251618444919586, 'learning_rate': 3.7204940452191205e-05, 'epoch': 1.95} 20%|█▉ | 1952/10000 [3:02:59<12:21:01, 5.52s/it][2025-06-19 16:32:44,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:32:44,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.12 | bwd_microstep: 3369.49 | bwd_inner_microstep: 3368.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.27 [2025-06-19 16:32:44,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.12 | bwd: 3369.50 | bwd_inner: 3368.70 | bwd_allreduce: 0.76 | step: 7.27 20%|█▉ | 1953/10000 [3:03:04<12:21:55, 5.53s/it] {'loss': 0.1941, 'grad_norm': 0.8915385007858276, 'learning_rate': 3.7201636812996776e-05, 'epoch': 1.95} 20%|█▉ | 1953/10000 [3:03:04<12:21:55, 5.53s/it][2025-06-19 16:32:49,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:32:49,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.23 | bwd_microstep: 3379.32 | bwd_inner_microstep: 3378.49 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.94 [2025-06-19 16:32:49,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.23 | bwd: 3379.34 | bwd_inner: 3378.49 | bwd_allreduce: 0.80 | step: 6.94 20%|█▉ | 1954/10000 [3:03:10<12:22:56, 5.54s/it] {'loss': 0.0937, 'grad_norm': 0.4213869273662567, 'learning_rate': 3.7198331369430476e-05, 'epoch': 1.95} 20%|█▉ | 1954/10000 [3:03:10<12:22:56, 5.54s/it][2025-06-19 16:32:55,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:32:55,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2152.59 | bwd_microstep: 3384.58 | bwd_inner_microstep: 3383.57 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.20 [2025-06-19 16:32:55,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2152.59 | bwd: 3384.59 | bwd_inner: 3383.57 | bwd_allreduce: 0.98 | step: 7.20 20%|█▉ | 1955/10000 [3:03:15<12:24:24, 5.55s/it] {'loss': 0.0536, 'grad_norm': 0.2519681751728058, 'learning_rate': 3.7195024121839046e-05, 'epoch': 1.96} 20%|█▉ | 1955/10000 [3:03:15<12:24:24, 5.55s/it][2025-06-19 16:33:00,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 16:33:00,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.57 | bwd_microstep: 3329.65 | bwd_inner_microstep: 3328.66 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.80 [2025-06-19 16:33:00,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.58 | bwd: 3329.66 | bwd_inner: 3328.66 | bwd_allreduce: 0.96 | step: 7.81 20%|█▉ | 1956/10000 [3:03:21<12:21:26, 5.53s/it] {'loss': 0.136, 'grad_norm': 0.703655481338501, 'learning_rate': 3.7191715070569384e-05, 'epoch': 1.96} 20%|█▉ | 1956/10000 [3:03:21<12:21:26, 5.53s/it][2025-06-19 16:33:06,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:33:06,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.39 | bwd_microstep: 3375.01 | bwd_inner_microstep: 3374.22 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 16:33:06,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.39 | bwd: 3375.02 | bwd_inner: 3374.22 | bwd_allreduce: 0.76 | step: 6.57 20%|█▉ | 1957/10000 [3:03:27<12:22:49, 5.54s/it] {'loss': 0.0966, 'grad_norm': 0.5111676454544067, 'learning_rate': 3.71884042159686e-05, 'epoch': 1.96} 20%|█▉ | 1957/10000 [3:03:27<12:22:49, 5.54s/it][2025-06-19 16:33:11,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:33:11,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.46 | bwd_microstep: 3384.78 | bwd_inner_microstep: 3383.98 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-19 16:33:11,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.46 | bwd: 3384.80 | bwd_inner: 3383.98 | bwd_allreduce: 0.77 | step: 6.86 20%|█▉ | 1958/10000 [3:03:32<12:23:07, 5.54s/it] {'loss': 0.0684, 'grad_norm': 0.24960610270500183, 'learning_rate': 3.7185091558383986e-05, 'epoch': 1.96} 20%|█▉ | 1958/10000 [3:03:32<12:23:07, 5.54s/it][2025-06-19 16:33:17,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:33:17,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.44 | bwd_microstep: 3325.03 | bwd_inner_microstep: 3324.11 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.58 [2025-06-19 16:33:17,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.44 | bwd: 3325.06 | bwd_inner: 3324.11 | bwd_allreduce: 0.88 | step: 7.57 20%|█▉ | 1959/10000 [3:03:38<12:20:20, 5.52s/it] {'loss': 0.123, 'grad_norm': 0.7495672106742859, 'learning_rate': 3.718177709816303e-05, 'epoch': 1.96} 20%|█▉ | 1959/10000 [3:03:38<12:20:20, 5.52s/it][2025-06-19 16:33:22,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:33:22,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.42 | bwd_microstep: 3329.17 | bwd_inner_microstep: 3328.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 16:33:22,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.43 | bwd: 3329.18 | bwd_inner: 3328.39 | bwd_allreduce: 0.75 | step: 6.60 20%|█▉ | 1960/10000 [3:03:43<12:18:22, 5.51s/it] {'loss': 0.0813, 'grad_norm': 0.44997209310531616, 'learning_rate': 3.71784608356534e-05, 'epoch': 1.96} 20%|█▉ | 1960/10000 [3:03:43<12:18:22, 5.51s/it][2025-06-19 16:33:28,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:33:28,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.26 | bwd_microstep: 3376.39 | bwd_inner_microstep: 3375.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 16:33:28,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.26 | bwd: 3376.41 | bwd_inner: 3375.61 | bwd_allreduce: 0.76 | step: 6.58 20%|█▉ | 1961/10000 [3:03:49<12:20:16, 5.53s/it] {'loss': 0.0805, 'grad_norm': 0.3727801442146301, 'learning_rate': 3.717514277120296e-05, 'epoch': 1.96} 20%|█▉ | 1961/10000 [3:03:49<12:20:16, 5.53s/it][2025-06-19 16:33:33,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:33:33,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.49 | bwd_microstep: 3328.28 | bwd_inner_microstep: 3327.49 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 16:33:33,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.49 | bwd: 3328.29 | bwd_inner: 3327.49 | bwd_allreduce: 0.76 | step: 6.56 20%|█▉ | 1962/10000 [3:03:54<12:18:03, 5.51s/it] {'loss': 0.0944, 'grad_norm': 0.9375348687171936, 'learning_rate': 3.717182290515974e-05, 'epoch': 1.96} 20%|█▉ | 1962/10000 [3:03:54<12:18:03, 5.51s/it][2025-06-19 16:33:39,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:33:39,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.92 | bwd_microstep: 3326.03 | bwd_inner_microstep: 3325.21 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.80 [2025-06-19 16:33:39,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.92 | bwd: 3326.05 | bwd_inner: 3325.21 | bwd_allreduce: 0.79 | step: 6.86 20%|█▉ | 1963/10000 [3:04:00<12:16:41, 5.50s/it] {'loss': 0.1225, 'grad_norm': 0.9091466665267944, 'learning_rate': 3.7168501237872e-05, 'epoch': 1.96} 20%|█▉ | 1963/10000 [3:04:00<12:16:41, 5.50s/it][2025-06-19 16:33:44,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.71 | optimizer_step: 2.73 [2025-06-19 16:33:44,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.88 | bwd_microstep: 3381.57 | bwd_inner_microstep: 3380.58 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.75 [2025-06-19 16:33:44,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.88 | bwd: 3381.59 | bwd_inner: 3380.58 | bwd_allreduce: 0.96 | step: 7.75 20%|█▉ | 1964/10000 [3:04:05<12:19:11, 5.52s/it] {'loss': 0.0721, 'grad_norm': 0.5066200494766235, 'learning_rate': 3.716517776968817e-05, 'epoch': 1.96} 20%|█▉ | 1964/10000 [3:04:05<12:19:11, 5.52s/it][2025-06-19 16:33:50,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:33:50,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.70 | bwd_microstep: 3379.00 | bwd_inner_microstep: 3378.19 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-19 16:33:50,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.71 | bwd: 3379.01 | bwd_inner: 3378.19 | bwd_allreduce: 0.78 | step: 7.04 20%|█▉ | 1965/10000 [3:04:11<12:20:12, 5.53s/it] {'loss': 0.0993, 'grad_norm': 0.6930442452430725, 'learning_rate': 3.716185250095685e-05, 'epoch': 1.96} 20%|█▉ | 1965/10000 [3:04:11<12:20:12, 5.53s/it][2025-06-19 16:33:55,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:33:55,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.78 | bwd_microstep: 3333.17 | bwd_inner_microstep: 3332.27 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.88 [2025-06-19 16:33:55,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.78 | bwd: 3333.18 | bwd_inner: 3332.27 | bwd_allreduce: 0.87 | step: 6.88 20%|█▉ | 1966/10000 [3:04:16<12:18:28, 5.52s/it] {'loss': 0.1573, 'grad_norm': 1.4848241806030273, 'learning_rate': 3.715852543202686e-05, 'epoch': 1.97} 20%|█▉ | 1966/10000 [3:04:16<12:18:28, 5.52s/it][2025-06-19 16:34:01,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:34:01,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.73 | bwd_microstep: 3326.49 | bwd_inner_microstep: 3325.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 16:34:01,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.73 | bwd: 3326.50 | bwd_inner: 3325.71 | bwd_allreduce: 0.75 | step: 6.55 20%|█▉ | 1967/10000 [3:04:22<12:16:48, 5.50s/it] {'loss': 0.1046, 'grad_norm': 0.5935189127922058, 'learning_rate': 3.715519656324718e-05, 'epoch': 1.97} 20%|█▉ | 1967/10000 [3:04:22<12:16:48, 5.50s/it][2025-06-19 16:34:06,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:34:06,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.50 | bwd_microstep: 3374.24 | bwd_inner_microstep: 3373.37 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.95 [2025-06-19 16:34:06,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.51 | bwd: 3374.27 | bwd_inner: 3373.37 | bwd_allreduce: 0.83 | step: 6.96 20%|█▉ | 1968/10000 [3:04:27<12:18:36, 5.52s/it] {'loss': 0.0667, 'grad_norm': 0.4590344727039337, 'learning_rate': 3.7151865894967006e-05, 'epoch': 1.97} 20%|█▉ | 1968/10000 [3:04:27<12:18:36, 5.52s/it][2025-06-19 16:34:12,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:34:12,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.86 | bwd_microstep: 3330.91 | bwd_inner_microstep: 3330.07 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.84 [2025-06-19 16:34:12,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.86 | bwd: 3330.92 | bwd_inner: 3330.07 | bwd_allreduce: 0.81 | step: 6.84 20%|█▉ | 1969/10000 [3:04:33<12:17:09, 5.51s/it] {'loss': 0.0765, 'grad_norm': 0.36360153555870056, 'learning_rate': 3.71485334275357e-05, 'epoch': 1.97} 20%|█▉ | 1969/10000 [3:04:33<12:17:09, 5.51s/it][2025-06-19 16:34:17,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:34:17,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.94 | bwd_microstep: 3376.55 | bwd_inner_microstep: 3375.75 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 16:34:17,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.94 | bwd: 3376.57 | bwd_inner: 3375.75 | bwd_allreduce: 0.77 | step: 6.97 20%|█▉ | 1970/10000 [3:04:38<12:18:45, 5.52s/it] {'loss': 0.0757, 'grad_norm': 0.34710976481437683, 'learning_rate': 3.714519916130283e-05, 'epoch': 1.97} 20%|█▉ | 1970/10000 [3:04:38<12:18:45, 5.52s/it][2025-06-19 16:34:23,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:34:23,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.79 | bwd_microstep: 3376.94 | bwd_inner_microstep: 3376.12 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.86 [2025-06-19 16:34:23,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.79 | bwd: 3376.96 | bwd_inner: 3376.12 | bwd_allreduce: 0.79 | step: 6.86 20%|█▉ | 1971/10000 [3:04:44<12:19:41, 5.53s/it] {'loss': 0.0554, 'grad_norm': 0.3737884759902954, 'learning_rate': 3.714186309661814e-05, 'epoch': 1.97} 20%|█▉ | 1971/10000 [3:04:44<12:19:41, 5.53s/it][2025-06-19 16:34:28,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:34:28,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.79 | bwd_microstep: 3382.92 | bwd_inner_microstep: 3382.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 16:34:28,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.79 | bwd: 3382.93 | bwd_inner: 3382.12 | bwd_allreduce: 0.77 | step: 6.76 20%|█▉ | 1972/10000 [3:04:49<12:20:43, 5.54s/it] {'loss': 0.1969, 'grad_norm': 0.76936936378479, 'learning_rate': 3.713852523383157e-05, 'epoch': 1.97} 20%|█▉ | 1972/10000 [3:04:49<12:20:43, 5.54s/it][2025-06-19 16:34:34,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:34:34,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.68 | bwd_microstep: 3382.23 | bwd_inner_microstep: 3381.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 16:34:34,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.68 | bwd: 3382.25 | bwd_inner: 3381.44 | bwd_allreduce: 0.77 | step: 6.81 20%|█▉ | 1973/10000 [3:04:55<12:21:11, 5.54s/it] {'loss': 0.1073, 'grad_norm': 0.5937730073928833, 'learning_rate': 3.713518557329324e-05, 'epoch': 1.97} 20%|█▉ | 1973/10000 [3:04:55<12:21:11, 5.54s/it][2025-06-19 16:34:40,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:34:40,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.46 | bwd_microstep: 3334.48 | bwd_inner_microstep: 3333.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 16:34:40,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.46 | bwd: 3334.49 | bwd_inner: 3333.67 | bwd_allreduce: 0.77 | step: 6.97 20%|█▉ | 1974/10000 [3:05:00<12:18:48, 5.52s/it] {'loss': 0.1445, 'grad_norm': 0.7914083003997803, 'learning_rate': 3.7131844115353476e-05, 'epoch': 1.97} 20%|█▉ | 1974/10000 [3:05:00<12:18:48, 5.52s/it][2025-06-19 16:34:45,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:34:45,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.66 | bwd_microstep: 3319.33 | bwd_inner_microstep: 3318.32 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.39 [2025-06-19 16:34:45,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.66 | bwd: 3319.35 | bwd_inner: 3318.32 | bwd_allreduce: 0.98 | step: 7.39 20%|█▉ | 1975/10000 [3:05:06<12:16:35, 5.51s/it] {'loss': 0.0829, 'grad_norm': 0.4233263432979584, 'learning_rate': 3.7128500860362775e-05, 'epoch': 1.98} 20%|█▉ | 1975/10000 [3:05:06<12:16:35, 5.51s/it][2025-06-19 16:34:51,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:34:51,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.25 | bwd_microstep: 3379.26 | bwd_inner_microstep: 3378.45 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.65 [2025-06-19 16:34:51,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.25 | bwd: 3379.27 | bwd_inner: 3378.45 | bwd_allreduce: 0.78 | step: 6.65 20%|█▉ | 1976/10000 [3:05:11<12:18:19, 5.52s/it] {'loss': 0.1137, 'grad_norm': 0.7343443036079407, 'learning_rate': 3.7125155808671826e-05, 'epoch': 1.98} 20%|█▉ | 1976/10000 [3:05:11<12:18:19, 5.52s/it][2025-06-19 16:34:56,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:34:56,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.81 | bwd_microstep: 3331.37 | bwd_inner_microstep: 3330.41 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.15 [2025-06-19 16:34:56,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.81 | bwd: 3331.38 | bwd_inner: 3330.41 | bwd_allreduce: 0.93 | step: 7.15 20%|█▉ | 1977/10000 [3:05:17<12:16:39, 5.51s/it] {'loss': 0.1159, 'grad_norm': 0.5687777400016785, 'learning_rate': 3.7121808960631513e-05, 'epoch': 1.98} 20%|█▉ | 1977/10000 [3:05:17<12:16:39, 5.51s/it][2025-06-19 16:35:01,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:35:01,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.02 | bwd_microstep: 3322.10 | bwd_inner_microstep: 3320.95 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.39 [2025-06-19 16:35:01,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.02 | bwd: 3322.12 | bwd_inner: 3320.95 | bwd_allreduce: 1.11 | step: 7.39 20%|█▉ | 1978/10000 [3:05:22<12:15:10, 5.50s/it] {'loss': 0.1932, 'grad_norm': 0.9013243317604065, 'learning_rate': 3.711846031659291e-05, 'epoch': 1.98} 20%|█▉ | 1978/10000 [3:05:22<12:15:10, 5.50s/it][2025-06-19 16:35:07,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:35:07,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.92 | bwd_microstep: 3331.69 | bwd_inner_microstep: 3330.88 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-19 16:35:07,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.92 | bwd: 3331.71 | bwd_inner: 3330.88 | bwd_allreduce: 0.78 | step: 6.78 20%|█▉ | 1979/10000 [3:05:28<12:14:18, 5.49s/it] {'loss': 0.1448, 'grad_norm': 1.0167149305343628, 'learning_rate': 3.711510987690726e-05, 'epoch': 1.98} 20%|█▉ | 1979/10000 [3:05:28<12:14:18, 5.49s/it][2025-06-19 16:35:13,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.86 [2025-06-19 16:35:13,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.29 | bwd_microstep: 3382.70 | bwd_inner_microstep: 3381.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.32 [2025-06-19 16:35:13,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.29 | bwd: 3382.72 | bwd_inner: 3381.91 | bwd_allreduce: 0.76 | step: 7.32 20%|█▉ | 1980/10000 [3:05:33<12:16:41, 5.51s/it] {'loss': 0.2527, 'grad_norm': 1.0284929275512695, 'learning_rate': 3.711175764192603e-05, 'epoch': 1.98} 20%|█▉ | 1980/10000 [3:05:33<12:16:41, 5.51s/it][2025-06-19 16:35:18,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.86 [2025-06-19 16:35:18,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.64 | bwd_microstep: 3375.44 | bwd_inner_microstep: 3374.63 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 16:35:18,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.64 | bwd: 3375.45 | bwd_inner: 3374.63 | bwd_allreduce: 0.77 | step: 7.01 20%|█▉ | 1981/10000 [3:05:39<12:17:43, 5.52s/it] {'loss': 0.07, 'grad_norm': 0.3165569007396698, 'learning_rate': 3.7108403612000834e-05, 'epoch': 1.98} 20%|█▉ | 1981/10000 [3:05:39<12:17:43, 5.52s/it][2025-06-19 16:35:24,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 16:35:24,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.05 | bwd_microstep: 3322.60 | bwd_inner_microstep: 3321.64 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.84 [2025-06-19 16:35:24,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.05 | bwd: 3322.62 | bwd_inner: 3321.64 | bwd_allreduce: 0.92 | step: 7.85 20%|█▉ | 1982/10000 [3:05:44<12:15:40, 5.51s/it] {'loss': 0.1367, 'grad_norm': 0.6066577434539795, 'learning_rate': 3.71050477874835e-05, 'epoch': 1.98} 20%|█▉ | 1982/10000 [3:05:44<12:15:40, 5.51s/it][2025-06-19 16:35:29,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:35:29,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.92 | bwd_microstep: 3381.61 | bwd_inner_microstep: 3380.70 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.94 [2025-06-19 16:35:29,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.92 | bwd: 3381.62 | bwd_inner: 3380.71 | bwd_allreduce: 0.87 | step: 6.94 20%|█▉ | 1983/10000 [3:05:50<12:17:50, 5.52s/it] {'loss': 0.1227, 'grad_norm': 0.6974902749061584, 'learning_rate': 3.7101690168726046e-05, 'epoch': 1.98} 20%|█▉ | 1983/10000 [3:05:50<12:17:50, 5.52s/it][2025-06-19 16:35:35,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:35:35,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.58 | bwd_microstep: 3321.15 | bwd_inner_microstep: 3320.32 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.85 [2025-06-19 16:35:35,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.58 | bwd: 3321.16 | bwd_inner: 3320.32 | bwd_allreduce: 0.80 | step: 6.86 20%|█▉ | 1984/10000 [3:05:55<12:16:02, 5.51s/it] {'loss': 0.0911, 'grad_norm': 0.4125991463661194, 'learning_rate': 3.7098330756080656e-05, 'epoch': 1.98} 20%|█▉ | 1984/10000 [3:05:55<12:16:02, 5.51s/it][2025-06-19 16:35:40,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:35:40,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.08 | bwd_microstep: 3369.35 | bwd_inner_microstep: 3368.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.59 [2025-06-19 16:35:40,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.08 | bwd: 3369.36 | bwd_inner: 3368.56 | bwd_allreduce: 0.76 | step: 6.60 20%|█▉ | 1985/10000 [3:06:01<12:17:21, 5.52s/it] {'loss': 0.0973, 'grad_norm': 0.6580559611320496, 'learning_rate': 3.709496954989973e-05, 'epoch': 1.98} 20%|█▉ | 1985/10000 [3:06:01<12:17:21, 5.52s/it][2025-06-19 16:35:46,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:35:46,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.46 | bwd_microstep: 3373.31 | bwd_inner_microstep: 3372.49 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.11 [2025-06-19 16:35:46,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.46 | bwd: 3373.33 | bwd_inner: 3372.49 | bwd_allreduce: 0.79 | step: 7.11 20%|█▉ | 1986/10000 [3:06:06<12:18:04, 5.53s/it] {'loss': 0.106, 'grad_norm': 0.3905705511569977, 'learning_rate': 3.7091606550535846e-05, 'epoch': 1.99} 20%|█▉ | 1986/10000 [3:06:06<12:18:04, 5.53s/it][2025-06-19 16:35:51,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:35:51,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.12 | bwd_microstep: 3405.95 | bwd_inner_microstep: 3405.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 16:35:51,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.12 | bwd: 3405.97 | bwd_inner: 3405.16 | bwd_allreduce: 0.76 | step: 6.75 20%|█▉ | 1987/10000 [3:06:12<12:20:21, 5.54s/it] {'loss': 0.0952, 'grad_norm': 0.5285655856132507, 'learning_rate': 3.708824175834175e-05, 'epoch': 1.99} 20%|█▉ | 1987/10000 [3:06:12<12:20:21, 5.54s/it][2025-06-19 16:35:57,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:35:57,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.03 | bwd_microstep: 3321.53 | bwd_inner_microstep: 3320.46 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.90 [2025-06-19 16:35:57,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.03 | bwd: 3321.55 | bwd_inner: 3320.46 | bwd_allreduce: 1.04 | step: 7.90 20%|█▉ | 1988/10000 [3:06:18<12:17:21, 5.52s/it] {'loss': 0.1441, 'grad_norm': 0.716923713684082, 'learning_rate': 3.7084875173670404e-05, 'epoch': 1.99} 20%|█▉ | 1988/10000 [3:06:18<12:17:21, 5.52s/it][2025-06-19 16:36:02,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:36:02,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.67 | bwd_microstep: 3400.47 | bwd_inner_microstep: 3399.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 16:36:02,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.67 | bwd: 3400.49 | bwd_inner: 3399.69 | bwd_allreduce: 0.76 | step: 6.66 20%|█▉ | 1989/10000 [3:06:23<12:19:40, 5.54s/it] {'loss': 0.1359, 'grad_norm': 0.8380409479141235, 'learning_rate': 3.7081506796874946e-05, 'epoch': 1.99} 20%|█▉ | 1989/10000 [3:06:23<12:19:40, 5.54s/it][2025-06-19 16:36:08,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.79 [2025-06-19 16:36:08,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.88 | bwd_microstep: 3366.14 | bwd_inner_microstep: 3365.20 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.79 [2025-06-19 16:36:08,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.88 | bwd: 3366.15 | bwd_inner: 3365.20 | bwd_allreduce: 0.91 | step: 6.79 20%|█▉ | 1990/10000 [3:06:29<12:19:04, 5.54s/it] {'loss': 0.0802, 'grad_norm': 0.35124459862709045, 'learning_rate': 3.707813662830871e-05, 'epoch': 1.99} 20%|█▉ | 1990/10000 [3:06:29<12:19:04, 5.54s/it][2025-06-19 16:36:13,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:36:13,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.38 | bwd_microstep: 3337.87 | bwd_inner_microstep: 3336.89 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.03 [2025-06-19 16:36:13,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.38 | bwd: 3337.89 | bwd_inner: 3336.89 | bwd_allreduce: 0.96 | step: 7.04 20%|█▉ | 1991/10000 [3:06:34<12:16:46, 5.52s/it] {'loss': 0.1144, 'grad_norm': 0.608684241771698, 'learning_rate': 3.707476466832519e-05, 'epoch': 1.99} 20%|█▉ | 1991/10000 [3:06:34<12:16:46, 5.52s/it][2025-06-19 16:36:19,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 16:36:19,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.67 | bwd_microstep: 3327.32 | bwd_inner_microstep: 3326.12 | bwd_allreduce_microstep: 1.13 | step_microstep: 8.19 [2025-06-19 16:36:19,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.68 | bwd: 3327.34 | bwd_inner: 3326.12 | bwd_allreduce: 1.16 | step: 8.20 20%|█▉ | 1992/10000 [3:06:40<12:15:31, 5.51s/it] {'loss': 0.0893, 'grad_norm': 0.6102568507194519, 'learning_rate': 3.707139091727811e-05, 'epoch': 1.99} 20%|█▉ | 1992/10000 [3:06:40<12:15:31, 5.51s/it][2025-06-19 16:36:24,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:36:24,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.31 | bwd_microstep: 3317.72 | bwd_inner_microstep: 3316.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-19 16:36:24,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.31 | bwd: 3317.74 | bwd_inner: 3316.92 | bwd_allreduce: 0.77 | step: 6.88 20%|█▉ | 1993/10000 [3:06:45<12:13:45, 5.50s/it] {'loss': 0.189, 'grad_norm': 0.7104586362838745, 'learning_rate': 3.7068015375521363e-05, 'epoch': 1.99} 20%|█▉ | 1993/10000 [3:06:45<12:13:45, 5.50s/it][2025-06-19 16:36:30,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:36:30,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.73 | bwd_microstep: 3319.77 | bwd_inner_microstep: 3318.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 16:36:30,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.73 | bwd: 3319.78 | bwd_inner: 3318.98 | bwd_allreduce: 0.76 | step: 6.56 20%|█▉ | 1994/10000 [3:06:51<12:12:20, 5.49s/it] {'loss': 0.1848, 'grad_norm': 0.5080738663673401, 'learning_rate': 3.7064638043409003e-05, 'epoch': 1.99} 20%|█▉ | 1994/10000 [3:06:51<12:12:20, 5.49s/it][2025-06-19 16:36:35,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:36:35,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.04 | bwd_microstep: 3375.42 | bwd_inner_microstep: 3374.36 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.43 [2025-06-19 16:36:35,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.04 | bwd: 3375.45 | bwd_inner: 3374.36 | bwd_allreduce: 1.02 | step: 7.43 20%|█▉ | 1995/10000 [3:06:56<12:14:20, 5.50s/it] {'loss': 0.1147, 'grad_norm': 0.6139086484909058, 'learning_rate': 3.7061258921295316e-05, 'epoch': 2.0} 20%|█▉ | 1995/10000 [3:06:56<12:14:20, 5.50s/it][2025-06-19 16:36:41,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:36:41,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.45 | bwd_microstep: 3364.32 | bwd_inner_microstep: 3363.46 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.97 [2025-06-19 16:36:41,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.45 | bwd: 3364.34 | bwd_inner: 3363.46 | bwd_allreduce: 0.83 | step: 6.97 20%|█▉ | 1996/10000 [3:07:02<12:15:36, 5.51s/it] {'loss': 0.091, 'grad_norm': 0.5708485841751099, 'learning_rate': 3.7057878009534755e-05, 'epoch': 2.0} 20%|█▉ | 1996/10000 [3:07:02<12:15:36, 5.51s/it][2025-06-19 16:36:46,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 16:36:46,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.77 | bwd_microstep: 3373.27 | bwd_inner_microstep: 3372.23 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.20 [2025-06-19 16:36:46,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.77 | bwd: 3373.29 | bwd_inner: 3372.23 | bwd_allreduce: 1.01 | step: 8.20 20%|█▉ | 1997/10000 [3:07:07<12:16:43, 5.52s/it] {'loss': 0.1218, 'grad_norm': 0.6038703918457031, 'learning_rate': 3.7054495308481954e-05, 'epoch': 2.0} 20%|█▉ | 1997/10000 [3:07:07<12:16:43, 5.52s/it][2025-06-19 16:36:52,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:36:52,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.87 | bwd_microstep: 3372.41 | bwd_inner_microstep: 3371.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 16:36:52,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.87 | bwd: 3372.43 | bwd_inner: 3371.62 | bwd_allreduce: 0.76 | step: 6.77 20%|█▉ | 1998/10000 [3:07:13<12:17:23, 5.53s/it] {'loss': 0.1176, 'grad_norm': 0.5377686023712158, 'learning_rate': 3.7051110818491754e-05, 'epoch': 2.0} 20%|█▉ | 1998/10000 [3:07:13<12:17:23, 5.53s/it][2025-06-19 16:36:57,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:36:57,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.76 | bwd_microstep: 3365.71 | bwd_inner_microstep: 3364.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 16:36:57,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.76 | bwd: 3365.73 | bwd_inner: 3364.93 | bwd_allreduce: 0.76 | step: 6.64 20%|█▉ | 1999/10000 [3:07:18<12:17:24, 5.53s/it] {'loss': 0.077, 'grad_norm': 0.30576056241989136, 'learning_rate': 3.704772453991916e-05, 'epoch': 2.0} 20%|█▉ | 1999/10000 [3:07:18<12:17:24, 5.53s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-06-19 16:37:05,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.73 [2025-06-19 16:37:05,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2088.83 | bwd_microstep: 3307.15 | bwd_inner_microstep: 3306.29 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.41 [2025-06-19 16:37:05,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2088.83 | bwd: 3307.17 | bwd_inner: 3306.29 | bwd_allreduce: 0.83 | step: 7.42 20%|██ | 2000/10000 [3:07:26<13:35:30, 6.12s/it] {'loss': 0.0938, 'grad_norm': 0.5594139695167542, 'learning_rate': 3.7044336473119386e-05, 'epoch': 2.0} 20%|██ | 2000/10000 [3:07:26<13:35:30, 6.12s/it]evaluate! [INFO|trainer.py:3910] 2025-06-19 16:37:15,595 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 16:37:15,601 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 16:37:15,601 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 16:38:06,312 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 16:38:06,315 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 16:38:06,316 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 16:38:06,316 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json evaluate! [INFO|trainer.py:3910] 2025-06-19 16:38:20,038 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 16:38:20,043 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 16:38:20,043 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 16:39:16,963 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 16:39:16,967 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 16:39:16,968 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 16:39:16,968 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json [2025-06-19 16:39:21,627] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 16:39:27,428] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 16:39:33,159] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 16:39:39,011] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 16:39:56,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-19 16:39:56,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2093.65 | bwd_microstep: 3308.39 | bwd_inner_microstep: 3307.48 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.87 [2025-06-19 16:39:56,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2093.65 | bwd: 3308.42 | bwd_inner: 3307.48 | bwd_allreduce: 0.87 | step: 7.86 20%|██ | 2001/10000 [3:10:17<123:51:13, 55.74s/it] {'loss': 0.0966, 'grad_norm': 0.44088467955589294, 'learning_rate': 3.7040946618447824e-05, 'epoch': 2.0} 20%|██ | 2001/10000 [3:10:17<123:51:13, 55.74s/it][2025-06-19 16:40:02,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:40:02,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.57 | bwd_microstep: 3323.84 | bwd_inner_microstep: 3323.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 16:40:02,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.57 | bwd: 3323.85 | bwd_inner: 3323.04 | bwd_allreduce: 0.77 | step: 6.80 20%|██ | 2002/10000 [3:10:23<90:20:01, 40.66s/it] {'loss': 0.0586, 'grad_norm': 0.2254749834537506, 'learning_rate': 3.703755497626005e-05, 'epoch': 2.0} 20%|██ | 2002/10000 [3:10:23<90:20:01, 40.66s/it][2025-06-19 16:40:07,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:40:07,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2079.69 | bwd_microstep: 3280.83 | bwd_inner_microstep: 3280.02 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.18 [2025-06-19 16:40:07,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2079.69 | bwd: 3280.85 | bwd_inner: 3280.02 | bwd_allreduce: 0.79 | step: 7.18 20%|██ | 2003/10000 [3:10:28<66:49:54, 30.09s/it] {'loss': 0.0782, 'grad_norm': 0.3078458309173584, 'learning_rate': 3.7034161546911825e-05, 'epoch': 2.0} 20%|██ | 2003/10000 [3:10:28<66:49:54, 30.09s/it][2025-06-19 16:40:13,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:40:13,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.66 | bwd_microstep: 3330.33 | bwd_inner_microstep: 3329.52 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-19 16:40:13,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.66 | bwd: 3330.34 | bwd_inner: 3329.52 | bwd_allreduce: 0.77 | step: 6.82 20%|██ | 2004/10000 [3:10:34<50:25:40, 22.70s/it] {'loss': 0.0662, 'grad_norm': 0.383975625038147, 'learning_rate': 3.703076633075912e-05, 'epoch': 2.0} 20%|██ | 2004/10000 [3:10:34<50:25:40, 22.70s/it][2025-06-19 16:40:18,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:40:18,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2088.73 | bwd_microstep: 3298.55 | bwd_inner_microstep: 3297.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 16:40:18,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2088.73 | bwd: 3298.56 | bwd_inner: 3297.75 | bwd_allreduce: 0.76 | step: 6.73 20%|██ | 2005/10000 [3:10:39<38:54:32, 17.52s/it] {'loss': 0.0866, 'grad_norm': 0.42484891414642334, 'learning_rate': 3.7027369328158066e-05, 'epoch': 2.0} 20%|██ | 2005/10000 [3:10:39<38:54:32, 17.52s/it][2025-06-19 16:40:24,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:40:24,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.31 | bwd_microstep: 3355.65 | bwd_inner_microstep: 3354.66 | bwd_allreduce_microstep: 0.94 | step_microstep: 8.24 [2025-06-19 16:40:24,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.31 | bwd: 3355.66 | bwd_inner: 3354.66 | bwd_allreduce: 0.96 | step: 8.25 20%|██ | 2006/10000 [3:10:45<30:54:21, 13.92s/it] {'loss': 0.0703, 'grad_norm': 0.35099363327026367, 'learning_rate': 3.702397053946499e-05, 'epoch': 2.01} 20%|██ | 2006/10000 [3:10:45<30:54:21, 13.92s/it][2025-06-19 16:40:29,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.66 | optimizer_step: 2.73 [2025-06-19 16:40:29,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.16 | bwd_microstep: 3356.93 | bwd_inner_microstep: 3356.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 16:40:29,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.16 | bwd: 3356.95 | bwd_inner: 3356.13 | bwd_allreduce: 0.77 | step: 6.82 20%|██ | 2007/10000 [3:10:50<25:18:46, 11.40s/it] {'loss': 0.0616, 'grad_norm': 0.2926812469959259, 'learning_rate': 3.7020569965036425e-05, 'epoch': 2.01} 20%|██ | 2007/10000 [3:10:50<25:18:46, 11.40s/it][2025-06-19 16:40:35,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:40:35,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.77 | bwd_microstep: 3349.55 | bwd_inner_microstep: 3348.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 16:40:35,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.77 | bwd: 3349.56 | bwd_inner: 3348.76 | bwd_allreduce: 0.76 | step: 6.65 20%|██ | 2008/10000 [3:10:56<21:23:19, 9.63s/it] {'loss': 0.0739, 'grad_norm': 0.3268858790397644, 'learning_rate': 3.701716760522907e-05, 'epoch': 2.01} 20%|██ | 2008/10000 [3:10:56<21:23:19, 9.63s/it][2025-06-19 16:40:40,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:40:40,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.84 | bwd_microstep: 3346.36 | bwd_inner_microstep: 3345.41 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.52 [2025-06-19 16:40:40,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.84 | bwd: 3346.38 | bwd_inner: 3345.41 | bwd_allreduce: 0.91 | step: 7.51 20%|██ | 2009/10000 [3:11:01<18:38:04, 8.39s/it] {'loss': 0.097, 'grad_norm': 0.5387699604034424, 'learning_rate': 3.70137634603998e-05, 'epoch': 2.01} 20%|██ | 2009/10000 [3:11:01<18:38:04, 8.39s/it][2025-06-19 16:40:46,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:40:46,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2090.66 | bwd_microstep: 3290.35 | bwd_inner_microstep: 3289.57 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.21 [2025-06-19 16:40:46,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2090.66 | bwd: 3290.37 | bwd_inner: 3289.57 | bwd_allreduce: 0.76 | step: 7.23 20%|██ | 2010/10000 [3:11:07<16:39:21, 7.50s/it] {'loss': 0.1072, 'grad_norm': 0.6414551138877869, 'learning_rate': 3.7010357530905706e-05, 'epoch': 2.01} 20%|██ | 2010/10000 [3:11:07<16:39:21, 7.50s/it][2025-06-19 16:40:51,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:40:51,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2091.19 | bwd_microstep: 3290.11 | bwd_inner_microstep: 3289.31 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.09 [2025-06-19 16:40:51,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2091.19 | bwd: 3290.13 | bwd_inner: 3289.31 | bwd_allreduce: 0.78 | step: 7.09 20%|██ | 2011/10000 [3:11:12<15:15:54, 6.88s/it] {'loss': 0.0667, 'grad_norm': 0.3650943338871002, 'learning_rate': 3.700694981710406e-05, 'epoch': 2.01} 20%|██ | 2011/10000 [3:11:12<15:15:54, 6.88s/it][2025-06-19 16:40:57,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:40:57,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.52 | bwd_microstep: 3374.09 | bwd_inner_microstep: 3373.30 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 16:40:57,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.52 | bwd: 3374.10 | bwd_inner: 3373.30 | bwd_allreduce: 0.76 | step: 6.66 20%|██ | 2012/10000 [3:11:17<14:21:44, 6.47s/it] {'loss': 0.0554, 'grad_norm': 0.22613459825515747, 'learning_rate': 3.7003540319352314e-05, 'epoch': 2.01} 20%|██ | 2012/10000 [3:11:17<14:21:44, 6.47s/it][2025-06-19 16:41:02,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:41:02,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.26 | bwd_microstep: 3340.47 | bwd_inner_microstep: 3339.66 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.28 [2025-06-19 16:41:02,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.26 | bwd: 3340.48 | bwd_inner: 3339.66 | bwd_allreduce: 0.78 | step: 7.28 20%|██ | 2013/10000 [3:11:23<13:42:15, 6.18s/it] {'loss': 0.0972, 'grad_norm': 0.31975916028022766, 'learning_rate': 3.7000129038008094e-05, 'epoch': 2.01} 20%|██ | 2013/10000 [3:11:23<13:42:15, 6.18s/it][2025-06-19 16:41:08,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:41:08,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2086.35 | bwd_microstep: 3300.77 | bwd_inner_microstep: 3299.96 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.37 [2025-06-19 16:41:08,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2086.35 | bwd: 3300.79 | bwd_inner: 3299.96 | bwd_allreduce: 0.78 | step: 7.37 20%|██ | 2014/10000 [3:11:28<13:12:25, 5.95s/it] {'loss': 0.0502, 'grad_norm': 0.174695685505867, 'learning_rate': 3.699671597342925e-05, 'epoch': 2.01} 20%|██ | 2014/10000 [3:11:28<13:12:25, 5.95s/it][2025-06-19 16:41:13,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:41:13,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2089.93 | bwd_microstep: 3290.84 | bwd_inner_microstep: 3290.04 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-19 16:41:13,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2089.93 | bwd: 3290.86 | bwd_inner: 3290.04 | bwd_allreduce: 0.77 | step: 7.02 20%|██ | 2015/10000 [3:11:34<12:51:02, 5.79s/it] {'loss': 0.1101, 'grad_norm': 0.5085338950157166, 'learning_rate': 3.6993301125973775e-05, 'epoch': 2.02} 20%|██ | 2015/10000 [3:11:34<12:51:02, 5.79s/it][2025-06-19 16:41:19,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:41:19,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.70 | bwd_microstep: 3374.52 | bwd_inner_microstep: 3373.74 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 16:41:19,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.70 | bwd: 3374.54 | bwd_inner: 3373.74 | bwd_allreduce: 0.76 | step: 6.78 20%|██ | 2016/10000 [3:11:39<12:40:34, 5.72s/it] {'loss': 0.0922, 'grad_norm': 0.4987410008907318, 'learning_rate': 3.698988449599988e-05, 'epoch': 2.02} 20%|██ | 2016/10000 [3:11:39<12:40:34, 5.72s/it][2025-06-19 16:41:24,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:41:24,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.69 | bwd_microstep: 3386.89 | bwd_inner_microstep: 3386.07 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.41 [2025-06-19 16:41:24,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.69 | bwd: 3386.90 | bwd_inner: 3386.07 | bwd_allreduce: 0.78 | step: 7.41 20%|██ | 2017/10000 [3:11:45<12:33:44, 5.67s/it] {'loss': 0.0802, 'grad_norm': 0.40102362632751465, 'learning_rate': 3.6986466083865955e-05, 'epoch': 2.02} 20%|██ | 2017/10000 [3:11:45<12:33:44, 5.67s/it][2025-06-19 16:41:30,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:41:30,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.73 | bwd_microstep: 3355.25 | bwd_inner_microstep: 3354.47 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 16:41:30,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.73 | bwd: 3355.27 | bwd_inner: 3354.47 | bwd_allreduce: 0.76 | step: 6.69 20%|██ | 2018/10000 [3:11:50<12:27:44, 5.62s/it] {'loss': 0.0755, 'grad_norm': 0.41626814007759094, 'learning_rate': 3.698304588993058e-05, 'epoch': 2.02} 20%|██ | 2018/10000 [3:11:50<12:27:44, 5.62s/it][2025-06-19 16:41:35,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:41:35,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.18 | bwd_microstep: 3316.61 | bwd_inner_microstep: 3315.56 | bwd_allreduce_microstep: 0.98 | step_microstep: 8.02 [2025-06-19 16:41:35,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.18 | bwd: 3316.64 | bwd_inner: 3315.56 | bwd_allreduce: 1.01 | step: 8.03 20%|██ | 2019/10000 [3:11:56<12:21:23, 5.57s/it] {'loss': 0.0651, 'grad_norm': 0.30835381150245667, 'learning_rate': 3.697962391455251e-05, 'epoch': 2.02} 20%|██ | 2019/10000 [3:11:56<12:21:23, 5.57s/it][2025-06-19 16:41:41,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:41:41,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.94 | bwd_microstep: 3316.24 | bwd_inner_microstep: 3315.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.04 [2025-06-19 16:41:41,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.94 | bwd: 3316.25 | bwd_inner: 3315.45 | bwd_allreduce: 0.76 | step: 7.04 20%|██ | 2020/10000 [3:12:01<12:17:16, 5.54s/it] {'loss': 0.1511, 'grad_norm': 0.7517143487930298, 'learning_rate': 3.697620015809069e-05, 'epoch': 2.02} 20%|██ | 2020/10000 [3:12:01<12:17:16, 5.54s/it][2025-06-19 16:41:46,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:41:46,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.78 | bwd_microstep: 3329.27 | bwd_inner_microstep: 3328.40 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.17 [2025-06-19 16:41:46,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.79 | bwd: 3329.29 | bwd_inner: 3328.40 | bwd_allreduce: 0.83 | step: 7.17 20%|██ | 2021/10000 [3:12:07<12:14:42, 5.52s/it] {'loss': 0.123, 'grad_norm': 0.803368866443634, 'learning_rate': 3.697277462090427e-05, 'epoch': 2.02} 20%|██ | 2021/10000 [3:12:07<12:14:42, 5.52s/it][2025-06-19 16:41:52,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.70 | optimizer_step: 2.73 [2025-06-19 16:41:52,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.04 | bwd_microstep: 3387.43 | bwd_inner_microstep: 3386.20 | bwd_allreduce_microstep: 1.11 | step_microstep: 10.66 [2025-06-19 16:41:52,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.04 | bwd: 3387.48 | bwd_inner: 3386.20 | bwd_allreduce: 1.17 | step: 10.67 20%|██ | 2022/10000 [3:12:12<12:17:19, 5.55s/it] {'loss': 0.0771, 'grad_norm': 0.7126859426498413, 'learning_rate': 3.6969347303352565e-05, 'epoch': 2.02} 20%|██ | 2022/10000 [3:12:12<12:17:19, 5.55s/it][2025-06-19 16:41:57,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:41:57,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2159.03 | bwd_microstep: 3383.41 | bwd_inner_microstep: 3382.52 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.59 [2025-06-19 16:41:57,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2159.03 | bwd: 3383.44 | bwd_inner: 3382.52 | bwd_allreduce: 0.84 | step: 7.60 20%|██ | 2023/10000 [3:12:18<12:19:23, 5.56s/it] {'loss': 0.0972, 'grad_norm': 0.8134093880653381, 'learning_rate': 3.696591820579508e-05, 'epoch': 2.02} 20%|██ | 2023/10000 [3:12:18<12:19:23, 5.56s/it][2025-06-19 16:42:03,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:42:03,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2151.68 | bwd_microstep: 3338.58 | bwd_inner_microstep: 3337.68 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.39 [2025-06-19 16:42:03,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2151.68 | bwd: 3338.60 | bwd_inner: 3337.68 | bwd_allreduce: 0.86 | step: 7.39 20%|██ | 2024/10000 [3:12:24<12:18:26, 5.55s/it] {'loss': 0.1767, 'grad_norm': 1.2942203283309937, 'learning_rate': 3.696248732859152e-05, 'epoch': 2.02} 20%|██ | 2024/10000 [3:12:24<12:18:26, 5.55s/it][2025-06-19 16:42:08,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:42:08,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.76 | bwd_microstep: 3319.52 | bwd_inner_microstep: 3318.62 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.90 [2025-06-19 16:42:08,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.76 | bwd: 3319.54 | bwd_inner: 3318.62 | bwd_allreduce: 0.87 | step: 6.90 20%|██ | 2025/10000 [3:12:29<12:16:15, 5.54s/it] {'loss': 0.0581, 'grad_norm': 0.48841673135757446, 'learning_rate': 3.695905467210176e-05, 'epoch': 2.02} 20%|██ | 2025/10000 [3:12:29<12:16:15, 5.54s/it][2025-06-19 16:42:14,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:42:14,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2151.49 | bwd_microstep: 3376.63 | bwd_inner_microstep: 3375.74 | bwd_allreduce_microstep: 0.81 | step_microstep: 8.03 [2025-06-19 16:42:14,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2151.49 | bwd: 3376.65 | bwd_inner: 3375.74 | bwd_allreduce: 0.84 | step: 8.03 20%|██ | 2026/10000 [3:12:35<12:17:21, 5.55s/it] {'loss': 0.1273, 'grad_norm': 0.9158210754394531, 'learning_rate': 3.695562023668588e-05, 'epoch': 2.03} 20%|██ | 2026/10000 [3:12:35<12:17:21, 5.55s/it][2025-06-19 16:42:19,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.72 [2025-06-19 16:42:19,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.17 | bwd_microstep: 3381.64 | bwd_inner_microstep: 3380.42 | bwd_allreduce_microstep: 1.11 | step_microstep: 8.35 [2025-06-19 16:42:19,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.17 | bwd: 3381.68 | bwd_inner: 3380.42 | bwd_allreduce: 1.16 | step: 8.34 20%|██ | 2027/10000 [3:12:40<12:18:09, 5.55s/it] {'loss': 0.0886, 'grad_norm': 0.414700984954834, 'learning_rate': 3.6952184022704127e-05, 'epoch': 2.03} 20%|██ | 2027/10000 [3:12:40<12:18:09, 5.55s/it][2025-06-19 16:42:25,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 16:42:25,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.39 | bwd_microstep: 3337.45 | bwd_inner_microstep: 3336.49 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.13 [2025-06-19 16:42:25,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.39 | bwd: 3337.46 | bwd_inner: 3336.49 | bwd_allreduce: 0.92 | step: 7.13 20%|██ | 2028/10000 [3:12:46<12:15:59, 5.54s/it] {'loss': 0.1067, 'grad_norm': 0.8524466156959534, 'learning_rate': 3.6948746030516944e-05, 'epoch': 2.03} 20%|██ | 2028/10000 [3:12:46<12:15:59, 5.54s/it][2025-06-19 16:42:30,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:42:30,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.03 | bwd_microstep: 3335.59 | bwd_inner_microstep: 3334.76 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.78 [2025-06-19 16:42:30,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.03 | bwd: 3335.61 | bwd_inner: 3334.76 | bwd_allreduce: 0.80 | step: 7.80 20%|██ | 2029/10000 [3:12:51<12:14:20, 5.53s/it] {'loss': 0.1192, 'grad_norm': 1.1110588312149048, 'learning_rate': 3.694530626048497e-05, 'epoch': 2.03} 20%|██ | 2029/10000 [3:12:51<12:14:20, 5.53s/it][2025-06-19 16:42:36,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 16:42:36,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.50 | bwd_microstep: 3330.73 | bwd_inner_microstep: 3329.60 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.34 [2025-06-19 16:42:36,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.50 | bwd: 3330.75 | bwd_inner: 3329.60 | bwd_allreduce: 1.10 | step: 7.34 20%|██ | 2030/10000 [3:12:57<12:12:40, 5.52s/it] {'loss': 0.0562, 'grad_norm': 0.3442155420780182, 'learning_rate': 3.694186471296902e-05, 'epoch': 2.03} 20%|██ | 2030/10000 [3:12:57<12:12:40, 5.52s/it][2025-06-19 16:42:41,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.72 [2025-06-19 16:42:41,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.73 | bwd_microstep: 3340.28 | bwd_inner_microstep: 3339.16 | bwd_allreduce_microstep: 1.03 | step_microstep: 8.49 [2025-06-19 16:42:41,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.73 | bwd: 3340.32 | bwd_inner: 3339.16 | bwd_allreduce: 1.07 | step: 8.50 20%|██ | 2031/10000 [3:13:02<12:12:23, 5.51s/it] {'loss': 0.0442, 'grad_norm': 0.24895700812339783, 'learning_rate': 3.6938421388330084e-05, 'epoch': 2.03} 20%|██ | 2031/10000 [3:13:02<12:12:23, 5.51s/it][2025-06-19 16:42:47,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 16:42:47,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.23 | bwd_microstep: 3340.06 | bwd_inner_microstep: 3338.92 | bwd_allreduce_microstep: 1.07 | step_microstep: 8.29 [2025-06-19 16:42:47,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.23 | bwd: 3340.08 | bwd_inner: 3338.92 | bwd_allreduce: 1.10 | step: 8.33 20%|██ | 2032/10000 [3:13:08<12:12:27, 5.52s/it] {'loss': 0.1075, 'grad_norm': 0.6420744061470032, 'learning_rate': 3.693497628692936e-05, 'epoch': 2.03} 20%|██ | 2032/10000 [3:13:08<12:12:27, 5.52s/it][2025-06-19 16:42:52,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:42:52,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.47 | bwd_microstep: 3334.80 | bwd_inner_microstep: 3333.85 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.10 [2025-06-19 16:42:52,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.47 | bwd: 3334.82 | bwd_inner: 3333.85 | bwd_allreduce: 0.93 | step: 7.11 20%|██ | 2033/10000 [3:13:13<12:12:15, 5.51s/it] {'loss': 0.0916, 'grad_norm': 0.3889232873916626, 'learning_rate': 3.693152940912822e-05, 'epoch': 2.03} 20%|██ | 2033/10000 [3:13:13<12:12:15, 5.51s/it][2025-06-19 16:42:58,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.77 | optimizer_step: 2.72 [2025-06-19 16:42:58,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.43 | bwd_microstep: 3371.85 | bwd_inner_microstep: 3370.51 | bwd_allreduce_microstep: 1.26 | step_microstep: 8.47 [2025-06-19 16:42:58,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.43 | bwd: 3371.88 | bwd_inner: 3370.51 | bwd_allreduce: 1.30 | step: 8.46 20%|██ | 2034/10000 [3:13:19<12:13:50, 5.53s/it] {'loss': 0.0846, 'grad_norm': 0.43916621804237366, 'learning_rate': 3.692808075528822e-05, 'epoch': 2.03} 20%|██ | 2034/10000 [3:13:19<12:13:50, 5.53s/it][2025-06-19 16:43:03,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:43:03,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.80 | bwd_microstep: 3325.19 | bwd_inner_microstep: 3324.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 16:43:03,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.80 | bwd: 3325.20 | bwd_inner: 3324.38 | bwd_allreduce: 0.78 | step: 6.77 20%|██ | 2035/10000 [3:13:24<12:12:02, 5.51s/it] {'loss': 0.0659, 'grad_norm': 0.5474081039428711, 'learning_rate': 3.6924630325771126e-05, 'epoch': 2.04} 20%|██ | 2035/10000 [3:13:24<12:12:02, 5.51s/it][2025-06-19 16:43:09,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:43:09,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.02 | bwd_microstep: 3329.37 | bwd_inner_microstep: 3328.54 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.49 [2025-06-19 16:43:09,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.02 | bwd: 3329.39 | bwd_inner: 3328.54 | bwd_allreduce: 0.80 | step: 7.50 20%|██ | 2036/10000 [3:13:30<12:11:16, 5.51s/it] {'loss': 0.0373, 'grad_norm': 0.2202329933643341, 'learning_rate': 3.692117812093886e-05, 'epoch': 2.04} 20%|██ | 2036/10000 [3:13:30<12:11:16, 5.51s/it][2025-06-19 16:43:14,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 16:43:14,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.94 | bwd_microstep: 3335.00 | bwd_inner_microstep: 3334.06 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.04 [2025-06-19 16:43:14,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.94 | bwd: 3335.01 | bwd_inner: 3334.06 | bwd_allreduce: 0.90 | step: 7.04 20%|██ | 2037/10000 [3:13:35<12:10:55, 5.51s/it] {'loss': 0.1125, 'grad_norm': 0.7822904586791992, 'learning_rate': 3.6917724141153534e-05, 'epoch': 2.04} 20%|██ | 2037/10000 [3:13:35<12:10:55, 5.51s/it][2025-06-19 16:43:20,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:43:20,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.44 | bwd_microstep: 3330.90 | bwd_inner_microstep: 3329.91 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.14 [2025-06-19 16:43:20,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.44 | bwd: 3330.92 | bwd_inner: 3329.91 | bwd_allreduce: 0.96 | step: 7.14 20%|██ | 2038/10000 [3:13:41<12:10:13, 5.50s/it] {'loss': 0.0772, 'grad_norm': 0.5054899454116821, 'learning_rate': 3.691426838677746e-05, 'epoch': 2.04} 20%|██ | 2038/10000 [3:13:41<12:10:13, 5.50s/it][2025-06-19 16:43:25,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:43:25,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.76 | bwd_microstep: 3339.38 | bwd_inner_microstep: 3338.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.14 [2025-06-19 16:43:25,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.75 | bwd: 3339.40 | bwd_inner: 3338.59 | bwd_allreduce: 0.77 | step: 7.15 20%|██ | 2039/10000 [3:13:46<12:09:51, 5.50s/it] {'loss': 0.0895, 'grad_norm': 0.6154398322105408, 'learning_rate': 3.691081085817315e-05, 'epoch': 2.04} 20%|██ | 2039/10000 [3:13:46<12:09:51, 5.50s/it][2025-06-19 16:43:31,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:43:31,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.38 | bwd_microstep: 3399.09 | bwd_inner_microstep: 3398.27 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.67 [2025-06-19 16:43:31,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.38 | bwd: 3399.10 | bwd_inner: 3398.27 | bwd_allreduce: 0.78 | step: 6.67 20%|██ | 2040/10000 [3:13:52<12:13:02, 5.53s/it] {'loss': 0.0827, 'grad_norm': 0.5837743282318115, 'learning_rate': 3.6907351555703254e-05, 'epoch': 2.04} 20%|██ | 2040/10000 [3:13:52<12:13:02, 5.53s/it][2025-06-19 16:43:37,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:43:37,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.91 | bwd_microstep: 3343.86 | bwd_inner_microstep: 3342.98 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.79 [2025-06-19 16:43:37,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.91 | bwd: 3343.88 | bwd_inner: 3342.98 | bwd_allreduce: 0.85 | step: 6.80 20%|██ | 2041/10000 [3:13:57<12:12:26, 5.52s/it] {'loss': 0.0451, 'grad_norm': 0.31248387694358826, 'learning_rate': 3.690389047973065e-05, 'epoch': 2.04} 20%|██ | 2041/10000 [3:13:57<12:12:26, 5.52s/it][2025-06-19 16:43:42,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:43:42,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.88 | bwd_microstep: 3339.22 | bwd_inner_microstep: 3338.24 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.00 [2025-06-19 16:43:42,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.88 | bwd: 3339.23 | bwd_inner: 3338.24 | bwd_allreduce: 0.95 | step: 7.00 20%|██ | 2042/10000 [3:14:03<12:11:16, 5.51s/it] {'loss': 0.0516, 'grad_norm': 0.6386900544166565, 'learning_rate': 3.69004276306184e-05, 'epoch': 2.04} 20%|██ | 2042/10000 [3:14:03<12:11:16, 5.51s/it][2025-06-19 16:43:48,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:43:48,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.96 | bwd_microstep: 3328.50 | bwd_inner_microstep: 3327.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 16:43:48,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.96 | bwd: 3328.52 | bwd_inner: 3327.71 | bwd_allreduce: 0.77 | step: 6.96 20%|██ | 2043/10000 [3:14:08<12:09:53, 5.50s/it] {'loss': 0.0627, 'grad_norm': 0.4966859519481659, 'learning_rate': 3.689696300872971e-05, 'epoch': 2.04} 20%|██ | 2043/10000 [3:14:08<12:09:53, 5.50s/it][2025-06-19 16:43:53,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:43:53,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.59 | bwd_microstep: 3330.29 | bwd_inner_microstep: 3329.16 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.42 [2025-06-19 16:43:53,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.59 | bwd: 3330.32 | bwd_inner: 3329.16 | bwd_allreduce: 1.10 | step: 7.42 20%|██ | 2044/10000 [3:14:14<12:09:04, 5.50s/it] {'loss': 0.0303, 'grad_norm': 0.23901009559631348, 'learning_rate': 3.689349661442804e-05, 'epoch': 2.04} 20%|██ | 2044/10000 [3:14:14<12:09:04, 5.50s/it][2025-06-19 16:43:59,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:43:59,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.92 | bwd_microstep: 3334.41 | bwd_inner_microstep: 3333.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-19 16:43:59,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.92 | bwd: 3334.43 | bwd_inner: 3333.60 | bwd_allreduce: 0.78 | step: 6.84 20%|██ | 2045/10000 [3:14:19<12:08:49, 5.50s/it] {'loss': 0.112, 'grad_norm': 0.8951483964920044, 'learning_rate': 3.689002844807697e-05, 'epoch': 2.04} 20%|██ | 2045/10000 [3:14:19<12:08:49, 5.50s/it][2025-06-19 16:44:04,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:44:04,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.74 | bwd_microstep: 3389.87 | bwd_inner_microstep: 3388.79 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.36 [2025-06-19 16:44:04,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.74 | bwd: 3389.89 | bwd_inner: 3388.79 | bwd_allreduce: 1.05 | step: 7.37 20%|██ | 2046/10000 [3:14:25<12:11:52, 5.52s/it] {'loss': 0.0343, 'grad_norm': 0.3058314919471741, 'learning_rate': 3.6886558510040305e-05, 'epoch': 2.05} 20%|██ | 2046/10000 [3:14:25<12:11:52, 5.52s/it][2025-06-19 16:44:10,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:44:10,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.25 | bwd_microstep: 3324.94 | bwd_inner_microstep: 3324.16 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.77 [2025-06-19 16:44:10,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.25 | bwd: 3324.95 | bwd_inner: 3324.16 | bwd_allreduce: 0.75 | step: 6.77 20%|██ | 2047/10000 [3:14:30<12:09:57, 5.51s/it] {'loss': 0.1381, 'grad_norm': 1.7012590169906616, 'learning_rate': 3.6883086800682025e-05, 'epoch': 2.05} 20%|██ | 2047/10000 [3:14:30<12:09:57, 5.51s/it][2025-06-19 16:44:15,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:44:15,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.50 | bwd_microstep: 3382.36 | bwd_inner_microstep: 3381.40 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.10 [2025-06-19 16:44:15,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.50 | bwd: 3382.37 | bwd_inner: 3381.40 | bwd_allreduce: 0.92 | step: 7.10 20%|██ | 2048/10000 [3:14:36<12:11:55, 5.52s/it] {'loss': 0.0822, 'grad_norm': 0.573839008808136, 'learning_rate': 3.68796133203663e-05, 'epoch': 2.05} 20%|██ | 2048/10000 [3:14:36<12:11:55, 5.52s/it][2025-06-19 16:44:21,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:44:21,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.94 | bwd_microstep: 3328.26 | bwd_inner_microstep: 3327.44 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.08 [2025-06-19 16:44:21,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.94 | bwd: 3328.27 | bwd_inner: 3327.44 | bwd_allreduce: 0.79 | step: 7.09 20%|██ | 2049/10000 [3:14:41<12:10:27, 5.51s/it] {'loss': 0.1102, 'grad_norm': 0.7511575818061829, 'learning_rate': 3.6876138069457476e-05, 'epoch': 2.05} 20%|██ | 2049/10000 [3:14:41<12:10:27, 5.51s/it][2025-06-19 16:44:26,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 16:44:26,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.74 | bwd_microstep: 3383.17 | bwd_inner_microstep: 3382.19 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.18 [2025-06-19 16:44:26,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.74 | bwd: 3383.18 | bwd_inner: 3382.19 | bwd_allreduce: 0.95 | step: 7.19 20%|██ | 2050/10000 [3:14:47<12:12:22, 5.53s/it] {'loss': 0.1998, 'grad_norm': 1.1493068933486938, 'learning_rate': 3.6872661048320096e-05, 'epoch': 2.05} 20%|██ | 2050/10000 [3:14:47<12:12:22, 5.53s/it][2025-06-19 16:44:32,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:44:32,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.48 | bwd_microstep: 3379.73 | bwd_inner_microstep: 3378.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 16:44:32,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.48 | bwd: 3379.75 | bwd_inner: 3378.95 | bwd_allreduce: 0.76 | step: 6.65 21%|██ | 2051/10000 [3:14:53<12:13:22, 5.54s/it] {'loss': 0.0533, 'grad_norm': 0.3837531805038452, 'learning_rate': 3.686918225731888e-05, 'epoch': 2.05} 21%|██ | 2051/10000 [3:14:53<12:13:22, 5.54s/it][2025-06-19 16:44:37,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:44:37,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.81 | bwd_microstep: 3383.17 | bwd_inner_microstep: 3382.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-19 16:44:37,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.81 | bwd: 3383.18 | bwd_inner: 3382.37 | bwd_allreduce: 0.77 | step: 7.04 21%|██ | 2052/10000 [3:14:58<12:13:56, 5.54s/it] {'loss': 0.0638, 'grad_norm': 0.6790914535522461, 'learning_rate': 3.686570169681873e-05, 'epoch': 2.05} 21%|██ | 2052/10000 [3:14:58<12:13:56, 5.54s/it][2025-06-19 16:44:43,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:44:43,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.06 | bwd_microstep: 3333.99 | bwd_inner_microstep: 3333.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 16:44:43,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.06 | bwd: 3334.01 | bwd_inner: 3333.20 | bwd_allreduce: 0.76 | step: 6.81 21%|██ | 2053/10000 [3:15:04<12:11:15, 5.52s/it] {'loss': 0.0648, 'grad_norm': 0.8745989799499512, 'learning_rate': 3.686221936718476e-05, 'epoch': 2.05} 21%|██ | 2053/10000 [3:15:04<12:11:15, 5.52s/it][2025-06-19 16:44:48,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:44:48,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.13 | bwd_microstep: 3410.84 | bwd_inner_microstep: 3409.93 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.14 [2025-06-19 16:44:48,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.13 | bwd: 3410.86 | bwd_inner: 3409.93 | bwd_allreduce: 0.88 | step: 7.15 21%|██ | 2054/10000 [3:15:09<12:13:48, 5.54s/it] {'loss': 0.0813, 'grad_norm': 0.5793017148971558, 'learning_rate': 3.685873526878223e-05, 'epoch': 2.05} 21%|██ | 2054/10000 [3:15:09<12:13:48, 5.54s/it][2025-06-19 16:44:54,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:44:54,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.89 | bwd_microstep: 3323.36 | bwd_inner_microstep: 3322.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.04 [2025-06-19 16:44:54,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.89 | bwd: 3323.38 | bwd_inner: 3322.57 | bwd_allreduce: 0.76 | step: 7.04 21%|██ | 2055/10000 [3:15:15<12:11:07, 5.52s/it] {'loss': 0.1284, 'grad_norm': 0.8010075688362122, 'learning_rate': 3.6855249401976615e-05, 'epoch': 2.06} 21%|██ | 2055/10000 [3:15:15<12:11:07, 5.52s/it][2025-06-19 16:44:59,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 16:44:59,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.96 | bwd_microstep: 3323.86 | bwd_inner_microstep: 3323.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 16:44:59,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.96 | bwd: 3323.88 | bwd_inner: 3323.08 | bwd_allreduce: 0.75 | step: 6.66 21%|██ | 2056/10000 [3:15:20<12:09:14, 5.51s/it] {'loss': 0.0446, 'grad_norm': 0.33537065982818604, 'learning_rate': 3.685176176713357e-05, 'epoch': 2.06} 21%|██ | 2056/10000 [3:15:20<12:09:14, 5.51s/it][2025-06-19 16:45:05,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:45:05,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.54 | bwd_microstep: 3329.19 | bwd_inner_microstep: 3328.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 16:45:05,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.54 | bwd: 3329.20 | bwd_inner: 3328.39 | bwd_allreduce: 0.77 | step: 6.71 21%|██ | 2057/10000 [3:15:26<12:07:44, 5.50s/it] {'loss': 0.0323, 'grad_norm': 0.28700533509254456, 'learning_rate': 3.684827236461892e-05, 'epoch': 2.06} 21%|██ | 2057/10000 [3:15:26<12:07:44, 5.50s/it][2025-06-19 16:45:10,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:45:10,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.45 | bwd_microstep: 3370.62 | bwd_inner_microstep: 3369.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 16:45:10,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.45 | bwd: 3370.64 | bwd_inner: 3369.82 | bwd_allreduce: 0.77 | step: 6.90 21%|██ | 2058/10000 [3:15:31<12:09:17, 5.51s/it] {'loss': 0.0551, 'grad_norm': 0.3859666883945465, 'learning_rate': 3.68447811947987e-05, 'epoch': 2.06} 21%|██ | 2058/10000 [3:15:31<12:09:17, 5.51s/it][2025-06-19 16:45:16,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:45:16,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.78 | bwd_microstep: 3315.63 | bwd_inner_microstep: 3314.82 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-19 16:45:16,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.78 | bwd: 3315.64 | bwd_inner: 3314.82 | bwd_allreduce: 0.78 | step: 6.98 21%|██ | 2059/10000 [3:15:37<12:07:14, 5.49s/it] {'loss': 0.0637, 'grad_norm': 0.4772355258464813, 'learning_rate': 3.684128825803911e-05, 'epoch': 2.06} 21%|██ | 2059/10000 [3:15:37<12:07:14, 5.49s/it][2025-06-19 16:45:21,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:45:21,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.26 | bwd_microstep: 3328.33 | bwd_inner_microstep: 3327.40 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.43 [2025-06-19 16:45:21,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.26 | bwd: 3328.35 | bwd_inner: 3327.40 | bwd_allreduce: 0.90 | step: 7.43 21%|██ | 2060/10000 [3:15:42<12:06:40, 5.49s/it] {'loss': 0.0972, 'grad_norm': 0.9893127083778381, 'learning_rate': 3.6837793554706545e-05, 'epoch': 2.06} 21%|██ | 2060/10000 [3:15:42<12:06:40, 5.49s/it][2025-06-19 16:45:27,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:45:27,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.09 | bwd_microstep: 3383.21 | bwd_inner_microstep: 3382.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.85 [2025-06-19 16:45:27,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.09 | bwd: 3383.22 | bwd_inner: 3382.42 | bwd_allreduce: 0.76 | step: 6.85 21%|██ | 2061/10000 [3:15:48<12:09:13, 5.51s/it] {'loss': 0.0423, 'grad_norm': 0.460936039686203, 'learning_rate': 3.6834297085167576e-05, 'epoch': 2.06} 21%|██ | 2061/10000 [3:15:48<12:09:13, 5.51s/it][2025-06-19 16:45:32,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:45:32,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.87 | bwd_microstep: 3329.16 | bwd_inner_microstep: 3328.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 16:45:32,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.87 | bwd: 3329.18 | bwd_inner: 3328.37 | bwd_allreduce: 0.77 | step: 6.88 21%|██ | 2062/10000 [3:15:53<12:08:00, 5.50s/it] {'loss': 0.0717, 'grad_norm': 0.8071036338806152, 'learning_rate': 3.6830798849788983e-05, 'epoch': 2.06} 21%|██ | 2062/10000 [3:15:53<12:08:00, 5.50s/it][2025-06-19 16:45:38,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:45:38,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.76 | bwd_microstep: 3336.70 | bwd_inner_microstep: 3335.49 | bwd_allreduce_microstep: 1.15 | step_microstep: 7.63 [2025-06-19 16:45:38,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.76 | bwd: 3336.72 | bwd_inner: 3335.49 | bwd_allreduce: 1.17 | step: 7.64 21%|██ | 2063/10000 [3:15:59<12:07:34, 5.50s/it] {'loss': 0.073, 'grad_norm': 1.1076931953430176, 'learning_rate': 3.68272988489377e-05, 'epoch': 2.06} 21%|██ | 2063/10000 [3:15:59<12:07:34, 5.50s/it][2025-06-19 16:45:43,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:45:43,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.83 | bwd_microstep: 3396.83 | bwd_inner_microstep: 3396.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 16:45:43,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.84 | bwd: 3396.85 | bwd_inner: 3396.04 | bwd_allreduce: 0.76 | step: 6.63 21%|██ | 2064/10000 [3:16:04<12:10:33, 5.52s/it] {'loss': 0.1735, 'grad_norm': 0.9831295013427734, 'learning_rate': 3.6823797082980867e-05, 'epoch': 2.06} 21%|██ | 2064/10000 [3:16:04<12:10:33, 5.52s/it][2025-06-19 16:45:49,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:45:49,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.77 | bwd_microstep: 3327.68 | bwd_inner_microstep: 3326.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 16:45:49,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.77 | bwd: 3327.69 | bwd_inner: 3326.87 | bwd_allreduce: 0.78 | step: 6.98 21%|██ | 2065/10000 [3:16:10<12:08:19, 5.51s/it] {'loss': 0.0703, 'grad_norm': 1.352684497833252, 'learning_rate': 3.68202935522858e-05, 'epoch': 2.06} 21%|██ | 2065/10000 [3:16:10<12:08:19, 5.51s/it][2025-06-19 16:45:54,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:45:54,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.16 | bwd_microstep: 3319.46 | bwd_inner_microstep: 3318.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 16:45:54,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.16 | bwd: 3319.48 | bwd_inner: 3318.67 | bwd_allreduce: 0.77 | step: 7.06 21%|██ | 2066/10000 [3:16:15<12:07:01, 5.50s/it] {'loss': 0.1135, 'grad_norm': 1.0530171394348145, 'learning_rate': 3.681678825722001e-05, 'epoch': 2.07} 21%|██ | 2066/10000 [3:16:15<12:07:01, 5.50s/it][2025-06-19 16:46:00,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:46:00,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.40 | bwd_microstep: 3322.64 | bwd_inner_microstep: 3321.79 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.13 [2025-06-19 16:46:00,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.40 | bwd: 3322.65 | bwd_inner: 3321.79 | bwd_allreduce: 0.81 | step: 7.14 21%|██ | 2067/10000 [3:16:21<12:06:08, 5.49s/it] {'loss': 0.1392, 'grad_norm': 1.0995042324066162, 'learning_rate': 3.681328119815117e-05, 'epoch': 2.07} 21%|██ | 2067/10000 [3:16:21<12:06:08, 5.49s/it][2025-06-19 16:46:05,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:46:05,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.68 | bwd_microstep: 3317.75 | bwd_inner_microstep: 3316.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 16:46:05,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.68 | bwd: 3317.76 | bwd_inner: 3316.96 | bwd_allreduce: 0.76 | step: 6.85 21%|██ | 2068/10000 [3:16:26<12:04:51, 5.48s/it] {'loss': 0.0604, 'grad_norm': 0.6509122848510742, 'learning_rate': 3.680977237544718e-05, 'epoch': 2.07} 21%|██ | 2068/10000 [3:16:26<12:04:51, 5.48s/it][2025-06-19 16:46:11,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:46:11,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.75 | bwd_microstep: 3324.23 | bwd_inner_microstep: 3323.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-19 16:46:11,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.75 | bwd: 3324.24 | bwd_inner: 3323.43 | bwd_allreduce: 0.77 | step: 6.98 21%|██ | 2069/10000 [3:16:32<12:04:03, 5.48s/it] {'loss': 0.0494, 'grad_norm': 0.36890339851379395, 'learning_rate': 3.6806261789476076e-05, 'epoch': 2.07} 21%|██ | 2069/10000 [3:16:32<12:04:03, 5.48s/it][2025-06-19 16:46:16,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:46:16,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.10 | bwd_microstep: 3322.95 | bwd_inner_microstep: 3322.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 16:46:16,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.10 | bwd: 3322.96 | bwd_inner: 3322.17 | bwd_allreduce: 0.75 | step: 6.61 21%|██ | 2070/10000 [3:16:37<12:03:45, 5.48s/it] {'loss': 0.0578, 'grad_norm': 0.789481520652771, 'learning_rate': 3.680274944060611e-05, 'epoch': 2.07} 21%|██ | 2070/10000 [3:16:37<12:03:45, 5.48s/it][2025-06-19 16:46:22,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:46:22,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.16 | bwd_microstep: 3322.51 | bwd_inner_microstep: 3321.53 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.48 [2025-06-19 16:46:22,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.16 | bwd: 3322.52 | bwd_inner: 3321.53 | bwd_allreduce: 0.95 | step: 7.48 21%|██ | 2071/10000 [3:16:42<12:03:13, 5.47s/it] {'loss': 0.0607, 'grad_norm': 0.7007958889007568, 'learning_rate': 3.679923532920571e-05, 'epoch': 2.07} 21%|██ | 2071/10000 [3:16:42<12:03:13, 5.47s/it][2025-06-19 16:46:27,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:46:27,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.60 | bwd_microstep: 3309.82 | bwd_inner_microstep: 3309.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 16:46:27,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.60 | bwd: 3309.84 | bwd_inner: 3309.04 | bwd_allreduce: 0.76 | step: 6.66 21%|██ | 2072/10000 [3:16:48<12:02:15, 5.47s/it] {'loss': 0.0406, 'grad_norm': 0.42578110098838806, 'learning_rate': 3.6795719455643494e-05, 'epoch': 2.07} 21%|██ | 2072/10000 [3:16:48<12:02:15, 5.47s/it][2025-06-19 16:46:33,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:46:33,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.55 | bwd_microstep: 3377.53 | bwd_inner_microstep: 3376.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 16:46:33,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.55 | bwd: 3377.55 | bwd_inner: 3376.73 | bwd_allreduce: 0.78 | step: 7.18 21%|██ | 2073/10000 [3:16:53<12:05:37, 5.49s/it] {'loss': 0.1162, 'grad_norm': 0.886077344417572, 'learning_rate': 3.679220182028826e-05, 'epoch': 2.07} 21%|██ | 2073/10000 [3:16:53<12:05:37, 5.49s/it][2025-06-19 16:46:38,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:46:38,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.87 | bwd_microstep: 3318.37 | bwd_inner_microstep: 3317.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 16:46:38,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.87 | bwd: 3318.39 | bwd_inner: 3317.58 | bwd_allreduce: 0.76 | step: 6.67 21%|██ | 2074/10000 [3:16:59<12:04:11, 5.48s/it] {'loss': 0.1348, 'grad_norm': 0.7440393567085266, 'learning_rate': 3.678868242350899e-05, 'epoch': 2.07} 21%|██ | 2074/10000 [3:16:59<12:04:11, 5.48s/it][2025-06-19 16:46:44,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:46:44,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.94 | bwd_microstep: 3369.57 | bwd_inner_microstep: 3368.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 16:46:44,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.94 | bwd: 3369.58 | bwd_inner: 3368.78 | bwd_allreduce: 0.76 | step: 6.84 21%|██ | 2075/10000 [3:17:04<12:06:15, 5.50s/it] {'loss': 0.0426, 'grad_norm': 0.385429322719574, 'learning_rate': 3.6785161265674847e-05, 'epoch': 2.08} 21%|██ | 2075/10000 [3:17:04<12:06:15, 5.50s/it][2025-06-19 16:46:49,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:46:49,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.53 | bwd_microstep: 3309.18 | bwd_inner_microstep: 3308.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 16:46:49,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.53 | bwd: 3309.20 | bwd_inner: 3308.38 | bwd_allreduce: 0.77 | step: 7.18 21%|██ | 2076/10000 [3:17:10<12:04:11, 5.48s/it] {'loss': 0.0778, 'grad_norm': 0.719183087348938, 'learning_rate': 3.67816383471552e-05, 'epoch': 2.08} 21%|██ | 2076/10000 [3:17:10<12:04:11, 5.48s/it][2025-06-19 16:46:55,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:46:55,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.57 | bwd_microstep: 3315.01 | bwd_inner_microstep: 3314.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 16:46:55,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.57 | bwd: 3315.02 | bwd_inner: 3314.23 | bwd_allreduce: 0.75 | step: 6.56 21%|██ | 2077/10000 [3:17:15<12:03:11, 5.48s/it] {'loss': 0.0508, 'grad_norm': 0.49567294120788574, 'learning_rate': 3.677811366831956e-05, 'epoch': 2.08} 21%|██ | 2077/10000 [3:17:15<12:03:11, 5.48s/it][2025-06-19 16:47:00,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:47:00,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.64 | bwd_microstep: 3317.29 | bwd_inner_microstep: 3316.47 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 16:47:00,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.64 | bwd: 3317.30 | bwd_inner: 3316.48 | bwd_allreduce: 0.78 | step: 7.17 21%|██ | 2078/10000 [3:17:21<12:02:27, 5.47s/it] {'loss': 0.0888, 'grad_norm': 0.767076313495636, 'learning_rate': 3.6774587229537685e-05, 'epoch': 2.08} 21%|██ | 2078/10000 [3:17:21<12:02:27, 5.47s/it][2025-06-19 16:47:06,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:47:06,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.99 | bwd_microstep: 3367.94 | bwd_inner_microstep: 3367.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 16:47:06,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.99 | bwd: 3367.96 | bwd_inner: 3367.15 | bwd_allreduce: 0.76 | step: 6.77 21%|██ | 2079/10000 [3:17:26<12:04:53, 5.49s/it] {'loss': 0.107, 'grad_norm': 1.2325022220611572, 'learning_rate': 3.677105903117945e-05, 'epoch': 2.08} 21%|██ | 2079/10000 [3:17:26<12:04:53, 5.49s/it][2025-06-19 16:47:11,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:47:11,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.61 | bwd_microstep: 3321.21 | bwd_inner_microstep: 3320.37 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.16 [2025-06-19 16:47:11,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.61 | bwd: 3321.22 | bwd_inner: 3320.37 | bwd_allreduce: 0.80 | step: 7.16 21%|██ | 2080/10000 [3:17:32<12:03:54, 5.48s/it] {'loss': 0.1707, 'grad_norm': 1.4607539176940918, 'learning_rate': 3.676752907361497e-05, 'epoch': 2.08} 21%|██ | 2080/10000 [3:17:32<12:03:54, 5.48s/it][h264 @ 0x2da2be80] Reference 5 >= 5 [h264 @ 0x2da2be80] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x2da12680] left block unavailable for requested intra mode [h264 @ 0x2da12680] error while decoding MB 0 25, bytestream 45493 [h264 @ 0x2f07d900] Reference 5 >= 5 [h264 @ 0x2f07d900] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x2f07d900] left block unavailable for requested intra mode [h264 @ 0x2f07d900] error while decoding MB 0 25, bytestream 45493 [2025-06-19 16:47:16,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.78 [2025-06-19 16:47:16,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.10 | bwd_microstep: 3309.97 | bwd_inner_microstep: 3309.02 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.78 [2025-06-19 16:47:16,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.10 | bwd: 3309.99 | bwd_inner: 3309.02 | bwd_allreduce: 0.91 | step: 7.78 21%|██ | 2081/10000 [3:17:37<12:02:49, 5.48s/it] {'loss': 0.11, 'grad_norm': 0.9945054650306702, 'learning_rate': 3.67639973572145e-05, 'epoch': 2.08} 21%|██ | 2081/10000 [3:17:37<12:02:49, 5.48s/it][2025-06-19 16:47:22,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:47:22,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.71 | bwd_microstep: 3322.54 | bwd_inner_microstep: 3321.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 16:47:22,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.71 | bwd: 3322.55 | bwd_inner: 3321.75 | bwd_allreduce: 0.76 | step: 6.55 21%|██ | 2082/10000 [3:17:43<12:02:38, 5.48s/it] {'loss': 0.1272, 'grad_norm': 1.0653835535049438, 'learning_rate': 3.6760463882348516e-05, 'epoch': 2.08} 21%|██ | 2082/10000 [3:17:43<12:02:38, 5.48s/it][2025-06-19 16:47:27,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:47:27,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.48 | bwd_microstep: 3313.10 | bwd_inner_microstep: 3312.30 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 16:47:27,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.48 | bwd: 3313.11 | bwd_inner: 3312.30 | bwd_allreduce: 0.77 | step: 6.86 21%|██ | 2083/10000 [3:17:48<12:01:37, 5.47s/it] {'loss': 0.1328, 'grad_norm': 1.3071014881134033, 'learning_rate': 3.675692864938766e-05, 'epoch': 2.08} 21%|██ | 2083/10000 [3:17:48<12:01:37, 5.47s/it][2025-06-19 16:47:33,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:47:33,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.00 | bwd_microstep: 3373.62 | bwd_inner_microstep: 3372.58 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.38 [2025-06-19 16:47:33,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.00 | bwd: 3373.64 | bwd_inner: 3372.58 | bwd_allreduce: 1.01 | step: 7.38 21%|██ | 2084/10000 [3:17:54<12:04:29, 5.49s/it] {'loss': 0.0974, 'grad_norm': 1.1167097091674805, 'learning_rate': 3.6753391658702756e-05, 'epoch': 2.08} 21%|██ | 2084/10000 [3:17:54<12:04:29, 5.49s/it][2025-06-19 16:47:38,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 16:47:38,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.86 | bwd_microstep: 3330.96 | bwd_inner_microstep: 3329.99 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.45 [2025-06-19 16:47:38,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.87 | bwd: 3330.98 | bwd_inner: 3329.99 | bwd_allreduce: 0.94 | step: 7.45 21%|██ | 2085/10000 [3:17:59<12:03:53, 5.49s/it] {'loss': 0.1266, 'grad_norm': 1.0238450765609741, 'learning_rate': 3.674985291066482e-05, 'epoch': 2.08} 21%|██ | 2085/10000 [3:17:59<12:03:53, 5.49s/it][2025-06-19 16:47:44,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:47:44,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.91 | bwd_microstep: 3318.23 | bwd_inner_microstep: 3317.42 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.27 [2025-06-19 16:47:44,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.91 | bwd: 3318.25 | bwd_inner: 3317.42 | bwd_allreduce: 0.78 | step: 7.27 21%|██ | 2086/10000 [3:18:05<12:03:37, 5.49s/it] {'loss': 0.0597, 'grad_norm': 0.5909729599952698, 'learning_rate': 3.6746312405645065e-05, 'epoch': 2.09} 21%|██ | 2086/10000 [3:18:05<12:03:37, 5.49s/it][2025-06-19 16:47:49,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:47:49,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.22 | bwd_microstep: 3366.07 | bwd_inner_microstep: 3365.01 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.16 [2025-06-19 16:47:49,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.22 | bwd: 3366.09 | bwd_inner: 3365.01 | bwd_allreduce: 1.03 | step: 7.17 21%|██ | 2087/10000 [3:18:10<12:05:58, 5.50s/it] {'loss': 0.0659, 'grad_norm': 0.6047276258468628, 'learning_rate': 3.674277014401485e-05, 'epoch': 2.09} 21%|██ | 2087/10000 [3:18:10<12:05:58, 5.50s/it][2025-06-19 16:47:55,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:47:55,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.03 | bwd_microstep: 3366.83 | bwd_inner_microstep: 3366.05 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 16:47:55,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.03 | bwd: 3366.85 | bwd_inner: 3366.05 | bwd_allreduce: 0.76 | step: 6.63 21%|██ | 2088/10000 [3:18:16<12:07:24, 5.52s/it] {'loss': 0.0618, 'grad_norm': 0.48903653025627136, 'learning_rate': 3.673922612614575e-05, 'epoch': 2.09} 21%|██ | 2088/10000 [3:18:16<12:07:24, 5.52s/it][2025-06-19 16:48:01,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:48:01,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.01 | bwd_microstep: 3391.32 | bwd_inner_microstep: 3390.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 16:48:01,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.01 | bwd: 3391.33 | bwd_inner: 3390.53 | bwd_allreduce: 0.76 | step: 6.61 21%|██ | 2089/10000 [3:18:21<12:09:01, 5.53s/it] {'loss': 0.1218, 'grad_norm': 0.6025627851486206, 'learning_rate': 3.673568035240952e-05, 'epoch': 2.09} 21%|██ | 2089/10000 [3:18:21<12:09:01, 5.53s/it][2025-06-19 16:48:06,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:48:06,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.05 | bwd_microstep: 3364.84 | bwd_inner_microstep: 3364.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.17 [2025-06-19 16:48:06,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.05 | bwd: 3364.85 | bwd_inner: 3364.04 | bwd_allreduce: 0.77 | step: 7.17 21%|██ | 2090/10000 [3:18:27<12:09:10, 5.53s/it] {'loss': 0.0798, 'grad_norm': 0.6175546646118164, 'learning_rate': 3.67321328231781e-05, 'epoch': 2.09} 21%|██ | 2090/10000 [3:18:27<12:09:10, 5.53s/it][2025-06-19 16:48:12,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:48:12,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.95 | bwd_microstep: 3325.35 | bwd_inner_microstep: 3324.55 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.62 [2025-06-19 16:48:12,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.95 | bwd: 3325.36 | bwd_inner: 3324.55 | bwd_allreduce: 0.77 | step: 6.62 21%|██ | 2091/10000 [3:18:32<12:06:49, 5.51s/it] {'loss': 0.1772, 'grad_norm': 1.2148101329803467, 'learning_rate': 3.67285835388236e-05, 'epoch': 2.09} 21%|██ | 2091/10000 [3:18:32<12:06:49, 5.51s/it][2025-06-19 16:48:17,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:48:17,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.49 | bwd_microstep: 3313.41 | bwd_inner_microstep: 3312.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 16:48:17,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.49 | bwd: 3313.42 | bwd_inner: 3312.62 | bwd_allreduce: 0.76 | step: 6.67 21%|██ | 2092/10000 [3:18:38<12:04:28, 5.50s/it] {'loss': 0.0821, 'grad_norm': 0.6808964610099792, 'learning_rate': 3.672503249971833e-05, 'epoch': 2.09} 21%|██ | 2092/10000 [3:18:38<12:04:28, 5.50s/it][2025-06-19 16:48:22,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:48:22,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.73 | bwd_microstep: 3309.79 | bwd_inner_microstep: 3309.00 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 16:48:22,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.73 | bwd: 3309.80 | bwd_inner: 3309.00 | bwd_allreduce: 0.76 | step: 6.60 21%|██ | 2093/10000 [3:18:43<12:02:49, 5.49s/it] {'loss': 0.1688, 'grad_norm': 0.8716436624526978, 'learning_rate': 3.672147970623477e-05, 'epoch': 2.09} 21%|██ | 2093/10000 [3:18:43<12:02:49, 5.49s/it][2025-06-19 16:48:28,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:48:28,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.15 | bwd_microstep: 3319.58 | bwd_inner_microstep: 3318.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-19 16:48:28,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.15 | bwd: 3319.59 | bwd_inner: 3318.77 | bwd_allreduce: 0.78 | step: 7.22 21%|██ | 2094/10000 [3:18:49<12:01:55, 5.48s/it] {'loss': 0.0886, 'grad_norm': 0.8515020608901978, 'learning_rate': 3.6717925158745594e-05, 'epoch': 2.09} 21%|██ | 2094/10000 [3:18:49<12:01:55, 5.48s/it][2025-06-19 16:48:34,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:48:34,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.31 | bwd_microstep: 3385.83 | bwd_inner_microstep: 3385.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 16:48:34,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.31 | bwd: 3385.85 | bwd_inner: 3385.04 | bwd_allreduce: 0.76 | step: 6.68 21%|██ | 2095/10000 [3:18:54<12:05:09, 5.50s/it] {'loss': 0.0751, 'grad_norm': 0.5597369074821472, 'learning_rate': 3.671436885762366e-05, 'epoch': 2.1} 21%|██ | 2095/10000 [3:18:54<12:05:09, 5.50s/it][2025-06-19 16:48:39,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 16:48:39,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.12 | bwd_microstep: 3405.97 | bwd_inner_microstep: 3405.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 16:48:39,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.12 | bwd: 3405.99 | bwd_inner: 3405.16 | bwd_allreduce: 0.78 | step: 7.20 21%|██ | 2096/10000 [3:19:00<12:08:00, 5.53s/it] {'loss': 0.0983, 'grad_norm': 0.8601267337799072, 'learning_rate': 3.6710810803242e-05, 'epoch': 2.1} 21%|██ | 2096/10000 [3:19:00<12:08:00, 5.53s/it][2025-06-19 16:48:45,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:48:45,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.12 | bwd_microstep: 3315.74 | bwd_inner_microstep: 3314.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 16:48:45,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.12 | bwd: 3315.76 | bwd_inner: 3314.95 | bwd_allreduce: 0.77 | step: 6.72 21%|██ | 2097/10000 [3:19:05<12:05:00, 5.50s/it] {'loss': 0.0571, 'grad_norm': 0.42982110381126404, 'learning_rate': 3.6707250995973855e-05, 'epoch': 2.1} 21%|██ | 2097/10000 [3:19:05<12:05:00, 5.50s/it][2025-06-19 16:48:50,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:48:50,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.12 | bwd_microstep: 3369.81 | bwd_inner_microstep: 3368.82 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.41 [2025-06-19 16:48:50,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.12 | bwd: 3369.83 | bwd_inner: 3368.82 | bwd_allreduce: 0.95 | step: 7.41 21%|██ | 2098/10000 [3:19:11<12:06:05, 5.51s/it] {'loss': 0.1702, 'grad_norm': 0.8060614466667175, 'learning_rate': 3.670368943619262e-05, 'epoch': 2.1} 21%|██ | 2098/10000 [3:19:11<12:06:05, 5.51s/it][2025-06-19 16:48:56,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:48:56,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.75 | bwd_microstep: 3311.98 | bwd_inner_microstep: 3311.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.59 [2025-06-19 16:48:56,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.75 | bwd: 3311.99 | bwd_inner: 3311.19 | bwd_allreduce: 0.76 | step: 6.59 21%|██ | 2099/10000 [3:19:16<12:03:44, 5.50s/it] {'loss': 0.0887, 'grad_norm': 1.1720185279846191, 'learning_rate': 3.670012612427188e-05, 'epoch': 2.1} 21%|██ | 2099/10000 [3:19:16<12:03:44, 5.50s/it][2025-06-19 16:49:01,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:49:01,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.97 | bwd_microstep: 3315.43 | bwd_inner_microstep: 3314.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-19 16:49:01,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.97 | bwd: 3315.44 | bwd_inner: 3314.63 | bwd_allreduce: 0.77 | step: 7.06 21%|██ | 2100/10000 [3:19:22<12:02:13, 5.49s/it] {'loss': 0.0668, 'grad_norm': 0.42144984006881714, 'learning_rate': 3.6696561060585424e-05, 'epoch': 2.1} 21%|██ | 2100/10000 [3:19:22<12:02:13, 5.49s/it][2025-06-19 16:49:06,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:49:06,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.19 | bwd_microstep: 3307.49 | bwd_inner_microstep: 3306.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 16:49:06,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.19 | bwd: 3307.50 | bwd_inner: 3306.71 | bwd_allreduce: 0.75 | step: 6.72 21%|██ | 2101/10000 [3:19:27<12:00:28, 5.47s/it] {'loss': 0.0782, 'grad_norm': 0.5030491352081299, 'learning_rate': 3.669299424550721e-05, 'epoch': 2.1} 21%|██ | 2101/10000 [3:19:27<12:00:28, 5.47s/it][2025-06-19 16:49:12,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:49:12,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.77 | bwd_microstep: 3314.43 | bwd_inner_microstep: 3313.65 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 16:49:12,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.77 | bwd: 3314.45 | bwd_inner: 3313.65 | bwd_allreduce: 0.76 | step: 6.61 21%|██ | 2102/10000 [3:19:33<11:59:24, 5.47s/it] {'loss': 0.0785, 'grad_norm': 0.5106540322303772, 'learning_rate': 3.6689425679411364e-05, 'epoch': 2.1} 21%|██ | 2102/10000 [3:19:33<11:59:24, 5.47s/it][2025-06-19 16:49:17,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:49:17,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.97 | bwd_microstep: 3361.12 | bwd_inner_microstep: 3360.30 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.80 [2025-06-19 16:49:17,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.97 | bwd: 3361.14 | bwd_inner: 3360.30 | bwd_allreduce: 0.79 | step: 6.81 21%|██ | 2103/10000 [3:19:38<12:01:33, 5.48s/it] {'loss': 0.1147, 'grad_norm': 1.0875972509384155, 'learning_rate': 3.6685855362672224e-05, 'epoch': 2.1} 21%|██ | 2103/10000 [3:19:38<12:01:33, 5.48s/it][2025-06-19 16:49:23,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:49:23,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.75 | bwd_microstep: 3313.80 | bwd_inner_microstep: 3313.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 16:49:23,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.75 | bwd: 3313.81 | bwd_inner: 3313.01 | bwd_allreduce: 0.76 | step: 6.95 21%|██ | 2104/10000 [3:19:44<12:00:22, 5.47s/it] {'loss': 0.0288, 'grad_norm': 0.30772385001182556, 'learning_rate': 3.66822832956643e-05, 'epoch': 2.1} 21%|██ | 2104/10000 [3:19:44<12:00:22, 5.47s/it][2025-06-19 16:49:28,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:49:28,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.77 | bwd_microstep: 3309.32 | bwd_inner_microstep: 3308.53 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 16:49:28,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.77 | bwd: 3309.34 | bwd_inner: 3308.53 | bwd_allreduce: 0.77 | step: 6.76 21%|██ | 2105/10000 [3:19:49<11:59:12, 5.47s/it] {'loss': 0.1263, 'grad_norm': 0.9812391996383667, 'learning_rate': 3.6678709478762276e-05, 'epoch': 2.1} 21%|██ | 2105/10000 [3:19:49<11:59:12, 5.47s/it][2025-06-19 16:49:34,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:49:34,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.45 | bwd_microstep: 3314.94 | bwd_inner_microstep: 3314.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 16:49:34,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.46 | bwd: 3314.95 | bwd_inner: 3314.16 | bwd_allreduce: 0.75 | step: 6.64 21%|██ | 2106/10000 [3:19:55<11:58:37, 5.46s/it] {'loss': 0.0776, 'grad_norm': 0.4827485680580139, 'learning_rate': 3.6675133912341045e-05, 'epoch': 2.11} 21%|██ | 2106/10000 [3:19:55<11:58:37, 5.46s/it][2025-06-19 16:49:39,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:49:39,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.43 | bwd_microstep: 3360.02 | bwd_inner_microstep: 3359.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 16:49:39,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.43 | bwd: 3360.03 | bwd_inner: 3359.23 | bwd_allreduce: 0.76 | step: 6.62 21%|██ | 2107/10000 [3:20:00<12:00:48, 5.48s/it] {'loss': 0.0702, 'grad_norm': 0.64213627576828, 'learning_rate': 3.6671556596775656e-05, 'epoch': 2.11} 21%|██ | 2107/10000 [3:20:00<12:00:48, 5.48s/it][2025-06-19 16:49:45,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:49:45,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.36 | bwd_microstep: 3313.32 | bwd_inner_microstep: 3312.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 16:49:45,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.36 | bwd: 3313.33 | bwd_inner: 3312.52 | bwd_allreduce: 0.76 | step: 6.70 21%|██ | 2108/10000 [3:20:06<11:59:47, 5.47s/it] {'loss': 0.0765, 'grad_norm': 0.5167406797409058, 'learning_rate': 3.666797753244135e-05, 'epoch': 2.11} 21%|██ | 2108/10000 [3:20:06<11:59:47, 5.47s/it][2025-06-19 16:49:50,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:49:50,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.35 | bwd_microstep: 3318.62 | bwd_inner_microstep: 3317.80 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.98 [2025-06-19 16:49:50,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.35 | bwd: 3318.63 | bwd_inner: 3317.80 | bwd_allreduce: 0.78 | step: 6.98 21%|██ | 2109/10000 [3:20:11<11:59:07, 5.47s/it] {'loss': 0.0522, 'grad_norm': 0.32280614972114563, 'learning_rate': 3.6664396719713565e-05, 'epoch': 2.11} 21%|██ | 2109/10000 [3:20:11<11:59:07, 5.47s/it][2025-06-19 16:49:56,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:49:56,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.22 | bwd_microstep: 3322.93 | bwd_inner_microstep: 3322.12 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.17 [2025-06-19 16:49:56,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.22 | bwd: 3322.95 | bwd_inner: 3322.12 | bwd_allreduce: 0.78 | step: 7.17 21%|██ | 2110/10000 [3:20:16<11:59:04, 5.47s/it] {'loss': 0.0977, 'grad_norm': 1.3105462789535522, 'learning_rate': 3.66608141589679e-05, 'epoch': 2.11} 21%|██ | 2110/10000 [3:20:16<11:59:04, 5.47s/it][2025-06-19 16:50:01,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:50:01,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.12 | bwd_microstep: 3379.96 | bwd_inner_microstep: 3379.17 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 16:50:01,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.12 | bwd: 3379.98 | bwd_inner: 3379.17 | bwd_allreduce: 0.76 | step: 6.69 21%|██ | 2111/10000 [3:20:22<12:01:57, 5.49s/it] {'loss': 0.0691, 'grad_norm': 0.6151405572891235, 'learning_rate': 3.665722985058016e-05, 'epoch': 2.11} 21%|██ | 2111/10000 [3:20:22<12:01:57, 5.49s/it][2025-06-19 16:50:07,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:50:07,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.90 | bwd_microstep: 3316.37 | bwd_inner_microstep: 3315.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-19 16:50:07,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.90 | bwd: 3316.38 | bwd_inner: 3315.57 | bwd_allreduce: 0.77 | step: 6.78 21%|██ | 2112/10000 [3:20:27<12:00:17, 5.48s/it] {'loss': 0.1038, 'grad_norm': 0.6066915988922119, 'learning_rate': 3.665364379492632e-05, 'epoch': 2.11} 21%|██ | 2112/10000 [3:20:27<12:00:17, 5.48s/it][2025-06-19 16:50:12,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:50:12,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.24 | bwd_microstep: 3313.33 | bwd_inner_microstep: 3312.53 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 16:50:12,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.24 | bwd: 3313.34 | bwd_inner: 3312.53 | bwd_allreduce: 0.76 | step: 6.77 21%|██ | 2113/10000 [3:20:33<11:59:09, 5.47s/it] {'loss': 0.0469, 'grad_norm': 0.4208831489086151, 'learning_rate': 3.6650055992382534e-05, 'epoch': 2.11} 21%|██ | 2113/10000 [3:20:33<11:59:09, 5.47s/it][2025-06-19 16:50:18,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:50:18,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.03 | bwd_microstep: 3331.74 | bwd_inner_microstep: 3330.93 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 16:50:18,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.03 | bwd: 3331.76 | bwd_inner: 3330.93 | bwd_allreduce: 0.78 | step: 7.09 21%|██ | 2114/10000 [3:20:38<11:59:14, 5.47s/it] {'loss': 0.0434, 'grad_norm': 0.6544013619422913, 'learning_rate': 3.664646644332515e-05, 'epoch': 2.11} 21%|██ | 2114/10000 [3:20:38<11:59:14, 5.47s/it][2025-06-19 16:50:23,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:50:23,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.35 | bwd_microstep: 3314.58 | bwd_inner_microstep: 3313.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 16:50:23,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.35 | bwd: 3314.59 | bwd_inner: 3313.78 | bwd_allreduce: 0.76 | step: 6.82 21%|██ | 2115/10000 [3:20:44<11:58:40, 5.47s/it] {'loss': 0.1079, 'grad_norm': 0.7877567410469055, 'learning_rate': 3.664287514813069e-05, 'epoch': 2.12} 21%|██ | 2115/10000 [3:20:44<11:58:40, 5.47s/it][2025-06-19 16:50:29,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-19 16:50:29,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.24 | bwd_microstep: 3376.32 | bwd_inner_microstep: 3375.53 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-19 16:50:29,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.24 | bwd: 3376.33 | bwd_inner: 3375.53 | bwd_allreduce: 0.76 | step: 7.08 21%|██ | 2116/10000 [3:20:49<12:01:27, 5.49s/it] {'loss': 0.0514, 'grad_norm': 0.42955583333969116, 'learning_rate': 3.663928210717588e-05, 'epoch': 2.12} 21%|██ | 2116/10000 [3:20:49<12:01:27, 5.49s/it][2025-06-19 16:50:34,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:50:34,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.68 | bwd_microstep: 3320.69 | bwd_inner_microstep: 3319.70 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.19 [2025-06-19 16:50:34,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.68 | bwd: 3320.71 | bwd_inner: 3319.70 | bwd_allreduce: 0.96 | step: 7.19 21%|██ | 2117/10000 [3:20:55<12:00:25, 5.48s/it] {'loss': 0.0886, 'grad_norm': 0.7144939303398132, 'learning_rate': 3.66356873208376e-05, 'epoch': 2.12} 21%|██ | 2117/10000 [3:20:55<12:00:25, 5.48s/it][2025-06-19 16:50:40,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.96 [2025-06-19 16:50:40,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.77 | bwd_microstep: 3320.55 | bwd_inner_microstep: 3319.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.24 [2025-06-19 16:50:40,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.77 | bwd: 3320.56 | bwd_inner: 3319.74 | bwd_allreduce: 0.76 | step: 7.24 21%|██ | 2118/10000 [3:21:00<11:59:43, 5.48s/it] {'loss': 0.0996, 'grad_norm': 0.7528051137924194, 'learning_rate': 3.663209078949292e-05, 'epoch': 2.12} 21%|██ | 2118/10000 [3:21:00<11:59:43, 5.48s/it][2025-06-19 16:50:45,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:50:45,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.03 | bwd_microstep: 3370.80 | bwd_inner_microstep: 3369.92 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.05 [2025-06-19 16:50:45,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.03 | bwd: 3370.82 | bwd_inner: 3369.92 | bwd_allreduce: 0.85 | step: 7.06 21%|██ | 2119/10000 [3:21:06<12:02:03, 5.50s/it] {'loss': 0.1266, 'grad_norm': 1.1344902515411377, 'learning_rate': 3.662849251351911e-05, 'epoch': 2.12} 21%|██ | 2119/10000 [3:21:06<12:02:03, 5.50s/it][2025-06-19 16:50:51,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:50:51,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.31 | bwd_microstep: 3374.75 | bwd_inner_microstep: 3373.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 16:50:51,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.31 | bwd: 3374.77 | bwd_inner: 3373.95 | bwd_allreduce: 0.78 | step: 6.84 21%|██ | 2120/10000 [3:21:11<12:03:48, 5.51s/it] {'loss': 0.0621, 'grad_norm': 0.5473932027816772, 'learning_rate': 3.662489249329362e-05, 'epoch': 2.12} 21%|██ | 2120/10000 [3:21:11<12:03:48, 5.51s/it][2025-06-19 16:50:56,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:50:56,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.33 | bwd_microstep: 3380.95 | bwd_inner_microstep: 3379.86 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.46 [2025-06-19 16:50:56,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.33 | bwd: 3380.97 | bwd_inner: 3379.86 | bwd_allreduce: 1.05 | step: 7.47 21%|██ | 2121/10000 [3:21:17<12:05:16, 5.52s/it] {'loss': 0.057, 'grad_norm': 0.6794443726539612, 'learning_rate': 3.6621290729194056e-05, 'epoch': 2.12} 21%|██ | 2121/10000 [3:21:17<12:05:16, 5.52s/it][2025-06-19 16:51:02,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:51:02,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.89 | bwd_microstep: 3321.29 | bwd_inner_microstep: 3320.49 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-19 16:51:02,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.89 | bwd: 3321.30 | bwd_inner: 3320.49 | bwd_allreduce: 0.77 | step: 6.72 21%|██ | 2122/10000 [3:21:22<12:03:22, 5.51s/it] {'loss': 0.0864, 'grad_norm': 0.5945124626159668, 'learning_rate': 3.6617687221598243e-05, 'epoch': 2.12} 21%|██ | 2122/10000 [3:21:22<12:03:22, 5.51s/it][2025-06-19 16:51:07,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:51:07,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.62 | bwd_microstep: 3374.27 | bwd_inner_microstep: 3373.48 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 16:51:07,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.62 | bwd: 3374.28 | bwd_inner: 3373.48 | bwd_allreduce: 0.76 | step: 6.60 21%|██ | 2123/10000 [3:21:28<12:04:44, 5.52s/it] {'loss': 0.056, 'grad_norm': 0.5603172779083252, 'learning_rate': 3.661408197088416e-05, 'epoch': 2.12} 21%|██ | 2123/10000 [3:21:28<12:04:44, 5.52s/it][2025-06-19 16:51:13,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:51:13,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.62 | bwd_microstep: 3326.90 | bwd_inner_microstep: 3326.09 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.41 [2025-06-19 16:51:13,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.62 | bwd: 3326.91 | bwd_inner: 3326.09 | bwd_allreduce: 0.78 | step: 7.41 21%|██ | 2124/10000 [3:21:33<12:03:07, 5.51s/it] {'loss': 0.0664, 'grad_norm': 1.0607564449310303, 'learning_rate': 3.661047497742999e-05, 'epoch': 2.12} 21%|██ | 2124/10000 [3:21:33<12:03:07, 5.51s/it][2025-06-19 16:51:18,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:51:18,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.20 | bwd_microstep: 3328.38 | bwd_inner_microstep: 3327.59 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 16:51:18,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.20 | bwd: 3328.39 | bwd_inner: 3327.59 | bwd_allreduce: 0.76 | step: 6.72 21%|██▏ | 2125/10000 [3:21:39<12:01:50, 5.50s/it] {'loss': 0.082, 'grad_norm': 0.5056087970733643, 'learning_rate': 3.6606866241614085e-05, 'epoch': 2.12} 21%|██▏ | 2125/10000 [3:21:39<12:01:50, 5.50s/it][2025-06-19 16:51:24,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:51:24,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.62 | bwd_microstep: 3378.19 | bwd_inner_microstep: 3377.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.06 [2025-06-19 16:51:24,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.63 | bwd: 3378.21 | bwd_inner: 3377.40 | bwd_allreduce: 0.77 | step: 7.06 21%|██▏ | 2126/10000 [3:21:44<12:03:42, 5.51s/it] {'loss': 0.0514, 'grad_norm': 0.34938591718673706, 'learning_rate': 3.660325576381499e-05, 'epoch': 2.13} 21%|██▏ | 2126/10000 [3:21:44<12:03:42, 5.51s/it][2025-06-19 16:51:29,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:51:29,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.31 | bwd_microstep: 3314.92 | bwd_inner_microstep: 3314.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 16:51:29,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.31 | bwd: 3314.93 | bwd_inner: 3314.13 | bwd_allreduce: 0.76 | step: 6.67 21%|██▏ | 2127/10000 [3:21:50<12:01:19, 5.50s/it] {'loss': 0.058, 'grad_norm': 0.5530058741569519, 'learning_rate': 3.659964354441141e-05, 'epoch': 2.13} 21%|██▏ | 2127/10000 [3:21:50<12:01:19, 5.50s/it][2025-06-19 16:51:35,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:51:35,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.57 | bwd_microstep: 3328.68 | bwd_inner_microstep: 3327.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 16:51:35,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.57 | bwd: 3328.70 | bwd_inner: 3327.87 | bwd_allreduce: 0.78 | step: 7.17 21%|██▏ | 2128/10000 [3:21:55<12:00:20, 5.49s/it] {'loss': 0.0732, 'grad_norm': 1.274125337600708, 'learning_rate': 3.6596029583782276e-05, 'epoch': 2.13} 21%|██▏ | 2128/10000 [3:21:55<12:00:20, 5.49s/it][2025-06-19 16:51:40,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:51:40,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.72 | bwd_microstep: 3321.99 | bwd_inner_microstep: 3321.03 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.55 [2025-06-19 16:51:40,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.72 | bwd: 3322.00 | bwd_inner: 3321.03 | bwd_allreduce: 0.92 | step: 7.56 21%|██▏ | 2129/10000 [3:22:01<11:59:23, 5.48s/it] {'loss': 0.0602, 'grad_norm': 0.44847366213798523, 'learning_rate': 3.6592413882306666e-05, 'epoch': 2.13} 21%|██▏ | 2129/10000 [3:22:01<11:59:23, 5.48s/it][2025-06-19 16:51:46,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:51:46,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.26 | bwd_microstep: 3380.20 | bwd_inner_microstep: 3379.28 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.85 [2025-06-19 16:51:46,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.26 | bwd: 3380.21 | bwd_inner: 3379.28 | bwd_allreduce: 0.89 | step: 6.85 21%|██▏ | 2130/10000 [3:22:06<12:02:17, 5.51s/it] {'loss': 0.0489, 'grad_norm': 0.5875100493431091, 'learning_rate': 3.658879644036384e-05, 'epoch': 2.13} 21%|██▏ | 2130/10000 [3:22:06<12:02:17, 5.51s/it][2025-06-19 16:51:51,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:51:51,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.71 | bwd_microstep: 3402.56 | bwd_inner_microstep: 3401.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 16:51:51,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.71 | bwd: 3402.57 | bwd_inner: 3401.76 | bwd_allreduce: 0.77 | step: 6.68 21%|██▏ | 2131/10000 [3:22:12<12:05:01, 5.53s/it] {'loss': 0.0362, 'grad_norm': 0.32117989659309387, 'learning_rate': 3.658517725833326e-05, 'epoch': 2.13} 21%|██▏ | 2131/10000 [3:22:12<12:05:01, 5.53s/it][2025-06-19 16:51:57,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:51:57,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.42 | bwd_microstep: 3327.89 | bwd_inner_microstep: 3327.09 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-19 16:51:57,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.42 | bwd: 3327.90 | bwd_inner: 3327.09 | bwd_allreduce: 0.77 | step: 7.03 21%|██▏ | 2132/10000 [3:22:18<12:02:59, 5.51s/it] {'loss': 0.0898, 'grad_norm': 0.9542432427406311, 'learning_rate': 3.658155633659456e-05, 'epoch': 2.13} 21%|██▏ | 2132/10000 [3:22:18<12:02:59, 5.51s/it][2025-06-19 16:52:02,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:52:02,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.51 | bwd_microstep: 3335.41 | bwd_inner_microstep: 3334.63 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 16:52:02,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.51 | bwd: 3335.43 | bwd_inner: 3334.63 | bwd_allreduce: 0.75 | step: 6.58 21%|██▏ | 2133/10000 [3:22:23<12:01:25, 5.50s/it] {'loss': 0.0666, 'grad_norm': 0.7995808124542236, 'learning_rate': 3.657793367552756e-05, 'epoch': 2.13} 21%|██▏ | 2133/10000 [3:22:23<12:01:25, 5.50s/it][2025-06-19 16:52:08,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:52:08,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.33 | bwd_microstep: 3337.66 | bwd_inner_microstep: 3336.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 16:52:08,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.33 | bwd: 3337.68 | bwd_inner: 3336.88 | bwd_allreduce: 0.76 | step: 6.72 21%|██▏ | 2134/10000 [3:22:28<12:00:38, 5.50s/it] {'loss': 0.0806, 'grad_norm': 0.6365395188331604, 'learning_rate': 3.657430927551225e-05, 'epoch': 2.13} 21%|██▏ | 2134/10000 [3:22:28<12:00:38, 5.50s/it][2025-06-19 16:52:13,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:52:13,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.97 | bwd_microstep: 3331.87 | bwd_inner_microstep: 3330.93 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.04 [2025-06-19 16:52:13,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.97 | bwd: 3331.88 | bwd_inner: 3330.93 | bwd_allreduce: 0.90 | step: 7.04 21%|██▏ | 2135/10000 [3:22:34<11:59:36, 5.49s/it] {'loss': 0.0648, 'grad_norm': 0.530562698841095, 'learning_rate': 3.657068313692883e-05, 'epoch': 2.13} 21%|██▏ | 2135/10000 [3:22:34<11:59:36, 5.49s/it][2025-06-19 16:52:19,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:52:19,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.46 | bwd_microstep: 3378.54 | bwd_inner_microstep: 3377.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 16:52:19,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.46 | bwd: 3378.55 | bwd_inner: 3377.75 | bwd_allreduce: 0.76 | step: 6.54 21%|██▏ | 2136/10000 [3:22:39<12:02:04, 5.51s/it] {'loss': 0.0999, 'grad_norm': 0.8166413307189941, 'learning_rate': 3.656705526015765e-05, 'epoch': 2.14} 21%|██▏ | 2136/10000 [3:22:39<12:02:04, 5.51s/it][2025-06-19 16:52:24,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 16:52:24,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.15 | bwd_microstep: 3334.15 | bwd_inner_microstep: 3333.22 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.64 [2025-06-19 16:52:24,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.15 | bwd: 3334.17 | bwd_inner: 3333.22 | bwd_allreduce: 0.89 | step: 7.64 21%|██▏ | 2137/10000 [3:22:45<12:01:24, 5.50s/it] {'loss': 0.0625, 'grad_norm': 0.796072244644165, 'learning_rate': 3.6563425645579264e-05, 'epoch': 2.14} 21%|██▏ | 2137/10000 [3:22:45<12:01:24, 5.50s/it][2025-06-19 16:52:30,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:52:30,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.21 | bwd_microstep: 3338.72 | bwd_inner_microstep: 3337.89 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.26 [2025-06-19 16:52:30,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.21 | bwd: 3338.74 | bwd_inner: 3337.89 | bwd_allreduce: 0.80 | step: 7.27 21%|██▏ | 2138/10000 [3:22:50<12:00:57, 5.50s/it] {'loss': 0.0524, 'grad_norm': 0.7488252520561218, 'learning_rate': 3.655979429357441e-05, 'epoch': 2.14} 21%|██▏ | 2138/10000 [3:22:50<12:00:57, 5.50s/it][2025-06-19 16:52:35,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:52:35,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.05 | bwd_microstep: 3340.04 | bwd_inner_microstep: 3339.13 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.07 [2025-06-19 16:52:35,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.05 | bwd: 3340.06 | bwd_inner: 3339.13 | bwd_allreduce: 0.88 | step: 7.07 21%|██▏ | 2139/10000 [3:22:56<12:00:35, 5.50s/it] {'loss': 0.1355, 'grad_norm': 1.0364930629730225, 'learning_rate': 3.655616120452398e-05, 'epoch': 2.14} 21%|██▏ | 2139/10000 [3:22:56<12:00:35, 5.50s/it][2025-06-19 16:52:41,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:52:41,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.52 | bwd_microstep: 3325.92 | bwd_inner_microstep: 3324.90 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.64 [2025-06-19 16:52:41,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.52 | bwd: 3325.93 | bwd_inner: 3324.90 | bwd_allreduce: 0.99 | step: 7.64 21%|██▏ | 2140/10000 [3:23:01<11:59:38, 5.49s/it] {'loss': 0.0731, 'grad_norm': 0.6823224425315857, 'learning_rate': 3.655252637880908e-05, 'epoch': 2.14} 21%|██▏ | 2140/10000 [3:23:01<11:59:38, 5.49s/it][2025-06-19 16:52:46,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:52:46,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.45 | bwd_microstep: 3327.98 | bwd_inner_microstep: 3327.08 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.70 [2025-06-19 16:52:46,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.45 | bwd: 3327.99 | bwd_inner: 3327.08 | bwd_allreduce: 0.87 | step: 6.71 21%|██▏ | 2141/10000 [3:23:07<11:58:40, 5.49s/it] {'loss': 0.0354, 'grad_norm': 0.4664429724216461, 'learning_rate': 3.654888981681099e-05, 'epoch': 2.14} 21%|██▏ | 2141/10000 [3:23:07<11:58:40, 5.49s/it][2025-06-19 16:52:52,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:52:52,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.46 | bwd_microstep: 3333.93 | bwd_inner_microstep: 3333.08 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.43 [2025-06-19 16:52:52,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.46 | bwd: 3333.95 | bwd_inner: 3333.08 | bwd_allreduce: 0.81 | step: 7.43 21%|██▏ | 2142/10000 [3:23:12<11:58:37, 5.49s/it] {'loss': 0.0473, 'grad_norm': 0.5111781358718872, 'learning_rate': 3.6545251518911164e-05, 'epoch': 2.14} 21%|██▏ | 2142/10000 [3:23:12<11:58:37, 5.49s/it][2025-06-19 16:52:57,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:52:57,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.53 | bwd_microstep: 3377.58 | bwd_inner_microstep: 3376.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 16:52:57,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.53 | bwd: 3377.60 | bwd_inner: 3376.80 | bwd_allreduce: 0.75 | step: 6.68 21%|██▏ | 2143/10000 [3:23:18<12:01:03, 5.51s/it] {'loss': 0.0983, 'grad_norm': 0.9302878379821777, 'learning_rate': 3.654161148549124e-05, 'epoch': 2.14} 21%|██▏ | 2143/10000 [3:23:18<12:01:03, 5.51s/it][2025-06-19 16:53:03,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:53:03,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.56 | bwd_microstep: 3332.98 | bwd_inner_microstep: 3332.09 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.16 [2025-06-19 16:53:03,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.56 | bwd: 3332.99 | bwd_inner: 3332.09 | bwd_allreduce: 0.86 | step: 7.16 21%|██▏ | 2144/10000 [3:23:23<12:00:09, 5.50s/it] {'loss': 0.1189, 'grad_norm': 0.8684513568878174, 'learning_rate': 3.653796971693305e-05, 'epoch': 2.14} 21%|██▏ | 2144/10000 [3:23:23<12:00:09, 5.50s/it][2025-06-19 16:53:08,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:53:08,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.43 | bwd_microstep: 3345.81 | bwd_inner_microstep: 3345.01 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-19 16:53:08,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.43 | bwd: 3345.82 | bwd_inner: 3345.01 | bwd_allreduce: 0.77 | step: 7.04 21%|██▏ | 2145/10000 [3:23:29<11:59:58, 5.50s/it] {'loss': 0.0396, 'grad_norm': 0.689419686794281, 'learning_rate': 3.6534326213618577e-05, 'epoch': 2.15} 21%|██▏ | 2145/10000 [3:23:29<11:59:58, 5.50s/it][2025-06-19 16:53:14,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.74 [2025-06-19 16:53:14,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.09 | bwd_microstep: 3370.82 | bwd_inner_microstep: 3369.96 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.02 [2025-06-19 16:53:14,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.09 | bwd: 3370.84 | bwd_inner: 3369.96 | bwd_allreduce: 0.82 | step: 7.03 21%|██▏ | 2146/10000 [3:23:34<12:01:34, 5.51s/it] {'loss': 0.1311, 'grad_norm': 1.1474287509918213, 'learning_rate': 3.6530680975930035e-05, 'epoch': 2.15} 21%|██▏ | 2146/10000 [3:23:34<12:01:34, 5.51s/it][2025-06-19 16:53:19,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:53:19,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.69 | bwd_microstep: 3340.73 | bwd_inner_microstep: 3339.82 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.50 [2025-06-19 16:53:19,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.69 | bwd: 3340.75 | bwd_inner: 3339.82 | bwd_allreduce: 0.89 | step: 7.50 21%|██▏ | 2147/10000 [3:23:40<12:00:44, 5.51s/it] {'loss': 0.7189, 'grad_norm': 215.1722869873047, 'learning_rate': 3.652703400424978e-05, 'epoch': 2.15} 21%|██▏ | 2147/10000 [3:23:40<12:00:44, 5.51s/it][2025-06-19 16:53:25,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:53:25,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.40 | bwd_microstep: 3337.81 | bwd_inner_microstep: 3337.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 16:53:25,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.40 | bwd: 3337.83 | bwd_inner: 3337.03 | bwd_allreduce: 0.76 | step: 6.58 21%|██▏ | 2148/10000 [3:23:45<12:00:04, 5.50s/it] {'loss': 0.1012, 'grad_norm': 0.9936990737915039, 'learning_rate': 3.652338529896035e-05, 'epoch': 2.15} 21%|██▏ | 2148/10000 [3:23:45<12:00:04, 5.50s/it][2025-06-19 16:53:30,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:53:30,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.28 | bwd_microstep: 3323.99 | bwd_inner_microstep: 3323.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.61 [2025-06-19 16:53:30,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.28 | bwd: 3324.01 | bwd_inner: 3323.21 | bwd_allreduce: 0.76 | step: 6.61 21%|██▏ | 2149/10000 [3:23:51<11:59:14, 5.50s/it] {'loss': 0.133, 'grad_norm': 1.3090829849243164, 'learning_rate': 3.65197348604445e-05, 'epoch': 2.15} 21%|██▏ | 2149/10000 [3:23:51<11:59:14, 5.50s/it][2025-06-19 16:53:36,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 16:53:36,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.40 | bwd_microstep: 3317.15 | bwd_inner_microstep: 3316.11 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.23 [2025-06-19 16:53:36,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.40 | bwd: 3317.17 | bwd_inner: 3316.11 | bwd_allreduce: 1.01 | step: 7.24 22%|██▏ | 2150/10000 [3:23:56<11:58:06, 5.49s/it] {'loss': 0.0992, 'grad_norm': 1.411327600479126, 'learning_rate': 3.651608268908513e-05, 'epoch': 2.15} 22%|██▏ | 2150/10000 [3:23:56<11:58:06, 5.49s/it][2025-06-19 16:53:41,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.79 [2025-06-19 16:53:41,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.50 | bwd_microstep: 3328.72 | bwd_inner_microstep: 3327.91 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 16:53:41,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.50 | bwd: 3328.74 | bwd_inner: 3327.91 | bwd_allreduce: 0.78 | step: 7.21 22%|██▏ | 2151/10000 [3:24:02<11:57:36, 5.49s/it] {'loss': 0.0561, 'grad_norm': 0.5352442264556885, 'learning_rate': 3.651242878526534e-05, 'epoch': 2.15} 22%|██▏ | 2151/10000 [3:24:02<11:57:36, 5.49s/it][2025-06-19 16:53:47,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:53:47,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.23 | bwd_microstep: 3327.92 | bwd_inner_microstep: 3326.94 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.26 [2025-06-19 16:53:47,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.23 | bwd: 3327.93 | bwd_inner: 3326.94 | bwd_allreduce: 0.95 | step: 7.27 22%|██▏ | 2152/10000 [3:24:07<11:57:05, 5.48s/it] {'loss': 0.0512, 'grad_norm': 0.5681672692298889, 'learning_rate': 3.650877314936841e-05, 'epoch': 2.15} 22%|██▏ | 2152/10000 [3:24:07<11:57:05, 5.48s/it][2025-06-19 16:53:52,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 16:53:52,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.44 | bwd_microstep: 3376.27 | bwd_inner_microstep: 3375.17 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.84 [2025-06-19 16:53:52,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.44 | bwd: 3376.29 | bwd_inner: 3375.17 | bwd_allreduce: 1.07 | step: 7.84 22%|██▏ | 2153/10000 [3:24:13<12:00:13, 5.51s/it] {'loss': 0.1273, 'grad_norm': 1.140465497970581, 'learning_rate': 3.65051157817778e-05, 'epoch': 2.15} 22%|██▏ | 2153/10000 [3:24:13<12:00:13, 5.51s/it][2025-06-19 16:53:58,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:53:58,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.19 | bwd_microstep: 3377.16 | bwd_inner_microstep: 3376.26 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.29 [2025-06-19 16:53:58,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.19 | bwd: 3377.17 | bwd_inner: 3376.26 | bwd_allreduce: 0.86 | step: 7.30 22%|██▏ | 2154/10000 [3:24:19<12:01:48, 5.52s/it] {'loss': 0.26, 'grad_norm': 1.5050348043441772, 'learning_rate': 3.650145668287714e-05, 'epoch': 2.15} 22%|██▏ | 2154/10000 [3:24:19<12:01:48, 5.52s/it][2025-06-19 16:54:03,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:54:03,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.71 | bwd_microstep: 3326.37 | bwd_inner_microstep: 3325.55 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.20 [2025-06-19 16:54:03,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.71 | bwd: 3326.38 | bwd_inner: 3325.55 | bwd_allreduce: 0.78 | step: 7.20 22%|██▏ | 2155/10000 [3:24:24<12:00:03, 5.51s/it] {'loss': 0.0582, 'grad_norm': 0.45051971077919006, 'learning_rate': 3.649779585305026e-05, 'epoch': 2.15} 22%|██▏ | 2155/10000 [3:24:24<12:00:03, 5.51s/it][2025-06-19 16:54:09,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:54:09,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.02 | bwd_microstep: 3320.84 | bwd_inner_microstep: 3319.91 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.10 [2025-06-19 16:54:09,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.02 | bwd: 3320.85 | bwd_inner: 3319.91 | bwd_allreduce: 0.90 | step: 7.10 22%|██▏ | 2156/10000 [3:24:29<11:58:18, 5.49s/it] {'loss': 0.0413, 'grad_norm': 0.3904256224632263, 'learning_rate': 3.649413329268116e-05, 'epoch': 2.16} 22%|██▏ | 2156/10000 [3:24:29<11:58:18, 5.49s/it][2025-06-19 16:54:14,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:54:14,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.07 | bwd_microstep: 3328.49 | bwd_inner_microstep: 3327.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 16:54:14,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.07 | bwd: 3328.50 | bwd_inner: 3327.70 | bwd_allreduce: 0.76 | step: 6.66 22%|██▏ | 2157/10000 [3:24:35<11:57:31, 5.49s/it] {'loss': 0.0675, 'grad_norm': 0.5780043601989746, 'learning_rate': 3.649046900215404e-05, 'epoch': 2.16} 22%|██▏ | 2157/10000 [3:24:35<11:57:31, 5.49s/it][2025-06-19 16:54:20,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:54:20,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.26 | bwd_microstep: 3378.76 | bwd_inner_microstep: 3377.76 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.76 [2025-06-19 16:54:20,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.26 | bwd: 3378.78 | bwd_inner: 3377.76 | bwd_allreduce: 0.97 | step: 7.77 22%|██▏ | 2158/10000 [3:24:40<11:59:40, 5.51s/it] {'loss': 0.048, 'grad_norm': 0.5136224627494812, 'learning_rate': 3.648680298185325e-05, 'epoch': 2.16} 22%|██▏ | 2158/10000 [3:24:40<11:59:40, 5.51s/it][2025-06-19 16:54:25,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:54:25,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.87 | bwd_microstep: 3401.02 | bwd_inner_microstep: 3400.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 16:54:25,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.87 | bwd: 3401.04 | bwd_inner: 3400.23 | bwd_allreduce: 0.77 | step: 6.71 22%|██▏ | 2159/10000 [3:24:46<12:02:23, 5.53s/it] {'loss': 0.0623, 'grad_norm': 0.9133728742599487, 'learning_rate': 3.648313523216335e-05, 'epoch': 2.16} 22%|██▏ | 2159/10000 [3:24:46<12:02:23, 5.53s/it][2025-06-19 16:54:31,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:54:31,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.24 | bwd_microstep: 3327.64 | bwd_inner_microstep: 3326.84 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.26 [2025-06-19 16:54:31,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.24 | bwd: 3327.66 | bwd_inner: 3326.84 | bwd_allreduce: 0.78 | step: 7.26 22%|██▏ | 2160/10000 [3:24:52<11:59:59, 5.51s/it] {'loss': 0.0824, 'grad_norm': 0.7916721105575562, 'learning_rate': 3.647946575346905e-05, 'epoch': 2.16} 22%|██▏ | 2160/10000 [3:24:52<11:59:59, 5.51s/it][2025-06-19 16:54:36,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:54:36,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.98 | bwd_microstep: 3370.36 | bwd_inner_microstep: 3369.25 | bwd_allreduce_microstep: 1.05 | step_microstep: 6.99 [2025-06-19 16:54:36,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.98 | bwd: 3370.37 | bwd_inner: 3369.25 | bwd_allreduce: 1.07 | step: 6.99 22%|██▏ | 2161/10000 [3:24:57<12:01:05, 5.52s/it] {'loss': 0.0694, 'grad_norm': 0.6290222406387329, 'learning_rate': 3.647579454615529e-05, 'epoch': 2.16} 22%|██▏ | 2161/10000 [3:24:57<12:01:05, 5.52s/it][2025-06-19 16:54:42,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 16:54:42,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.92 | bwd_microstep: 3371.19 | bwd_inner_microstep: 3370.35 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.29 [2025-06-19 16:54:42,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.92 | bwd: 3371.20 | bwd_inner: 3370.35 | bwd_allreduce: 0.81 | step: 7.29 22%|██▏ | 2162/10000 [3:25:03<12:01:37, 5.52s/it] {'loss': 0.0555, 'grad_norm': 0.46432897448539734, 'learning_rate': 3.647212161060714e-05, 'epoch': 2.16} 22%|██▏ | 2162/10000 [3:25:03<12:01:37, 5.52s/it][2025-06-19 16:54:47,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 16:54:47,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.94 | bwd_microstep: 3370.67 | bwd_inner_microstep: 3369.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 16:54:47,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.94 | bwd: 3370.69 | bwd_inner: 3369.90 | bwd_allreduce: 0.75 | step: 6.56 22%|██▏ | 2163/10000 [3:25:08<12:01:56, 5.53s/it] {'loss': 0.1588, 'grad_norm': 3.7458269596099854, 'learning_rate': 3.646844694720989e-05, 'epoch': 2.16} 22%|██▏ | 2163/10000 [3:25:08<12:01:56, 5.53s/it][2025-06-19 16:54:53,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:54:53,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.38 | bwd_microstep: 3376.71 | bwd_inner_microstep: 3375.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 16:54:53,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.38 | bwd: 3376.73 | bwd_inner: 3375.92 | bwd_allreduce: 0.77 | step: 6.74 22%|██▏ | 2164/10000 [3:25:14<12:02:23, 5.53s/it] {'loss': 0.0397, 'grad_norm': 0.49122270941734314, 'learning_rate': 3.646477055634898e-05, 'epoch': 2.16} 22%|██▏ | 2164/10000 [3:25:14<12:02:23, 5.53s/it][2025-06-19 16:54:58,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:54:58,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.49 | bwd_microstep: 3378.85 | bwd_inner_microstep: 3378.05 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 16:54:58,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.49 | bwd: 3378.87 | bwd_inner: 3378.05 | bwd_allreduce: 0.78 | step: 7.20 22%|██▏ | 2165/10000 [3:25:19<12:02:54, 5.54s/it] {'loss': 0.0564, 'grad_norm': 0.5061359405517578, 'learning_rate': 3.6461092438410064e-05, 'epoch': 2.17} 22%|██▏ | 2165/10000 [3:25:19<12:02:54, 5.54s/it][2025-06-19 16:55:04,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:55:04,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.91 | bwd_microstep: 3363.10 | bwd_inner_microstep: 3362.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 16:55:04,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.91 | bwd: 3363.11 | bwd_inner: 3362.31 | bwd_allreduce: 0.75 | step: 6.67 22%|██▏ | 2166/10000 [3:25:25<12:02:29, 5.53s/it] {'loss': 0.08, 'grad_norm': 0.7095668911933899, 'learning_rate': 3.645741259377894e-05, 'epoch': 2.17} 22%|██▏ | 2166/10000 [3:25:25<12:02:29, 5.53s/it][2025-06-19 16:55:09,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:55:09,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.96 | bwd_microstep: 3322.63 | bwd_inner_microstep: 3321.84 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 16:55:09,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.96 | bwd: 3322.65 | bwd_inner: 3321.84 | bwd_allreduce: 0.76 | step: 6.65 22%|██▏ | 2167/10000 [3:25:30<12:00:04, 5.52s/it] {'loss': 0.0858, 'grad_norm': 1.0942569971084595, 'learning_rate': 3.6453731022841624e-05, 'epoch': 2.17} 22%|██▏ | 2167/10000 [3:25:30<12:00:04, 5.52s/it][2025-06-19 16:55:15,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:55:15,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.83 | bwd_microstep: 3367.68 | bwd_inner_microstep: 3366.86 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.03 [2025-06-19 16:55:15,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.83 | bwd: 3367.69 | bwd_inner: 3366.86 | bwd_allreduce: 0.79 | step: 7.04 22%|██▏ | 2168/10000 [3:25:36<12:00:41, 5.52s/it] {'loss': 0.0724, 'grad_norm': 0.48355504870414734, 'learning_rate': 3.645004772598428e-05, 'epoch': 2.17} 22%|██▏ | 2168/10000 [3:25:36<12:00:41, 5.52s/it][2025-06-19 16:55:20,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 16:55:20,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.70 | bwd_microstep: 3380.27 | bwd_inner_microstep: 3379.29 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.41 [2025-06-19 16:55:20,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.70 | bwd: 3380.28 | bwd_inner: 3379.29 | bwd_allreduce: 0.94 | step: 7.41 22%|██▏ | 2169/10000 [3:25:41<12:01:39, 5.53s/it] {'loss': 0.0579, 'grad_norm': 0.7313299775123596, 'learning_rate': 3.6446362703593284e-05, 'epoch': 2.17} 22%|██▏ | 2169/10000 [3:25:41<12:01:39, 5.53s/it][2025-06-19 16:55:26,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:55:26,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.09 | bwd_microstep: 3322.00 | bwd_inner_microstep: 3321.19 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.30 [2025-06-19 16:55:26,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.09 | bwd: 3322.20 | bwd_inner: 3321.19 | bwd_allreduce: 0.78 | step: 7.30 22%|██▏ | 2170/10000 [3:25:47<11:59:16, 5.51s/it] {'loss': 0.0441, 'grad_norm': 0.529906153678894, 'learning_rate': 3.644267595605516e-05, 'epoch': 2.17} 22%|██▏ | 2170/10000 [3:25:47<11:59:16, 5.51s/it][2025-06-19 16:55:31,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:55:31,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.89 | bwd_microstep: 3313.53 | bwd_inner_microstep: 3312.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 16:55:31,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.89 | bwd: 3313.55 | bwd_inner: 3312.75 | bwd_allreduce: 0.76 | step: 6.64 22%|██▏ | 2171/10000 [3:25:52<11:57:04, 5.50s/it] {'loss': 0.0775, 'grad_norm': 1.2197926044464111, 'learning_rate': 3.643898748375665e-05, 'epoch': 2.17} 22%|██▏ | 2171/10000 [3:25:52<11:57:04, 5.50s/it][2025-06-19 16:55:37,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:55:37,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.72 | bwd_microstep: 3377.53 | bwd_inner_microstep: 3376.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 16:55:37,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.72 | bwd: 3377.54 | bwd_inner: 3376.74 | bwd_allreduce: 0.76 | step: 6.81 22%|██▏ | 2172/10000 [3:25:58<11:58:52, 5.51s/it] {'loss': 0.0754, 'grad_norm': 0.75766921043396, 'learning_rate': 3.643529728708465e-05, 'epoch': 2.17} 22%|██▏ | 2172/10000 [3:25:58<11:58:52, 5.51s/it][2025-06-19 16:55:42,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:55:42,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.46 | bwd_microstep: 3316.38 | bwd_inner_microstep: 3315.56 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-19 16:55:42,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.46 | bwd: 3316.39 | bwd_inner: 3315.56 | bwd_allreduce: 0.78 | step: 7.13 22%|██▏ | 2173/10000 [3:26:03<11:56:58, 5.50s/it] {'loss': 0.0532, 'grad_norm': 0.5691685080528259, 'learning_rate': 3.643160536642624e-05, 'epoch': 2.17} 22%|██▏ | 2173/10000 [3:26:03<11:56:58, 5.50s/it][2025-06-19 16:55:48,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 16:55:48,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.03 | bwd_microstep: 3376.56 | bwd_inner_microstep: 3375.60 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.35 [2025-06-19 16:55:48,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.03 | bwd: 3376.57 | bwd_inner: 3375.60 | bwd_allreduce: 0.92 | step: 7.35 22%|██▏ | 2174/10000 [3:26:09<11:58:51, 5.51s/it] {'loss': 0.0807, 'grad_norm': 1.0542210340499878, 'learning_rate': 3.6427911722168684e-05, 'epoch': 2.17} 22%|██▏ | 2174/10000 [3:26:09<11:58:51, 5.51s/it][2025-06-19 16:55:53,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:55:53,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.27 | bwd_microstep: 3323.24 | bwd_inner_microstep: 3322.40 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.84 [2025-06-19 16:55:53,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.27 | bwd: 3323.26 | bwd_inner: 3322.40 | bwd_allreduce: 0.81 | step: 6.84 22%|██▏ | 2175/10000 [3:26:14<11:57:16, 5.50s/it] {'loss': 0.1136, 'grad_norm': 0.9084038138389587, 'learning_rate': 3.6424216354699444e-05, 'epoch': 2.17} 22%|██▏ | 2175/10000 [3:26:14<11:57:16, 5.50s/it][2025-06-19 16:55:59,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:55:59,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.34 | bwd_microstep: 3317.94 | bwd_inner_microstep: 3317.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 16:55:59,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.34 | bwd: 3317.96 | bwd_inner: 3317.14 | bwd_allreduce: 0.77 | step: 6.95 22%|██▏ | 2176/10000 [3:26:20<11:55:32, 5.49s/it] {'loss': 0.1169, 'grad_norm': 1.2525122165679932, 'learning_rate': 3.6420519264406125e-05, 'epoch': 2.18} 22%|██▏ | 2176/10000 [3:26:20<11:55:32, 5.49s/it][2025-06-19 16:56:04,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:56:04,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.28 | bwd_microstep: 3324.64 | bwd_inner_microstep: 3323.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 16:56:04,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.28 | bwd: 3324.66 | bwd_inner: 3323.85 | bwd_allreduce: 0.77 | step: 6.98 22%|██▏ | 2177/10000 [3:26:25<11:55:00, 5.48s/it] {'loss': 0.0389, 'grad_norm': 0.29612642526626587, 'learning_rate': 3.641682045167655e-05, 'epoch': 2.18} 22%|██▏ | 2177/10000 [3:26:25<11:55:00, 5.48s/it][2025-06-19 16:56:10,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:56:10,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.25 | bwd_microstep: 3319.65 | bwd_inner_microstep: 3318.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 16:56:10,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.26 | bwd: 3319.67 | bwd_inner: 3318.85 | bwd_allreduce: 0.78 | step: 7.19 22%|██▏ | 2178/10000 [3:26:31<11:54:16, 5.48s/it] {'loss': 0.097, 'grad_norm': 1.161450743675232, 'learning_rate': 3.64131199168987e-05, 'epoch': 2.18} 22%|██▏ | 2178/10000 [3:26:31<11:54:16, 5.48s/it][2025-06-19 16:56:15,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:56:15,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.52 | bwd_microstep: 3370.06 | bwd_inner_microstep: 3369.28 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 16:56:15,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.52 | bwd: 3370.08 | bwd_inner: 3369.28 | bwd_allreduce: 0.76 | step: 6.65 22%|██▏ | 2179/10000 [3:26:36<11:56:34, 5.50s/it] {'loss': 0.1296, 'grad_norm': 1.5322972536087036, 'learning_rate': 3.6409417660460744e-05, 'epoch': 2.18} 22%|██▏ | 2179/10000 [3:26:36<11:56:34, 5.50s/it][2025-06-19 16:56:21,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:56:21,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.89 | bwd_microstep: 3374.96 | bwd_inner_microstep: 3374.09 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.03 [2025-06-19 16:56:21,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.89 | bwd: 3374.98 | bwd_inner: 3374.09 | bwd_allreduce: 0.84 | step: 7.03 22%|██▏ | 2180/10000 [3:26:42<11:58:34, 5.51s/it] {'loss': 0.0827, 'grad_norm': 1.0163848400115967, 'learning_rate': 3.6405713682751034e-05, 'epoch': 2.18} 22%|██▏ | 2180/10000 [3:26:42<11:58:34, 5.51s/it][2025-06-19 16:56:26,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:56:26,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.81 | bwd_microstep: 3315.70 | bwd_inner_microstep: 3314.63 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.40 [2025-06-19 16:56:26,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.81 | bwd: 3315.71 | bwd_inner: 3314.63 | bwd_allreduce: 1.04 | step: 7.41 22%|██▏ | 2181/10000 [3:26:47<11:56:43, 5.50s/it] {'loss': 0.2135, 'grad_norm': 1.5059318542480469, 'learning_rate': 3.640200798415811e-05, 'epoch': 2.18} 22%|██▏ | 2181/10000 [3:26:47<11:56:43, 5.50s/it][2025-06-19 16:56:32,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:56:32,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.46 | bwd_microstep: 3307.93 | bwd_inner_microstep: 3307.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 16:56:32,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.46 | bwd: 3307.95 | bwd_inner: 3307.14 | bwd_allreduce: 0.76 | step: 6.69 22%|██▏ | 2182/10000 [3:26:53<11:54:59, 5.49s/it] {'loss': 0.0586, 'grad_norm': 0.49668335914611816, 'learning_rate': 3.639830056507067e-05, 'epoch': 2.18} 22%|██▏ | 2182/10000 [3:26:53<11:54:59, 5.49s/it][2025-06-19 16:56:37,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:56:37,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.97 | bwd_microstep: 3313.94 | bwd_inner_microstep: 3313.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 16:56:37,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.97 | bwd: 3313.96 | bwd_inner: 3313.15 | bwd_allreduce: 0.76 | step: 6.71 22%|██▏ | 2183/10000 [3:26:58<11:53:34, 5.48s/it] {'loss': 0.0636, 'grad_norm': 0.5480231046676636, 'learning_rate': 3.6394591425877596e-05, 'epoch': 2.18} 22%|██▏ | 2183/10000 [3:26:58<11:53:34, 5.48s/it][2025-06-19 16:56:43,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:56:43,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.55 | bwd_microstep: 3369.28 | bwd_inner_microstep: 3368.48 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 16:56:43,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.55 | bwd: 3369.29 | bwd_inner: 3368.48 | bwd_allreduce: 0.77 | step: 7.01 22%|██▏ | 2184/10000 [3:27:04<11:55:56, 5.50s/it] {'loss': 0.0591, 'grad_norm': 0.5419064164161682, 'learning_rate': 3.639088056696798e-05, 'epoch': 2.18} 22%|██▏ | 2184/10000 [3:27:04<11:55:56, 5.50s/it][2025-06-19 16:56:48,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:56:48,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.88 | bwd_microstep: 3320.44 | bwd_inner_microstep: 3319.55 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.93 [2025-06-19 16:56:48,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.88 | bwd: 3320.45 | bwd_inner: 3319.55 | bwd_allreduce: 0.86 | step: 6.94 22%|██▏ | 2185/10000 [3:27:09<11:54:40, 5.49s/it] {'loss': 0.0364, 'grad_norm': 0.2663864493370056, 'learning_rate': 3.638716798873106e-05, 'epoch': 2.19} 22%|██▏ | 2185/10000 [3:27:09<11:54:40, 5.49s/it][2025-06-19 16:56:54,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:56:54,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.69 | bwd_microstep: 3360.00 | bwd_inner_microstep: 3359.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-19 16:56:54,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.69 | bwd: 3360.02 | bwd_inner: 3359.20 | bwd_allreduce: 0.77 | step: 6.77 22%|██▏ | 2186/10000 [3:27:15<11:56:21, 5.50s/it] {'loss': 0.0569, 'grad_norm': 0.8638063073158264, 'learning_rate': 3.638345369155628e-05, 'epoch': 2.19} 22%|██▏ | 2186/10000 [3:27:15<11:56:21, 5.50s/it][2025-06-19 16:56:59,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 16:56:59,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.87 | bwd_microstep: 3324.36 | bwd_inner_microstep: 3323.18 | bwd_allreduce_microstep: 1.10 | step_microstep: 7.99 [2025-06-19 16:56:59,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.87 | bwd: 3324.39 | bwd_inner: 3323.18 | bwd_allreduce: 1.13 | step: 8.00 22%|██▏ | 2187/10000 [3:27:20<11:55:22, 5.49s/it] {'loss': 0.0629, 'grad_norm': 0.6512454748153687, 'learning_rate': 3.637973767583324e-05, 'epoch': 2.19} 22%|██▏ | 2187/10000 [3:27:20<11:55:22, 5.49s/it][2025-06-19 16:57:05,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:57:05,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.41 | bwd_microstep: 3371.48 | bwd_inner_microstep: 3370.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.69 [2025-06-19 16:57:05,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.41 | bwd: 3371.50 | bwd_inner: 3370.65 | bwd_allreduce: 0.79 | step: 7.70 22%|██▏ | 2188/10000 [3:27:26<11:57:19, 5.51s/it] {'loss': 0.1176, 'grad_norm': 1.1639519929885864, 'learning_rate': 3.637601994195174e-05, 'epoch': 2.19} 22%|██▏ | 2188/10000 [3:27:26<11:57:19, 5.51s/it][2025-06-19 16:57:10,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:57:10,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.01 | bwd_microstep: 3312.62 | bwd_inner_microstep: 3311.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-19 16:57:10,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.01 | bwd: 3312.63 | bwd_inner: 3311.82 | bwd_allreduce: 0.77 | step: 6.93 22%|██▏ | 2189/10000 [3:27:31<11:55:06, 5.49s/it] {'loss': 0.1619, 'grad_norm': 2.368568181991577, 'learning_rate': 3.637230049030175e-05, 'epoch': 2.19} 22%|██▏ | 2189/10000 [3:27:31<11:55:06, 5.49s/it][2025-06-19 16:57:16,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:57:16,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.53 | bwd_microstep: 3371.96 | bwd_inner_microstep: 3371.06 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.19 [2025-06-19 16:57:16,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.53 | bwd: 3371.97 | bwd_inner: 3371.06 | bwd_allreduce: 0.87 | step: 7.19 22%|██▏ | 2190/10000 [3:27:37<11:56:40, 5.51s/it] {'loss': 0.0463, 'grad_norm': 0.7189322710037231, 'learning_rate': 3.636857932127343e-05, 'epoch': 2.19} 22%|██▏ | 2190/10000 [3:27:37<11:56:40, 5.51s/it][2025-06-19 16:57:21,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:57:21,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.86 | bwd_microstep: 3318.24 | bwd_inner_microstep: 3317.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 16:57:21,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.86 | bwd: 3318.25 | bwd_inner: 3317.45 | bwd_allreduce: 0.76 | step: 6.62 22%|██▏ | 2191/10000 [3:27:42<11:54:56, 5.49s/it] {'loss': 0.0427, 'grad_norm': 0.43882399797439575, 'learning_rate': 3.636485643525711e-05, 'epoch': 2.19} 22%|██▏ | 2191/10000 [3:27:42<11:54:56, 5.49s/it][2025-06-19 16:57:27,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 16:57:27,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.60 | bwd_microstep: 3364.10 | bwd_inner_microstep: 3363.17 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.98 [2025-06-19 16:57:27,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.60 | bwd: 3364.12 | bwd_inner: 3363.17 | bwd_allreduce: 0.90 | step: 7.98 22%|██▏ | 2192/10000 [3:27:48<11:56:13, 5.50s/it] {'loss': 0.1012, 'grad_norm': 1.2130603790283203, 'learning_rate': 3.636113183264329e-05, 'epoch': 2.19} 22%|██▏ | 2192/10000 [3:27:48<11:56:13, 5.50s/it][2025-06-19 16:57:32,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:57:32,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.41 | bwd_microstep: 3370.53 | bwd_inner_microstep: 3369.70 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.82 [2025-06-19 16:57:32,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.41 | bwd: 3370.55 | bwd_inner: 3369.70 | bwd_allreduce: 0.81 | step: 6.82 22%|██▏ | 2193/10000 [3:27:53<11:57:29, 5.51s/it] {'loss': 0.0628, 'grad_norm': 0.5508891940116882, 'learning_rate': 3.635740551382268e-05, 'epoch': 2.19} 22%|██▏ | 2193/10000 [3:27:53<11:57:29, 5.51s/it][2025-06-19 16:57:38,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 16:57:38,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.28 | bwd_microstep: 3315.71 | bwd_inner_microstep: 3314.59 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.14 [2025-06-19 16:57:38,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.28 | bwd: 3315.74 | bwd_inner: 3314.59 | bwd_allreduce: 1.09 | step: 8.14 22%|██▏ | 2194/10000 [3:27:59<11:55:30, 5.50s/it] {'loss': 0.0751, 'grad_norm': 0.692594587802887, 'learning_rate': 3.6353677479186157e-05, 'epoch': 2.19} 22%|██▏ | 2194/10000 [3:27:59<11:55:30, 5.50s/it][2025-06-19 16:57:43,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:57:43,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.55 | bwd_microstep: 3314.80 | bwd_inner_microstep: 3313.99 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.47 [2025-06-19 16:57:43,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.55 | bwd: 3314.82 | bwd_inner: 3313.99 | bwd_allreduce: 0.79 | step: 7.48 22%|██▏ | 2195/10000 [3:28:04<11:54:12, 5.49s/it] {'loss': 0.0269, 'grad_norm': 0.5376111268997192, 'learning_rate': 3.634994772912476e-05, 'epoch': 2.19} 22%|██▏ | 2195/10000 [3:28:04<11:54:12, 5.49s/it][2025-06-19 16:57:49,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:57:49,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.20 | bwd_microstep: 3313.19 | bwd_inner_microstep: 3312.37 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.26 [2025-06-19 16:57:49,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.20 | bwd: 3313.21 | bwd_inner: 3312.37 | bwd_allreduce: 0.79 | step: 7.26 22%|██▏ | 2196/10000 [3:28:10<11:52:58, 5.48s/it] {'loss': 0.1797, 'grad_norm': 1.7339451313018799, 'learning_rate': 3.634621626402972e-05, 'epoch': 2.2} 22%|██▏ | 2196/10000 [3:28:10<11:52:58, 5.48s/it][2025-06-19 16:57:54,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.73 [2025-06-19 16:57:54,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.49 | bwd_microstep: 3314.15 | bwd_inner_microstep: 3313.33 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.99 [2025-06-19 16:57:54,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.49 | bwd: 3314.17 | bwd_inner: 3313.33 | bwd_allreduce: 0.79 | step: 6.99 22%|██▏ | 2197/10000 [3:28:15<11:51:54, 5.47s/it] {'loss': 0.1057, 'grad_norm': 1.6666905879974365, 'learning_rate': 3.634248308429247e-05, 'epoch': 2.2} 22%|██▏ | 2197/10000 [3:28:15<11:51:54, 5.47s/it][2025-06-19 16:58:00,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:58:00,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.14 | bwd_microstep: 3364.54 | bwd_inner_microstep: 3363.73 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.32 [2025-06-19 16:58:00,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.14 | bwd: 3364.56 | bwd_inner: 3363.73 | bwd_allreduce: 0.78 | step: 7.33 22%|██▏ | 2198/10000 [3:28:21<11:54:06, 5.49s/it] {'loss': 0.0553, 'grad_norm': 0.659644365310669, 'learning_rate': 3.6338748190304596e-05, 'epoch': 2.2} 22%|██▏ | 2198/10000 [3:28:21<11:54:06, 5.49s/it][2025-06-19 16:58:05,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 16:58:05,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.27 | bwd_microstep: 3361.35 | bwd_inner_microstep: 3360.48 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.41 [2025-06-19 16:58:05,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.27 | bwd: 3361.38 | bwd_inner: 3360.48 | bwd_allreduce: 0.84 | step: 7.41 22%|██▏ | 2199/10000 [3:28:26<11:55:51, 5.51s/it] {'loss': 0.0749, 'grad_norm': 0.6865257024765015, 'learning_rate': 3.633501158245786e-05, 'epoch': 2.2} 22%|██▏ | 2199/10000 [3:28:26<11:55:51, 5.51s/it][2025-06-19 16:58:11,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:58:11,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.57 | bwd_microstep: 3317.55 | bwd_inner_microstep: 3316.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 16:58:11,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.57 | bwd: 3317.57 | bwd_inner: 3316.75 | bwd_allreduce: 0.77 | step: 6.90 22%|██▏ | 2200/10000 [3:28:32<11:54:03, 5.49s/it] {'loss': 0.0457, 'grad_norm': 0.8684275150299072, 'learning_rate': 3.633127326114422e-05, 'epoch': 2.2} 22%|██▏ | 2200/10000 [3:28:32<11:54:03, 5.49s/it][2025-06-19 16:58:16,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:58:16,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.43 | bwd_microstep: 3308.51 | bwd_inner_microstep: 3307.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 16:58:16,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.43 | bwd: 3308.53 | bwd_inner: 3307.72 | bwd_allreduce: 0.77 | step: 6.65 22%|██▏ | 2201/10000 [3:28:37<11:52:04, 5.48s/it] {'loss': 0.044, 'grad_norm': 0.5829071998596191, 'learning_rate': 3.632753322675582e-05, 'epoch': 2.2} 22%|██▏ | 2201/10000 [3:28:37<11:52:04, 5.48s/it][2025-06-19 16:58:22,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:58:22,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.22 | bwd_microstep: 3363.77 | bwd_inner_microstep: 3362.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 16:58:22,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.22 | bwd: 3363.79 | bwd_inner: 3362.97 | bwd_allreduce: 0.78 | step: 7.13 22%|██▏ | 2202/10000 [3:28:43<11:53:45, 5.49s/it] {'loss': 0.0457, 'grad_norm': 0.5161507725715637, 'learning_rate': 3.632379147968495e-05, 'epoch': 2.2} 22%|██▏ | 2202/10000 [3:28:43<11:53:45, 5.49s/it][2025-06-19 16:58:27,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:58:27,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.26 | bwd_microstep: 3396.40 | bwd_inner_microstep: 3395.55 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.09 [2025-06-19 16:58:27,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.26 | bwd: 3396.42 | bwd_inner: 3395.55 | bwd_allreduce: 0.81 | step: 7.09 22%|██▏ | 2203/10000 [3:28:48<11:57:13, 5.52s/it] {'loss': 0.1771, 'grad_norm': 1.2167439460754395, 'learning_rate': 3.6320048020324124e-05, 'epoch': 2.2} 22%|██▏ | 2203/10000 [3:28:48<11:57:13, 5.52s/it][2025-06-19 16:58:33,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:58:33,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.85 | bwd_microstep: 3318.13 | bwd_inner_microstep: 3317.27 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.31 [2025-06-19 16:58:33,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.85 | bwd: 3318.16 | bwd_inner: 3317.27 | bwd_allreduce: 0.82 | step: 7.32 22%|██▏ | 2204/10000 [3:28:54<11:54:51, 5.50s/it] {'loss': 0.1135, 'grad_norm': 2.4675796031951904, 'learning_rate': 3.6316302849066005e-05, 'epoch': 2.2} 22%|██▏ | 2204/10000 [3:28:54<11:54:51, 5.50s/it][2025-06-19 16:58:38,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.77 [2025-06-19 16:58:38,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.65 | bwd_microstep: 3364.98 | bwd_inner_microstep: 3364.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-19 16:58:38,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.65 | bwd: 3364.99 | bwd_inner: 3364.18 | bwd_allreduce: 0.76 | step: 7.07 22%|██▏ | 2205/10000 [3:28:59<11:55:51, 5.51s/it] {'loss': 0.0552, 'grad_norm': 0.8123115301132202, 'learning_rate': 3.631255596630344e-05, 'epoch': 2.21} 22%|██▏ | 2205/10000 [3:28:59<11:55:51, 5.51s/it][2025-06-19 16:58:44,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:58:44,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.75 | bwd_microstep: 3318.31 | bwd_inner_microstep: 3317.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 16:58:44,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.75 | bwd: 3318.32 | bwd_inner: 3317.52 | bwd_allreduce: 0.77 | step: 6.77 22%|██▏ | 2206/10000 [3:29:05<11:53:33, 5.49s/it] {'loss': 0.0352, 'grad_norm': 0.4597923755645752, 'learning_rate': 3.630880737242946e-05, 'epoch': 2.21} 22%|██▏ | 2206/10000 [3:29:05<11:53:33, 5.49s/it][2025-06-19 16:58:49,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:58:49,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.28 | bwd_microstep: 3365.07 | bwd_inner_microstep: 3364.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.91 [2025-06-19 16:58:49,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.28 | bwd: 3365.09 | bwd_inner: 3364.29 | bwd_allreduce: 0.76 | step: 6.91 22%|██▏ | 2207/10000 [3:29:10<11:55:07, 5.51s/it] {'loss': 0.1209, 'grad_norm': 1.2434545755386353, 'learning_rate': 3.6305057067837285e-05, 'epoch': 2.21} 22%|██▏ | 2207/10000 [3:29:10<11:55:07, 5.51s/it][2025-06-19 16:58:55,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 16:58:55,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.06 | bwd_microstep: 3372.88 | bwd_inner_microstep: 3372.03 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.62 [2025-06-19 16:58:55,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.06 | bwd: 3372.90 | bwd_inner: 3372.03 | bwd_allreduce: 0.82 | step: 7.62 22%|██▏ | 2208/10000 [3:29:16<11:56:55, 5.52s/it] {'loss': 0.0707, 'grad_norm': 0.9725509285926819, 'learning_rate': 3.630130505292029e-05, 'epoch': 2.21} 22%|██▏ | 2208/10000 [3:29:16<11:56:55, 5.52s/it][2025-06-19 16:59:00,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:59:00,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.20 | bwd_microstep: 3317.39 | bwd_inner_microstep: 3316.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 16:59:00,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.20 | bwd: 3317.40 | bwd_inner: 3316.61 | bwd_allreduce: 0.76 | step: 6.58 22%|██▏ | 2209/10000 [3:29:21<11:54:37, 5.50s/it] {'loss': 0.0794, 'grad_norm': 0.8621352314949036, 'learning_rate': 3.629755132807206e-05, 'epoch': 2.21} 22%|██▏ | 2209/10000 [3:29:21<11:54:37, 5.50s/it][2025-06-19 16:59:06,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:59:06,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.46 | bwd_microstep: 3312.24 | bwd_inner_microstep: 3311.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 16:59:06,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.46 | bwd: 3312.25 | bwd_inner: 3311.46 | bwd_allreduce: 0.75 | step: 6.63 22%|██▏ | 2210/10000 [3:29:27<11:52:41, 5.49s/it] {'loss': 0.0778, 'grad_norm': 0.5437660813331604, 'learning_rate': 3.6293795893686324e-05, 'epoch': 2.21} 22%|██▏ | 2210/10000 [3:29:27<11:52:41, 5.49s/it][2025-06-19 16:59:11,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:59:11,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.06 | bwd_microstep: 3359.97 | bwd_inner_microstep: 3359.17 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 16:59:11,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.06 | bwd: 3359.99 | bwd_inner: 3359.17 | bwd_allreduce: 0.77 | step: 6.81 22%|██▏ | 2211/10000 [3:29:32<11:53:47, 5.50s/it] {'loss': 0.0909, 'grad_norm': 0.9723973870277405, 'learning_rate': 3.6290038750157034e-05, 'epoch': 2.21} 22%|██▏ | 2211/10000 [3:29:32<11:53:47, 5.50s/it][2025-06-19 16:59:17,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 16:59:17,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.52 | bwd_microstep: 3315.49 | bwd_inner_microstep: 3314.66 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.11 [2025-06-19 16:59:17,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.52 | bwd: 3315.50 | bwd_inner: 3314.66 | bwd_allreduce: 0.80 | step: 7.11 22%|██▏ | 2212/10000 [3:29:38<11:51:55, 5.48s/it] {'loss': 0.0522, 'grad_norm': 0.6070064902305603, 'learning_rate': 3.6286279897878276e-05, 'epoch': 2.21} 22%|██▏ | 2212/10000 [3:29:38<11:51:55, 5.48s/it][2025-06-19 16:59:22,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 16:59:22,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.47 | bwd_microstep: 3355.38 | bwd_inner_microstep: 3354.40 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.35 [2025-06-19 16:59:22,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.47 | bwd: 3355.40 | bwd_inner: 3354.40 | bwd_allreduce: 0.95 | step: 7.35 22%|██▏ | 2213/10000 [3:29:43<11:53:40, 5.50s/it] {'loss': 0.0568, 'grad_norm': 0.5468306541442871, 'learning_rate': 3.628251933724435e-05, 'epoch': 2.21} 22%|██▏ | 2213/10000 [3:29:43<11:53:40, 5.50s/it][2025-06-19 16:59:28,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:59:28,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.24 | bwd_microstep: 3312.97 | bwd_inner_microstep: 3312.14 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.44 [2025-06-19 16:59:28,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.24 | bwd: 3312.99 | bwd_inner: 3312.14 | bwd_allreduce: 0.80 | step: 7.44 22%|██▏ | 2214/10000 [3:29:49<11:52:16, 5.49s/it] {'loss': 0.0911, 'grad_norm': 0.9111354351043701, 'learning_rate': 3.62787570686497e-05, 'epoch': 2.21} 22%|██▏ | 2214/10000 [3:29:49<11:52:16, 5.49s/it][2025-06-19 16:59:33,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 16:59:33,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.76 | bwd_microstep: 3313.79 | bwd_inner_microstep: 3312.99 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 16:59:33,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.76 | bwd: 3313.80 | bwd_inner: 3312.99 | bwd_allreduce: 0.76 | step: 6.74 22%|██▏ | 2215/10000 [3:29:54<11:50:53, 5.48s/it] {'loss': 0.1543, 'grad_norm': 1.1485017538070679, 'learning_rate': 3.6274993092489e-05, 'epoch': 2.21} 22%|██▏ | 2215/10000 [3:29:54<11:50:53, 5.48s/it][2025-06-19 16:59:39,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:59:39,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.24 | bwd_microstep: 3315.36 | bwd_inner_microstep: 3314.55 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 16:59:39,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.24 | bwd: 3315.37 | bwd_inner: 3314.55 | bwd_allreduce: 0.78 | step: 7.25 22%|██▏ | 2216/10000 [3:29:59<11:50:06, 5.47s/it] {'loss': 0.1038, 'grad_norm': 0.8620693683624268, 'learning_rate': 3.627122740915705e-05, 'epoch': 2.22} 22%|██▏ | 2216/10000 [3:29:59<11:50:06, 5.47s/it][2025-06-19 16:59:44,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 16:59:44,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.09 | bwd_microstep: 3320.92 | bwd_inner_microstep: 3320.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 16:59:44,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.09 | bwd: 3320.93 | bwd_inner: 3320.12 | bwd_allreduce: 0.76 | step: 6.66 22%|██▏ | 2217/10000 [3:30:05<11:49:23, 5.47s/it] {'loss': 0.0733, 'grad_norm': 0.8630375266075134, 'learning_rate': 3.626746001904887e-05, 'epoch': 2.22} 22%|██▏ | 2217/10000 [3:30:05<11:49:23, 5.47s/it][2025-06-19 16:59:50,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 16:59:50,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.05 | bwd_microstep: 3405.73 | bwd_inner_microstep: 3404.93 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.29 [2025-06-19 16:59:50,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.05 | bwd: 3405.75 | bwd_inner: 3404.93 | bwd_allreduce: 0.78 | step: 7.29 22%|██▏ | 2218/10000 [3:30:11<11:53:53, 5.50s/it] {'loss': 0.041, 'grad_norm': 0.6420944333076477, 'learning_rate': 3.6263690922559625e-05, 'epoch': 2.22} 22%|██▏ | 2218/10000 [3:30:11<11:53:53, 5.50s/it][2025-06-19 16:59:55,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 16:59:55,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.68 | bwd_microstep: 3369.59 | bwd_inner_microstep: 3368.75 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.96 [2025-06-19 16:59:55,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.68 | bwd: 3369.61 | bwd_inner: 3368.75 | bwd_allreduce: 0.82 | step: 6.96 22%|██▏ | 2219/10000 [3:30:16<11:55:01, 5.51s/it] {'loss': 0.1195, 'grad_norm': 1.175360083580017, 'learning_rate': 3.625992012008469e-05, 'epoch': 2.22} 22%|██▏ | 2219/10000 [3:30:16<11:55:01, 5.51s/it][2025-06-19 17:00:01,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:00:01,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.86 | bwd_microstep: 3322.95 | bwd_inner_microstep: 3322.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.13 [2025-06-19 17:00:01,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.86 | bwd: 3322.97 | bwd_inner: 3322.16 | bwd_allreduce: 0.77 | step: 7.14 22%|██▏ | 2220/10000 [3:30:22<11:53:13, 5.50s/it] {'loss': 0.1367, 'grad_norm': 1.0658977031707764, 'learning_rate': 3.62561476120196e-05, 'epoch': 2.22} 22%|██▏ | 2220/10000 [3:30:22<11:53:13, 5.50s/it][2025-06-19 17:00:06,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:00:06,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.17 | bwd_microstep: 3382.94 | bwd_inner_microstep: 3381.98 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.96 [2025-06-19 17:00:06,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.17 | bwd: 3382.95 | bwd_inner: 3381.98 | bwd_allreduce: 0.93 | step: 7.97 22%|██▏ | 2221/10000 [3:30:27<11:55:30, 5.52s/it] {'loss': 0.1767, 'grad_norm': 1.1061480045318604, 'learning_rate': 3.625237339876007e-05, 'epoch': 2.22} 22%|██▏ | 2221/10000 [3:30:27<11:55:30, 5.52s/it][2025-06-19 17:00:12,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:00:12,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.79 | bwd_microstep: 3322.40 | bwd_inner_microstep: 3321.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-19 17:00:12,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.79 | bwd: 3322.42 | bwd_inner: 3321.60 | bwd_allreduce: 0.77 | step: 6.82 22%|██▏ | 2222/10000 [3:30:33<11:53:37, 5.50s/it] {'loss': 0.0583, 'grad_norm': 0.5357918739318848, 'learning_rate': 3.624859748070201e-05, 'epoch': 2.22} 22%|██▏ | 2222/10000 [3:30:33<11:53:37, 5.50s/it][2025-06-19 17:00:17,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:00:17,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.30 | bwd_microstep: 3370.12 | bwd_inner_microstep: 3369.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.36 [2025-06-19 17:00:17,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.30 | bwd: 3370.13 | bwd_inner: 3369.31 | bwd_allreduce: 0.78 | step: 7.36 22%|██▏ | 2223/10000 [3:30:38<11:54:38, 5.51s/it] {'loss': 0.1555, 'grad_norm': 1.4350316524505615, 'learning_rate': 3.624481985824147e-05, 'epoch': 2.22} 22%|██▏ | 2223/10000 [3:30:38<11:54:38, 5.51s/it][2025-06-19 17:00:23,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:00:23,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.98 | bwd_microstep: 3368.63 | bwd_inner_microstep: 3367.84 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 17:00:23,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.98 | bwd: 3368.64 | bwd_inner: 3367.84 | bwd_allreduce: 0.75 | step: 6.65 22%|██▏ | 2224/10000 [3:30:44<11:55:22, 5.52s/it] {'loss': 0.0673, 'grad_norm': 0.6476210951805115, 'learning_rate': 3.624104053177473e-05, 'epoch': 2.22} 22%|██▏ | 2224/10000 [3:30:44<11:55:22, 5.52s/it][2025-06-19 17:00:28,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:00:28,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.69 | bwd_microstep: 3374.47 | bwd_inner_microstep: 3373.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.30 [2025-06-19 17:00:28,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.69 | bwd: 3374.48 | bwd_inner: 3373.67 | bwd_allreduce: 0.77 | step: 7.30 22%|██▏ | 2225/10000 [3:30:49<11:56:14, 5.53s/it] {'loss': 0.1466, 'grad_norm': 0.8475548624992371, 'learning_rate': 3.623725950169821e-05, 'epoch': 2.23} 22%|██▏ | 2225/10000 [3:30:49<11:56:14, 5.53s/it][2025-06-19 17:00:34,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:00:34,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.10 | bwd_microstep: 3381.13 | bwd_inner_microstep: 3380.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 17:00:34,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.10 | bwd: 3381.14 | bwd_inner: 3380.34 | bwd_allreduce: 0.76 | step: 6.64 22%|██▏ | 2226/10000 [3:30:55<11:57:04, 5.53s/it] {'loss': 0.0816, 'grad_norm': 1.238913893699646, 'learning_rate': 3.6233476768408536e-05, 'epoch': 2.23} 22%|██▏ | 2226/10000 [3:30:55<11:57:04, 5.53s/it][2025-06-19 17:00:39,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:00:39,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.62 | bwd_microstep: 3384.72 | bwd_inner_microstep: 3383.91 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.35 [2025-06-19 17:00:39,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.62 | bwd: 3384.73 | bwd_inner: 3383.91 | bwd_allreduce: 0.78 | step: 7.36 22%|██▏ | 2227/10000 [3:31:00<11:57:59, 5.54s/it] {'loss': 0.1158, 'grad_norm': 0.8077952265739441, 'learning_rate': 3.6229692332302485e-05, 'epoch': 2.23} 22%|██▏ | 2227/10000 [3:31:00<11:57:59, 5.54s/it][2025-06-19 17:00:45,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:00:45,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.09 | bwd_microstep: 3371.38 | bwd_inner_microstep: 3370.53 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.15 [2025-06-19 17:00:45,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.09 | bwd: 3371.40 | bwd_inner: 3370.53 | bwd_allreduce: 0.82 | step: 7.16 22%|██▏ | 2228/10000 [3:31:06<11:57:46, 5.54s/it] {'loss': 0.0522, 'grad_norm': 0.481742262840271, 'learning_rate': 3.622590619377703e-05, 'epoch': 2.23} 22%|██▏ | 2228/10000 [3:31:06<11:57:46, 5.54s/it][2025-06-19 17:00:51,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:00:51,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.52 | bwd_microstep: 3320.58 | bwd_inner_microstep: 3319.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-19 17:00:51,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.52 | bwd: 3320.59 | bwd_inner: 3319.78 | bwd_allreduce: 0.77 | step: 6.89 22%|██▏ | 2229/10000 [3:31:11<11:54:46, 5.52s/it] {'loss': 0.0595, 'grad_norm': 0.6381138563156128, 'learning_rate': 3.622211835322933e-05, 'epoch': 2.23} 22%|██▏ | 2229/10000 [3:31:11<11:54:46, 5.52s/it][2025-06-19 17:00:56,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:00:56,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.88 | bwd_microstep: 3382.40 | bwd_inner_microstep: 3381.52 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.01 [2025-06-19 17:00:56,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.88 | bwd: 3382.41 | bwd_inner: 3381.52 | bwd_allreduce: 0.85 | step: 7.01 22%|██▏ | 2230/10000 [3:31:17<11:56:00, 5.53s/it] {'loss': 0.1324, 'grad_norm': 1.078382968902588, 'learning_rate': 3.62183288110567e-05, 'epoch': 2.23} 22%|██▏ | 2230/10000 [3:31:17<11:56:00, 5.53s/it][2025-06-19 17:01:02,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:01:02,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.55 | bwd_microstep: 3333.58 | bwd_inner_microstep: 3332.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 17:01:02,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.55 | bwd: 3333.60 | bwd_inner: 3332.78 | bwd_allreduce: 0.77 | step: 6.95 22%|██▏ | 2231/10000 [3:31:22<11:54:14, 5.52s/it] {'loss': 0.0626, 'grad_norm': 0.67207932472229, 'learning_rate': 3.621453756765665e-05, 'epoch': 2.23} 22%|██▏ | 2231/10000 [3:31:22<11:54:14, 5.52s/it][2025-06-19 17:01:07,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:01:07,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.45 | bwd_microstep: 3326.47 | bwd_inner_microstep: 3325.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 17:01:07,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.45 | bwd: 3326.48 | bwd_inner: 3325.67 | bwd_allreduce: 0.77 | step: 6.98 22%|██▏ | 2232/10000 [3:31:28<11:52:36, 5.50s/it] {'loss': 0.1028, 'grad_norm': 1.1935079097747803, 'learning_rate': 3.621074462342686e-05, 'epoch': 2.23} 22%|██▏ | 2232/10000 [3:31:28<11:52:36, 5.50s/it][2025-06-19 17:01:12,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:01:12,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.37 | bwd_microstep: 3335.02 | bwd_inner_microstep: 3334.12 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.90 [2025-06-19 17:01:12,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.37 | bwd: 3335.04 | bwd_inner: 3334.12 | bwd_allreduce: 0.88 | step: 6.90 22%|██▏ | 2233/10000 [3:31:33<11:51:33, 5.50s/it] {'loss': 0.083, 'grad_norm': 0.617952287197113, 'learning_rate': 3.62069499787652e-05, 'epoch': 2.23} 22%|██▏ | 2233/10000 [3:31:33<11:51:33, 5.50s/it][2025-06-19 17:01:18,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:01:18,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.08 | bwd_microstep: 3375.88 | bwd_inner_microstep: 3375.09 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 17:01:18,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.08 | bwd: 3375.90 | bwd_inner: 3375.09 | bwd_allreduce: 0.76 | step: 6.67 22%|██▏ | 2234/10000 [3:31:39<11:53:22, 5.51s/it] {'loss': 0.0753, 'grad_norm': 0.8469122052192688, 'learning_rate': 3.6203153634069705e-05, 'epoch': 2.23} 22%|██▏ | 2234/10000 [3:31:39<11:53:22, 5.51s/it][2025-06-19 17:01:24,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 17:01:24,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.85 | bwd_microstep: 3373.95 | bwd_inner_microstep: 3372.78 | bwd_allreduce_microstep: 1.09 | step_microstep: 7.99 [2025-06-19 17:01:24,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.85 | bwd: 3373.98 | bwd_inner: 3372.78 | bwd_allreduce: 1.13 | step: 8.00 22%|██▏ | 2235/10000 [3:31:44<11:54:35, 5.52s/it] {'loss': 0.08, 'grad_norm': 0.693047046661377, 'learning_rate': 3.619935558973859e-05, 'epoch': 2.23} 22%|██▏ | 2235/10000 [3:31:44<11:54:35, 5.52s/it][2025-06-19 17:01:29,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:01:29,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.86 | bwd_microstep: 3387.16 | bwd_inner_microstep: 3386.12 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.44 [2025-06-19 17:01:29,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.86 | bwd: 3387.20 | bwd_inner: 3386.12 | bwd_allreduce: 1.00 | step: 7.44 22%|██▏ | 2236/10000 [3:31:50<11:56:35, 5.54s/it] {'loss': 0.0952, 'grad_norm': 1.0143190622329712, 'learning_rate': 3.619555584617026e-05, 'epoch': 2.24} 22%|██▏ | 2236/10000 [3:31:50<11:56:35, 5.54s/it][2025-06-19 17:01:35,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:01:35,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.48 | bwd_microstep: 3327.85 | bwd_inner_microstep: 3327.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.59 [2025-06-19 17:01:35,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.48 | bwd: 3327.86 | bwd_inner: 3327.06 | bwd_allreduce: 0.76 | step: 6.60 22%|██▏ | 2237/10000 [3:31:55<11:54:10, 5.52s/it] {'loss': 0.049, 'grad_norm': 0.5297645926475525, 'learning_rate': 3.6191754403763295e-05, 'epoch': 2.24} 22%|██▏ | 2237/10000 [3:31:55<11:54:10, 5.52s/it][2025-06-19 17:01:40,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 17:01:40,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.79 | bwd_microstep: 3337.19 | bwd_inner_microstep: 3336.24 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.12 [2025-06-19 17:01:40,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.79 | bwd: 3337.20 | bwd_inner: 3336.24 | bwd_allreduce: 0.91 | step: 7.12 22%|██▏ | 2238/10000 [3:32:01<11:52:45, 5.51s/it] {'loss': 0.0543, 'grad_norm': 0.4415402412414551, 'learning_rate': 3.618795126291643e-05, 'epoch': 2.24} 22%|██▏ | 2238/10000 [3:32:01<11:52:45, 5.51s/it][2025-06-19 17:01:46,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:01:46,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.03 | bwd_microstep: 3369.99 | bwd_inner_microstep: 3369.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-19 17:01:46,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.03 | bwd: 3370.00 | bwd_inner: 3369.20 | bwd_allreduce: 0.76 | step: 6.78 22%|██▏ | 2239/10000 [3:32:06<11:54:05, 5.52s/it] {'loss': 0.0823, 'grad_norm': 1.2079771757125854, 'learning_rate': 3.6184146424028614e-05, 'epoch': 2.24} 22%|██▏ | 2239/10000 [3:32:06<11:54:05, 5.52s/it][2025-06-19 17:01:51,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:01:51,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.98 | bwd_microstep: 3326.92 | bwd_inner_microstep: 3326.10 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.91 [2025-06-19 17:01:51,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.98 | bwd: 3326.93 | bwd_inner: 3326.10 | bwd_allreduce: 0.79 | step: 6.92 22%|██▏ | 2240/10000 [3:32:12<11:52:10, 5.51s/it] {'loss': 0.1389, 'grad_norm': 1.4532288312911987, 'learning_rate': 3.6180339887498953e-05, 'epoch': 2.24} 22%|██▏ | 2240/10000 [3:32:12<11:52:10, 5.51s/it][2025-06-19 17:01:57,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 17:01:57,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.37 | bwd_microstep: 3330.39 | bwd_inner_microstep: 3329.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 17:01:57,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.37 | bwd: 3330.41 | bwd_inner: 3329.61 | bwd_allreduce: 0.75 | step: 6.62 22%|██▏ | 2241/10000 [3:32:17<11:51:13, 5.50s/it] {'loss': 0.0453, 'grad_norm': 0.6218274831771851, 'learning_rate': 3.617653165372673e-05, 'epoch': 2.24} 22%|██▏ | 2241/10000 [3:32:17<11:51:13, 5.50s/it][2025-06-19 17:02:02,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:02:02,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.84 | bwd_microstep: 3389.06 | bwd_inner_microstep: 3388.28 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.51 [2025-06-19 17:02:02,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.84 | bwd: 3389.08 | bwd_inner: 3388.28 | bwd_allreduce: 0.75 | step: 6.52 22%|██▏ | 2242/10000 [3:32:23<11:53:41, 5.52s/it] {'loss': 0.1271, 'grad_norm': 0.9748104214668274, 'learning_rate': 3.6172721723111415e-05, 'epoch': 2.24} 22%|██▏ | 2242/10000 [3:32:23<11:53:41, 5.52s/it][2025-06-19 17:02:08,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:02:08,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.76 | bwd_microstep: 3321.76 | bwd_inner_microstep: 3320.99 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.52 [2025-06-19 17:02:08,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.76 | bwd: 3321.77 | bwd_inner: 3320.99 | bwd_allreduce: 0.75 | step: 6.52 22%|██▏ | 2243/10000 [3:32:28<11:51:33, 5.50s/it] {'loss': 0.079, 'grad_norm': 1.501226782798767, 'learning_rate': 3.6168910096052655e-05, 'epoch': 2.24} 22%|██▏ | 2243/10000 [3:32:28<11:51:33, 5.50s/it][2025-06-19 17:02:13,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:02:13,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.22 | bwd_microstep: 3419.63 | bwd_inner_microstep: 3418.85 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 17:02:13,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.22 | bwd: 3419.65 | bwd_inner: 3418.85 | bwd_allreduce: 0.75 | step: 6.63 22%|██▏ | 2244/10000 [3:32:34<11:55:09, 5.53s/it] {'loss': 0.0497, 'grad_norm': 0.4779725968837738, 'learning_rate': 3.616509677295026e-05, 'epoch': 2.24} 22%|██▏ | 2244/10000 [3:32:34<11:55:09, 5.53s/it][2025-06-19 17:02:19,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:02:19,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.25 | bwd_microstep: 3388.11 | bwd_inner_microstep: 3387.33 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.57 [2025-06-19 17:02:19,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.25 | bwd: 3388.12 | bwd_inner: 3387.33 | bwd_allreduce: 0.75 | step: 6.58 22%|██▏ | 2245/10000 [3:32:40<11:56:33, 5.54s/it] {'loss': 0.0522, 'grad_norm': 0.39487552642822266, 'learning_rate': 3.616128175420424e-05, 'epoch': 2.25} 22%|██▏ | 2245/10000 [3:32:40<11:56:33, 5.54s/it][2025-06-19 17:02:24,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:02:24,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.80 | bwd_microstep: 3331.88 | bwd_inner_microstep: 3331.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 17:02:24,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.80 | bwd: 3331.90 | bwd_inner: 3331.10 | bwd_allreduce: 0.75 | step: 6.60 22%|██▏ | 2246/10000 [3:32:45<11:54:04, 5.53s/it] {'loss': 0.1046, 'grad_norm': 0.8630045652389526, 'learning_rate': 3.615746504021477e-05, 'epoch': 2.25} 22%|██▏ | 2246/10000 [3:32:45<11:54:04, 5.53s/it][2025-06-19 17:02:30,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:02:30,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.21 | bwd_microstep: 3378.42 | bwd_inner_microstep: 3377.64 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 17:02:30,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.21 | bwd: 3378.44 | bwd_inner: 3377.64 | bwd_allreduce: 0.76 | step: 6.77 22%|██▏ | 2247/10000 [3:32:51<11:54:54, 5.53s/it] {'loss': 0.0609, 'grad_norm': 0.5646699666976929, 'learning_rate': 3.615364663138221e-05, 'epoch': 2.25} 22%|██▏ | 2247/10000 [3:32:51<11:54:54, 5.53s/it][2025-06-19 17:02:35,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:02:35,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.99 | bwd_microstep: 3378.47 | bwd_inner_microstep: 3377.49 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.46 [2025-06-19 17:02:35,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.00 | bwd: 3378.49 | bwd_inner: 3377.49 | bwd_allreduce: 0.95 | step: 7.46 22%|██▏ | 2248/10000 [3:32:56<11:55:50, 5.54s/it] {'loss': 0.061, 'grad_norm': 0.5387133359909058, 'learning_rate': 3.6149826528107094e-05, 'epoch': 2.25} 22%|██▏ | 2248/10000 [3:32:56<11:55:50, 5.54s/it][2025-06-19 17:02:41,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:02:41,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.24 | bwd_microstep: 3328.95 | bwd_inner_microstep: 3328.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 17:02:41,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.24 | bwd: 3328.96 | bwd_inner: 3328.17 | bwd_allreduce: 0.76 | step: 6.66 22%|██▏ | 2249/10000 [3:33:02<11:53:30, 5.52s/it] {'loss': 0.0537, 'grad_norm': 0.5791246294975281, 'learning_rate': 3.614600473079012e-05, 'epoch': 2.25} 22%|██▏ | 2249/10000 [3:33:02<11:53:30, 5.52s/it][2025-06-19 17:02:46,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:02:46,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.41 | bwd_microstep: 3328.59 | bwd_inner_microstep: 3327.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 17:02:46,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.41 | bwd: 3328.61 | bwd_inner: 3327.81 | bwd_allreduce: 0.75 | step: 6.58 22%|██▎ | 2250/10000 [3:33:07<11:51:44, 5.51s/it] {'loss': 0.1279, 'grad_norm': 0.9200093746185303, 'learning_rate': 3.614218123983219e-05, 'epoch': 2.25} 22%|██▎ | 2250/10000 [3:33:07<11:51:44, 5.51s/it][2025-06-19 17:02:52,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:02:52,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.42 | bwd_microstep: 3331.54 | bwd_inner_microstep: 3330.76 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.56 [2025-06-19 17:02:52,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.42 | bwd: 3331.55 | bwd_inner: 3330.76 | bwd_allreduce: 0.75 | step: 6.57 23%|██▎ | 2251/10000 [3:33:13<11:50:27, 5.50s/it] {'loss': 0.0731, 'grad_norm': 0.5209107995033264, 'learning_rate': 3.613835605563436e-05, 'epoch': 2.25} 23%|██▎ | 2251/10000 [3:33:13<11:50:27, 5.50s/it][2025-06-19 17:02:57,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:02:57,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.33 | bwd_microstep: 3322.15 | bwd_inner_microstep: 3321.37 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 17:02:57,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.33 | bwd: 3322.16 | bwd_inner: 3321.37 | bwd_allreduce: 0.75 | step: 6.54 23%|██▎ | 2252/10000 [3:33:18<11:49:12, 5.49s/it] {'loss': 0.1975, 'grad_norm': 0.7075250744819641, 'learning_rate': 3.613452917859789e-05, 'epoch': 2.25} 23%|██▎ | 2252/10000 [3:33:18<11:49:12, 5.49s/it][2025-06-19 17:03:03,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:03:03,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.53 | bwd_microstep: 3332.17 | bwd_inner_microstep: 3331.19 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.00 [2025-06-19 17:03:03,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.53 | bwd: 3332.19 | bwd_inner: 3331.19 | bwd_allreduce: 0.94 | step: 7.00 23%|██▎ | 2253/10000 [3:33:24<11:48:43, 5.49s/it] {'loss': 0.0417, 'grad_norm': 0.39121484756469727, 'learning_rate': 3.613070060912419e-05, 'epoch': 2.25} 23%|██▎ | 2253/10000 [3:33:24<11:48:43, 5.49s/it][2025-06-19 17:03:08,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:03:08,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.13 | bwd_microstep: 3337.03 | bwd_inner_microstep: 3336.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 17:03:08,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.13 | bwd: 3337.04 | bwd_inner: 3336.24 | bwd_allreduce: 0.76 | step: 6.77 23%|██▎ | 2254/10000 [3:33:29<11:48:45, 5.49s/it] {'loss': 0.1036, 'grad_norm': 1.0202168226242065, 'learning_rate': 3.612687034761486e-05, 'epoch': 2.25} 23%|██▎ | 2254/10000 [3:33:29<11:48:45, 5.49s/it][2025-06-19 17:03:14,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:03:14,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.79 | bwd_microstep: 3334.41 | bwd_inner_microstep: 3333.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 17:03:14,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.79 | bwd: 3334.42 | bwd_inner: 3333.60 | bwd_allreduce: 0.78 | step: 6.81 23%|██▎ | 2255/10000 [3:33:35<11:48:34, 5.49s/it] {'loss': 0.0614, 'grad_norm': 0.8044315576553345, 'learning_rate': 3.612303839447167e-05, 'epoch': 2.25} 23%|██▎ | 2255/10000 [3:33:35<11:48:34, 5.49s/it][2025-06-19 17:03:19,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:03:19,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.91 | bwd_microstep: 3404.32 | bwd_inner_microstep: 3403.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 17:03:19,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.91 | bwd: 3404.33 | bwd_inner: 3403.54 | bwd_allreduce: 0.75 | step: 6.62 23%|██▎ | 2256/10000 [3:33:40<11:52:24, 5.52s/it] {'loss': 0.1078, 'grad_norm': 0.7663602828979492, 'learning_rate': 3.611920475009659e-05, 'epoch': 2.26} 23%|██▎ | 2256/10000 [3:33:40<11:52:24, 5.52s/it][2025-06-19 17:03:25,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:03:25,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.33 | bwd_microstep: 3367.31 | bwd_inner_microstep: 3366.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 17:03:25,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.33 | bwd: 3367.33 | bwd_inner: 3366.54 | bwd_allreduce: 0.75 | step: 6.56 23%|██▎ | 2257/10000 [3:33:46<11:52:41, 5.52s/it] {'loss': 0.0589, 'grad_norm': 0.5250054001808167, 'learning_rate': 3.611536941489174e-05, 'epoch': 2.26} 23%|██▎ | 2257/10000 [3:33:46<11:52:41, 5.52s/it][2025-06-19 17:03:30,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:03:30,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.57 | bwd_microstep: 3321.94 | bwd_inner_microstep: 3321.05 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.96 [2025-06-19 17:03:30,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.57 | bwd: 3321.96 | bwd_inner: 3321.05 | bwd_allreduce: 0.86 | step: 6.96 23%|██▎ | 2258/10000 [3:33:51<11:50:33, 5.51s/it] {'loss': 0.1507, 'grad_norm': 1.0095877647399902, 'learning_rate': 3.6111532389259435e-05, 'epoch': 2.26} 23%|██▎ | 2258/10000 [3:33:51<11:50:33, 5.51s/it][2025-06-19 17:03:36,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:03:36,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.51 | bwd_microstep: 3330.81 | bwd_inner_microstep: 3329.95 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.94 [2025-06-19 17:03:36,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.51 | bwd: 3330.84 | bwd_inner: 3329.95 | bwd_allreduce: 0.82 | step: 6.94 23%|██▎ | 2259/10000 [3:33:57<11:49:50, 5.50s/it] {'loss': 0.0874, 'grad_norm': 0.6754019260406494, 'learning_rate': 3.610769367360215e-05, 'epoch': 2.26} 23%|██▎ | 2259/10000 [3:33:57<11:49:50, 5.50s/it][2025-06-19 17:03:41,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.79 [2025-06-19 17:03:41,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.37 | bwd_microstep: 3333.30 | bwd_inner_microstep: 3332.52 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.64 [2025-06-19 17:03:41,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.37 | bwd: 3333.31 | bwd_inner: 3332.52 | bwd_allreduce: 0.75 | step: 6.65 23%|██▎ | 2260/10000 [3:34:02<11:49:06, 5.50s/it] {'loss': 0.0902, 'grad_norm': 1.0770893096923828, 'learning_rate': 3.6103853268322565e-05, 'epoch': 2.26} 23%|██▎ | 2260/10000 [3:34:02<11:49:06, 5.50s/it][2025-06-19 17:03:47,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:03:47,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.74 | bwd_microstep: 3322.47 | bwd_inner_microstep: 3321.56 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.13 [2025-06-19 17:03:47,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.74 | bwd: 3322.48 | bwd_inner: 3321.56 | bwd_allreduce: 0.88 | step: 7.13 23%|██▎ | 2261/10000 [3:34:08<11:48:11, 5.49s/it] {'loss': 0.0893, 'grad_norm': 0.6091492772102356, 'learning_rate': 3.6100011173823514e-05, 'epoch': 2.26} 23%|██▎ | 2261/10000 [3:34:08<11:48:11, 5.49s/it][2025-06-19 17:03:52,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 17:03:52,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.56 | bwd_microstep: 3368.40 | bwd_inner_microstep: 3367.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 17:03:52,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.56 | bwd: 3368.42 | bwd_inner: 3367.62 | bwd_allreduce: 0.75 | step: 6.69 23%|██▎ | 2262/10000 [3:34:13<11:49:55, 5.50s/it] {'loss': 0.0442, 'grad_norm': 0.44551950693130493, 'learning_rate': 3.609616739050801e-05, 'epoch': 2.26} 23%|██▎ | 2262/10000 [3:34:13<11:49:55, 5.50s/it][2025-06-19 17:03:58,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:03:58,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.01 | bwd_microstep: 3374.16 | bwd_inner_microstep: 3373.22 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.90 [2025-06-19 17:03:58,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.01 | bwd: 3374.17 | bwd_inner: 3373.22 | bwd_allreduce: 0.90 | step: 6.90 23%|██▎ | 2263/10000 [3:34:19<11:51:32, 5.52s/it] {'loss': 0.062, 'grad_norm': 0.46388453245162964, 'learning_rate': 3.609232191877925e-05, 'epoch': 2.26} 23%|██▎ | 2263/10000 [3:34:19<11:51:32, 5.52s/it][2025-06-19 17:04:03,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:04:03,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.92 | bwd_microstep: 3369.53 | bwd_inner_microstep: 3368.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 17:04:03,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.92 | bwd: 3369.54 | bwd_inner: 3368.74 | bwd_allreduce: 0.76 | step: 6.77 23%|██▎ | 2264/10000 [3:34:24<11:52:27, 5.53s/it] {'loss': 0.0892, 'grad_norm': 1.2096861600875854, 'learning_rate': 3.608847475904061e-05, 'epoch': 2.26} 23%|██▎ | 2264/10000 [3:34:24<11:52:27, 5.53s/it][2025-06-19 17:04:09,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:04:09,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.54 | bwd_microstep: 3330.82 | bwd_inner_microstep: 3330.02 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.33 [2025-06-19 17:04:09,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.54 | bwd: 3330.84 | bwd_inner: 3330.02 | bwd_allreduce: 0.77 | step: 7.33 23%|██▎ | 2265/10000 [3:34:30<11:50:48, 5.51s/it] {'loss': 0.0439, 'grad_norm': 0.30821266770362854, 'learning_rate': 3.608462591169564e-05, 'epoch': 2.27} 23%|██▎ | 2265/10000 [3:34:30<11:50:48, 5.51s/it][2025-06-19 17:04:14,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:04:14,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.62 | bwd_microstep: 3329.39 | bwd_inner_microstep: 3328.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 17:04:14,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.62 | bwd: 3329.40 | bwd_inner: 3328.61 | bwd_allreduce: 0.75 | step: 6.60 23%|██▎ | 2266/10000 [3:34:35<11:49:20, 5.50s/it] {'loss': 0.0595, 'grad_norm': 0.5653089284896851, 'learning_rate': 3.6080775377148054e-05, 'epoch': 2.27} 23%|██▎ | 2266/10000 [3:34:35<11:49:20, 5.50s/it][2025-06-19 17:04:20,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:04:20,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.24 | bwd_microstep: 3337.96 | bwd_inner_microstep: 3337.03 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.73 [2025-06-19 17:04:20,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.24 | bwd: 3337.98 | bwd_inner: 3337.03 | bwd_allreduce: 0.89 | step: 7.74 23%|██▎ | 2267/10000 [3:34:41<11:48:34, 5.50s/it] {'loss': 0.0556, 'grad_norm': 0.5233069062232971, 'learning_rate': 3.6076923155801766e-05, 'epoch': 2.27} 23%|██▎ | 2267/10000 [3:34:41<11:48:34, 5.50s/it][2025-06-19 17:04:25,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:04:25,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.59 | bwd_microstep: 3322.21 | bwd_inner_microstep: 3321.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 17:04:25,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.59 | bwd: 3322.22 | bwd_inner: 3321.42 | bwd_allreduce: 0.76 | step: 6.73 23%|██▎ | 2268/10000 [3:34:46<11:47:43, 5.49s/it] {'loss': 0.1103, 'grad_norm': 1.3712899684906006, 'learning_rate': 3.6073069248060856e-05, 'epoch': 2.27} 23%|██▎ | 2268/10000 [3:34:46<11:47:43, 5.49s/it][2025-06-19 17:04:31,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:04:31,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.98 | bwd_microstep: 3321.93 | bwd_inner_microstep: 3321.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 17:04:31,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.98 | bwd: 3321.94 | bwd_inner: 3321.15 | bwd_allreduce: 0.75 | step: 6.55 23%|██▎ | 2269/10000 [3:34:52<11:46:56, 5.49s/it] {'loss': 0.0438, 'grad_norm': 0.4884372651576996, 'learning_rate': 3.606921365432959e-05, 'epoch': 2.27} 23%|██▎ | 2269/10000 [3:34:52<11:46:56, 5.49s/it][2025-06-19 17:04:36,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:04:36,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.61 | bwd_microstep: 3376.63 | bwd_inner_microstep: 3375.85 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.51 [2025-06-19 17:04:36,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.61 | bwd: 3376.65 | bwd_inner: 3375.85 | bwd_allreduce: 0.75 | step: 6.51 23%|██▎ | 2270/10000 [3:34:57<11:49:04, 5.50s/it] {'loss': 0.0527, 'grad_norm': 0.7795057892799377, 'learning_rate': 3.6065356375012374e-05, 'epoch': 2.27} 23%|██▎ | 2270/10000 [3:34:57<11:49:04, 5.50s/it][2025-06-19 17:04:42,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:04:42,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.70 | bwd_microstep: 3315.17 | bwd_inner_microstep: 3314.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-19 17:04:42,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.70 | bwd: 3315.18 | bwd_inner: 3314.38 | bwd_allreduce: 0.76 | step: 6.79 23%|██▎ | 2271/10000 [3:35:03<11:47:19, 5.49s/it] {'loss': 0.0516, 'grad_norm': 0.48841655254364014, 'learning_rate': 3.6061497410513846e-05, 'epoch': 2.27} 23%|██▎ | 2271/10000 [3:35:03<11:47:19, 5.49s/it][2025-06-19 17:04:47,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:04:47,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.69 | bwd_microstep: 3320.60 | bwd_inner_microstep: 3319.47 | bwd_allreduce_microstep: 1.09 | step_microstep: 7.18 [2025-06-19 17:04:47,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.69 | bwd: 3320.62 | bwd_inner: 3319.47 | bwd_allreduce: 1.11 | step: 7.18 23%|██▎ | 2272/10000 [3:35:08<11:46:32, 5.49s/it] {'loss': 0.0578, 'grad_norm': 0.6353466510772705, 'learning_rate': 3.605763676123878e-05, 'epoch': 2.27} 23%|██▎ | 2272/10000 [3:35:08<11:46:32, 5.49s/it][2025-06-19 17:04:53,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:04:53,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.63 | bwd_microstep: 3313.62 | bwd_inner_microstep: 3312.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 17:04:53,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.63 | bwd: 3313.64 | bwd_inner: 3312.83 | bwd_allreduce: 0.76 | step: 6.68 23%|██▎ | 2273/10000 [3:35:14<11:45:18, 5.48s/it] {'loss': 0.0455, 'grad_norm': 0.6745041608810425, 'learning_rate': 3.6053774427592145e-05, 'epoch': 2.27} 23%|██▎ | 2273/10000 [3:35:14<11:45:18, 5.48s/it][2025-06-19 17:04:58,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:04:58,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.46 | bwd_microstep: 3374.16 | bwd_inner_microstep: 3373.37 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 17:04:58,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.46 | bwd: 3374.18 | bwd_inner: 3373.37 | bwd_allreduce: 0.76 | step: 6.65 23%|██▎ | 2274/10000 [3:35:19<11:47:45, 5.50s/it] {'loss': 0.0756, 'grad_norm': 0.5442528128623962, 'learning_rate': 3.6049910409979074e-05, 'epoch': 2.27} 23%|██▎ | 2274/10000 [3:35:19<11:47:45, 5.50s/it][2025-06-19 17:05:04,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:05:04,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.12 | bwd_microstep: 3321.98 | bwd_inner_microstep: 3321.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 17:05:04,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.12 | bwd: 3321.99 | bwd_inner: 3321.19 | bwd_allreduce: 0.76 | step: 6.68 23%|██▎ | 2275/10000 [3:35:25<11:46:35, 5.49s/it] {'loss': 0.0619, 'grad_norm': 1.3218846321105957, 'learning_rate': 3.60460447088049e-05, 'epoch': 2.27} 23%|██▎ | 2275/10000 [3:35:25<11:46:35, 5.49s/it][2025-06-19 17:05:09,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:05:09,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.20 | bwd_microstep: 3365.91 | bwd_inner_microstep: 3364.92 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.71 [2025-06-19 17:05:09,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.20 | bwd: 3365.93 | bwd_inner: 3364.92 | bwd_allreduce: 0.96 | step: 7.73 23%|██▎ | 2276/10000 [3:35:30<11:48:20, 5.50s/it] {'loss': 0.089, 'grad_norm': 0.8093723654747009, 'learning_rate': 3.60421773244751e-05, 'epoch': 2.28} 23%|██▎ | 2276/10000 [3:35:30<11:48:20, 5.50s/it][2025-06-19 17:05:15,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:05:15,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.97 | bwd_microstep: 3317.36 | bwd_inner_microstep: 3316.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.86 [2025-06-19 17:05:15,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.97 | bwd: 3317.37 | bwd_inner: 3316.58 | bwd_allreduce: 0.75 | step: 6.86 23%|██▎ | 2277/10000 [3:35:36<11:46:48, 5.49s/it] {'loss': 0.0689, 'grad_norm': 1.2057193517684937, 'learning_rate': 3.603830825739536e-05, 'epoch': 2.28} 23%|██▎ | 2277/10000 [3:35:36<11:46:48, 5.49s/it][2025-06-19 17:05:20,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:05:20,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.96 | bwd_microstep: 3325.59 | bwd_inner_microstep: 3324.68 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.32 [2025-06-19 17:05:20,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.96 | bwd: 3325.60 | bwd_inner: 3324.68 | bwd_allreduce: 0.87 | step: 7.33 23%|██▎ | 2278/10000 [3:35:41<11:46:26, 5.49s/it] {'loss': 0.0555, 'grad_norm': 0.5353371500968933, 'learning_rate': 3.6034437507971516e-05, 'epoch': 2.28} 23%|██▎ | 2278/10000 [3:35:41<11:46:26, 5.49s/it][2025-06-19 17:05:26,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:05:26,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.30 | bwd_microstep: 3319.14 | bwd_inner_microstep: 3318.33 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-19 17:05:26,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.30 | bwd: 3319.15 | bwd_inner: 3318.33 | bwd_allreduce: 0.77 | step: 6.93 23%|██▎ | 2279/10000 [3:35:47<11:45:34, 5.48s/it] {'loss': 0.0716, 'grad_norm': 0.7353625893592834, 'learning_rate': 3.60305650766096e-05, 'epoch': 2.28} 23%|██▎ | 2279/10000 [3:35:47<11:45:34, 5.48s/it][2025-06-19 17:05:31,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:05:31,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.43 | bwd_microstep: 3322.02 | bwd_inner_microstep: 3321.18 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.17 [2025-06-19 17:05:31,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.43 | bwd: 3322.05 | bwd_inner: 3321.18 | bwd_allreduce: 0.81 | step: 7.18 23%|██▎ | 2280/10000 [3:35:52<11:45:01, 5.48s/it] {'loss': 0.0706, 'grad_norm': 0.836399495601654, 'learning_rate': 3.6026690963715806e-05, 'epoch': 2.28} 23%|██▎ | 2280/10000 [3:35:52<11:45:01, 5.48s/it][2025-06-19 17:05:37,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:05:37,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.07 | bwd_microstep: 3368.22 | bwd_inner_microstep: 3367.42 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 17:05:37,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.07 | bwd: 3368.24 | bwd_inner: 3367.42 | bwd_allreduce: 0.78 | step: 7.01 23%|██▎ | 2281/10000 [3:35:58<11:47:00, 5.50s/it] {'loss': 0.1302, 'grad_norm': 1.3419485092163086, 'learning_rate': 3.602281516969651e-05, 'epoch': 2.28} 23%|██▎ | 2281/10000 [3:35:58<11:47:00, 5.50s/it][2025-06-19 17:05:42,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:05:42,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.77 | bwd_microstep: 3391.95 | bwd_inner_microstep: 3390.95 | bwd_allreduce_microstep: 0.95 | step_microstep: 6.84 [2025-06-19 17:05:42,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.77 | bwd: 3391.97 | bwd_inner: 3390.95 | bwd_allreduce: 0.96 | step: 6.84 23%|██▎ | 2282/10000 [3:36:03<11:49:29, 5.52s/it] {'loss': 0.0607, 'grad_norm': 2.6143572330474854, 'learning_rate': 3.601893769495827e-05, 'epoch': 2.28} 23%|██▎ | 2282/10000 [3:36:03<11:49:29, 5.52s/it][2025-06-19 17:05:48,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:05:48,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.61 | bwd_microstep: 3319.89 | bwd_inner_microstep: 3319.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 17:05:48,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.61 | bwd: 3319.91 | bwd_inner: 3319.08 | bwd_allreduce: 0.78 | step: 7.10 23%|██▎ | 2283/10000 [3:36:09<11:47:04, 5.50s/it] {'loss': 0.0787, 'grad_norm': 1.3670122623443604, 'learning_rate': 3.6015058539907805e-05, 'epoch': 2.28} 23%|██▎ | 2283/10000 [3:36:09<11:47:04, 5.50s/it][2025-06-19 17:05:53,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 17:05:53,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.68 | bwd_microstep: 3313.64 | bwd_inner_microstep: 3312.73 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.20 [2025-06-19 17:05:53,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.68 | bwd: 3313.65 | bwd_inner: 3312.73 | bwd_allreduce: 0.88 | step: 7.20 23%|██▎ | 2284/10000 [3:36:14<11:45:37, 5.49s/it] {'loss': 0.0306, 'grad_norm': 0.5019016861915588, 'learning_rate': 3.6011177704952036e-05, 'epoch': 2.28} 23%|██▎ | 2284/10000 [3:36:14<11:45:37, 5.49s/it][2025-06-19 17:05:59,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:05:59,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.60 | bwd_microstep: 3363.40 | bwd_inner_microstep: 3362.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 17:05:59,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.60 | bwd: 3363.41 | bwd_inner: 3362.60 | bwd_allreduce: 0.76 | step: 6.66 23%|██▎ | 2285/10000 [3:36:20<11:46:50, 5.50s/it] {'loss': 0.329, 'grad_norm': 1.7754424810409546, 'learning_rate': 3.600729519049803e-05, 'epoch': 2.29} 23%|██▎ | 2285/10000 [3:36:20<11:46:50, 5.50s/it][2025-06-19 17:06:04,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:06:04,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.52 | bwd_microstep: 3362.15 | bwd_inner_microstep: 3361.29 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.16 [2025-06-19 17:06:04,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.52 | bwd: 3362.16 | bwd_inner: 3361.29 | bwd_allreduce: 0.82 | step: 7.16 23%|██▎ | 2286/10000 [3:36:25<11:47:47, 5.51s/it] {'loss': 0.2652, 'grad_norm': 1.855086326599121, 'learning_rate': 3.600341099695305e-05, 'epoch': 2.29} 23%|██▎ | 2286/10000 [3:36:25<11:47:47, 5.51s/it][2025-06-19 17:06:10,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:06:10,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.99 | bwd_microstep: 3319.31 | bwd_inner_microstep: 3318.26 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.67 [2025-06-19 17:06:10,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.99 | bwd: 3319.32 | bwd_inner: 3318.26 | bwd_allreduce: 1.01 | step: 7.68 23%|██▎ | 2287/10000 [3:36:31<11:46:13, 5.49s/it] {'loss': 0.0845, 'grad_norm': 1.742171049118042, 'learning_rate': 3.599952512472453e-05, 'epoch': 2.29} 23%|██▎ | 2287/10000 [3:36:31<11:46:13, 5.49s/it][2025-06-19 17:06:15,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:06:15,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.60 | bwd_microstep: 3319.90 | bwd_inner_microstep: 3319.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 17:06:15,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.60 | bwd: 3319.91 | bwd_inner: 3319.11 | bwd_allreduce: 0.76 | step: 6.65 23%|██▎ | 2288/10000 [3:36:36<11:45:02, 5.49s/it] {'loss': 0.1314, 'grad_norm': 1.3047102689743042, 'learning_rate': 3.599563757422009e-05, 'epoch': 2.29} 23%|██▎ | 2288/10000 [3:36:36<11:45:02, 5.49s/it][2025-06-19 17:06:21,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:06:21,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.36 | bwd_microstep: 3318.35 | bwd_inner_microstep: 3317.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-19 17:06:21,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.36 | bwd: 3318.36 | bwd_inner: 3317.55 | bwd_allreduce: 0.76 | step: 7.08 23%|██▎ | 2289/10000 [3:36:41<11:43:48, 5.48s/it] {'loss': 0.08, 'grad_norm': 1.0981918573379517, 'learning_rate': 3.59917483458475e-05, 'epoch': 2.29} 23%|██▎ | 2289/10000 [3:36:41<11:43:48, 5.48s/it][2025-06-19 17:06:26,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:06:26,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.66 | bwd_microstep: 3313.02 | bwd_inner_microstep: 3312.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 17:06:26,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.66 | bwd: 3313.03 | bwd_inner: 3312.23 | bwd_allreduce: 0.76 | step: 6.69 23%|██▎ | 2290/10000 [3:36:47<11:42:50, 5.47s/it] {'loss': 0.0697, 'grad_norm': 0.4964290261268616, 'learning_rate': 3.598785744001472e-05, 'epoch': 2.29} 23%|██▎ | 2290/10000 [3:36:47<11:42:50, 5.47s/it][2025-06-19 17:06:32,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:06:32,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.12 | bwd_microstep: 3312.39 | bwd_inner_microstep: 3311.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 17:06:32,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.12 | bwd: 3312.40 | bwd_inner: 3311.59 | bwd_allreduce: 0.77 | step: 6.62 23%|██▎ | 2291/10000 [3:36:52<11:42:08, 5.46s/it] {'loss': 0.0491, 'grad_norm': 0.6417549848556519, 'learning_rate': 3.5983964857129907e-05, 'epoch': 2.29} 23%|██▎ | 2291/10000 [3:36:52<11:42:08, 5.46s/it][2025-06-19 17:06:37,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:06:37,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.48 | bwd_microstep: 3318.08 | bwd_inner_microstep: 3317.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-19 17:06:37,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.49 | bwd: 3318.09 | bwd_inner: 3317.29 | bwd_allreduce: 0.76 | step: 6.92 23%|██▎ | 2292/10000 [3:36:58<11:41:58, 5.46s/it] {'loss': 0.0735, 'grad_norm': 0.8223130702972412, 'learning_rate': 3.5980070597601364e-05, 'epoch': 2.29} 23%|██▎ | 2292/10000 [3:36:58<11:41:58, 5.46s/it][2025-06-19 17:06:43,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:06:43,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.75 | bwd_microstep: 3320.35 | bwd_inner_microstep: 3319.57 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 17:06:43,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.75 | bwd: 3320.36 | bwd_inner: 3319.57 | bwd_allreduce: 0.75 | step: 6.54 23%|██▎ | 2293/10000 [3:37:03<11:41:43, 5.46s/it] {'loss': 0.0472, 'grad_norm': 0.4124312996864319, 'learning_rate': 3.5976174661837574e-05, 'epoch': 2.29} 23%|██▎ | 2293/10000 [3:37:03<11:41:43, 5.46s/it][2025-06-19 17:06:48,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:06:48,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.81 | bwd_microstep: 3316.96 | bwd_inner_microstep: 3316.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 17:06:48,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.81 | bwd: 3316.97 | bwd_inner: 3316.17 | bwd_allreduce: 0.75 | step: 6.67 23%|██▎ | 2294/10000 [3:37:09<11:41:21, 5.46s/it] {'loss': 0.1472, 'grad_norm': 1.1455273628234863, 'learning_rate': 3.597227705024722e-05, 'epoch': 2.29} 23%|██▎ | 2294/10000 [3:37:09<11:41:21, 5.46s/it][2025-06-19 17:06:53,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:06:53,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.35 | bwd_microstep: 3310.16 | bwd_inner_microstep: 3309.34 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.78 [2025-06-19 17:06:53,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.35 | bwd: 3310.18 | bwd_inner: 3309.34 | bwd_allreduce: 0.79 | step: 6.78 23%|██▎ | 2295/10000 [3:37:14<11:40:40, 5.46s/it] {'loss': 0.0299, 'grad_norm': 0.3044823110103607, 'learning_rate': 3.5968377763239125e-05, 'epoch': 2.29} 23%|██▎ | 2295/10000 [3:37:14<11:40:40, 5.46s/it][2025-06-19 17:06:59,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:06:59,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.49 | bwd_microstep: 3392.27 | bwd_inner_microstep: 3391.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-19 17:06:59,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.49 | bwd: 3392.28 | bwd_inner: 3391.47 | bwd_allreduce: 0.77 | step: 6.92 23%|██▎ | 2296/10000 [3:37:20<11:44:41, 5.49s/it] {'loss': 0.0753, 'grad_norm': 0.7585833072662354, 'learning_rate': 3.596447680122232e-05, 'epoch': 2.3} 23%|██▎ | 2296/10000 [3:37:20<11:44:41, 5.49s/it][2025-06-19 17:07:05,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:07:05,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.73 | bwd_microstep: 3375.94 | bwd_inner_microstep: 3375.10 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.04 [2025-06-19 17:07:05,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.73 | bwd: 3375.95 | bwd_inner: 3375.10 | bwd_allreduce: 0.81 | step: 7.04 23%|██▎ | 2297/10000 [3:37:25<11:46:44, 5.50s/it] {'loss': 0.1393, 'grad_norm': 1.1317819356918335, 'learning_rate': 3.596057416460599e-05, 'epoch': 2.3} 23%|██▎ | 2297/10000 [3:37:25<11:46:44, 5.50s/it][2025-06-19 17:07:10,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:07:10,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.16 | bwd_microstep: 3369.29 | bwd_inner_microstep: 3368.34 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.31 [2025-06-19 17:07:10,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.16 | bwd: 3369.30 | bwd_inner: 3368.34 | bwd_allreduce: 0.91 | step: 7.31 23%|██▎ | 2298/10000 [3:37:31<11:47:58, 5.52s/it] {'loss': 0.0796, 'grad_norm': 0.8859937191009521, 'learning_rate': 3.5956669853799496e-05, 'epoch': 2.3} 23%|██▎ | 2298/10000 [3:37:31<11:47:58, 5.52s/it][2025-06-19 17:07:16,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:07:16,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.30 | bwd_microstep: 3364.72 | bwd_inner_microstep: 3363.90 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.11 [2025-06-19 17:07:16,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.30 | bwd: 3364.73 | bwd_inner: 3363.90 | bwd_allreduce: 0.79 | step: 7.11 23%|██▎ | 2299/10000 [3:37:36<11:48:33, 5.52s/it] {'loss': 0.0524, 'grad_norm': 0.6809373497962952, 'learning_rate': 3.59527638692124e-05, 'epoch': 2.3} 23%|██▎ | 2299/10000 [3:37:36<11:48:33, 5.52s/it][2025-06-19 17:07:21,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:07:21,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.28 | bwd_microstep: 3312.40 | bwd_inner_microstep: 3311.42 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.10 [2025-06-19 17:07:21,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.28 | bwd: 3312.42 | bwd_inner: 3311.42 | bwd_allreduce: 0.95 | step: 7.10 23%|██▎ | 2300/10000 [3:37:42<11:45:58, 5.50s/it] {'loss': 0.0334, 'grad_norm': 0.2867967486381531, 'learning_rate': 3.594885621125442e-05, 'epoch': 2.3} 23%|██▎ | 2300/10000 [3:37:42<11:45:58, 5.50s/it][2025-06-19 17:07:27,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:07:27,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.62 | bwd_microstep: 3311.14 | bwd_inner_microstep: 3310.34 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 17:07:27,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.62 | bwd: 3311.16 | bwd_inner: 3310.34 | bwd_allreduce: 0.78 | step: 7.01 23%|██▎ | 2301/10000 [3:37:47<11:44:00, 5.49s/it] {'loss': 0.0489, 'grad_norm': 0.6252172589302063, 'learning_rate': 3.594494688033543e-05, 'epoch': 2.3} 23%|██▎ | 2301/10000 [3:37:47<11:44:00, 5.49s/it][2025-06-19 17:07:32,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:07:32,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.88 | bwd_microstep: 3359.73 | bwd_inner_microstep: 3358.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-19 17:07:32,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.88 | bwd: 3359.74 | bwd_inner: 3358.93 | bwd_allreduce: 0.77 | step: 6.60 23%|██▎ | 2302/10000 [3:37:53<11:45:11, 5.50s/it] {'loss': 0.0684, 'grad_norm': 1.1440521478652954, 'learning_rate': 3.594103587686553e-05, 'epoch': 2.3} 23%|██▎ | 2302/10000 [3:37:53<11:45:11, 5.50s/it][2025-06-19 17:07:37,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:07:37,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.47 | bwd_microstep: 3312.89 | bwd_inner_microstep: 3312.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 17:07:37,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.47 | bwd: 3312.90 | bwd_inner: 3312.10 | bwd_allreduce: 0.76 | step: 6.70 23%|██▎ | 2303/10000 [3:37:58<11:43:12, 5.48s/it] {'loss': 0.0692, 'grad_norm': 0.7003065943717957, 'learning_rate': 3.593712320125494e-05, 'epoch': 2.3} 23%|██▎ | 2303/10000 [3:37:58<11:43:12, 5.48s/it][2025-06-19 17:07:43,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:07:43,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.65 | bwd_microstep: 3361.75 | bwd_inner_microstep: 3360.96 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 17:07:43,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.65 | bwd: 3361.76 | bwd_inner: 3360.96 | bwd_allreduce: 0.76 | step: 6.65 23%|██▎ | 2304/10000 [3:38:04<11:44:46, 5.49s/it] {'loss': 0.0855, 'grad_norm': 0.7650908827781677, 'learning_rate': 3.59332088539141e-05, 'epoch': 2.3} 23%|██▎ | 2304/10000 [3:38:04<11:44:46, 5.49s/it][2025-06-19 17:07:48,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:07:48,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.93 | bwd_microstep: 3321.53 | bwd_inner_microstep: 3320.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 17:07:48,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.93 | bwd: 3321.54 | bwd_inner: 3320.73 | bwd_allreduce: 0.77 | step: 6.86 23%|██▎ | 2305/10000 [3:38:09<11:43:49, 5.49s/it] {'loss': 0.038, 'grad_norm': 0.4337785840034485, 'learning_rate': 3.59292928352536e-05, 'epoch': 2.31} 23%|██▎ | 2305/10000 [3:38:09<11:43:49, 5.49s/it][2025-06-19 17:07:54,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:07:54,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.25 | bwd_microstep: 3325.69 | bwd_inner_microstep: 3324.69 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.17 [2025-06-19 17:07:54,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.25 | bwd: 3325.71 | bwd_inner: 3324.69 | bwd_allreduce: 0.97 | step: 7.18 23%|██▎ | 2306/10000 [3:38:15<11:43:09, 5.48s/it] {'loss': 0.0335, 'grad_norm': 0.45461124181747437, 'learning_rate': 3.5925375145684203e-05, 'epoch': 2.31} 23%|██▎ | 2306/10000 [3:38:15<11:43:09, 5.48s/it][2025-06-19 17:07:59,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:07:59,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.35 | bwd_microstep: 3315.41 | bwd_inner_microstep: 3314.52 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.85 [2025-06-19 17:07:59,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.35 | bwd: 3315.42 | bwd_inner: 3314.52 | bwd_allreduce: 0.86 | step: 6.86 23%|██▎ | 2307/10000 [3:38:20<11:42:16, 5.48s/it] {'loss': 0.0764, 'grad_norm': 0.602543294429779, 'learning_rate': 3.5921455785616874e-05, 'epoch': 2.31} 23%|██▎ | 2307/10000 [3:38:20<11:42:16, 5.48s/it][2025-06-19 17:08:05,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:08:05,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.75 | bwd_microstep: 3320.13 | bwd_inner_microstep: 3319.35 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.55 [2025-06-19 17:08:05,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.75 | bwd: 3320.14 | bwd_inner: 3319.35 | bwd_allreduce: 0.75 | step: 6.55 23%|██▎ | 2308/10000 [3:38:26<11:41:46, 5.47s/it] {'loss': 0.0827, 'grad_norm': 1.433497667312622, 'learning_rate': 3.591753475546272e-05, 'epoch': 2.31} 23%|██▎ | 2308/10000 [3:38:26<11:41:46, 5.47s/it][2025-06-19 17:08:10,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:08:10,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.14 | bwd_microstep: 3313.46 | bwd_inner_microstep: 3312.69 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.52 [2025-06-19 17:08:10,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.14 | bwd: 3313.48 | bwd_inner: 3312.69 | bwd_allreduce: 0.75 | step: 6.52 23%|██▎ | 2309/10000 [3:38:31<11:40:49, 5.47s/it] {'loss': 0.0844, 'grad_norm': 0.7600208520889282, 'learning_rate': 3.591361205563304e-05, 'epoch': 2.31} 23%|██▎ | 2309/10000 [3:38:31<11:40:49, 5.47s/it][2025-06-19 17:08:16,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:08:16,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.60 | bwd_microstep: 3390.21 | bwd_inner_microstep: 3389.36 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.37 [2025-06-19 17:08:16,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.60 | bwd: 3390.23 | bwd_inner: 3389.36 | bwd_allreduce: 0.82 | step: 7.37 23%|██▎ | 2310/10000 [3:38:37<11:44:28, 5.50s/it] {'loss': 0.081, 'grad_norm': 0.5262228846549988, 'learning_rate': 3.590968768653933e-05, 'epoch': 2.31} 23%|██▎ | 2310/10000 [3:38:37<11:44:28, 5.50s/it][2025-06-19 17:08:21,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:08:21,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.93 | bwd_microstep: 3357.28 | bwd_inner_microstep: 3356.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 17:08:21,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.93 | bwd: 3357.30 | bwd_inner: 3356.49 | bwd_allreduce: 0.77 | step: 6.75 23%|██▎ | 2311/10000 [3:38:42<11:45:19, 5.50s/it] {'loss': 0.0691, 'grad_norm': 0.8826682567596436, 'learning_rate': 3.590576164859321e-05, 'epoch': 2.31} 23%|██▎ | 2311/10000 [3:38:42<11:45:19, 5.50s/it][2025-06-19 17:08:27,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:08:27,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.11 | bwd_microstep: 3310.88 | bwd_inner_microstep: 3309.73 | bwd_allreduce_microstep: 1.10 | step_microstep: 7.48 [2025-06-19 17:08:27,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.11 | bwd: 3310.90 | bwd_inner: 3309.73 | bwd_allreduce: 1.12 | step: 7.48 23%|██▎ | 2312/10000 [3:38:48<11:43:24, 5.49s/it] {'loss': 0.0537, 'grad_norm': 0.7016980648040771, 'learning_rate': 3.590183394220652e-05, 'epoch': 2.31} 23%|██▎ | 2312/10000 [3:38:48<11:43:24, 5.49s/it][2025-06-19 17:08:32,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:08:32,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.64 | bwd_microstep: 3364.37 | bwd_inner_microstep: 3363.59 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.75 [2025-06-19 17:08:32,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.64 | bwd: 3364.39 | bwd_inner: 3363.59 | bwd_allreduce: 0.76 | step: 6.75 23%|██▎ | 2313/10000 [3:38:53<11:44:56, 5.50s/it] {'loss': 0.0431, 'grad_norm': 0.4449796974658966, 'learning_rate': 3.589790456779124e-05, 'epoch': 2.31} 23%|██▎ | 2313/10000 [3:38:53<11:44:56, 5.50s/it][2025-06-19 17:08:38,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:08:38,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.60 | bwd_microstep: 3317.85 | bwd_inner_microstep: 3317.05 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 17:08:38,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.60 | bwd: 3317.86 | bwd_inner: 3317.05 | bwd_allreduce: 0.77 | step: 6.69 23%|██▎ | 2314/10000 [3:38:59<11:43:01, 5.49s/it] {'loss': 0.037, 'grad_norm': 0.538818895816803, 'learning_rate': 3.589397352575957e-05, 'epoch': 2.31} 23%|██▎ | 2314/10000 [3:38:59<11:43:01, 5.49s/it][2025-06-19 17:08:43,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:08:43,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.27 | bwd_microstep: 3313.75 | bwd_inner_microstep: 3312.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 17:08:43,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.27 | bwd: 3313.76 | bwd_inner: 3312.94 | bwd_allreduce: 0.78 | step: 7.19 23%|██▎ | 2315/10000 [3:39:04<11:42:48, 5.49s/it] {'loss': 0.0225, 'grad_norm': 0.4014069437980652, 'learning_rate': 3.589004081652384e-05, 'epoch': 2.31} 23%|██▎ | 2315/10000 [3:39:04<11:42:48, 5.49s/it][2025-06-19 17:08:49,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:08:49,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.59 | bwd_microstep: 3317.94 | bwd_inner_microstep: 3316.92 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.65 [2025-06-19 17:08:49,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.59 | bwd: 3317.96 | bwd_inner: 3316.92 | bwd_allreduce: 0.97 | step: 7.66 23%|██▎ | 2316/10000 [3:39:10<11:41:37, 5.48s/it] {'loss': 0.0985, 'grad_norm': 1.2302695512771606, 'learning_rate': 3.588610644049657e-05, 'epoch': 2.32} 23%|██▎ | 2316/10000 [3:39:10<11:41:37, 5.48s/it][2025-06-19 17:08:54,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:08:54,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.00 | bwd_microstep: 3312.39 | bwd_inner_microstep: 3311.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 17:08:54,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.00 | bwd: 3312.40 | bwd_inner: 3311.60 | bwd_allreduce: 0.76 | step: 6.68 23%|██▎ | 2317/10000 [3:39:15<11:40:44, 5.47s/it] {'loss': 0.035, 'grad_norm': 0.4846998453140259, 'learning_rate': 3.588217039809046e-05, 'epoch': 2.32} 23%|██▎ | 2317/10000 [3:39:15<11:40:44, 5.47s/it][2025-06-19 17:09:00,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:09:00,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.12 | bwd_microstep: 3356.81 | bwd_inner_microstep: 3356.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-19 17:09:00,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.12 | bwd: 3356.83 | bwd_inner: 3356.00 | bwd_allreduce: 0.78 | step: 7.22 23%|██▎ | 2318/10000 [3:39:21<11:42:43, 5.49s/it] {'loss': 0.1756, 'grad_norm': 1.5178322792053223, 'learning_rate': 3.58782326897184e-05, 'epoch': 2.32} 23%|██▎ | 2318/10000 [3:39:21<11:42:43, 5.49s/it][2025-06-19 17:09:05,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:09:05,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.10 | bwd_microstep: 3379.78 | bwd_inner_microstep: 3378.93 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.05 [2025-06-19 17:09:05,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.10 | bwd: 3379.80 | bwd_inner: 3378.93 | bwd_allreduce: 0.82 | step: 7.05 23%|██▎ | 2319/10000 [3:39:26<11:45:24, 5.51s/it] {'loss': 0.0778, 'grad_norm': 0.757371723651886, 'learning_rate': 3.5874293315793416e-05, 'epoch': 2.32} 23%|██▎ | 2319/10000 [3:39:26<11:45:24, 5.51s/it][2025-06-19 17:09:11,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-19 17:09:11,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.41 | bwd_microstep: 3316.23 | bwd_inner_microstep: 3315.17 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.26 [2025-06-19 17:09:11,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.41 | bwd: 3316.26 | bwd_inner: 3315.17 | bwd_allreduce: 1.02 | step: 8.27 23%|██▎ | 2320/10000 [3:39:32<11:43:43, 5.50s/it] {'loss': 0.0949, 'grad_norm': 0.8050721287727356, 'learning_rate': 3.587035227672874e-05, 'epoch': 2.32} 23%|██▎ | 2320/10000 [3:39:32<11:43:43, 5.50s/it][2025-06-19 17:09:16,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:09:16,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.52 | bwd_microstep: 3359.17 | bwd_inner_microstep: 3358.22 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.01 [2025-06-19 17:09:16,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.52 | bwd: 3359.18 | bwd_inner: 3358.22 | bwd_allreduce: 0.92 | step: 7.02 23%|██▎ | 2321/10000 [3:39:37<11:45:00, 5.51s/it] {'loss': 0.1336, 'grad_norm': 1.428615689277649, 'learning_rate': 3.5866409572937765e-05, 'epoch': 2.32} 23%|██▎ | 2321/10000 [3:39:37<11:45:00, 5.51s/it][2025-06-19 17:09:22,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:09:22,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.55 | bwd_microstep: 3304.40 | bwd_inner_microstep: 3303.45 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.03 [2025-06-19 17:09:22,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.55 | bwd: 3304.41 | bwd_inner: 3303.45 | bwd_allreduce: 0.92 | step: 7.03 23%|██▎ | 2322/10000 [3:39:43<11:42:44, 5.49s/it] {'loss': 0.2108, 'grad_norm': 1.6551768779754639, 'learning_rate': 3.586246520483407e-05, 'epoch': 2.32} 23%|██▎ | 2322/10000 [3:39:43<11:42:44, 5.49s/it][2025-06-19 17:09:27,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.79 [2025-06-19 17:09:27,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.13 | bwd_microstep: 3310.46 | bwd_inner_microstep: 3309.59 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.47 [2025-06-19 17:09:27,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.13 | bwd: 3310.48 | bwd_inner: 3309.59 | bwd_allreduce: 0.84 | step: 7.47 23%|██▎ | 2323/10000 [3:39:48<11:41:15, 5.48s/it] {'loss': 0.1027, 'grad_norm': 1.0712388753890991, 'learning_rate': 3.585851917283139e-05, 'epoch': 2.32} 23%|██▎ | 2323/10000 [3:39:48<11:41:15, 5.48s/it][2025-06-19 17:09:33,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:09:33,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.77 | bwd_microstep: 3307.37 | bwd_inner_microstep: 3306.46 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.88 [2025-06-19 17:09:33,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.77 | bwd: 3307.39 | bwd_inner: 3306.46 | bwd_allreduce: 0.88 | step: 6.89 23%|██▎ | 2324/10000 [3:39:54<11:39:58, 5.47s/it] {'loss': 0.0245, 'grad_norm': 0.394464910030365, 'learning_rate': 3.585457147734365e-05, 'epoch': 2.32} 23%|██▎ | 2324/10000 [3:39:54<11:39:58, 5.47s/it][2025-06-19 17:09:38,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 17:09:38,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.72 | bwd_microstep: 3321.44 | bwd_inner_microstep: 3320.43 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.73 [2025-06-19 17:09:38,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.72 | bwd: 3321.46 | bwd_inner: 3320.43 | bwd_allreduce: 0.97 | step: 7.73 23%|██▎ | 2325/10000 [3:39:59<11:39:47, 5.47s/it] {'loss': 0.1233, 'grad_norm': 1.4527686834335327, 'learning_rate': 3.5850622118784945e-05, 'epoch': 2.33} 23%|██▎ | 2325/10000 [3:39:59<11:39:47, 5.47s/it][2025-06-19 17:09:44,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:09:44,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.10 | bwd_microstep: 3364.31 | bwd_inner_microstep: 3363.38 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.20 [2025-06-19 17:09:44,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.10 | bwd: 3364.33 | bwd_inner: 3363.38 | bwd_allreduce: 0.89 | step: 7.20 23%|██▎ | 2326/10000 [3:40:05<11:42:10, 5.49s/it] {'loss': 0.1196, 'grad_norm': 1.0307570695877075, 'learning_rate': 3.5846671097569546e-05, 'epoch': 2.33} 23%|██▎ | 2326/10000 [3:40:05<11:42:10, 5.49s/it][2025-06-19 17:09:49,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.76 [2025-06-19 17:09:49,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.49 | bwd_microstep: 3319.26 | bwd_inner_microstep: 3318.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.04 [2025-06-19 17:09:49,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.49 | bwd: 3319.28 | bwd_inner: 3318.46 | bwd_allreduce: 0.77 | step: 7.04 23%|██▎ | 2327/10000 [3:40:10<11:40:56, 5.48s/it] {'loss': 0.0724, 'grad_norm': 0.8201965093612671, 'learning_rate': 3.58427184141119e-05, 'epoch': 2.33} 23%|██▎ | 2327/10000 [3:40:10<11:40:56, 5.48s/it][2025-06-19 17:09:55,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:09:55,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.07 | bwd_microstep: 3304.20 | bwd_inner_microstep: 3303.23 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.42 [2025-06-19 17:09:55,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.07 | bwd: 3304.21 | bwd_inner: 3303.23 | bwd_allreduce: 0.94 | step: 7.42 23%|██▎ | 2328/10000 [3:40:15<11:39:27, 5.47s/it] {'loss': 0.1187, 'grad_norm': 1.3389633893966675, 'learning_rate': 3.583876406882661e-05, 'epoch': 2.33} 23%|██▎ | 2328/10000 [3:40:15<11:39:27, 5.47s/it][2025-06-19 17:10:00,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:10:00,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.88 | bwd_microstep: 3318.31 | bwd_inner_microstep: 3317.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 17:10:00,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.88 | bwd: 3318.32 | bwd_inner: 3317.52 | bwd_allreduce: 0.76 | step: 6.89 23%|██▎ | 2329/10000 [3:40:21<11:39:23, 5.47s/it] {'loss': 0.086, 'grad_norm': 1.8370023965835571, 'learning_rate': 3.583480806212849e-05, 'epoch': 2.33} 23%|██▎ | 2329/10000 [3:40:21<11:39:23, 5.47s/it][2025-06-19 17:10:06,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:10:06,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.26 | bwd_microstep: 3322.64 | bwd_inner_microstep: 3321.72 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.49 [2025-06-19 17:10:06,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.26 | bwd: 3322.65 | bwd_inner: 3321.72 | bwd_allreduce: 0.89 | step: 7.50 23%|██▎ | 2330/10000 [3:40:26<11:39:29, 5.47s/it] {'loss': 0.1009, 'grad_norm': 0.9185839891433716, 'learning_rate': 3.583085039443249e-05, 'epoch': 2.33} 23%|██▎ | 2330/10000 [3:40:26<11:39:29, 5.47s/it][2025-06-19 17:10:11,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.83 [2025-06-19 17:10:11,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.65 | bwd_microstep: 3320.40 | bwd_inner_microstep: 3319.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.56 [2025-06-19 17:10:11,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.65 | bwd: 3320.42 | bwd_inner: 3319.60 | bwd_allreduce: 0.78 | step: 7.56 23%|██▎ | 2331/10000 [3:40:32<11:39:32, 5.47s/it] {'loss': 0.0902, 'grad_norm': 1.6787559986114502, 'learning_rate': 3.582689106615376e-05, 'epoch': 2.33} 23%|██▎ | 2331/10000 [3:40:32<11:39:32, 5.47s/it][2025-06-19 17:10:16,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.73 | optimizer_step: 2.73 [2025-06-19 17:10:16,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.01 | bwd_microstep: 3316.09 | bwd_inner_microstep: 3315.00 | bwd_allreduce_microstep: 1.04 | step_microstep: 8.30 [2025-06-19 17:10:16,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.01 | bwd: 3316.10 | bwd_inner: 3315.00 | bwd_allreduce: 1.06 | step: 8.31 23%|██▎ | 2332/10000 [3:40:37<11:38:59, 5.47s/it] {'loss': 0.0535, 'grad_norm': 0.7019132971763611, 'learning_rate': 3.582293007770761e-05, 'epoch': 2.33} 23%|██▎ | 2332/10000 [3:40:37<11:38:59, 5.47s/it][2025-06-19 17:10:22,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:10:22,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.10 | bwd_microstep: 3322.93 | bwd_inner_microstep: 3322.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-19 17:10:22,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.10 | bwd: 3322.95 | bwd_inner: 3322.13 | bwd_allreduce: 0.77 | step: 6.92 23%|██▎ | 2333/10000 [3:40:43<11:39:23, 5.47s/it] {'loss': 0.0545, 'grad_norm': 0.7568361163139343, 'learning_rate': 3.581896742950953e-05, 'epoch': 2.33} 23%|██▎ | 2333/10000 [3:40:43<11:39:23, 5.47s/it][2025-06-19 17:10:27,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:10:27,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.18 | bwd_microstep: 3322.46 | bwd_inner_microstep: 3321.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 17:10:27,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.18 | bwd: 3322.47 | bwd_inner: 3321.66 | bwd_allreduce: 0.77 | step: 6.88 23%|██▎ | 2334/10000 [3:40:48<11:39:24, 5.47s/it] {'loss': 0.0934, 'grad_norm': 0.6554878950119019, 'learning_rate': 3.581500312197519e-05, 'epoch': 2.33} 23%|██▎ | 2334/10000 [3:40:48<11:39:24, 5.47s/it][2025-06-19 17:10:33,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:10:33,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.85 | bwd_microstep: 3320.61 | bwd_inner_microstep: 3319.68 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.99 [2025-06-19 17:10:33,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.85 | bwd: 3320.63 | bwd_inner: 3319.68 | bwd_allreduce: 0.91 | step: 6.99 23%|██▎ | 2335/10000 [3:40:54<11:39:00, 5.47s/it] {'loss': 0.0823, 'grad_norm': 0.9845922589302063, 'learning_rate': 3.5811037155520414e-05, 'epoch': 2.33} 23%|██▎ | 2335/10000 [3:40:54<11:39:00, 5.47s/it][2025-06-19 17:10:38,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:10:38,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.39 | bwd_microstep: 3314.39 | bwd_inner_microstep: 3313.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 17:10:38,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.39 | bwd: 3314.40 | bwd_inner: 3313.61 | bwd_allreduce: 0.76 | step: 6.59 23%|██▎ | 2336/10000 [3:40:59<11:38:28, 5.47s/it] {'loss': 0.0263, 'grad_norm': 0.42141568660736084, 'learning_rate': 3.580706953056123e-05, 'epoch': 2.34} 23%|██▎ | 2336/10000 [3:40:59<11:38:28, 5.47s/it][2025-06-19 17:10:44,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:10:44,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.57 | bwd_microstep: 3311.66 | bwd_inner_microstep: 3310.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 17:10:44,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.57 | bwd: 3311.68 | bwd_inner: 3310.88 | bwd_allreduce: 0.75 | step: 6.59 23%|██▎ | 2337/10000 [3:41:05<11:37:48, 5.46s/it] {'loss': 0.1144, 'grad_norm': 1.7213631868362427, 'learning_rate': 3.580310024751381e-05, 'epoch': 2.34} 23%|██▎ | 2337/10000 [3:41:05<11:37:48, 5.46s/it][2025-06-19 17:10:49,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:10:49,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.87 | bwd_microstep: 3368.22 | bwd_inner_microstep: 3367.40 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.09 [2025-06-19 17:10:49,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.87 | bwd: 3368.23 | bwd_inner: 3367.40 | bwd_allreduce: 0.79 | step: 7.09 23%|██▎ | 2338/10000 [3:41:10<11:40:33, 5.49s/it] {'loss': 0.1671, 'grad_norm': 1.4574922323226929, 'learning_rate': 3.579912930679452e-05, 'epoch': 2.34} 23%|██▎ | 2338/10000 [3:41:10<11:40:33, 5.49s/it][2025-06-19 17:10:55,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:10:55,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.91 | bwd_microstep: 3319.46 | bwd_inner_microstep: 3318.68 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 17:10:55,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.91 | bwd: 3319.47 | bwd_inner: 3318.68 | bwd_allreduce: 0.75 | step: 6.71 23%|██▎ | 2339/10000 [3:41:16<11:39:56, 5.48s/it] {'loss': 0.1164, 'grad_norm': 1.8870452642440796, 'learning_rate': 3.579515670881989e-05, 'epoch': 2.34} 23%|██▎ | 2339/10000 [3:41:16<11:39:56, 5.48s/it][2025-06-19 17:11:00,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:11:00,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.40 | bwd_microstep: 3371.64 | bwd_inner_microstep: 3370.83 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 17:11:00,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.40 | bwd: 3371.65 | bwd_inner: 3370.83 | bwd_allreduce: 0.78 | step: 7.11 23%|██▎ | 2340/10000 [3:41:21<11:41:46, 5.50s/it] {'loss': 0.1357, 'grad_norm': 1.3324915170669556, 'learning_rate': 3.579118245400663e-05, 'epoch': 2.34} 23%|██▎ | 2340/10000 [3:41:21<11:41:46, 5.50s/it][2025-06-19 17:11:06,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:11:06,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.12 | bwd_microstep: 3321.69 | bwd_inner_microstep: 3320.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-19 17:11:06,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.12 | bwd: 3321.70 | bwd_inner: 3320.90 | bwd_allreduce: 0.76 | step: 6.73 23%|██▎ | 2341/10000 [3:41:27<11:40:14, 5.49s/it] {'loss': 0.0688, 'grad_norm': 1.451724648475647, 'learning_rate': 3.578720654277162e-05, 'epoch': 2.34} 23%|██▎ | 2341/10000 [3:41:27<11:40:14, 5.49s/it][2025-06-19 17:11:11,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:11:11,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.05 | bwd_microstep: 3365.79 | bwd_inner_microstep: 3365.00 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 17:11:11,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.05 | bwd: 3365.80 | bwd_inner: 3365.00 | bwd_allreduce: 0.76 | step: 6.65 23%|██▎ | 2342/10000 [3:41:32<11:41:47, 5.50s/it] {'loss': 0.0903, 'grad_norm': 0.8763667941093445, 'learning_rate': 3.578322897553192e-05, 'epoch': 2.34} 23%|██▎ | 2342/10000 [3:41:32<11:41:47, 5.50s/it][2025-06-19 17:11:17,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:11:17,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.88 | bwd_microstep: 3328.06 | bwd_inner_microstep: 3327.10 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.65 [2025-06-19 17:11:17,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.88 | bwd: 3328.08 | bwd_inner: 3327.10 | bwd_allreduce: 0.93 | step: 7.65 23%|██▎ | 2343/10000 [3:41:38<11:40:57, 5.49s/it] {'loss': 0.114, 'grad_norm': 0.9950730204582214, 'learning_rate': 3.577924975270474e-05, 'epoch': 2.34} 23%|██▎ | 2343/10000 [3:41:38<11:40:57, 5.49s/it][2025-06-19 17:11:22,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:11:22,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.70 | bwd_microstep: 3319.21 | bwd_inner_microstep: 3318.15 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.42 [2025-06-19 17:11:22,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.70 | bwd: 3319.23 | bwd_inner: 3318.15 | bwd_allreduce: 1.02 | step: 7.43 23%|██▎ | 2344/10000 [3:41:43<11:40:05, 5.49s/it] {'loss': 0.0932, 'grad_norm': 1.2661998271942139, 'learning_rate': 3.577526887470751e-05, 'epoch': 2.34} 23%|██▎ | 2344/10000 [3:41:43<11:40:05, 5.49s/it][2025-06-19 17:11:28,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:11:28,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.76 | bwd_microstep: 3327.08 | bwd_inner_microstep: 3326.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 17:11:28,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.76 | bwd: 3327.10 | bwd_inner: 3326.28 | bwd_allreduce: 0.78 | step: 7.15 23%|██▎ | 2345/10000 [3:41:49<11:39:40, 5.48s/it] {'loss': 0.112, 'grad_norm': 1.1139919757843018, 'learning_rate': 3.577128634195778e-05, 'epoch': 2.34} 23%|██▎ | 2345/10000 [3:41:49<11:39:40, 5.48s/it][2025-06-19 17:11:33,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:11:33,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.06 | bwd_microstep: 3365.99 | bwd_inner_microstep: 3365.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 17:11:33,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.06 | bwd: 3366.01 | bwd_inner: 3365.20 | bwd_allreduce: 0.76 | step: 6.74 23%|██▎ | 2346/10000 [3:41:54<11:41:21, 5.50s/it] {'loss': 0.0737, 'grad_norm': 0.6186294555664062, 'learning_rate': 3.576730215487331e-05, 'epoch': 2.35} 23%|██▎ | 2346/10000 [3:41:54<11:41:21, 5.50s/it][2025-06-19 17:11:39,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:11:39,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.02 | bwd_microstep: 3326.16 | bwd_inner_microstep: 3325.28 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.09 [2025-06-19 17:11:39,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.02 | bwd: 3326.19 | bwd_inner: 3325.28 | bwd_allreduce: 0.85 | step: 7.10 23%|██▎ | 2347/10000 [3:42:00<11:40:20, 5.49s/it] {'loss': 0.0593, 'grad_norm': 0.469305157661438, 'learning_rate': 3.5763316313872023e-05, 'epoch': 2.35} 23%|██▎ | 2347/10000 [3:42:00<11:40:20, 5.49s/it][2025-06-19 17:11:44,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:11:44,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.56 | bwd_microstep: 3383.15 | bwd_inner_microstep: 3382.27 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.34 [2025-06-19 17:11:44,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.56 | bwd: 3383.16 | bwd_inner: 3382.27 | bwd_allreduce: 0.85 | step: 7.34 23%|██▎ | 2348/10000 [3:42:05<11:43:07, 5.51s/it] {'loss': 0.0933, 'grad_norm': 0.9714233875274658, 'learning_rate': 3.5759328819372014e-05, 'epoch': 2.35} 23%|██▎ | 2348/10000 [3:42:05<11:43:07, 5.51s/it][2025-06-19 17:11:50,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.87 [2025-06-19 17:11:50,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.18 | bwd_microstep: 3326.45 | bwd_inner_microstep: 3325.41 | bwd_allreduce_microstep: 1.00 | step_microstep: 8.37 [2025-06-19 17:11:50,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.18 | bwd: 3326.47 | bwd_inner: 3325.41 | bwd_allreduce: 1.01 | step: 8.37 23%|██▎ | 2349/10000 [3:42:11<11:41:34, 5.50s/it] {'loss': 0.0663, 'grad_norm': 0.7449662089347839, 'learning_rate': 3.575533967179156e-05, 'epoch': 2.35} 23%|██▎ | 2349/10000 [3:42:11<11:41:34, 5.50s/it][2025-06-19 17:11:55,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:11:55,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.84 | bwd_microstep: 3320.75 | bwd_inner_microstep: 3319.94 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.22 [2025-06-19 17:11:55,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.84 | bwd: 3320.77 | bwd_inner: 3319.94 | bwd_allreduce: 0.79 | step: 7.22 24%|██▎ | 2350/10000 [3:42:16<11:40:28, 5.49s/it] {'loss': 0.0417, 'grad_norm': 0.4260098338127136, 'learning_rate': 3.575134887154909e-05, 'epoch': 2.35} 24%|██▎ | 2350/10000 [3:42:16<11:40:28, 5.49s/it][2025-06-19 17:12:01,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:12:01,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.19 | bwd_microstep: 3324.49 | bwd_inner_microstep: 3323.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-19 17:12:01,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.19 | bwd: 3324.51 | bwd_inner: 3323.70 | bwd_allreduce: 0.77 | step: 6.87 24%|██▎ | 2351/10000 [3:42:22<11:39:28, 5.49s/it] {'loss': 0.0851, 'grad_norm': 0.7992961406707764, 'learning_rate': 3.574735641906323e-05, 'epoch': 2.35} 24%|██▎ | 2351/10000 [3:42:22<11:39:28, 5.49s/it][2025-06-19 17:12:06,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:12:06,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.79 | bwd_microstep: 3368.83 | bwd_inner_microstep: 3368.02 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 17:12:06,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.79 | bwd: 3368.84 | bwd_inner: 3368.02 | bwd_allreduce: 0.78 | step: 6.97 24%|██▎ | 2352/10000 [3:42:27<11:41:19, 5.50s/it] {'loss': 0.1605, 'grad_norm': 1.3274829387664795, 'learning_rate': 3.5743362314752764e-05, 'epoch': 2.35} 24%|██▎ | 2352/10000 [3:42:27<11:41:19, 5.50s/it][2025-06-19 17:12:12,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:12:12,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.97 | bwd_microstep: 3372.02 | bwd_inner_microstep: 3371.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 17:12:12,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.97 | bwd: 3372.04 | bwd_inner: 3371.22 | bwd_allreduce: 0.77 | step: 6.83 24%|██▎ | 2353/10000 [3:42:33<11:42:55, 5.52s/it] {'loss': 0.1106, 'grad_norm': 1.2369974851608276, 'learning_rate': 3.573936655903667e-05, 'epoch': 2.35} 24%|██▎ | 2353/10000 [3:42:33<11:42:55, 5.52s/it][2025-06-19 17:12:17,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.85 [2025-06-19 17:12:17,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.17 | bwd_microstep: 3326.76 | bwd_inner_microstep: 3325.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-19 17:12:17,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.17 | bwd: 3326.77 | bwd_inner: 3325.79 | bwd_allreduce: 0.94 | step: 6.80 24%|██▎ | 2354/10000 [3:42:38<11:41:12, 5.50s/it] {'loss': 0.1572, 'grad_norm': 1.0665677785873413, 'learning_rate': 3.5735369152334056e-05, 'epoch': 2.35} 24%|██▎ | 2354/10000 [3:42:38<11:41:12, 5.50s/it][2025-06-19 17:12:23,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:12:23,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.59 | bwd_microstep: 3322.98 | bwd_inner_microstep: 3322.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 17:12:23,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.59 | bwd: 3322.99 | bwd_inner: 3322.19 | bwd_allreduce: 0.76 | step: 6.70 24%|██▎ | 2355/10000 [3:42:44<11:39:44, 5.49s/it] {'loss': 0.0848, 'grad_norm': 1.0101951360702515, 'learning_rate': 3.573137009506426e-05, 'epoch': 2.35} 24%|██▎ | 2355/10000 [3:42:44<11:39:44, 5.49s/it][2025-06-19 17:12:28,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:12:28,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.88 | bwd_microstep: 3375.11 | bwd_inner_microstep: 3374.03 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.90 [2025-06-19 17:12:28,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.88 | bwd: 3375.13 | bwd_inner: 3374.03 | bwd_allreduce: 1.04 | step: 7.91 24%|██▎ | 2356/10000 [3:42:49<11:41:50, 5.51s/it] {'loss': 0.1207, 'grad_norm': 0.8740012645721436, 'learning_rate': 3.572736938764675e-05, 'epoch': 2.36} 24%|██▎ | 2356/10000 [3:42:49<11:41:50, 5.51s/it][2025-06-19 17:12:34,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:12:34,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.96 | bwd_microstep: 3410.19 | bwd_inner_microstep: 3409.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.35 [2025-06-19 17:12:34,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.96 | bwd: 3410.20 | bwd_inner: 3409.38 | bwd_allreduce: 0.78 | step: 7.35 24%|██▎ | 2357/10000 [3:42:55<11:45:05, 5.54s/it] {'loss': 0.0475, 'grad_norm': 0.6383072137832642, 'learning_rate': 3.5723367030501176e-05, 'epoch': 2.36} 24%|██▎ | 2357/10000 [3:42:55<11:45:05, 5.54s/it][2025-06-19 17:12:39,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:12:39,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.81 | bwd_microstep: 3336.74 | bwd_inner_microstep: 3335.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 17:12:39,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.81 | bwd: 3336.75 | bwd_inner: 3335.94 | bwd_allreduce: 0.77 | step: 6.70 24%|██▎ | 2358/10000 [3:43:00<11:43:02, 5.52s/it] {'loss': 0.0606, 'grad_norm': 0.7204890847206116, 'learning_rate': 3.571936302404739e-05, 'epoch': 2.36} 24%|██▎ | 2358/10000 [3:43:00<11:43:02, 5.52s/it][2025-06-19 17:12:45,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:12:45,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.42 | bwd_microstep: 3372.19 | bwd_inner_microstep: 3371.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 17:12:45,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.42 | bwd: 3372.20 | bwd_inner: 3371.40 | bwd_allreduce: 0.76 | step: 6.84 24%|██▎ | 2359/10000 [3:43:06<11:43:42, 5.53s/it] {'loss': 0.068, 'grad_norm': 0.5178171396255493, 'learning_rate': 3.571535736870537e-05, 'epoch': 2.36} 24%|██▎ | 2359/10000 [3:43:06<11:43:42, 5.53s/it][2025-06-19 17:12:51,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:12:51,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.39 | bwd_microstep: 3374.03 | bwd_inner_microstep: 3373.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 17:12:51,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.39 | bwd: 3374.04 | bwd_inner: 3373.24 | bwd_allreduce: 0.76 | step: 6.69 24%|██▎ | 2360/10000 [3:43:11<11:44:26, 5.53s/it] {'loss': 0.0582, 'grad_norm': 0.5484987497329712, 'learning_rate': 3.5711350064895295e-05, 'epoch': 2.36} 24%|██▎ | 2360/10000 [3:43:11<11:44:26, 5.53s/it][2025-06-19 17:12:56,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:12:56,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.96 | bwd_microstep: 3317.32 | bwd_inner_microstep: 3316.48 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.91 [2025-06-19 17:12:56,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.96 | bwd: 3317.33 | bwd_inner: 3316.48 | bwd_allreduce: 0.80 | step: 6.92 24%|██▎ | 2361/10000 [3:43:17<11:41:43, 5.51s/it] {'loss': 0.0486, 'grad_norm': 0.49869653582572937, 'learning_rate': 3.5707341113037524e-05, 'epoch': 2.36} 24%|██▎ | 2361/10000 [3:43:17<11:41:43, 5.51s/it][2025-06-19 17:13:01,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:13:01,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.66 | bwd_microstep: 3324.70 | bwd_inner_microstep: 3323.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 17:13:01,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.66 | bwd: 3324.72 | bwd_inner: 3323.90 | bwd_allreduce: 0.77 | step: 7.07 24%|██▎ | 2362/10000 [3:43:22<11:40:14, 5.50s/it] {'loss': 0.1096, 'grad_norm': 1.251044511795044, 'learning_rate': 3.570333051355257e-05, 'epoch': 2.36} 24%|██▎ | 2362/10000 [3:43:22<11:40:14, 5.50s/it][2025-06-19 17:13:07,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:13:07,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.36 | bwd_microstep: 3322.82 | bwd_inner_microstep: 3322.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 17:13:07,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.36 | bwd: 3322.83 | bwd_inner: 3322.03 | bwd_allreduce: 0.76 | step: 6.55 24%|██▎ | 2363/10000 [3:43:28<11:38:59, 5.49s/it] {'loss': 0.1336, 'grad_norm': 1.0846761465072632, 'learning_rate': 3.569931826686112e-05, 'epoch': 2.36} 24%|██▎ | 2363/10000 [3:43:28<11:38:59, 5.49s/it][2025-06-19 17:13:12,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:13:12,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.26 | bwd_microstep: 3322.09 | bwd_inner_microstep: 3321.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 17:13:12,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.26 | bwd: 3322.11 | bwd_inner: 3321.31 | bwd_allreduce: 0.75 | step: 6.54 24%|██▎ | 2364/10000 [3:43:33<11:38:06, 5.49s/it] {'loss': 0.0431, 'grad_norm': 0.9913967251777649, 'learning_rate': 3.569530437338405e-05, 'epoch': 2.36} 24%|██▎ | 2364/10000 [3:43:33<11:38:06, 5.49s/it][2025-06-19 17:13:18,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:13:18,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.86 | bwd_microstep: 3326.05 | bwd_inner_microstep: 3325.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 17:13:18,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.87 | bwd: 3326.06 | bwd_inner: 3325.26 | bwd_allreduce: 0.76 | step: 6.56 24%|██▎ | 2365/10000 [3:43:39<11:37:16, 5.48s/it] {'loss': 0.0639, 'grad_norm': 0.554210364818573, 'learning_rate': 3.56912888335424e-05, 'epoch': 2.37} 24%|██▎ | 2365/10000 [3:43:39<11:37:16, 5.48s/it][2025-06-19 17:13:23,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:13:23,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.76 | bwd_microstep: 3318.09 | bwd_inner_microstep: 3317.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 17:13:23,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.76 | bwd: 3318.10 | bwd_inner: 3317.30 | bwd_allreduce: 0.76 | step: 6.57 24%|██▎ | 2366/10000 [3:43:44<11:36:22, 5.47s/it] {'loss': 0.0896, 'grad_norm': 1.3563646078109741, 'learning_rate': 3.5687271647757374e-05, 'epoch': 2.37} 24%|██▎ | 2366/10000 [3:43:44<11:36:22, 5.47s/it][2025-06-19 17:13:29,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:13:29,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.38 | bwd_microstep: 3374.96 | bwd_inner_microstep: 3374.17 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 17:13:29,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.38 | bwd: 3374.98 | bwd_inner: 3374.17 | bwd_allreduce: 0.76 | step: 6.95 24%|██▎ | 2367/10000 [3:43:50<11:38:51, 5.49s/it] {'loss': 0.0683, 'grad_norm': 0.8341872096061707, 'learning_rate': 3.5683252816450355e-05, 'epoch': 2.37} 24%|██▎ | 2367/10000 [3:43:50<11:38:51, 5.49s/it][2025-06-19 17:13:34,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:13:34,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.20 | bwd_microstep: 3322.71 | bwd_inner_microstep: 3321.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 17:13:34,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.20 | bwd: 3322.72 | bwd_inner: 3321.93 | bwd_allreduce: 0.75 | step: 6.67 24%|██▎ | 2368/10000 [3:43:55<11:37:57, 5.49s/it] {'loss': 0.1058, 'grad_norm': 2.1291351318359375, 'learning_rate': 3.5679232340042904e-05, 'epoch': 2.37} 24%|██▎ | 2368/10000 [3:43:55<11:37:57, 5.49s/it][2025-06-19 17:13:40,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:13:40,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.44 | bwd_microstep: 3402.42 | bwd_inner_microstep: 3401.63 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.97 [2025-06-19 17:13:40,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.44 | bwd: 3402.43 | bwd_inner: 3401.63 | bwd_allreduce: 0.76 | step: 6.97 24%|██▎ | 2369/10000 [3:44:01<11:41:22, 5.51s/it] {'loss': 0.0512, 'grad_norm': 0.45142221450805664, 'learning_rate': 3.567521021895676e-05, 'epoch': 2.37} 24%|██▎ | 2369/10000 [3:44:01<11:41:22, 5.51s/it][2025-06-19 17:13:45,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:13:45,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.54 | bwd_microstep: 3317.44 | bwd_inner_microstep: 3316.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 17:13:45,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.54 | bwd: 3317.45 | bwd_inner: 3316.66 | bwd_allreduce: 0.76 | step: 6.54 24%|██▎ | 2370/10000 [3:44:06<11:39:14, 5.50s/it] {'loss': 0.0763, 'grad_norm': 0.7383897304534912, 'learning_rate': 3.5671186453613804e-05, 'epoch': 2.37} 24%|██▎ | 2370/10000 [3:44:06<11:39:14, 5.50s/it][2025-06-19 17:13:51,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:13:51,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.61 | bwd_microstep: 3375.05 | bwd_inner_microstep: 3374.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 17:13:51,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.61 | bwd: 3375.06 | bwd_inner: 3374.26 | bwd_allreduce: 0.76 | step: 6.62 24%|██▎ | 2371/10000 [3:44:12<11:41:00, 5.51s/it] {'loss': 0.0536, 'grad_norm': 0.5346354842185974, 'learning_rate': 3.5667161044436124e-05, 'epoch': 2.37} 24%|██▎ | 2371/10000 [3:44:12<11:41:00, 5.51s/it][2025-06-19 17:13:56,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:13:56,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.56 | bwd_microstep: 3375.12 | bwd_inner_microstep: 3374.19 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.99 [2025-06-19 17:13:56,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.56 | bwd: 3375.15 | bwd_inner: 3374.19 | bwd_allreduce: 0.90 | step: 7.00 24%|██▎ | 2372/10000 [3:44:17<11:42:11, 5.52s/it] {'loss': 0.0804, 'grad_norm': 0.9664533734321594, 'learning_rate': 3.566313399184597e-05, 'epoch': 2.37} 24%|██▎ | 2372/10000 [3:44:17<11:42:11, 5.52s/it][2025-06-19 17:14:02,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:14:02,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.80 | bwd_microstep: 3374.36 | bwd_inner_microstep: 3373.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 17:14:02,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.80 | bwd: 3374.37 | bwd_inner: 3373.58 | bwd_allreduce: 0.75 | step: 6.64 24%|██▎ | 2373/10000 [3:44:23<11:42:44, 5.53s/it] {'loss': 0.0777, 'grad_norm': 0.8328354358673096, 'learning_rate': 3.5659105296265744e-05, 'epoch': 2.37} 24%|██▎ | 2373/10000 [3:44:23<11:42:44, 5.53s/it][2025-06-19 17:14:07,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:14:07,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.52 | bwd_microstep: 3317.98 | bwd_inner_microstep: 3317.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 17:14:07,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.52 | bwd: 3317.99 | bwd_inner: 3317.18 | bwd_allreduce: 0.77 | step: 7.00 24%|██▎ | 2374/10000 [3:44:28<11:40:07, 5.51s/it] {'loss': 0.0876, 'grad_norm': 1.6178516149520874, 'learning_rate': 3.565507495811805e-05, 'epoch': 2.37} 24%|██▎ | 2374/10000 [3:44:28<11:40:07, 5.51s/it][2025-06-19 17:14:13,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:14:13,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.04 | bwd_microstep: 3331.60 | bwd_inner_microstep: 3330.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.87 [2025-06-19 17:14:13,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.04 | bwd: 3331.62 | bwd_inner: 3330.78 | bwd_allreduce: 0.79 | step: 6.87 24%|██▍ | 2375/10000 [3:44:34<11:39:06, 5.50s/it] {'loss': 0.0485, 'grad_norm': 0.5452425479888916, 'learning_rate': 3.5651042977825666e-05, 'epoch': 2.38} 24%|██▍ | 2375/10000 [3:44:34<11:39:06, 5.50s/it][2025-06-19 17:14:18,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:14:18,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.26 | bwd_microstep: 3371.89 | bwd_inner_microstep: 3370.95 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.13 [2025-06-19 17:14:18,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.26 | bwd: 3371.91 | bwd_inner: 3370.95 | bwd_allreduce: 0.91 | step: 7.14 24%|██▍ | 2376/10000 [3:44:39<11:40:27, 5.51s/it] {'loss': 0.1213, 'grad_norm': 0.9988662600517273, 'learning_rate': 3.56470093558115e-05, 'epoch': 2.38} 24%|██▍ | 2376/10000 [3:44:39<11:40:27, 5.51s/it][2025-06-19 17:14:24,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:14:24,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.94 | bwd_microstep: 3325.14 | bwd_inner_microstep: 3324.24 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.45 [2025-06-19 17:14:24,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.94 | bwd: 3325.16 | bwd_inner: 3324.24 | bwd_allreduce: 0.87 | step: 7.45 24%|██▍ | 2377/10000 [3:44:45<11:39:08, 5.50s/it] {'loss': 0.0741, 'grad_norm': 1.187016487121582, 'learning_rate': 3.564297409249867e-05, 'epoch': 2.38} 24%|██▍ | 2377/10000 [3:44:45<11:39:08, 5.50s/it][2025-06-19 17:14:29,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:14:29,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.17 | bwd_microstep: 3321.00 | bwd_inner_microstep: 3320.16 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.85 [2025-06-19 17:14:29,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.17 | bwd: 3321.02 | bwd_inner: 3320.16 | bwd_allreduce: 0.81 | step: 6.86 24%|██▍ | 2378/10000 [3:44:50<11:37:31, 5.49s/it] {'loss': 0.0528, 'grad_norm': 0.7109774351119995, 'learning_rate': 3.563893718831047e-05, 'epoch': 2.38} 24%|██▍ | 2378/10000 [3:44:50<11:37:31, 5.49s/it][2025-06-19 17:14:35,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:14:35,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.92 | bwd_microstep: 3325.47 | bwd_inner_microstep: 3324.66 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 17:14:35,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.92 | bwd: 3325.48 | bwd_inner: 3324.66 | bwd_allreduce: 0.77 | step: 6.97 24%|██▍ | 2379/10000 [3:44:56<11:36:55, 5.49s/it] {'loss': 0.0765, 'grad_norm': 1.993119478225708, 'learning_rate': 3.563489864367033e-05, 'epoch': 2.38} 24%|██▍ | 2379/10000 [3:44:56<11:36:55, 5.49s/it][2025-06-19 17:14:40,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:14:40,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.59 | bwd_microstep: 3369.36 | bwd_inner_microstep: 3368.54 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 17:14:40,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.59 | bwd: 3369.37 | bwd_inner: 3368.54 | bwd_allreduce: 0.79 | step: 7.29 24%|██▍ | 2380/10000 [3:45:01<11:38:54, 5.50s/it] {'loss': 0.0625, 'grad_norm': 0.6963226199150085, 'learning_rate': 3.563085845900189e-05, 'epoch': 2.38} 24%|██▍ | 2380/10000 [3:45:01<11:38:54, 5.50s/it][2025-06-19 17:14:46,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:14:46,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.60 | bwd_microstep: 3376.67 | bwd_inner_microstep: 3375.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 17:14:46,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.60 | bwd: 3376.69 | bwd_inner: 3375.88 | bwd_allreduce: 0.76 | step: 6.71 24%|██▍ | 2381/10000 [3:45:07<11:40:15, 5.51s/it] {'loss': 0.1401, 'grad_norm': 0.8696639537811279, 'learning_rate': 3.562681663472894e-05, 'epoch': 2.38} 24%|██▍ | 2381/10000 [3:45:07<11:40:15, 5.51s/it][2025-06-19 17:14:52,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:14:52,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.96 | bwd_microstep: 3373.40 | bwd_inner_microstep: 3372.47 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.48 [2025-06-19 17:14:52,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.96 | bwd: 3373.42 | bwd_inner: 3372.47 | bwd_allreduce: 0.89 | step: 7.49 24%|██▍ | 2382/10000 [3:45:12<11:41:03, 5.52s/it] {'loss': 0.0589, 'grad_norm': 0.7872399091720581, 'learning_rate': 3.5622773171275456e-05, 'epoch': 2.38} 24%|██▍ | 2382/10000 [3:45:12<11:41:03, 5.52s/it][2025-06-19 17:14:57,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:14:57,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.92 | bwd_microstep: 3370.90 | bwd_inner_microstep: 3370.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-19 17:14:57,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.92 | bwd: 3370.92 | bwd_inner: 3370.11 | bwd_allreduce: 0.76 | step: 7.02 24%|██▍ | 2383/10000 [3:45:18<11:41:40, 5.53s/it] {'loss': 0.0412, 'grad_norm': 0.52264004945755, 'learning_rate': 3.561872806906558e-05, 'epoch': 2.38} 24%|██▍ | 2383/10000 [3:45:18<11:41:40, 5.53s/it][2025-06-19 17:15:03,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:15:03,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.70 | bwd_microstep: 3371.86 | bwd_inner_microstep: 3371.05 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 17:15:03,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.71 | bwd: 3371.87 | bwd_inner: 3371.05 | bwd_allreduce: 0.78 | step: 7.06 24%|██▍ | 2384/10000 [3:45:23<11:41:59, 5.53s/it] {'loss': 0.0302, 'grad_norm': 0.394990473985672, 'learning_rate': 3.56146813285236e-05, 'epoch': 2.38} 24%|██▍ | 2384/10000 [3:45:23<11:41:59, 5.53s/it][2025-06-19 17:15:08,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:15:08,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.63 | bwd_microstep: 3324.75 | bwd_inner_microstep: 3323.78 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.20 [2025-06-19 17:15:08,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.63 | bwd: 3324.77 | bwd_inner: 3323.78 | bwd_allreduce: 0.95 | step: 7.20 24%|██▍ | 2385/10000 [3:45:29<11:39:45, 5.51s/it] {'loss': 0.1138, 'grad_norm': 1.7903587818145752, 'learning_rate': 3.561063295007403e-05, 'epoch': 2.38} 24%|██▍ | 2385/10000 [3:45:29<11:39:45, 5.51s/it][2025-06-19 17:15:14,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:15:14,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.20 | bwd_microstep: 3313.75 | bwd_inner_microstep: 3312.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 17:15:14,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.20 | bwd: 3313.77 | bwd_inner: 3312.96 | bwd_allreduce: 0.76 | step: 6.70 24%|██▍ | 2386/10000 [3:45:34<11:37:31, 5.50s/it] {'loss': 0.0584, 'grad_norm': 0.7133169770240784, 'learning_rate': 3.560658293414152e-05, 'epoch': 2.39} 24%|██▍ | 2386/10000 [3:45:34<11:37:31, 5.50s/it][2025-06-19 17:15:19,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:15:19,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.53 | bwd_microstep: 3314.10 | bwd_inner_microstep: 3313.29 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 17:15:19,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.53 | bwd: 3314.12 | bwd_inner: 3313.29 | bwd_allreduce: 0.78 | step: 7.24 24%|██▍ | 2387/10000 [3:45:40<11:35:47, 5.48s/it] {'loss': 0.1335, 'grad_norm': 1.2720632553100586, 'learning_rate': 3.5602531281150884e-05, 'epoch': 2.39} 24%|██▍ | 2387/10000 [3:45:40<11:35:47, 5.48s/it][2025-06-19 17:15:24,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:15:24,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.27 | bwd_microstep: 3317.50 | bwd_inner_microstep: 3316.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.80 [2025-06-19 17:15:24,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.27 | bwd: 3317.51 | bwd_inner: 3316.71 | bwd_allreduce: 0.76 | step: 6.81 24%|██▍ | 2388/10000 [3:45:45<11:34:46, 5.48s/it] {'loss': 0.1522, 'grad_norm': 0.936018705368042, 'learning_rate': 3.559847799152714e-05, 'epoch': 2.39} 24%|██▍ | 2388/10000 [3:45:45<11:34:46, 5.48s/it][2025-06-19 17:15:30,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:15:30,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.32 | bwd_microstep: 3372.81 | bwd_inner_microstep: 3371.99 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.67 [2025-06-19 17:15:30,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.32 | bwd: 3372.82 | bwd_inner: 3371.99 | bwd_allreduce: 0.79 | step: 6.68 24%|██▍ | 2389/10000 [3:45:51<11:37:07, 5.50s/it] {'loss': 0.0392, 'grad_norm': 0.5543992519378662, 'learning_rate': 3.559442306569544e-05, 'epoch': 2.39} 24%|██▍ | 2389/10000 [3:45:51<11:37:07, 5.50s/it][2025-06-19 17:15:36,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:15:36,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.60 | bwd_microstep: 3368.31 | bwd_inner_microstep: 3367.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 17:15:36,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.60 | bwd: 3368.32 | bwd_inner: 3367.51 | bwd_allreduce: 0.77 | step: 6.69 24%|██▍ | 2390/10000 [3:45:56<11:38:41, 5.51s/it] {'loss': 0.1621, 'grad_norm': 1.334677815437317, 'learning_rate': 3.5590366504081136e-05, 'epoch': 2.39} 24%|██▍ | 2390/10000 [3:45:56<11:38:41, 5.51s/it][2025-06-19 17:15:41,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:15:41,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.74 | bwd_microstep: 3364.00 | bwd_inner_microstep: 3363.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 17:15:41,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.74 | bwd: 3364.02 | bwd_inner: 3363.20 | bwd_allreduce: 0.78 | step: 7.06 24%|██▍ | 2391/10000 [3:46:02<11:39:24, 5.52s/it] {'loss': 0.0357, 'grad_norm': 0.7351389527320862, 'learning_rate': 3.558630830710975e-05, 'epoch': 2.39} 24%|██▍ | 2391/10000 [3:46:02<11:39:24, 5.52s/it][2025-06-19 17:15:47,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 17:15:47,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.69 | bwd_microstep: 3316.84 | bwd_inner_microstep: 3316.06 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 17:15:47,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.69 | bwd: 3316.85 | bwd_inner: 3316.06 | bwd_allreduce: 0.75 | step: 6.64 24%|██▍ | 2392/10000 [3:46:07<11:37:01, 5.50s/it] {'loss': 0.0324, 'grad_norm': 0.5919739603996277, 'learning_rate': 3.5582248475206955e-05, 'epoch': 2.39} 24%|██▍ | 2392/10000 [3:46:07<11:37:01, 5.50s/it][2025-06-19 17:15:52,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:15:52,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.90 | bwd_microstep: 3367.48 | bwd_inner_microstep: 3366.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 17:15:52,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.90 | bwd: 3367.50 | bwd_inner: 3366.70 | bwd_allreduce: 0.76 | step: 6.61 24%|██▍ | 2393/10000 [3:46:13<11:38:26, 5.51s/it] {'loss': 0.0246, 'grad_norm': 0.35678672790527344, 'learning_rate': 3.5578187008798614e-05, 'epoch': 2.39} 24%|██▍ | 2393/10000 [3:46:13<11:38:26, 5.51s/it][2025-06-19 17:15:58,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:15:58,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.91 | bwd_microstep: 3375.62 | bwd_inner_microstep: 3374.71 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.90 [2025-06-19 17:15:58,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.91 | bwd: 3375.64 | bwd_inner: 3374.71 | bwd_allreduce: 0.88 | step: 6.90 24%|██▍ | 2394/10000 [3:46:18<11:39:32, 5.52s/it] {'loss': 0.1008, 'grad_norm': 0.9670867323875427, 'learning_rate': 3.557412390831076e-05, 'epoch': 2.39} 24%|██▍ | 2394/10000 [3:46:18<11:39:32, 5.52s/it][2025-06-19 17:16:03,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:16:03,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.93 | bwd_microstep: 3318.17 | bwd_inner_microstep: 3317.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 17:16:03,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.93 | bwd: 3318.18 | bwd_inner: 3317.38 | bwd_allreduce: 0.76 | step: 6.80 24%|██▍ | 2395/10000 [3:46:24<11:37:44, 5.50s/it] {'loss': 0.0587, 'grad_norm': 1.3549472093582153, 'learning_rate': 3.5570059174169586e-05, 'epoch': 2.4} 24%|██▍ | 2395/10000 [3:46:24<11:37:44, 5.50s/it][2025-06-19 17:16:09,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:16:09,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.38 | bwd_microstep: 3382.68 | bwd_inner_microstep: 3381.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.80 [2025-06-19 17:16:09,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.38 | bwd: 3382.70 | bwd_inner: 3381.90 | bwd_allreduce: 0.76 | step: 6.81 24%|██▍ | 2396/10000 [3:46:29<11:39:34, 5.52s/it] {'loss': 0.0869, 'grad_norm': 1.064268708229065, 'learning_rate': 3.5565992806801474e-05, 'epoch': 2.4} 24%|██▍ | 2396/10000 [3:46:29<11:39:34, 5.52s/it][2025-06-19 17:16:14,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:16:14,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.48 | bwd_microstep: 3319.77 | bwd_inner_microstep: 3318.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 17:16:14,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.48 | bwd: 3319.79 | bwd_inner: 3318.99 | bwd_allreduce: 0.75 | step: 6.53 24%|██▍ | 2397/10000 [3:46:35<11:37:12, 5.50s/it] {'loss': 0.0442, 'grad_norm': 0.6866244673728943, 'learning_rate': 3.556192480663295e-05, 'epoch': 2.4} 24%|██▍ | 2397/10000 [3:46:35<11:37:12, 5.50s/it][2025-06-19 17:16:20,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:16:20,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.04 | bwd_microstep: 3328.96 | bwd_inner_microstep: 3328.01 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.01 [2025-06-19 17:16:20,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.04 | bwd: 3328.97 | bwd_inner: 3328.01 | bwd_allreduce: 0.92 | step: 7.01 24%|██▍ | 2398/10000 [3:46:40<11:36:15, 5.50s/it] {'loss': 0.149, 'grad_norm': 1.542911171913147, 'learning_rate': 3.555785517409075e-05, 'epoch': 2.4} 24%|██▍ | 2398/10000 [3:46:40<11:36:15, 5.50s/it][2025-06-19 17:16:25,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:16:25,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.17 | bwd_microstep: 3309.53 | bwd_inner_microstep: 3308.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.58 [2025-06-19 17:16:25,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.17 | bwd: 3309.54 | bwd_inner: 3308.73 | bwd_allreduce: 0.76 | step: 6.58 24%|██▍ | 2399/10000 [3:46:46<11:34:36, 5.48s/it] {'loss': 0.0981, 'grad_norm': 0.8207359313964844, 'learning_rate': 3.555378390960174e-05, 'epoch': 2.4} 24%|██▍ | 2399/10000 [3:46:46<11:34:36, 5.48s/it][2025-06-19 17:16:30,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 17:16:30,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.87 | bwd_microstep: 3314.91 | bwd_inner_microstep: 3313.84 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.64 [2025-06-19 17:16:30,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.87 | bwd: 3314.93 | bwd_inner: 3313.84 | bwd_allreduce: 1.03 | step: 7.64 24%|██▍ | 2400/10000 [3:46:51<11:33:33, 5.48s/it] {'loss': 0.0641, 'grad_norm': 0.9597362875938416, 'learning_rate': 3.5549711013592995e-05, 'epoch': 2.4} 24%|██▍ | 2400/10000 [3:46:51<11:33:33, 5.48s/it][2025-06-19 17:16:36,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:16:36,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.91 | bwd_microstep: 3363.58 | bwd_inner_microstep: 3362.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-19 17:16:36,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.91 | bwd: 3363.59 | bwd_inner: 3362.77 | bwd_allreduce: 0.78 | step: 6.95 24%|██▍ | 2401/10000 [3:46:57<11:35:36, 5.49s/it] {'loss': 0.0404, 'grad_norm': 0.9599600434303284, 'learning_rate': 3.554563648649172e-05, 'epoch': 2.4} 24%|██▍ | 2401/10000 [3:46:57<11:35:36, 5.49s/it][2025-06-19 17:16:41,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:16:41,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.03 | bwd_microstep: 3326.34 | bwd_inner_microstep: 3325.50 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.81 [2025-06-19 17:16:41,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.03 | bwd: 3326.35 | bwd_inner: 3325.50 | bwd_allreduce: 0.81 | step: 6.82 24%|██▍ | 2402/10000 [3:47:02<11:34:49, 5.49s/it] {'loss': 0.0893, 'grad_norm': 1.175002098083496, 'learning_rate': 3.554156032872533e-05, 'epoch': 2.4} 24%|██▍ | 2402/10000 [3:47:02<11:34:49, 5.49s/it][2025-06-19 17:16:47,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:16:47,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.28 | bwd_microstep: 3319.41 | bwd_inner_microstep: 3318.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 17:16:47,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.28 | bwd: 3319.42 | bwd_inner: 3318.62 | bwd_allreduce: 0.76 | step: 6.65 24%|██▍ | 2403/10000 [3:47:08<11:33:48, 5.48s/it] {'loss': 0.0808, 'grad_norm': 0.8075504899024963, 'learning_rate': 3.55374825407214e-05, 'epoch': 2.4} 24%|██▍ | 2403/10000 [3:47:08<11:33:48, 5.48s/it][2025-06-19 17:16:52,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:16:52,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.60 | bwd_microstep: 3319.37 | bwd_inner_microstep: 3318.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 17:16:52,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.60 | bwd: 3319.38 | bwd_inner: 3318.58 | bwd_allreduce: 0.76 | step: 6.68 24%|██▍ | 2404/10000 [3:47:13<11:33:03, 5.47s/it] {'loss': 0.0419, 'grad_norm': 1.1018303632736206, 'learning_rate': 3.553340312290766e-05, 'epoch': 2.4} 24%|██▍ | 2404/10000 [3:47:13<11:33:03, 5.47s/it][2025-06-19 17:16:58,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:16:58,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.59 | bwd_microstep: 3360.92 | bwd_inner_microstep: 3360.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 17:16:58,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.60 | bwd: 3360.94 | bwd_inner: 3360.13 | bwd_allreduce: 0.76 | step: 6.66 24%|██▍ | 2405/10000 [3:47:19<11:34:55, 5.49s/it] {'loss': 0.0736, 'grad_norm': 1.1109344959259033, 'learning_rate': 3.5529322075712025e-05, 'epoch': 2.41} 24%|██▍ | 2405/10000 [3:47:19<11:34:55, 5.49s/it][2025-06-19 17:17:03,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:17:03,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.92 | bwd_microstep: 3324.90 | bwd_inner_microstep: 3323.99 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.85 [2025-06-19 17:17:03,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.92 | bwd: 3324.92 | bwd_inner: 3323.99 | bwd_allreduce: 0.88 | step: 6.85 24%|██▍ | 2406/10000 [3:47:24<11:33:56, 5.48s/it] {'loss': 0.1139, 'grad_norm': 2.055920362472534, 'learning_rate': 3.552523939956258e-05, 'epoch': 2.41} 24%|██▍ | 2406/10000 [3:47:24<11:33:56, 5.48s/it][2025-06-19 17:17:09,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:17:09,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.62 | bwd_microstep: 3318.12 | bwd_inner_microstep: 3317.26 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.99 [2025-06-19 17:17:09,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.62 | bwd: 3318.15 | bwd_inner: 3317.26 | bwd_allreduce: 0.82 | step: 6.99 24%|██▍ | 2407/10000 [3:47:30<11:33:20, 5.48s/it] {'loss': 0.1039, 'grad_norm': 1.3661813735961914, 'learning_rate': 3.552115509488757e-05, 'epoch': 2.41} 24%|██▍ | 2407/10000 [3:47:30<11:33:20, 5.48s/it][2025-06-19 17:17:14,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:17:14,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.15 | bwd_microstep: 3370.91 | bwd_inner_microstep: 3370.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 17:17:14,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.15 | bwd: 3370.92 | bwd_inner: 3370.11 | bwd_allreduce: 0.77 | step: 7.01 24%|██▍ | 2408/10000 [3:47:35<11:35:34, 5.50s/it] {'loss': 0.0608, 'grad_norm': 0.7754065990447998, 'learning_rate': 3.551706916211544e-05, 'epoch': 2.41} 24%|██▍ | 2408/10000 [3:47:35<11:35:34, 5.50s/it][2025-06-19 17:17:20,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:17:20,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.95 | bwd_microstep: 3357.92 | bwd_inner_microstep: 3357.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 17:17:20,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.95 | bwd: 3357.93 | bwd_inner: 3357.12 | bwd_allreduce: 0.77 | step: 6.93 24%|██▍ | 2409/10000 [3:47:41<11:36:22, 5.50s/it] {'loss': 0.178, 'grad_norm': 1.5879793167114258, 'learning_rate': 3.5512981601674755e-05, 'epoch': 2.41} 24%|██▍ | 2409/10000 [3:47:41<11:36:22, 5.50s/it][2025-06-19 17:17:25,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:17:25,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.22 | bwd_microstep: 3366.78 | bwd_inner_microstep: 3365.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 17:17:25,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.22 | bwd: 3366.79 | bwd_inner: 3365.99 | bwd_allreduce: 0.76 | step: 6.69 24%|██▍ | 2410/10000 [3:47:46<11:37:24, 5.51s/it] {'loss': 0.073, 'grad_norm': 1.0701053142547607, 'learning_rate': 3.5508892413994305e-05, 'epoch': 2.41} 24%|██▍ | 2410/10000 [3:47:46<11:37:24, 5.51s/it][2025-06-19 17:17:31,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:17:31,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.72 | bwd_microstep: 3321.12 | bwd_inner_microstep: 3320.32 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 17:17:31,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.72 | bwd: 3321.13 | bwd_inner: 3320.32 | bwd_allreduce: 0.77 | step: 6.71 24%|██▍ | 2411/10000 [3:47:52<11:35:14, 5.50s/it] {'loss': 0.0733, 'grad_norm': 0.6856436133384705, 'learning_rate': 3.550480159950303e-05, 'epoch': 2.41} 24%|██▍ | 2411/10000 [3:47:52<11:35:14, 5.50s/it][2025-06-19 17:17:36,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:17:36,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.21 | bwd_microstep: 3319.56 | bwd_inner_microstep: 3318.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 17:17:36,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.21 | bwd: 3319.57 | bwd_inner: 3318.76 | bwd_allreduce: 0.76 | step: 6.64 24%|██▍ | 2412/10000 [3:47:57<11:33:47, 5.49s/it] {'loss': 0.0615, 'grad_norm': 0.7406362891197205, 'learning_rate': 3.550070915863001e-05, 'epoch': 2.41} 24%|██▍ | 2412/10000 [3:47:57<11:33:47, 5.49s/it][2025-06-19 17:17:42,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:17:42,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.16 | bwd_microstep: 3317.73 | bwd_inner_microstep: 3316.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 17:17:42,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.16 | bwd: 3317.75 | bwd_inner: 3316.94 | bwd_allreduce: 0.76 | step: 6.61 24%|██▍ | 2413/10000 [3:48:03<11:32:35, 5.48s/it] {'loss': 0.0592, 'grad_norm': 0.6528465151786804, 'learning_rate': 3.549661509180455e-05, 'epoch': 2.41} 24%|██▍ | 2413/10000 [3:48:03<11:32:35, 5.48s/it][2025-06-19 17:17:47,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:17:47,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.51 | bwd_microstep: 3315.10 | bwd_inner_microstep: 3314.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 17:17:47,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.51 | bwd: 3315.11 | bwd_inner: 3314.31 | bwd_allreduce: 0.76 | step: 6.69 24%|██▍ | 2414/10000 [3:48:08<11:31:41, 5.47s/it] {'loss': 0.032, 'grad_norm': 0.3508124053478241, 'learning_rate': 3.549251939945608e-05, 'epoch': 2.41} 24%|██▍ | 2414/10000 [3:48:08<11:31:41, 5.47s/it][2025-06-19 17:17:53,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:17:53,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.21 | bwd_microstep: 3330.72 | bwd_inner_microstep: 3329.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-19 17:17:53,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.21 | bwd: 3330.74 | bwd_inner: 3329.92 | bwd_allreduce: 0.78 | step: 6.99 24%|██▍ | 2415/10000 [3:48:14<11:31:37, 5.47s/it] {'loss': 0.0959, 'grad_norm': 1.2249335050582886, 'learning_rate': 3.548842208201424e-05, 'epoch': 2.42} 24%|██▍ | 2415/10000 [3:48:14<11:31:37, 5.47s/it][2025-06-19 17:17:58,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:17:58,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.79 | bwd_microstep: 3329.00 | bwd_inner_microstep: 3328.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 17:17:58,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.79 | bwd: 3329.01 | bwd_inner: 3328.20 | bwd_allreduce: 0.77 | step: 6.96 24%|██▍ | 2416/10000 [3:48:19<11:31:25, 5.47s/it] {'loss': 0.0577, 'grad_norm': 0.7000120878219604, 'learning_rate': 3.54843231399088e-05, 'epoch': 2.42} 24%|██▍ | 2416/10000 [3:48:19<11:31:25, 5.47s/it][2025-06-19 17:18:04,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 17:18:04,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.62 | bwd_microstep: 3320.92 | bwd_inner_microstep: 3320.04 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.79 [2025-06-19 17:18:04,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.62 | bwd: 3320.93 | bwd_inner: 3320.04 | bwd_allreduce: 0.85 | step: 6.80 24%|██▍ | 2417/10000 [3:48:25<11:31:21, 5.47s/it] {'loss': 0.0638, 'grad_norm': 1.2779277563095093, 'learning_rate': 3.548022257356973e-05, 'epoch': 2.42} 24%|██▍ | 2417/10000 [3:48:25<11:31:21, 5.47s/it][2025-06-19 17:18:09,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:18:09,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.54 | bwd_microstep: 3374.30 | bwd_inner_microstep: 3373.52 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 17:18:09,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.54 | bwd: 3374.32 | bwd_inner: 3373.52 | bwd_allreduce: 0.76 | step: 6.62 24%|██▍ | 2418/10000 [3:48:30<11:33:55, 5.49s/it] {'loss': 0.0769, 'grad_norm': 0.8637416362762451, 'learning_rate': 3.547612038342716e-05, 'epoch': 2.42} 24%|██▍ | 2418/10000 [3:48:30<11:33:55, 5.49s/it][2025-06-19 17:18:15,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:18:15,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.31 | bwd_microstep: 3316.03 | bwd_inner_microstep: 3315.17 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.82 [2025-06-19 17:18:15,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.31 | bwd: 3316.05 | bwd_inner: 3315.17 | bwd_allreduce: 0.82 | step: 6.82 24%|██▍ | 2419/10000 [3:48:36<11:32:27, 5.48s/it] {'loss': 0.0317, 'grad_norm': 0.5083671808242798, 'learning_rate': 3.5472016569911386e-05, 'epoch': 2.42} 24%|██▍ | 2419/10000 [3:48:36<11:32:27, 5.48s/it][2025-06-19 17:18:20,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:18:20,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.24 | bwd_microstep: 3367.72 | bwd_inner_microstep: 3366.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 17:18:20,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.24 | bwd: 3367.73 | bwd_inner: 3366.93 | bwd_allreduce: 0.76 | step: 6.70 24%|██▍ | 2420/10000 [3:48:41<11:34:14, 5.50s/it] {'loss': 0.0798, 'grad_norm': 1.0223597288131714, 'learning_rate': 3.5467911133452876e-05, 'epoch': 2.42} 24%|██▍ | 2420/10000 [3:48:41<11:34:14, 5.50s/it][2025-06-19 17:18:26,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:18:26,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.79 | bwd_microstep: 3315.91 | bwd_inner_microstep: 3315.10 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 17:18:26,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.79 | bwd: 3315.92 | bwd_inner: 3315.10 | bwd_allreduce: 0.78 | step: 7.24 24%|██▍ | 2421/10000 [3:48:46<11:32:33, 5.48s/it] {'loss': 0.0616, 'grad_norm': 0.9060949683189392, 'learning_rate': 3.5463804074482285e-05, 'epoch': 2.42} 24%|██▍ | 2421/10000 [3:48:46<11:32:33, 5.48s/it][2025-06-19 17:18:31,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:18:31,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.05 | bwd_microstep: 3356.53 | bwd_inner_microstep: 3355.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 17:18:31,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.05 | bwd: 3356.55 | bwd_inner: 3355.74 | bwd_allreduce: 0.77 | step: 6.65 24%|██▍ | 2422/10000 [3:48:52<11:34:02, 5.50s/it] {'loss': 0.0695, 'grad_norm': 1.6088240146636963, 'learning_rate': 3.545969539343042e-05, 'epoch': 2.42} 24%|██▍ | 2422/10000 [3:48:52<11:34:02, 5.50s/it][2025-06-19 17:18:37,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:18:37,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.03 | bwd_microstep: 3360.77 | bwd_inner_microstep: 3359.96 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 17:18:37,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.03 | bwd: 3360.78 | bwd_inner: 3359.96 | bwd_allreduce: 0.78 | step: 7.23 24%|██▍ | 2423/10000 [3:48:58<11:34:57, 5.50s/it] {'loss': 0.1447, 'grad_norm': 1.8782364130020142, 'learning_rate': 3.5455585090728246e-05, 'epoch': 2.42} 24%|██▍ | 2423/10000 [3:48:58<11:34:57, 5.50s/it][2025-06-19 17:18:42,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:18:42,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.79 | bwd_microstep: 3308.08 | bwd_inner_microstep: 3307.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 17:18:42,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.80 | bwd: 3308.10 | bwd_inner: 3307.30 | bwd_allreduce: 0.76 | step: 6.60 24%|██▍ | 2424/10000 [3:49:03<11:32:34, 5.49s/it] {'loss': 0.0537, 'grad_norm': 1.0627756118774414, 'learning_rate': 3.545147316680694e-05, 'epoch': 2.42} 24%|██▍ | 2424/10000 [3:49:03<11:32:34, 5.49s/it][2025-06-19 17:18:48,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:18:48,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.15 | bwd_microstep: 3364.34 | bwd_inner_microstep: 3363.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 17:18:48,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.15 | bwd: 3364.36 | bwd_inner: 3363.56 | bwd_allreduce: 0.75 | step: 6.60 24%|██▍ | 2425/10000 [3:49:09<11:34:03, 5.50s/it] {'loss': 0.0924, 'grad_norm': 1.1599544286727905, 'learning_rate': 3.544735962209781e-05, 'epoch': 2.42} 24%|██▍ | 2425/10000 [3:49:09<11:34:03, 5.50s/it][2025-06-19 17:18:53,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:18:53,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.76 | bwd_microstep: 3311.32 | bwd_inner_microstep: 3310.39 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.45 [2025-06-19 17:18:53,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.76 | bwd: 3311.34 | bwd_inner: 3310.39 | bwd_allreduce: 0.90 | step: 7.45 24%|██▍ | 2426/10000 [3:49:14<11:32:15, 5.48s/it] {'loss': 0.0862, 'grad_norm': 0.8980948328971863, 'learning_rate': 3.544324445703234e-05, 'epoch': 2.43} 24%|██▍ | 2426/10000 [3:49:14<11:32:15, 5.48s/it][2025-06-19 17:18:59,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:18:59,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.69 | bwd_microstep: 3312.03 | bwd_inner_microstep: 3311.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 17:18:59,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.69 | bwd: 3312.05 | bwd_inner: 3311.24 | bwd_allreduce: 0.77 | step: 6.77 24%|██▍ | 2427/10000 [3:49:19<11:31:09, 5.48s/it] {'loss': 0.0619, 'grad_norm': 0.8688400983810425, 'learning_rate': 3.543912767204221e-05, 'epoch': 2.43} 24%|██▍ | 2427/10000 [3:49:19<11:31:09, 5.48s/it][2025-06-19 17:19:04,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:19:04,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.16 | bwd_microstep: 3362.51 | bwd_inner_microstep: 3361.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 17:19:04,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.16 | bwd: 3362.52 | bwd_inner: 3361.72 | bwd_allreduce: 0.76 | step: 6.62 24%|██▍ | 2428/10000 [3:49:25<11:32:46, 5.49s/it] {'loss': 0.0404, 'grad_norm': 0.6759039759635925, 'learning_rate': 3.5435009267559236e-05, 'epoch': 2.43} 24%|██▍ | 2428/10000 [3:49:25<11:32:46, 5.49s/it][2025-06-19 17:19:10,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:19:10,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.68 | bwd_microstep: 3392.63 | bwd_inner_microstep: 3391.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 17:19:10,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.68 | bwd: 3392.65 | bwd_inner: 3391.83 | bwd_allreduce: 0.77 | step: 6.76 24%|██▍ | 2429/10000 [3:49:31<11:35:29, 5.51s/it] {'loss': 0.1045, 'grad_norm': 1.0430887937545776, 'learning_rate': 3.543088924401543e-05, 'epoch': 2.43} 24%|██▍ | 2429/10000 [3:49:31<11:35:29, 5.51s/it][2025-06-19 17:19:15,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:19:15,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.16 | bwd_microstep: 3317.33 | bwd_inner_microstep: 3316.53 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 17:19:15,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.16 | bwd: 3317.34 | bwd_inner: 3316.53 | bwd_allreduce: 0.77 | step: 6.73 24%|██▍ | 2430/10000 [3:49:36<11:33:31, 5.50s/it] {'loss': 0.1157, 'grad_norm': 1.4899530410766602, 'learning_rate': 3.542676760184296e-05, 'epoch': 2.43} 24%|██▍ | 2430/10000 [3:49:36<11:33:31, 5.50s/it][2025-06-19 17:19:21,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:19:21,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.45 | bwd_microstep: 3316.80 | bwd_inner_microstep: 3315.97 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.37 [2025-06-19 17:19:21,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.45 | bwd: 3316.81 | bwd_inner: 3315.97 | bwd_allreduce: 0.79 | step: 7.37 24%|██▍ | 2431/10000 [3:49:41<11:32:06, 5.49s/it] {'loss': 0.1125, 'grad_norm': 1.6861895322799683, 'learning_rate': 3.542264434147416e-05, 'epoch': 2.43} 24%|██▍ | 2431/10000 [3:49:41<11:32:06, 5.49s/it][2025-06-19 17:19:26,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.82 [2025-06-19 17:19:26,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.27 | bwd_microstep: 3362.85 | bwd_inner_microstep: 3362.05 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 17:19:26,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.27 | bwd: 3362.87 | bwd_inner: 3362.05 | bwd_allreduce: 0.77 | step: 7.12 24%|██▍ | 2432/10000 [3:49:47<11:33:37, 5.50s/it] {'loss': 0.0342, 'grad_norm': 0.48438528180122375, 'learning_rate': 3.541851946334155e-05, 'epoch': 2.43} 24%|██▍ | 2432/10000 [3:49:47<11:33:37, 5.50s/it][2025-06-19 17:19:32,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:19:32,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.85 | bwd_microstep: 3369.47 | bwd_inner_microstep: 3368.63 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.37 [2025-06-19 17:19:32,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.85 | bwd: 3369.49 | bwd_inner: 3368.63 | bwd_allreduce: 0.81 | step: 7.37 24%|██▍ | 2433/10000 [3:49:52<11:35:01, 5.51s/it] {'loss': 0.0762, 'grad_norm': 0.9670668244361877, 'learning_rate': 3.5414392967877805e-05, 'epoch': 2.43} 24%|██▍ | 2433/10000 [3:49:52<11:35:01, 5.51s/it][2025-06-19 17:19:37,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:19:37,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.81 | bwd_microstep: 3310.63 | bwd_inner_microstep: 3309.84 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 17:19:37,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.81 | bwd: 3310.65 | bwd_inner: 3309.84 | bwd_allreduce: 0.76 | step: 6.76 24%|██▍ | 2434/10000 [3:49:58<11:32:56, 5.50s/it] {'loss': 0.018, 'grad_norm': 0.3282663822174072, 'learning_rate': 3.541026485551579e-05, 'epoch': 2.43} 24%|██▍ | 2434/10000 [3:49:58<11:32:56, 5.50s/it][2025-06-19 17:19:43,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:19:43,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.86 | bwd_microstep: 3359.86 | bwd_inner_microstep: 3359.03 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.84 [2025-06-19 17:19:43,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.86 | bwd: 3359.88 | bwd_inner: 3359.03 | bwd_allreduce: 0.79 | step: 6.85 24%|██▍ | 2435/10000 [3:50:03<11:34:09, 5.51s/it] {'loss': 0.0816, 'grad_norm': 0.9407689571380615, 'learning_rate': 3.5406135126688504e-05, 'epoch': 2.44} 24%|██▍ | 2435/10000 [3:50:03<11:34:09, 5.51s/it][2025-06-19 17:19:48,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:19:48,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.43 | bwd_microstep: 3363.08 | bwd_inner_microstep: 3362.27 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 17:19:48,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.43 | bwd: 3363.10 | bwd_inner: 3362.27 | bwd_allreduce: 0.78 | step: 7.14 24%|██▍ | 2436/10000 [3:50:09<11:34:55, 5.51s/it] {'loss': 0.1537, 'grad_norm': 1.1873458623886108, 'learning_rate': 3.540200378182914e-05, 'epoch': 2.44} 24%|██▍ | 2436/10000 [3:50:09<11:34:55, 5.51s/it][2025-06-19 17:19:54,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:19:54,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.27 | bwd_microstep: 3307.64 | bwd_inner_microstep: 3306.77 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.27 [2025-06-19 17:19:54,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.27 | bwd: 3307.66 | bwd_inner: 3306.77 | bwd_allreduce: 0.83 | step: 7.27 24%|██▍ | 2437/10000 [3:50:14<11:32:25, 5.49s/it] {'loss': 0.0356, 'grad_norm': 0.48232394456863403, 'learning_rate': 3.5397870821371074e-05, 'epoch': 2.44} 24%|██▍ | 2437/10000 [3:50:14<11:32:25, 5.49s/it][2025-06-19 17:19:59,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:19:59,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.25 | bwd_microstep: 3375.00 | bwd_inner_microstep: 3374.04 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.55 [2025-06-19 17:19:59,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.25 | bwd: 3375.02 | bwd_inner: 3374.04 | bwd_allreduce: 0.93 | step: 7.55 24%|██▍ | 2438/10000 [3:50:20<11:34:17, 5.51s/it] {'loss': 0.0194, 'grad_norm': 0.2990833520889282, 'learning_rate': 3.539373624574782e-05, 'epoch': 2.44} 24%|██▍ | 2438/10000 [3:50:20<11:34:17, 5.51s/it][2025-06-19 17:20:05,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:20:05,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.58 | bwd_microstep: 3372.64 | bwd_inner_microstep: 3371.69 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.91 [2025-06-19 17:20:05,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.58 | bwd: 3372.66 | bwd_inner: 3371.69 | bwd_allreduce: 0.92 | step: 6.90 24%|██▍ | 2439/10000 [3:50:26<11:35:29, 5.52s/it] {'loss': 0.0964, 'grad_norm': 2.3448994159698486, 'learning_rate': 3.538960005539307e-05, 'epoch': 2.44} 24%|██▍ | 2439/10000 [3:50:26<11:35:29, 5.52s/it][2025-06-19 17:20:10,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:20:10,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.23 | bwd_microstep: 3365.12 | bwd_inner_microstep: 3364.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 17:20:10,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.23 | bwd: 3365.13 | bwd_inner: 3364.31 | bwd_allreduce: 0.78 | step: 7.19 24%|██▍ | 2440/10000 [3:50:31<11:36:18, 5.53s/it] {'loss': 0.014, 'grad_norm': 0.1957758516073227, 'learning_rate': 3.538546225074071e-05, 'epoch': 2.44} 24%|██▍ | 2440/10000 [3:50:31<11:36:18, 5.53s/it][2025-06-19 17:20:16,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:20:16,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.77 | bwd_microstep: 3377.48 | bwd_inner_microstep: 3376.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 17:20:16,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.77 | bwd: 3377.50 | bwd_inner: 3376.67 | bwd_allreduce: 0.78 | step: 7.01 24%|██▍ | 2441/10000 [3:50:37<11:36:55, 5.53s/it] {'loss': 0.1022, 'grad_norm': 0.7380083203315735, 'learning_rate': 3.538132283222476e-05, 'epoch': 2.44} 24%|██▍ | 2441/10000 [3:50:37<11:36:55, 5.53s/it][2025-06-19 17:20:21,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:20:21,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.98 | bwd_microstep: 3320.65 | bwd_inner_microstep: 3319.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 17:20:21,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.98 | bwd: 3320.66 | bwd_inner: 3319.85 | bwd_allreduce: 0.77 | step: 6.94 24%|██▍ | 2442/10000 [3:50:42<11:34:11, 5.51s/it] {'loss': 0.0223, 'grad_norm': 0.5385625958442688, 'learning_rate': 3.537718180027943e-05, 'epoch': 2.44} 24%|██▍ | 2442/10000 [3:50:42<11:34:11, 5.51s/it][2025-06-19 17:20:27,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:20:27,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.71 | bwd_microstep: 3370.48 | bwd_inner_microstep: 3369.60 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.04 [2025-06-19 17:20:27,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.71 | bwd: 3370.49 | bwd_inner: 3369.60 | bwd_allreduce: 0.85 | step: 7.05 24%|██▍ | 2443/10000 [3:50:48<11:35:05, 5.52s/it] {'loss': 0.0942, 'grad_norm': 1.8931013345718384, 'learning_rate': 3.53730391553391e-05, 'epoch': 2.44} 24%|██▍ | 2443/10000 [3:50:48<11:35:05, 5.52s/it][2025-06-19 17:20:32,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:20:32,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.57 | bwd_microstep: 3315.28 | bwd_inner_microstep: 3314.45 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.94 [2025-06-19 17:20:32,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.57 | bwd: 3315.30 | bwd_inner: 3314.45 | bwd_allreduce: 0.81 | step: 6.94 24%|██▍ | 2444/10000 [3:50:53<11:33:31, 5.51s/it] {'loss': 0.0366, 'grad_norm': 0.5455352067947388, 'learning_rate': 3.536889489783831e-05, 'epoch': 2.44} 24%|██▍ | 2444/10000 [3:50:53<11:33:31, 5.51s/it][2025-06-19 17:20:38,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:20:38,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.43 | bwd_microstep: 3320.35 | bwd_inner_microstep: 3319.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.30 [2025-06-19 17:20:38,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.43 | bwd: 3320.36 | bwd_inner: 3319.53 | bwd_allreduce: 0.78 | step: 7.30 24%|██▍ | 2445/10000 [3:50:59<11:31:56, 5.50s/it] {'loss': 0.1621, 'grad_norm': 1.890005350112915, 'learning_rate': 3.536474902821178e-05, 'epoch': 2.44} 24%|██▍ | 2445/10000 [3:50:59<11:31:56, 5.50s/it][2025-06-19 17:20:43,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:20:43,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.80 | bwd_microstep: 3369.02 | bwd_inner_microstep: 3368.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 17:20:43,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.80 | bwd: 3369.04 | bwd_inner: 3368.24 | bwd_allreduce: 0.76 | step: 6.60 24%|██▍ | 2446/10000 [3:51:04<11:33:43, 5.51s/it] {'loss': 0.0337, 'grad_norm': 0.5787478089332581, 'learning_rate': 3.536060154689438e-05, 'epoch': 2.45} 24%|██▍ | 2446/10000 [3:51:04<11:33:43, 5.51s/it][2025-06-19 17:20:49,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 17:20:49,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.02 | bwd_microstep: 3369.71 | bwd_inner_microstep: 3368.72 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.57 [2025-06-19 17:20:49,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.02 | bwd: 3369.72 | bwd_inner: 3368.72 | bwd_allreduce: 0.95 | step: 7.57 24%|██▍ | 2447/10000 [3:51:10<11:34:33, 5.52s/it] {'loss': 0.0413, 'grad_norm': 0.437943696975708, 'learning_rate': 3.535645245432117e-05, 'epoch': 2.45} 24%|██▍ | 2447/10000 [3:51:10<11:34:33, 5.52s/it][2025-06-19 17:20:54,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:20:54,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.31 | bwd_microstep: 3332.65 | bwd_inner_microstep: 3331.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 17:20:54,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.31 | bwd: 3332.66 | bwd_inner: 3331.85 | bwd_allreduce: 0.77 | step: 6.91 24%|██▍ | 2448/10000 [3:51:15<11:33:22, 5.51s/it] {'loss': 0.0587, 'grad_norm': 0.7043914198875427, 'learning_rate': 3.5352301750927376e-05, 'epoch': 2.45} 24%|██▍ | 2448/10000 [3:51:15<11:33:22, 5.51s/it][2025-06-19 17:21:00,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:21:00,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.85 | bwd_microstep: 3379.60 | bwd_inner_microstep: 3378.66 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.99 [2025-06-19 17:21:00,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.85 | bwd: 3379.61 | bwd_inner: 3378.66 | bwd_allreduce: 0.90 | step: 6.99 24%|██▍ | 2449/10000 [3:51:21<11:34:57, 5.52s/it] {'loss': 0.0817, 'grad_norm': 0.8624406456947327, 'learning_rate': 3.534814943714837e-05, 'epoch': 2.45} 24%|██▍ | 2449/10000 [3:51:21<11:34:57, 5.52s/it][2025-06-19 17:21:05,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:21:05,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.78 | bwd_microstep: 3319.63 | bwd_inner_microstep: 3318.67 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.10 [2025-06-19 17:21:05,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.78 | bwd: 3319.65 | bwd_inner: 3318.67 | bwd_allreduce: 0.93 | step: 7.11 24%|██▍ | 2450/10000 [3:51:26<11:32:39, 5.50s/it] {'loss': 0.0664, 'grad_norm': 0.761021614074707, 'learning_rate': 3.534399551341972e-05, 'epoch': 2.45} 24%|██▍ | 2450/10000 [3:51:26<11:32:39, 5.50s/it][2025-06-19 17:21:11,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:21:11,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.26 | bwd_microstep: 3375.50 | bwd_inner_microstep: 3374.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 17:21:11,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.26 | bwd: 3375.52 | bwd_inner: 3374.69 | bwd_allreduce: 0.78 | step: 7.20 25%|██▍ | 2451/10000 [3:51:32<11:34:01, 5.52s/it] {'loss': 0.0304, 'grad_norm': 0.5383636951446533, 'learning_rate': 3.5339839980177165e-05, 'epoch': 2.45} 25%|██▍ | 2451/10000 [3:51:32<11:34:01, 5.52s/it][2025-06-19 17:21:16,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 17:21:16,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.68 | bwd_microstep: 3333.19 | bwd_inner_microstep: 3332.17 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.31 [2025-06-19 17:21:16,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.68 | bwd: 3333.20 | bwd_inner: 3332.18 | bwd_allreduce: 0.98 | step: 7.32 25%|██▍ | 2452/10000 [3:51:37<11:32:50, 5.51s/it] {'loss': 0.1143, 'grad_norm': 1.182121753692627, 'learning_rate': 3.5335682837856584e-05, 'epoch': 2.45} 25%|██▍ | 2452/10000 [3:51:37<11:32:50, 5.51s/it][2025-06-19 17:21:22,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 17:21:22,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.71 | bwd_microstep: 3383.07 | bwd_inner_microstep: 3382.03 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.67 [2025-06-19 17:21:22,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.71 | bwd: 3383.08 | bwd_inner: 3382.03 | bwd_allreduce: 0.99 | step: 7.68 25%|██▍ | 2453/10000 [3:51:43<11:34:36, 5.52s/it] {'loss': 0.2221, 'grad_norm': 1.3872703313827515, 'learning_rate': 3.5331524086894045e-05, 'epoch': 2.45} 25%|██▍ | 2453/10000 [3:51:43<11:34:36, 5.52s/it][2025-06-19 17:21:28,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:21:28,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.26 | bwd_microstep: 3376.23 | bwd_inner_microstep: 3375.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-19 17:21:28,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.26 | bwd: 3376.25 | bwd_inner: 3375.43 | bwd_allreduce: 0.77 | step: 6.74 25%|██▍ | 2454/10000 [3:51:48<11:35:38, 5.53s/it] {'loss': 0.0576, 'grad_norm': 0.8992351293563843, 'learning_rate': 3.532736372772579e-05, 'epoch': 2.45} 25%|██▍ | 2454/10000 [3:51:48<11:35:38, 5.53s/it][2025-06-19 17:21:33,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:21:33,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.65 | bwd_microstep: 3370.70 | bwd_inner_microstep: 3369.72 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.62 [2025-06-19 17:21:33,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.65 | bwd: 3370.72 | bwd_inner: 3369.72 | bwd_allreduce: 0.95 | step: 7.63 25%|██▍ | 2455/10000 [3:51:54<11:36:09, 5.54s/it] {'loss': 0.1019, 'grad_norm': 1.5570530891418457, 'learning_rate': 3.5323201760788226e-05, 'epoch': 2.46} 25%|██▍ | 2455/10000 [3:51:54<11:36:09, 5.54s/it][2025-06-19 17:21:39,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:21:39,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.09 | bwd_microstep: 3320.08 | bwd_inner_microstep: 3319.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 17:21:39,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.09 | bwd: 3320.10 | bwd_inner: 3319.30 | bwd_allreduce: 0.76 | step: 6.65 25%|██▍ | 2456/10000 [3:51:59<11:33:48, 5.52s/it] {'loss': 0.1414, 'grad_norm': 0.9267074465751648, 'learning_rate': 3.53190381865179e-05, 'epoch': 2.46} 25%|██▍ | 2456/10000 [3:51:59<11:33:48, 5.52s/it][2025-06-19 17:21:44,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 17:21:44,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.30 | bwd_microstep: 3322.81 | bwd_inner_microstep: 3321.82 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.11 [2025-06-19 17:21:44,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.30 | bwd: 3322.83 | bwd_inner: 3321.82 | bwd_allreduce: 0.96 | step: 7.11 25%|██▍ | 2457/10000 [3:52:05<11:31:45, 5.50s/it] {'loss': 0.1135, 'grad_norm': 0.8174594044685364, 'learning_rate': 3.531487300535157e-05, 'epoch': 2.46} 25%|██▍ | 2457/10000 [3:52:05<11:31:45, 5.50s/it][2025-06-19 17:21:49,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:21:49,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.69 | bwd_microstep: 3338.11 | bwd_inner_microstep: 3337.06 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.54 [2025-06-19 17:21:49,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.69 | bwd: 3338.13 | bwd_inner: 3337.06 | bwd_allreduce: 1.03 | step: 7.55 25%|██▍ | 2458/10000 [3:52:10<11:31:22, 5.50s/it] {'loss': 0.0366, 'grad_norm': 0.7946297526359558, 'learning_rate': 3.5310706217726137e-05, 'epoch': 2.46} 25%|██▍ | 2458/10000 [3:52:10<11:31:22, 5.50s/it][2025-06-19 17:21:55,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:21:55,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.49 | bwd_microstep: 3341.75 | bwd_inner_microstep: 3340.90 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.76 [2025-06-19 17:21:55,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.49 | bwd: 3341.76 | bwd_inner: 3340.90 | bwd_allreduce: 0.82 | step: 6.76 25%|██▍ | 2459/10000 [3:52:16<11:31:06, 5.50s/it] {'loss': 0.0577, 'grad_norm': 0.7913361191749573, 'learning_rate': 3.530653782407869e-05, 'epoch': 2.46} 25%|██▍ | 2459/10000 [3:52:16<11:31:06, 5.50s/it][2025-06-19 17:22:00,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:22:00,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.31 | bwd_microstep: 3338.30 | bwd_inner_microstep: 3337.42 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.25 [2025-06-19 17:22:00,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.31 | bwd: 3338.32 | bwd_inner: 3337.42 | bwd_allreduce: 0.84 | step: 7.25 25%|██▍ | 2460/10000 [3:52:21<11:31:00, 5.50s/it] {'loss': 0.0334, 'grad_norm': 0.4477596879005432, 'learning_rate': 3.5302367824846456e-05, 'epoch': 2.46} 25%|██▍ | 2460/10000 [3:52:21<11:31:00, 5.50s/it][2025-06-19 17:22:06,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:22:06,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.59 | bwd_microstep: 3337.13 | bwd_inner_microstep: 3336.27 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.26 [2025-06-19 17:22:06,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.59 | bwd: 3337.16 | bwd_inner: 3336.27 | bwd_allreduce: 0.82 | step: 7.26 25%|██▍ | 2461/10000 [3:52:27<11:30:29, 5.50s/it] {'loss': 0.0955, 'grad_norm': 1.984968662261963, 'learning_rate': 3.529819622046686e-05, 'epoch': 2.46} 25%|██▍ | 2461/10000 [3:52:27<11:30:29, 5.50s/it][2025-06-19 17:22:12,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:22:12,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.14 | bwd_microstep: 3383.93 | bwd_inner_microstep: 3383.07 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.36 [2025-06-19 17:22:12,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.13 | bwd: 3383.95 | bwd_inner: 3383.07 | bwd_allreduce: 0.83 | step: 7.37 25%|██▍ | 2462/10000 [3:52:32<11:32:54, 5.52s/it] {'loss': 0.0626, 'grad_norm': 0.8896495699882507, 'learning_rate': 3.529402301137749e-05, 'epoch': 2.46} 25%|██▍ | 2462/10000 [3:52:32<11:32:54, 5.52s/it][2025-06-19 17:22:17,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:22:17,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.81 | bwd_microstep: 3373.50 | bwd_inner_microstep: 3372.58 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.94 [2025-06-19 17:22:17,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.81 | bwd: 3373.51 | bwd_inner: 3372.58 | bwd_allreduce: 0.89 | step: 6.94 25%|██▍ | 2463/10000 [3:52:38<11:33:47, 5.52s/it] {'loss': 0.0462, 'grad_norm': 0.784803569316864, 'learning_rate': 3.5289848198016076e-05, 'epoch': 2.46} 25%|██▍ | 2463/10000 [3:52:38<11:33:47, 5.52s/it][2025-06-19 17:22:23,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:22:23,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.08 | bwd_microstep: 3326.25 | bwd_inner_microstep: 3325.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 17:22:23,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.08 | bwd: 3326.26 | bwd_inner: 3325.45 | bwd_allreduce: 0.76 | step: 6.69 25%|██▍ | 2464/10000 [3:52:43<11:31:48, 5.51s/it] {'loss': 0.0456, 'grad_norm': 0.7302521467208862, 'learning_rate': 3.528567178082056e-05, 'epoch': 2.46} 25%|██▍ | 2464/10000 [3:52:43<11:31:48, 5.51s/it][2025-06-19 17:22:28,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:22:28,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.62 | bwd_microstep: 3326.74 | bwd_inner_microstep: 3325.96 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 17:22:28,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.62 | bwd: 3326.76 | bwd_inner: 3325.96 | bwd_allreduce: 0.76 | step: 6.69 25%|██▍ | 2465/10000 [3:52:49<11:30:17, 5.50s/it] {'loss': 0.1457, 'grad_norm': 2.2648704051971436, 'learning_rate': 3.528149376022901e-05, 'epoch': 2.46} 25%|██▍ | 2465/10000 [3:52:49<11:30:17, 5.50s/it][2025-06-19 17:22:34,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.88 [2025-06-19 17:22:34,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.08 | bwd_microstep: 3373.83 | bwd_inner_microstep: 3373.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.27 [2025-06-19 17:22:34,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.08 | bwd: 3373.84 | bwd_inner: 3373.04 | bwd_allreduce: 0.76 | step: 7.28 25%|██▍ | 2466/10000 [3:52:54<11:31:52, 5.51s/it] {'loss': 0.1816, 'grad_norm': 1.4762359857559204, 'learning_rate': 3.527731413667969e-05, 'epoch': 2.47} 25%|██▍ | 2466/10000 [3:52:54<11:31:52, 5.51s/it][2025-06-19 17:22:39,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:22:39,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.02 | bwd_microstep: 3333.69 | bwd_inner_microstep: 3332.72 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.90 [2025-06-19 17:22:39,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.02 | bwd: 3333.71 | bwd_inner: 3332.72 | bwd_allreduce: 0.94 | step: 7.90 25%|██▍ | 2467/10000 [3:53:00<11:31:09, 5.50s/it] {'loss': 0.0252, 'grad_norm': 0.28185585141181946, 'learning_rate': 3.527313291061102e-05, 'epoch': 2.47} 25%|██▍ | 2467/10000 [3:53:00<11:31:09, 5.50s/it][2025-06-19 17:22:45,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:22:45,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2172.10 | bwd_microstep: 3375.40 | bwd_inner_microstep: 3374.44 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.82 [2025-06-19 17:22:45,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2172.10 | bwd: 3375.41 | bwd_inner: 3374.44 | bwd_allreduce: 0.93 | step: 6.83 25%|██▍ | 2468/10000 [3:53:05<11:34:12, 5.53s/it] {'loss': 0.0535, 'grad_norm': 0.4924300014972687, 'learning_rate': 3.526895008246159e-05, 'epoch': 2.47} 25%|██▍ | 2468/10000 [3:53:05<11:34:12, 5.53s/it][2025-06-19 17:22:50,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.88 [2025-06-19 17:22:50,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.27 | bwd_microstep: 3326.52 | bwd_inner_microstep: 3325.74 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-19 17:22:50,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.27 | bwd: 3326.53 | bwd_inner: 3325.74 | bwd_allreduce: 0.75 | step: 6.78 25%|██▍ | 2469/10000 [3:53:11<11:32:12, 5.51s/it] {'loss': 0.1356, 'grad_norm': 0.9412673115730286, 'learning_rate': 3.5264765652670164e-05, 'epoch': 2.47} 25%|██▍ | 2469/10000 [3:53:11<11:32:12, 5.51s/it][2025-06-19 17:22:56,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:22:56,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.98 | bwd_microstep: 3317.66 | bwd_inner_microstep: 3316.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 17:22:56,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.99 | bwd: 3317.68 | bwd_inner: 3316.87 | bwd_allreduce: 0.76 | step: 6.81 25%|██▍ | 2470/10000 [3:53:16<11:30:18, 5.50s/it] {'loss': 0.0457, 'grad_norm': 0.7555994391441345, 'learning_rate': 3.526057962167567e-05, 'epoch': 2.47} 25%|██▍ | 2470/10000 [3:53:16<11:30:18, 5.50s/it][2025-06-19 17:23:01,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:23:01,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.04 | bwd_microstep: 3326.07 | bwd_inner_microstep: 3325.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 17:23:01,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.04 | bwd: 3326.08 | bwd_inner: 3325.28 | bwd_allreduce: 0.76 | step: 6.78 25%|██▍ | 2471/10000 [3:53:22<11:28:56, 5.49s/it] {'loss': 0.0562, 'grad_norm': 0.7463447451591492, 'learning_rate': 3.525639198991719e-05, 'epoch': 2.47} 25%|██▍ | 2471/10000 [3:53:22<11:28:56, 5.49s/it][2025-06-19 17:23:07,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 17:23:07,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.89 | bwd_microstep: 3376.14 | bwd_inner_microstep: 3375.36 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 17:23:07,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.89 | bwd: 3376.15 | bwd_inner: 3375.35 | bwd_allreduce: 0.75 | step: 6.62 25%|██▍ | 2472/10000 [3:53:27<11:30:55, 5.51s/it] {'loss': 0.0379, 'grad_norm': 0.44057217240333557, 'learning_rate': 3.525220275783401e-05, 'epoch': 2.47} 25%|██▍ | 2472/10000 [3:53:27<11:30:55, 5.51s/it][2025-06-19 17:23:12,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:23:12,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.44 | bwd_microstep: 3320.10 | bwd_inner_microstep: 3319.33 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.53 [2025-06-19 17:23:12,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.44 | bwd: 3320.12 | bwd_inner: 3319.33 | bwd_allreduce: 0.75 | step: 6.54 25%|██▍ | 2473/10000 [3:53:33<11:29:00, 5.49s/it] {'loss': 0.098, 'grad_norm': 1.1921567916870117, 'learning_rate': 3.524801192586554e-05, 'epoch': 2.47} 25%|██▍ | 2473/10000 [3:53:33<11:29:00, 5.49s/it][2025-06-19 17:23:18,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:23:18,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.58 | bwd_microstep: 3376.65 | bwd_inner_microstep: 3375.87 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.50 [2025-06-19 17:23:18,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.58 | bwd: 3376.66 | bwd_inner: 3375.87 | bwd_allreduce: 0.75 | step: 6.50 25%|██▍ | 2474/10000 [3:53:38<11:30:47, 5.51s/it] {'loss': 0.0888, 'grad_norm': 1.511702060699463, 'learning_rate': 3.52438194944514e-05, 'epoch': 2.47} 25%|██▍ | 2474/10000 [3:53:38<11:30:47, 5.51s/it][2025-06-19 17:23:23,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:23:23,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.03 | bwd_microstep: 3390.45 | bwd_inner_microstep: 3389.47 | bwd_allreduce_microstep: 0.93 | step_microstep: 6.67 [2025-06-19 17:23:23,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.03 | bwd: 3390.46 | bwd_inner: 3389.47 | bwd_allreduce: 0.95 | step: 6.67 25%|██▍ | 2475/10000 [3:53:44<11:32:38, 5.52s/it] {'loss': 0.1938, 'grad_norm': 1.7019025087356567, 'learning_rate': 3.523962546403133e-05, 'epoch': 2.48} 25%|██▍ | 2475/10000 [3:53:44<11:32:38, 5.52s/it][2025-06-19 17:23:29,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:23:29,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.64 | bwd_microstep: 3379.65 | bwd_inner_microstep: 3378.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 17:23:29,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.64 | bwd: 3379.67 | bwd_inner: 3378.87 | bwd_allreduce: 0.75 | step: 6.59 25%|██▍ | 2476/10000 [3:53:50<11:33:43, 5.53s/it] {'loss': 0.1323, 'grad_norm': 3.0565221309661865, 'learning_rate': 3.523542983504528e-05, 'epoch': 2.48} 25%|██▍ | 2476/10000 [3:53:50<11:33:43, 5.53s/it][2025-06-19 17:23:34,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:23:34,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.24 | bwd_microstep: 3374.91 | bwd_inner_microstep: 3373.82 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.76 [2025-06-19 17:23:34,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.24 | bwd: 3374.93 | bwd_inner: 3373.82 | bwd_allreduce: 1.06 | step: 7.77 25%|██▍ | 2477/10000 [3:53:55<11:34:23, 5.54s/it] {'loss': 0.1632, 'grad_norm': 1.5787110328674316, 'learning_rate': 3.523123260793335e-05, 'epoch': 2.48} 25%|██▍ | 2477/10000 [3:53:55<11:34:23, 5.54s/it][2025-06-19 17:23:40,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:23:40,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.96 | bwd_microstep: 3375.93 | bwd_inner_microstep: 3375.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 17:23:40,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.96 | bwd: 3375.95 | bwd_inner: 3375.15 | bwd_allreduce: 0.75 | step: 6.63 25%|██▍ | 2478/10000 [3:54:01<11:34:46, 5.54s/it] {'loss': 0.0671, 'grad_norm': 1.2229200601577759, 'learning_rate': 3.5227033783135814e-05, 'epoch': 2.48} 25%|██▍ | 2478/10000 [3:54:01<11:34:46, 5.54s/it][2025-06-19 17:23:45,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:23:45,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.63 | bwd_microstep: 3376.05 | bwd_inner_microstep: 3375.22 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.95 [2025-06-19 17:23:45,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.63 | bwd: 3376.07 | bwd_inner: 3375.22 | bwd_allreduce: 0.80 | step: 6.96 25%|██▍ | 2479/10000 [3:54:06<11:35:00, 5.54s/it] {'loss': 0.072, 'grad_norm': 1.0721670389175415, 'learning_rate': 3.52228333610931e-05, 'epoch': 2.48} 25%|██▍ | 2479/10000 [3:54:06<11:35:00, 5.54s/it][2025-06-19 17:23:51,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 3.04 [2025-06-19 17:23:51,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.56 | bwd_microstep: 3327.08 | bwd_inner_microstep: 3326.22 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.18 [2025-06-19 17:23:51,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.56 | bwd: 3327.09 | bwd_inner: 3326.22 | bwd_allreduce: 0.83 | step: 7.18 25%|██▍ | 2480/10000 [3:54:12<11:32:22, 5.52s/it] {'loss': 0.0253, 'grad_norm': 0.4218074381351471, 'learning_rate': 3.5218631342245824e-05, 'epoch': 2.48} 25%|██▍ | 2480/10000 [3:54:12<11:32:22, 5.52s/it][2025-06-19 17:23:56,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:23:56,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.98 | bwd_microstep: 3348.78 | bwd_inner_microstep: 3347.90 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.38 [2025-06-19 17:23:56,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.98 | bwd: 3348.80 | bwd_inner: 3347.90 | bwd_allreduce: 0.84 | step: 7.38 25%|██▍ | 2481/10000 [3:54:17<11:32:20, 5.52s/it] {'loss': 0.0674, 'grad_norm': 0.8284409046173096, 'learning_rate': 3.521442772703475e-05, 'epoch': 2.48} 25%|██▍ | 2481/10000 [3:54:17<11:32:20, 5.52s/it][2025-06-19 17:24:02,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:24:02,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2163.26 | bwd_microstep: 3348.77 | bwd_inner_microstep: 3347.92 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.15 [2025-06-19 17:24:02,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2163.26 | bwd: 3348.80 | bwd_inner: 3347.92 | bwd_allreduce: 0.82 | step: 7.16 25%|██▍ | 2482/10000 [3:54:23<11:33:26, 5.53s/it] {'loss': 0.0792, 'grad_norm': 1.118483304977417, 'learning_rate': 3.5210222515900816e-05, 'epoch': 2.48} 25%|██▍ | 2482/10000 [3:54:23<11:33:26, 5.53s/it][2025-06-19 17:24:07,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.68 | optimizer_step: 2.73 [2025-06-19 17:24:07,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.08 | bwd_microstep: 3344.84 | bwd_inner_microstep: 3343.33 | bwd_allreduce_microstep: 1.42 | step_microstep: 10.71 [2025-06-19 17:24:07,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.08 | bwd: 3344.87 | bwd_inner: 3343.33 | bwd_allreduce: 1.46 | step: 10.75 25%|██▍ | 2483/10000 [3:54:28<11:34:07, 5.54s/it] {'loss': 0.0641, 'grad_norm': 0.8007524609565735, 'learning_rate': 3.520601570928514e-05, 'epoch': 2.48} 25%|██▍ | 2483/10000 [3:54:28<11:34:07, 5.54s/it][2025-06-19 17:24:13,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:24:13,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2158.07 | bwd_microstep: 3376.46 | bwd_inner_microstep: 3375.61 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.05 [2025-06-19 17:24:13,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2158.07 | bwd: 3376.49 | bwd_inner: 3375.61 | bwd_allreduce: 0.81 | step: 7.05 25%|██▍ | 2484/10000 [3:54:34<11:35:48, 5.55s/it] {'loss': 0.0405, 'grad_norm': 0.4427482783794403, 'learning_rate': 3.520180730762898e-05, 'epoch': 2.48} 25%|██▍ | 2484/10000 [3:54:34<11:35:48, 5.55s/it][2025-06-19 17:24:19,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:24:19,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.49 | bwd_microstep: 3320.67 | bwd_inner_microstep: 3319.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.68 [2025-06-19 17:24:19,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.49 | bwd: 3320.69 | bwd_inner: 3319.86 | bwd_allreduce: 0.78 | step: 6.69 25%|██▍ | 2485/10000 [3:54:39<11:33:24, 5.54s/it] {'loss': 0.0877, 'grad_norm': 0.8756679892539978, 'learning_rate': 3.51975973113738e-05, 'epoch': 2.48} 25%|██▍ | 2485/10000 [3:54:39<11:33:24, 5.54s/it][2025-06-19 17:24:24,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:24:24,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.17 | bwd_microstep: 3334.79 | bwd_inner_microstep: 3333.79 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.91 [2025-06-19 17:24:24,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.17 | bwd: 3334.81 | bwd_inner: 3333.79 | bwd_allreduce: 0.97 | step: 7.91 25%|██▍ | 2486/10000 [3:54:45<11:31:47, 5.52s/it] {'loss': 0.1152, 'grad_norm': 0.9880447387695312, 'learning_rate': 3.519338572096119e-05, 'epoch': 2.49} 25%|██▍ | 2486/10000 [3:54:45<11:31:47, 5.52s/it][2025-06-19 17:24:30,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:24:30,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.46 | bwd_microstep: 3327.47 | bwd_inner_microstep: 3326.40 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.06 [2025-06-19 17:24:30,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.46 | bwd: 3327.49 | bwd_inner: 3326.40 | bwd_allreduce: 1.04 | step: 7.06 25%|██▍ | 2487/10000 [3:54:50<11:30:13, 5.51s/it] {'loss': 0.05, 'grad_norm': 0.49913015961647034, 'learning_rate': 3.518917253683292e-05, 'epoch': 2.49} 25%|██▍ | 2487/10000 [3:54:50<11:30:13, 5.51s/it][2025-06-19 17:24:35,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:24:35,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.58 | bwd_microstep: 3328.42 | bwd_inner_microstep: 3327.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 17:24:35,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.58 | bwd: 3328.43 | bwd_inner: 3327.62 | bwd_allreduce: 0.77 | step: 6.65 25%|██▍ | 2488/10000 [3:54:56<11:29:22, 5.51s/it] {'loss': 0.057, 'grad_norm': 0.5664817094802856, 'learning_rate': 3.518495775943096e-05, 'epoch': 2.49} 25%|██▍ | 2488/10000 [3:54:56<11:29:22, 5.51s/it][2025-06-19 17:24:41,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:24:41,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.27 | bwd_microstep: 3388.22 | bwd_inner_microstep: 3387.32 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.31 [2025-06-19 17:24:41,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.27 | bwd: 3388.24 | bwd_inner: 3387.32 | bwd_allreduce: 0.87 | step: 7.33 25%|██▍ | 2489/10000 [3:55:01<11:31:36, 5.52s/it] {'loss': 0.0548, 'grad_norm': 0.9600605964660645, 'learning_rate': 3.51807413891974e-05, 'epoch': 2.49} 25%|██▍ | 2489/10000 [3:55:01<11:31:36, 5.52s/it][2025-06-19 17:24:46,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:24:46,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.28 | bwd_microstep: 3375.33 | bwd_inner_microstep: 3374.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.47 [2025-06-19 17:24:46,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.28 | bwd: 3375.34 | bwd_inner: 3374.53 | bwd_allreduce: 0.77 | step: 7.48 25%|██▍ | 2490/10000 [3:55:07<11:32:20, 5.53s/it] {'loss': 0.0406, 'grad_norm': 0.7490702271461487, 'learning_rate': 3.517652342657454e-05, 'epoch': 2.49} 25%|██▍ | 2490/10000 [3:55:07<11:32:20, 5.53s/it][2025-06-19 17:24:52,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:24:52,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.25 | bwd_microstep: 3364.41 | bwd_inner_microstep: 3363.57 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.99 [2025-06-19 17:24:52,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.25 | bwd: 3364.44 | bwd_inner: 3363.57 | bwd_allreduce: 0.80 | step: 7.00 25%|██▍ | 2491/10000 [3:55:12<11:32:29, 5.53s/it] {'loss': 0.0661, 'grad_norm': 0.9423381090164185, 'learning_rate': 3.51723038720048e-05, 'epoch': 2.49} 25%|██▍ | 2491/10000 [3:55:12<11:32:29, 5.53s/it][2025-06-19 17:24:57,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:24:57,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.84 | bwd_microstep: 3333.39 | bwd_inner_microstep: 3332.47 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.99 [2025-06-19 17:24:57,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.84 | bwd: 3333.40 | bwd_inner: 3332.47 | bwd_allreduce: 0.89 | step: 6.99 25%|██▍ | 2492/10000 [3:55:18<11:30:39, 5.52s/it] {'loss': 0.1645, 'grad_norm': 1.7394007444381714, 'learning_rate': 3.5168082725930794e-05, 'epoch': 2.49} 25%|██▍ | 2492/10000 [3:55:18<11:30:39, 5.52s/it][2025-06-19 17:25:03,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:25:03,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.16 | bwd_microstep: 3368.97 | bwd_inner_microstep: 3368.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 17:25:03,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.16 | bwd: 3368.99 | bwd_inner: 3368.19 | bwd_allreduce: 0.76 | step: 6.57 25%|██▍ | 2493/10000 [3:55:24<11:31:15, 5.52s/it] {'loss': 0.1521, 'grad_norm': 2.18904972076416, 'learning_rate': 3.516385998879531e-05, 'epoch': 2.49} 25%|██▍ | 2493/10000 [3:55:24<11:31:15, 5.52s/it][2025-06-19 17:25:08,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:25:08,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.02 | bwd_microstep: 3373.62 | bwd_inner_microstep: 3372.72 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.95 [2025-06-19 17:25:08,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.02 | bwd: 3373.64 | bwd_inner: 3372.72 | bwd_allreduce: 0.87 | step: 6.95 25%|██▍ | 2494/10000 [3:55:29<11:31:51, 5.53s/it] {'loss': 0.0657, 'grad_norm': 0.7070091962814331, 'learning_rate': 3.515963566104129e-05, 'epoch': 2.49} 25%|██▍ | 2494/10000 [3:55:29<11:31:51, 5.53s/it][2025-06-19 17:25:14,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:25:14,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.47 | bwd_microstep: 3368.23 | bwd_inner_microstep: 3367.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 17:25:14,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.47 | bwd: 3368.25 | bwd_inner: 3367.45 | bwd_allreduce: 0.75 | step: 6.56 25%|██▍ | 2495/10000 [3:55:35<11:32:17, 5.53s/it] {'loss': 0.047, 'grad_norm': 0.6455617547035217, 'learning_rate': 3.515540974311185e-05, 'epoch': 2.5} 25%|██▍ | 2495/10000 [3:55:35<11:32:17, 5.53s/it][2025-06-19 17:25:19,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-19 17:25:19,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.90 | bwd_microstep: 3328.85 | bwd_inner_microstep: 3327.98 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.51 [2025-06-19 17:25:19,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.90 | bwd: 3328.87 | bwd_inner: 3327.98 | bwd_allreduce: 0.83 | step: 7.52 25%|██▍ | 2496/10000 [3:55:40<11:30:17, 5.52s/it] {'loss': 0.1303, 'grad_norm': 1.447033405303955, 'learning_rate': 3.5151182235450264e-05, 'epoch': 2.5} 25%|██▍ | 2496/10000 [3:55:40<11:30:17, 5.52s/it][2025-06-19 17:25:25,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:25:25,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2161.15 | bwd_microstep: 3380.33 | bwd_inner_microstep: 3379.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-19 17:25:25,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2161.15 | bwd: 3380.34 | bwd_inner: 3379.54 | bwd_allreduce: 0.77 | step: 7.02 25%|██▍ | 2497/10000 [3:55:46<11:32:34, 5.54s/it] {'loss': 0.096, 'grad_norm': 0.8620128035545349, 'learning_rate': 3.514695313849998e-05, 'epoch': 2.5} 25%|██▍ | 2497/10000 [3:55:46<11:32:34, 5.54s/it][2025-06-19 17:25:30,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:25:30,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.95 | bwd_microstep: 3389.64 | bwd_inner_microstep: 3388.71 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.53 [2025-06-19 17:25:30,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.95 | bwd: 3389.65 | bwd_inner: 3388.71 | bwd_allreduce: 0.89 | step: 7.54 25%|██▍ | 2498/10000 [3:55:51<11:33:46, 5.55s/it] {'loss': 0.1358, 'grad_norm': 1.2630397081375122, 'learning_rate': 3.514272245270461e-05, 'epoch': 2.5} 25%|██▍ | 2498/10000 [3:55:51<11:33:46, 5.55s/it][2025-06-19 17:25:36,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:25:36,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.54 | bwd_microstep: 3323.98 | bwd_inner_microstep: 3323.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 17:25:36,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.54 | bwd: 3323.99 | bwd_inner: 3323.18 | bwd_allreduce: 0.77 | step: 6.71 25%|██▍ | 2499/10000 [3:55:57<11:36:02, 5.57s/it] {'loss': 0.0666, 'grad_norm': 0.7554882764816284, 'learning_rate': 3.513849017850794e-05, 'epoch': 2.5} 25%|██▍ | 2499/10000 [3:55:57<11:36:02, 5.57s/it][2025-06-19 17:25:42,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:25:42,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.55 | bwd_microstep: 3324.35 | bwd_inner_microstep: 3323.57 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 17:25:42,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.55 | bwd: 3324.37 | bwd_inner: 3323.57 | bwd_allreduce: 0.76 | step: 6.64 25%|██▌ | 2500/10000 [3:56:02<11:32:33, 5.54s/it] {'loss': 0.1393, 'grad_norm': 0.9184322357177734, 'learning_rate': 3.513425631635391e-05, 'epoch': 2.5} 25%|██▌ | 2500/10000 [3:56:02<11:32:33, 5.54s/it][2025-06-19 17:25:47,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:25:47,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.09 | bwd_microstep: 3323.00 | bwd_inner_microstep: 3322.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 17:25:47,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.09 | bwd: 3323.01 | bwd_inner: 3322.20 | bwd_allreduce: 0.77 | step: 6.68 25%|██▌ | 2501/10000 [3:56:08<11:29:35, 5.52s/it] {'loss': 0.0375, 'grad_norm': 0.6862058639526367, 'learning_rate': 3.513002086668663e-05, 'epoch': 2.5} 25%|██▌ | 2501/10000 [3:56:08<11:29:35, 5.52s/it][2025-06-19 17:25:53,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:25:53,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.04 | bwd_microstep: 3397.18 | bwd_inner_microstep: 3396.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 17:25:53,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.04 | bwd: 3397.19 | bwd_inner: 3396.40 | bwd_allreduce: 0.75 | step: 6.58 25%|██▌ | 2502/10000 [3:56:13<11:31:47, 5.54s/it] {'loss': 0.1039, 'grad_norm': 1.0441802740097046, 'learning_rate': 3.512578382995038e-05, 'epoch': 2.5} 25%|██▌ | 2502/10000 [3:56:13<11:31:47, 5.54s/it][2025-06-19 17:25:58,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:25:58,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.77 | bwd_microstep: 3383.06 | bwd_inner_microstep: 3381.98 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.45 [2025-06-19 17:25:58,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.77 | bwd: 3383.08 | bwd_inner: 3381.99 | bwd_allreduce: 1.04 | step: 7.45 25%|██▌ | 2503/10000 [3:56:19<11:32:36, 5.54s/it] {'loss': 0.0864, 'grad_norm': 1.040879249572754, 'learning_rate': 3.51215452065896e-05, 'epoch': 2.5} 25%|██▌ | 2503/10000 [3:56:19<11:32:36, 5.54s/it][2025-06-19 17:26:04,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-19 17:26:04,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.21 | bwd_microstep: 3328.88 | bwd_inner_microstep: 3327.88 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 17:26:04,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.21 | bwd: 3328.89 | bwd_inner: 3327.88 | bwd_allreduce: 0.77 | step: 6.87 25%|██▌ | 2504/10000 [3:56:24<11:29:54, 5.52s/it] {'loss': 0.0601, 'grad_norm': 0.8916642069816589, 'learning_rate': 3.511730499704893e-05, 'epoch': 2.5} 25%|██▌ | 2504/10000 [3:56:24<11:29:54, 5.52s/it][2025-06-19 17:26:09,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:26:09,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.68 | bwd_microstep: 3319.97 | bwd_inner_microstep: 3319.17 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 17:26:09,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.68 | bwd: 3319.99 | bwd_inner: 3319.17 | bwd_allreduce: 0.78 | step: 7.00 25%|██▌ | 2505/10000 [3:56:30<11:27:30, 5.50s/it] {'loss': 0.062, 'grad_norm': 0.7710597515106201, 'learning_rate': 3.511306320177311e-05, 'epoch': 2.5} 25%|██▌ | 2505/10000 [3:56:30<11:27:30, 5.50s/it][2025-06-19 17:26:15,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:26:15,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.42 | bwd_microstep: 3313.65 | bwd_inner_microstep: 3312.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 17:26:15,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.42 | bwd: 3313.66 | bwd_inner: 3312.86 | bwd_allreduce: 0.76 | step: 6.67 25%|██▌ | 2506/10000 [3:56:35<11:25:30, 5.49s/it] {'loss': 0.1196, 'grad_norm': 0.8768095374107361, 'learning_rate': 3.5108819821207106e-05, 'epoch': 2.51} 25%|██▌ | 2506/10000 [3:56:35<11:25:30, 5.49s/it][2025-06-19 17:26:20,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:26:20,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.38 | bwd_microstep: 3336.32 | bwd_inner_microstep: 3335.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 17:26:20,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.38 | bwd: 3336.33 | bwd_inner: 3335.52 | bwd_allreduce: 0.77 | step: 6.71 25%|██▌ | 2507/10000 [3:56:41<11:24:52, 5.48s/it] {'loss': 0.0739, 'grad_norm': 0.8933243751525879, 'learning_rate': 3.510457485579603e-05, 'epoch': 2.51} 25%|██▌ | 2507/10000 [3:56:41<11:24:52, 5.48s/it][2025-06-19 17:26:26,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:26:26,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.74 | bwd_microstep: 3383.63 | bwd_inner_microstep: 3382.76 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.14 [2025-06-19 17:26:26,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.74 | bwd: 3383.65 | bwd_inner: 3382.76 | bwd_allreduce: 0.83 | step: 7.14 25%|██▌ | 2508/10000 [3:56:46<11:27:02, 5.50s/it] {'loss': 0.1442, 'grad_norm': 1.2208141088485718, 'learning_rate': 3.5100328305985146e-05, 'epoch': 2.51} 25%|██▌ | 2508/10000 [3:56:46<11:27:02, 5.50s/it][2025-06-19 17:26:31,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:26:31,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.84 | bwd_microstep: 3350.68 | bwd_inner_microstep: 3349.80 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.25 [2025-06-19 17:26:31,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.84 | bwd: 3350.70 | bwd_inner: 3349.80 | bwd_allreduce: 0.83 | step: 7.25 25%|██▌ | 2509/10000 [3:56:52<11:28:01, 5.51s/it] {'loss': 0.0317, 'grad_norm': 0.6405633091926575, 'learning_rate': 3.5096080172219914e-05, 'epoch': 2.51} 25%|██▌ | 2509/10000 [3:56:52<11:28:01, 5.51s/it][2025-06-19 17:26:37,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:26:37,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.16 | bwd_microstep: 3348.05 | bwd_inner_microstep: 3347.16 | bwd_allreduce_microstep: 0.82 | step_microstep: 8.02 [2025-06-19 17:26:37,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.16 | bwd: 3348.07 | bwd_inner: 3347.16 | bwd_allreduce: 0.85 | step: 8.02 25%|██▌ | 2510/10000 [3:56:57<11:27:15, 5.51s/it] {'loss': 0.0444, 'grad_norm': 0.7170088291168213, 'learning_rate': 3.509183045494593e-05, 'epoch': 2.51} 25%|██▌ | 2510/10000 [3:56:57<11:27:15, 5.51s/it][2025-06-19 17:26:42,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:26:42,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2158.26 | bwd_microstep: 3375.99 | bwd_inner_microstep: 3375.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 17:26:42,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2158.26 | bwd: 3376.00 | bwd_inner: 3375.20 | bwd_allreduce: 0.76 | step: 6.72 25%|██▌ | 2511/10000 [3:57:03<11:29:48, 5.53s/it] {'loss': 0.1004, 'grad_norm': 1.0594576597213745, 'learning_rate': 3.508757915460897e-05, 'epoch': 2.51} 25%|██▌ | 2511/10000 [3:57:03<11:29:48, 5.53s/it][2025-06-19 17:26:48,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:26:48,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.84 | bwd_microstep: 3325.63 | bwd_inner_microstep: 3324.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 17:26:48,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.84 | bwd: 3325.64 | bwd_inner: 3324.83 | bwd_allreduce: 0.77 | step: 6.88 25%|██▌ | 2512/10000 [3:57:08<11:27:20, 5.51s/it] {'loss': 0.0444, 'grad_norm': 0.5624334216117859, 'learning_rate': 3.508332627165499e-05, 'epoch': 2.51} 25%|██▌ | 2512/10000 [3:57:08<11:27:20, 5.51s/it][2025-06-19 17:26:53,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:26:53,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.55 | bwd_microstep: 3317.62 | bwd_inner_microstep: 3316.83 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 17:26:53,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.55 | bwd: 3317.63 | bwd_inner: 3316.83 | bwd_allreduce: 0.76 | step: 6.68 25%|██▌ | 2513/10000 [3:57:14<11:25:22, 5.49s/it] {'loss': 0.1941, 'grad_norm': 1.708354115486145, 'learning_rate': 3.507907180653008e-05, 'epoch': 2.51} 25%|██▌ | 2513/10000 [3:57:14<11:25:22, 5.49s/it][2025-06-19 17:26:59,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:26:59,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.55 | bwd_microstep: 3361.31 | bwd_inner_microstep: 3360.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 8.43 [2025-06-19 17:26:59,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.55 | bwd: 3361.33 | bwd_inner: 3360.51 | bwd_allreduce: 0.77 | step: 8.45 25%|██▌ | 2514/10000 [3:57:19<11:26:30, 5.50s/it] {'loss': 0.0455, 'grad_norm': 0.5025665163993835, 'learning_rate': 3.507481575968053e-05, 'epoch': 2.51} 25%|██▌ | 2514/10000 [3:57:19<11:26:30, 5.50s/it][2025-06-19 17:27:04,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:27:04,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.17 | bwd_microstep: 3317.89 | bwd_inner_microstep: 3317.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 17:27:04,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.17 | bwd: 3317.90 | bwd_inner: 3317.10 | bwd_allreduce: 0.76 | step: 6.66 25%|██▌ | 2515/10000 [3:57:25<11:24:41, 5.49s/it] {'loss': 0.079, 'grad_norm': 0.9085714221000671, 'learning_rate': 3.507055813155276e-05, 'epoch': 2.52} 25%|██▌ | 2515/10000 [3:57:25<11:24:41, 5.49s/it][2025-06-19 17:27:10,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:27:10,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.29 | bwd_microstep: 3366.17 | bwd_inner_microstep: 3365.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-19 17:27:10,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.29 | bwd: 3366.18 | bwd_inner: 3365.37 | bwd_allreduce: 0.77 | step: 6.94 25%|██▌ | 2516/10000 [3:57:30<11:26:16, 5.50s/it] {'loss': 0.0699, 'grad_norm': 1.2317793369293213, 'learning_rate': 3.50662989225934e-05, 'epoch': 2.52} 25%|██▌ | 2516/10000 [3:57:30<11:26:16, 5.50s/it][2025-06-19 17:27:15,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:27:15,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.52 | bwd_microstep: 3316.00 | bwd_inner_microstep: 3315.11 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.37 [2025-06-19 17:27:15,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.52 | bwd: 3316.02 | bwd_inner: 3315.11 | bwd_allreduce: 0.85 | step: 7.37 25%|██▌ | 2517/10000 [3:57:36<11:24:35, 5.49s/it] {'loss': 0.076, 'grad_norm': 1.1327908039093018, 'learning_rate': 3.50620381332492e-05, 'epoch': 2.52} 25%|██▌ | 2517/10000 [3:57:36<11:24:35, 5.49s/it][2025-06-19 17:27:21,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:27:21,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.53 | bwd_microstep: 3394.85 | bwd_inner_microstep: 3394.03 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.80 [2025-06-19 17:27:21,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.53 | bwd: 3394.86 | bwd_inner: 3394.03 | bwd_allreduce: 0.78 | step: 6.80 25%|██▌ | 2518/10000 [3:57:41<11:27:58, 5.52s/it] {'loss': 0.1019, 'grad_norm': 0.8332833647727966, 'learning_rate': 3.50577757639671e-05, 'epoch': 2.52} 25%|██▌ | 2518/10000 [3:57:41<11:27:58, 5.52s/it][2025-06-19 17:27:26,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 17:27:26,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.65 | bwd_microstep: 3394.81 | bwd_inner_microstep: 3393.80 | bwd_allreduce_microstep: 0.92 | step_microstep: 8.69 [2025-06-19 17:27:26,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.65 | bwd: 3394.85 | bwd_inner: 3393.80 | bwd_allreduce: 0.96 | step: 8.70 25%|██▌ | 2519/10000 [3:57:47<11:30:46, 5.54s/it] {'loss': 0.1039, 'grad_norm': 0.8919432163238525, 'learning_rate': 3.505351181519422e-05, 'epoch': 2.52} 25%|██▌ | 2519/10000 [3:57:47<11:30:46, 5.54s/it][2025-06-19 17:27:32,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 17:27:32,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.47 | bwd_microstep: 3320.70 | bwd_inner_microstep: 3319.74 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.51 [2025-06-19 17:27:32,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.47 | bwd: 3320.74 | bwd_inner: 3319.74 | bwd_allreduce: 0.92 | step: 7.51 25%|██▌ | 2520/10000 [3:57:53<11:29:14, 5.53s/it] {'loss': 0.1231, 'grad_norm': 1.285766363143921, 'learning_rate': 3.504924628737781e-05, 'epoch': 2.52} 25%|██▌ | 2520/10000 [3:57:53<11:29:14, 5.53s/it][2025-06-19 17:27:37,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:27:37,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.90 | bwd_microstep: 3315.57 | bwd_inner_microstep: 3314.59 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.76 [2025-06-19 17:27:37,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.90 | bwd: 3315.59 | bwd_inner: 3314.59 | bwd_allreduce: 0.95 | step: 7.76 25%|██▌ | 2521/10000 [3:57:58<11:27:41, 5.52s/it] {'loss': 0.0771, 'grad_norm': 0.8819826245307922, 'learning_rate': 3.5044979180965306e-05, 'epoch': 2.52} 25%|██▌ | 2521/10000 [3:57:58<11:27:41, 5.52s/it][2025-06-19 17:27:43,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-19 17:27:43,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2168.28 | bwd_microstep: 3377.54 | bwd_inner_microstep: 3376.57 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.74 [2025-06-19 17:27:43,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2168.28 | bwd: 3377.57 | bwd_inner: 3376.57 | bwd_allreduce: 0.92 | step: 7.74 25%|██▌ | 2522/10000 [3:58:04<11:30:28, 5.54s/it] {'loss': 0.0934, 'grad_norm': 1.330863118171692, 'learning_rate': 3.5040710496404315e-05, 'epoch': 2.52} 25%|██▌ | 2522/10000 [3:58:04<11:30:28, 5.54s/it][2025-06-19 17:27:48,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:27:48,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.08 | bwd_microstep: 3320.37 | bwd_inner_microstep: 3319.20 | bwd_allreduce_microstep: 1.10 | step_microstep: 7.42 [2025-06-19 17:27:48,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.08 | bwd: 3320.39 | bwd_inner: 3319.20 | bwd_allreduce: 1.13 | step: 7.43 25%|██▌ | 2523/10000 [3:58:09<11:28:02, 5.52s/it] {'loss': 0.0875, 'grad_norm': 0.9941777586936951, 'learning_rate': 3.50364402341426e-05, 'epoch': 2.52} 25%|██▌ | 2523/10000 [3:58:09<11:28:02, 5.52s/it][2025-06-19 17:27:54,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:27:54,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.83 | bwd_microstep: 3360.20 | bwd_inner_microstep: 3359.40 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 17:27:54,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.83 | bwd: 3360.22 | bwd_inner: 3359.40 | bwd_allreduce: 0.78 | step: 7.01 25%|██▌ | 2524/10000 [3:58:15<11:28:28, 5.53s/it] {'loss': 0.058, 'grad_norm': 0.5673074126243591, 'learning_rate': 3.503216839462809e-05, 'epoch': 2.52} 25%|██▌ | 2524/10000 [3:58:15<11:28:28, 5.53s/it][2025-06-19 17:27:59,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:27:59,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.00 | bwd_microstep: 3320.24 | bwd_inner_microstep: 3319.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.28 [2025-06-19 17:27:59,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.00 | bwd: 3320.25 | bwd_inner: 3319.43 | bwd_allreduce: 0.77 | step: 7.28 25%|██▌ | 2525/10000 [3:58:20<11:26:08, 5.51s/it] {'loss': 0.0936, 'grad_norm': 0.8486220240592957, 'learning_rate': 3.5027894978308886e-05, 'epoch': 2.52} 25%|██▌ | 2525/10000 [3:58:20<11:26:08, 5.51s/it][2025-06-19 17:28:05,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:28:05,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.05 | bwd_microstep: 3397.23 | bwd_inner_microstep: 3396.29 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.04 [2025-06-19 17:28:05,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.05 | bwd: 3397.26 | bwd_inner: 3396.29 | bwd_allreduce: 0.89 | step: 7.03 25%|██▌ | 2526/10000 [3:58:26<11:28:45, 5.53s/it] {'loss': 0.0603, 'grad_norm': 1.0108599662780762, 'learning_rate': 3.502361998563324e-05, 'epoch': 2.53} 25%|██▌ | 2526/10000 [3:58:26<11:28:45, 5.53s/it][2025-06-19 17:28:10,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:28:10,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.75 | bwd_microstep: 3312.23 | bwd_inner_microstep: 3311.39 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.05 [2025-06-19 17:28:10,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.75 | bwd: 3312.25 | bwd_inner: 3311.39 | bwd_allreduce: 0.80 | step: 7.05 25%|██▌ | 2527/10000 [3:58:31<11:26:31, 5.51s/it] {'loss': 0.0623, 'grad_norm': 1.186518669128418, 'learning_rate': 3.501934341704958e-05, 'epoch': 2.53} 25%|██▌ | 2527/10000 [3:58:31<11:26:31, 5.51s/it][2025-06-19 17:28:16,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.85 [2025-06-19 17:28:16,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.23 | bwd_microstep: 3372.01 | bwd_inner_microstep: 3371.21 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 17:28:16,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.22 | bwd: 3372.02 | bwd_inner: 3371.21 | bwd_allreduce: 0.78 | step: 7.00 25%|██▌ | 2528/10000 [3:58:37<11:27:44, 5.52s/it] {'loss': 0.1199, 'grad_norm': 1.449506402015686, 'learning_rate': 3.501506527300651e-05, 'epoch': 2.53} 25%|██▌ | 2528/10000 [3:58:37<11:27:44, 5.52s/it][2025-06-19 17:28:21,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 2.11 | optimizer_step: 3.21 [2025-06-19 17:28:21,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.53 | bwd_microstep: 3315.05 | bwd_inner_microstep: 3313.46 | bwd_allreduce_microstep: 1.34 | step_microstep: 15.11 [2025-06-19 17:28:21,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.53 | bwd: 3315.16 | bwd_inner: 3313.46 | bwd_allreduce: 1.44 | step: 15.07 25%|██▌ | 2529/10000 [3:58:42<11:26:30, 5.51s/it] {'loss': 0.1158, 'grad_norm': 1.6376245021820068, 'learning_rate': 3.5010785553952784e-05, 'epoch': 2.53} 25%|██▌ | 2529/10000 [3:58:42<11:26:30, 5.51s/it][2025-06-19 17:28:27,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:28:27,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.91 | bwd_microstep: 3326.78 | bwd_inner_microstep: 3325.92 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.14 [2025-06-19 17:28:27,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.91 | bwd: 3326.80 | bwd_inner: 3325.92 | bwd_allreduce: 0.82 | step: 7.14 25%|██▌ | 2530/10000 [3:58:48<11:25:56, 5.51s/it] {'loss': 0.0767, 'grad_norm': 0.7149010300636292, 'learning_rate': 3.500650426033731e-05, 'epoch': 2.53} 25%|██▌ | 2530/10000 [3:58:48<11:25:56, 5.51s/it][2025-06-19 17:28:32,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.71 | optimizer_step: 2.73 [2025-06-19 17:28:32,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.94 | bwd_microstep: 3329.02 | bwd_inner_microstep: 3327.29 | bwd_allreduce_microstep: 1.59 | step_microstep: 10.69 [2025-06-19 17:28:32,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.94 | bwd: 3329.07 | bwd_inner: 3327.29 | bwd_allreduce: 1.67 | step: 10.73 25%|██▌ | 2531/10000 [3:58:53<11:26:17, 5.51s/it] {'loss': 0.1226, 'grad_norm': 1.188231348991394, 'learning_rate': 3.5002221392609196e-05, 'epoch': 2.53} 25%|██▌ | 2531/10000 [3:58:53<11:26:17, 5.51s/it][2025-06-19 17:28:38,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 17:28:38,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2177.12 | bwd_microstep: 3383.01 | bwd_inner_microstep: 3381.82 | bwd_allreduce_microstep: 1.11 | step_microstep: 7.82 [2025-06-19 17:28:38,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2177.07 | bwd: 3383.03 | bwd_inner: 3381.82 | bwd_allreduce: 1.14 | step: 7.83 25%|██▌ | 2532/10000 [3:58:59<11:30:29, 5.55s/it] {'loss': 0.1065, 'grad_norm': 1.2443821430206299, 'learning_rate': 3.499793695121768e-05, 'epoch': 2.53} 25%|██▌ | 2532/10000 [3:58:59<11:30:29, 5.55s/it][2025-06-19 17:28:44,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.73 [2025-06-19 17:28:44,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2184.29 | bwd_microstep: 3406.17 | bwd_inner_microstep: 3404.80 | bwd_allreduce_microstep: 1.28 | step_microstep: 9.28 [2025-06-19 17:28:44,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2184.29 | bwd: 3406.20 | bwd_inner: 3404.80 | bwd_allreduce: 1.32 | step: 9.29 25%|██▌ | 2533/10000 [3:59:04<11:34:10, 5.58s/it] {'loss': 0.0566, 'grad_norm': 1.5048128366470337, 'learning_rate': 3.4993650936612184e-05, 'epoch': 2.53} 25%|██▌ | 2533/10000 [3:59:04<11:34:10, 5.58s/it][2025-06-19 17:28:49,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:28:49,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2160.11 | bwd_microstep: 3322.26 | bwd_inner_microstep: 3321.40 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.15 [2025-06-19 17:28:49,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2160.11 | bwd: 3322.28 | bwd_inner: 3321.40 | bwd_allreduce: 0.82 | step: 7.16 25%|██▌ | 2534/10000 [3:59:10<11:32:26, 5.56s/it] {'loss': 0.154, 'grad_norm': 1.6175700426101685, 'learning_rate': 3.4989363349242295e-05, 'epoch': 2.53} 25%|██▌ | 2534/10000 [3:59:10<11:32:26, 5.56s/it][2025-06-19 17:28:55,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:28:55,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2150.59 | bwd_microstep: 3364.33 | bwd_inner_microstep: 3363.46 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.12 [2025-06-19 17:28:55,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2150.59 | bwd: 3364.35 | bwd_inner: 3363.46 | bwd_allreduce: 0.83 | step: 7.12 25%|██▌ | 2535/10000 [3:59:16<11:32:12, 5.56s/it] {'loss': 0.0524, 'grad_norm': 0.5872381329536438, 'learning_rate': 3.498507418955776e-05, 'epoch': 2.54} 25%|██▌ | 2535/10000 [3:59:16<11:32:12, 5.56s/it][2025-06-19 17:29:00,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:29:00,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.45 | bwd_microstep: 3323.71 | bwd_inner_microstep: 3322.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 17:29:00,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.45 | bwd: 3323.72 | bwd_inner: 3322.92 | bwd_allreduce: 0.76 | step: 6.61 25%|██▌ | 2536/10000 [3:59:21<11:29:31, 5.54s/it] {'loss': 0.0833, 'grad_norm': 1.0924879312515259, 'learning_rate': 3.4980783458008484e-05, 'epoch': 2.54} 25%|██▌ | 2536/10000 [3:59:21<11:29:31, 5.54s/it][2025-06-19 17:29:06,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:29:06,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.23 | bwd_microstep: 3312.30 | bwd_inner_microstep: 3311.47 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.81 [2025-06-19 17:29:06,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.23 | bwd: 3312.32 | bwd_inner: 3311.47 | bwd_allreduce: 0.80 | step: 6.81 25%|██▌ | 2537/10000 [3:59:27<11:27:14, 5.53s/it] {'loss': 0.1708, 'grad_norm': 1.3960658311843872, 'learning_rate': 3.497649115504456e-05, 'epoch': 2.54} 25%|██▌ | 2537/10000 [3:59:27<11:27:14, 5.53s/it][2025-06-19 17:29:11,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:29:11,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.24 | bwd_microstep: 3322.28 | bwd_inner_microstep: 3321.36 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.94 [2025-06-19 17:29:11,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.25 | bwd: 3322.29 | bwd_inner: 3321.36 | bwd_allreduce: 0.89 | step: 6.95 25%|██▌ | 2538/10000 [3:59:32<11:25:06, 5.51s/it] {'loss': 0.0827, 'grad_norm': 0.616404116153717, 'learning_rate': 3.497219728111621e-05, 'epoch': 2.54} 25%|██▌ | 2538/10000 [3:59:32<11:25:06, 5.51s/it][2025-06-19 17:29:17,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 17:29:17,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.68 | bwd_microstep: 3322.83 | bwd_inner_microstep: 3321.65 | bwd_allreduce_microstep: 1.10 | step_microstep: 9.06 [2025-06-19 17:29:17,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.68 | bwd: 3322.86 | bwd_inner: 3321.65 | bwd_allreduce: 1.14 | step: 9.07 25%|██▌ | 2539/10000 [3:59:38<11:25:01, 5.51s/it] {'loss': 0.068, 'grad_norm': 0.9014118909835815, 'learning_rate': 3.496790183667387e-05, 'epoch': 2.54} 25%|██▌ | 2539/10000 [3:59:38<11:25:01, 5.51s/it][2025-06-19 17:29:22,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 17:29:22,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.89 | bwd_microstep: 3317.48 | bwd_inner_microstep: 3316.47 | bwd_allreduce_microstep: 0.93 | step_microstep: 9.03 [2025-06-19 17:29:22,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.89 | bwd: 3317.51 | bwd_inner: 3316.47 | bwd_allreduce: 0.97 | step: 9.06 25%|██▌ | 2540/10000 [3:59:43<11:24:23, 5.50s/it] {'loss': 0.0709, 'grad_norm': 0.7493585348129272, 'learning_rate': 3.496360482216808e-05, 'epoch': 2.54} 25%|██▌ | 2540/10000 [3:59:43<11:24:23, 5.50s/it][2025-06-19 17:29:28,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:29:28,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.25 | bwd_microstep: 3331.42 | bwd_inner_microstep: 3330.44 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.26 [2025-06-19 17:29:28,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.25 | bwd: 3331.44 | bwd_inner: 3330.44 | bwd_allreduce: 0.94 | step: 7.26 25%|██▌ | 2541/10000 [3:59:49<11:25:07, 5.51s/it] {'loss': 0.0623, 'grad_norm': 0.9075900316238403, 'learning_rate': 3.49593062380496e-05, 'epoch': 2.54} 25%|██▌ | 2541/10000 [3:59:49<11:25:07, 5.51s/it][2025-06-19 17:29:33,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:29:33,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.13 | bwd_microstep: 3318.55 | bwd_inner_microstep: 3317.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 17:29:33,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.13 | bwd: 3318.57 | bwd_inner: 3317.76 | bwd_allreduce: 0.77 | step: 6.69 25%|██▌ | 2542/10000 [3:59:54<11:23:59, 5.50s/it] {'loss': 0.1604, 'grad_norm': 2.1885693073272705, 'learning_rate': 3.495500608476932e-05, 'epoch': 2.54} 25%|██▌ | 2542/10000 [3:59:54<11:23:59, 5.50s/it][2025-06-19 17:29:39,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:29:39,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.30 | bwd_microstep: 3374.47 | bwd_inner_microstep: 3373.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 17:29:39,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.30 | bwd: 3374.48 | bwd_inner: 3373.67 | bwd_allreduce: 0.76 | step: 6.81 25%|██▌ | 2543/10000 [4:00:00<11:25:40, 5.52s/it] {'loss': 0.0483, 'grad_norm': 0.6504806280136108, 'learning_rate': 3.495070436277832e-05, 'epoch': 2.54} 25%|██▌ | 2543/10000 [4:00:00<11:25:40, 5.52s/it][2025-06-19 17:29:44,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:29:44,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.98 | bwd_microstep: 3319.25 | bwd_inner_microstep: 3318.44 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 17:29:44,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.98 | bwd: 3319.26 | bwd_inner: 3318.44 | bwd_allreduce: 0.78 | step: 7.01 25%|██▌ | 2544/10000 [4:00:05<11:24:21, 5.51s/it] {'loss': 0.0947, 'grad_norm': 0.8900743126869202, 'learning_rate': 3.494640107252781e-05, 'epoch': 2.54} 25%|██▌ | 2544/10000 [4:00:05<11:24:21, 5.51s/it][2025-06-19 17:29:50,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 17:29:50,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.15 | bwd_microstep: 3312.21 | bwd_inner_microstep: 3311.06 | bwd_allreduce_microstep: 1.09 | step_microstep: 7.88 [2025-06-19 17:29:50,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.15 | bwd: 3312.23 | bwd_inner: 3311.06 | bwd_allreduce: 1.11 | step: 7.89 25%|██▌ | 2545/10000 [4:00:11<11:22:57, 5.50s/it] {'loss': 0.1122, 'grad_norm': 0.9914515018463135, 'learning_rate': 3.49420962144692e-05, 'epoch': 2.54} 25%|██▌ | 2545/10000 [4:00:11<11:22:57, 5.50s/it][2025-06-19 17:29:55,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:29:55,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.88 | bwd_microstep: 3313.48 | bwd_inner_microstep: 3312.45 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.32 [2025-06-19 17:29:55,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.88 | bwd: 3313.50 | bwd_inner: 3312.45 | bwd_allreduce: 1.00 | step: 7.32 25%|██▌ | 2546/10000 [4:00:16<11:21:55, 5.49s/it] {'loss': 0.0577, 'grad_norm': 0.8314306735992432, 'learning_rate': 3.493778978905404e-05, 'epoch': 2.55} 25%|██▌ | 2546/10000 [4:00:16<11:21:55, 5.49s/it][2025-06-19 17:30:01,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:30:01,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.06 | bwd_microstep: 3363.93 | bwd_inner_microstep: 3363.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 17:30:01,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.06 | bwd: 3363.94 | bwd_inner: 3363.14 | bwd_allreduce: 0.76 | step: 6.68 25%|██▌ | 2547/10000 [4:00:22<11:23:47, 5.50s/it] {'loss': 0.1113, 'grad_norm': 1.2173678874969482, 'learning_rate': 3.4933481796734066e-05, 'epoch': 2.55} 25%|██▌ | 2547/10000 [4:00:22<11:23:47, 5.50s/it][2025-06-19 17:30:06,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:30:06,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.10 | bwd_microstep: 3323.16 | bwd_inner_microstep: 3322.20 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.58 [2025-06-19 17:30:06,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.10 | bwd: 3323.17 | bwd_inner: 3322.20 | bwd_allreduce: 0.93 | step: 7.60 25%|██▌ | 2548/10000 [4:00:27<11:22:39, 5.50s/it] {'loss': 0.0986, 'grad_norm': 0.926315426826477, 'learning_rate': 3.492917223796116e-05, 'epoch': 2.55} 25%|██▌ | 2548/10000 [4:00:27<11:22:39, 5.50s/it][2025-06-19 17:30:12,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 17:30:12,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.97 | bwd_microstep: 3320.46 | bwd_inner_microstep: 3319.32 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.74 [2025-06-19 17:30:12,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.97 | bwd: 3320.49 | bwd_inner: 3319.32 | bwd_allreduce: 1.09 | step: 7.73 25%|██▌ | 2549/10000 [4:00:33<11:22:40, 5.50s/it] {'loss': 0.0577, 'grad_norm': 0.6709111332893372, 'learning_rate': 3.4924861113187375e-05, 'epoch': 2.55} 25%|██▌ | 2549/10000 [4:00:33<11:22:40, 5.50s/it][2025-06-19 17:30:17,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:30:17,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.32 | bwd_microstep: 3367.61 | bwd_inner_microstep: 3366.63 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.33 [2025-06-19 17:30:17,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.32 | bwd: 3367.64 | bwd_inner: 3366.63 | bwd_allreduce: 0.92 | step: 7.33 26%|██▌ | 2550/10000 [4:00:38<11:24:47, 5.52s/it] {'loss': 0.1157, 'grad_norm': 0.9533945322036743, 'learning_rate': 3.4920548422864926e-05, 'epoch': 2.55} 26%|██▌ | 2550/10000 [4:00:38<11:24:47, 5.52s/it][2025-06-19 17:30:23,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 17:30:23,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.86 | bwd_microstep: 3322.42 | bwd_inner_microstep: 3321.52 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.81 [2025-06-19 17:30:23,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.86 | bwd: 3322.45 | bwd_inner: 3321.52 | bwd_allreduce: 0.86 | step: 7.82 26%|██▌ | 2551/10000 [4:00:44<11:23:54, 5.51s/it] {'loss': 0.1503, 'grad_norm': 1.8436359167099, 'learning_rate': 3.491623416744619e-05, 'epoch': 2.55} 26%|██▌ | 2551/10000 [4:00:44<11:23:54, 5.51s/it][2025-06-19 17:30:28,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-19 17:30:28,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.35 | bwd_microstep: 3369.14 | bwd_inner_microstep: 3368.33 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-19 17:30:28,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.35 | bwd: 3369.15 | bwd_inner: 3368.33 | bwd_allreduce: 0.77 | step: 6.78 26%|██▌ | 2552/10000 [4:00:49<11:24:59, 5.52s/it] {'loss': 0.0982, 'grad_norm': 0.7689537405967712, 'learning_rate': 3.4911918347383725e-05, 'epoch': 2.55} 26%|██▌ | 2552/10000 [4:00:49<11:24:59, 5.52s/it][2025-06-19 17:30:34,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:30:34,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.35 | bwd_microstep: 3313.99 | bwd_inner_microstep: 3313.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 17:30:34,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.35 | bwd: 3314.00 | bwd_inner: 3313.20 | bwd_allreduce: 0.76 | step: 6.72 26%|██▌ | 2553/10000 [4:00:55<11:23:17, 5.51s/it] {'loss': 0.0681, 'grad_norm': 0.7376340627670288, 'learning_rate': 3.4907600963130235e-05, 'epoch': 2.55} 26%|██▌ | 2553/10000 [4:00:55<11:23:17, 5.51s/it][2025-06-19 17:30:39,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:30:39,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.88 | bwd_microstep: 3323.85 | bwd_inner_microstep: 3322.91 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.98 [2025-06-19 17:30:39,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.88 | bwd: 3323.87 | bwd_inner: 3322.91 | bwd_allreduce: 0.91 | step: 6.98 26%|██▌ | 2554/10000 [4:01:00<11:21:52, 5.49s/it] {'loss': 0.0467, 'grad_norm': 0.33172205090522766, 'learning_rate': 3.490328201513859e-05, 'epoch': 2.55} 26%|██▌ | 2554/10000 [4:01:00<11:21:52, 5.49s/it][2025-06-19 17:30:45,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:30:45,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.48 | bwd_microstep: 3322.16 | bwd_inner_microstep: 3321.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-19 17:30:45,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.48 | bwd: 3322.17 | bwd_inner: 3321.35 | bwd_allreduce: 0.77 | step: 6.93 26%|██▌ | 2555/10000 [4:01:06<11:21:00, 5.49s/it] {'loss': 0.0507, 'grad_norm': 0.9847705364227295, 'learning_rate': 3.489896150386183e-05, 'epoch': 2.56} 26%|██▌ | 2555/10000 [4:01:06<11:21:00, 5.49s/it][2025-06-19 17:30:50,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:30:50,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.66 | bwd_microstep: 3374.20 | bwd_inner_microstep: 3373.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-19 17:30:50,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.66 | bwd: 3374.22 | bwd_inner: 3373.40 | bwd_allreduce: 0.77 | step: 6.87 26%|██▌ | 2556/10000 [4:01:11<11:22:55, 5.50s/it] {'loss': 0.1149, 'grad_norm': 1.145704984664917, 'learning_rate': 3.489463942975316e-05, 'epoch': 2.56} 26%|██▌ | 2556/10000 [4:01:11<11:22:55, 5.50s/it][2025-06-19 17:30:56,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:30:56,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.04 | bwd_microstep: 3326.86 | bwd_inner_microstep: 3326.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 17:30:56,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.04 | bwd: 3326.87 | bwd_inner: 3326.07 | bwd_allreduce: 0.75 | step: 6.66 26%|██▌ | 2557/10000 [4:01:17<11:21:26, 5.49s/it] {'loss': 0.0808, 'grad_norm': 1.1126677989959717, 'learning_rate': 3.489031579326593e-05, 'epoch': 2.56} 26%|██▌ | 2557/10000 [4:01:17<11:21:26, 5.49s/it][2025-06-19 17:31:01,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:31:01,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.42 | bwd_microstep: 3366.95 | bwd_inner_microstep: 3366.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 17:31:01,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.42 | bwd: 3366.97 | bwd_inner: 3366.16 | bwd_allreduce: 0.76 | step: 6.65 26%|██▌ | 2558/10000 [4:01:22<11:22:42, 5.50s/it] {'loss': 0.0817, 'grad_norm': 1.861857295036316, 'learning_rate': 3.488599059485369e-05, 'epoch': 2.56} 26%|██▌ | 2558/10000 [4:01:22<11:22:42, 5.50s/it][2025-06-19 17:31:07,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:31:07,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.68 | bwd_microstep: 3376.02 | bwd_inner_microstep: 3375.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.10 [2025-06-19 17:31:07,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.68 | bwd: 3376.03 | bwd_inner: 3375.23 | bwd_allreduce: 0.76 | step: 7.11 26%|██▌ | 2559/10000 [4:01:28<11:24:17, 5.52s/it] {'loss': 0.0668, 'grad_norm': 0.5350480675697327, 'learning_rate': 3.488166383497013e-05, 'epoch': 2.56} 26%|██▌ | 2559/10000 [4:01:28<11:24:17, 5.52s/it][2025-06-19 17:31:12,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:31:12,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.27 | bwd_microstep: 3382.99 | bwd_inner_microstep: 3381.89 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.53 [2025-06-19 17:31:12,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.27 | bwd: 3383.01 | bwd_inner: 3381.89 | bwd_allreduce: 1.06 | step: 7.54 26%|██▌ | 2560/10000 [4:01:33<11:25:49, 5.53s/it] {'loss': 0.0544, 'grad_norm': 0.4662291407585144, 'learning_rate': 3.4877335514069095e-05, 'epoch': 2.56} 26%|██▌ | 2560/10000 [4:01:33<11:25:49, 5.53s/it][2025-06-19 17:31:18,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:31:18,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.23 | bwd_microstep: 3377.27 | bwd_inner_microstep: 3376.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 17:31:18,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.23 | bwd: 3377.29 | bwd_inner: 3376.47 | bwd_allreduce: 0.77 | step: 6.90 26%|██▌ | 2561/10000 [4:01:39<11:26:33, 5.54s/it] {'loss': 0.0872, 'grad_norm': 1.0910297632217407, 'learning_rate': 3.487300563260461e-05, 'epoch': 2.56} 26%|██▌ | 2561/10000 [4:01:39<11:26:33, 5.54s/it][2025-06-19 17:31:23,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:31:23,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.99 | bwd_microstep: 3330.58 | bwd_inner_microstep: 3329.76 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.95 [2025-06-19 17:31:23,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.99 | bwd: 3330.59 | bwd_inner: 3329.76 | bwd_allreduce: 0.78 | step: 6.95 26%|██▌ | 2562/10000 [4:01:44<11:24:18, 5.52s/it] {'loss': 0.0991, 'grad_norm': 1.1774455308914185, 'learning_rate': 3.486867419103086e-05, 'epoch': 2.56} 26%|██▌ | 2562/10000 [4:01:44<11:24:18, 5.52s/it][2025-06-19 17:31:29,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:31:29,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.38 | bwd_microstep: 3328.44 | bwd_inner_microstep: 3327.62 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.01 [2025-06-19 17:31:29,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.38 | bwd: 3328.46 | bwd_inner: 3327.62 | bwd_allreduce: 0.79 | step: 7.01 26%|██▌ | 2563/10000 [4:01:50<11:22:40, 5.51s/it] {'loss': 0.0631, 'grad_norm': 0.6661113500595093, 'learning_rate': 3.4864341189802204e-05, 'epoch': 2.56} 26%|██▌ | 2563/10000 [4:01:50<11:22:40, 5.51s/it][2025-06-19 17:31:34,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:31:34,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.56 | bwd_microstep: 3377.69 | bwd_inner_microstep: 3376.88 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 17:31:34,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.56 | bwd: 3377.71 | bwd_inner: 3376.88 | bwd_allreduce: 0.78 | step: 6.83 26%|██▌ | 2564/10000 [4:01:55<11:24:11, 5.52s/it] {'loss': 0.0883, 'grad_norm': 0.8152656555175781, 'learning_rate': 3.486000662937314e-05, 'epoch': 2.56} 26%|██▌ | 2564/10000 [4:01:55<11:24:11, 5.52s/it][2025-06-19 17:31:40,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:31:40,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.10 | bwd_microstep: 3329.88 | bwd_inner_microstep: 3329.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 17:31:40,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.10 | bwd: 3329.89 | bwd_inner: 3329.08 | bwd_allreduce: 0.77 | step: 7.08 26%|██▌ | 2565/10000 [4:02:01<11:22:24, 5.51s/it] {'loss': 0.0439, 'grad_norm': 0.45261842012405396, 'learning_rate': 3.4855670510198346e-05, 'epoch': 2.56} 26%|██▌ | 2565/10000 [4:02:01<11:22:24, 5.51s/it][2025-06-19 17:31:45,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:31:45,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.12 | bwd_microstep: 3329.79 | bwd_inner_microstep: 3329.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 17:31:45,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.12 | bwd: 3329.81 | bwd_inner: 3329.00 | bwd_allreduce: 0.76 | step: 6.70 26%|██▌ | 2566/10000 [4:02:06<11:21:33, 5.50s/it] {'loss': 0.0547, 'grad_norm': 0.9271538257598877, 'learning_rate': 3.485133283273267e-05, 'epoch': 2.57} 26%|██▌ | 2566/10000 [4:02:06<11:21:33, 5.50s/it][2025-06-19 17:31:51,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 17:31:51,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.32 | bwd_microstep: 3384.07 | bwd_inner_microstep: 3383.02 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.73 [2025-06-19 17:31:51,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.32 | bwd: 3384.09 | bwd_inner: 3383.02 | bwd_allreduce: 1.01 | step: 7.75 26%|██▌ | 2567/10000 [4:02:12<11:23:27, 5.52s/it] {'loss': 0.2336, 'grad_norm': 1.6735960245132446, 'learning_rate': 3.48469935974311e-05, 'epoch': 2.57} 26%|██▌ | 2567/10000 [4:02:12<11:23:27, 5.52s/it][2025-06-19 17:31:57,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:31:57,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.89 | bwd_microstep: 3375.39 | bwd_inner_microstep: 3374.55 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.25 [2025-06-19 17:31:57,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.89 | bwd: 3375.40 | bwd_inner: 3374.55 | bwd_allreduce: 0.81 | step: 7.26 26%|██▌ | 2568/10000 [4:02:17<11:24:44, 5.53s/it] {'loss': 0.1692, 'grad_norm': 1.5244300365447998, 'learning_rate': 3.484265280474881e-05, 'epoch': 2.57} 26%|██▌ | 2568/10000 [4:02:17<11:24:44, 5.53s/it][2025-06-19 17:32:02,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:32:02,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.04 | bwd_microstep: 3328.92 | bwd_inner_microstep: 3328.11 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.78 [2025-06-19 17:32:02,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.04 | bwd: 3328.93 | bwd_inner: 3328.11 | bwd_allreduce: 0.78 | step: 6.78 26%|██▌ | 2569/10000 [4:02:23<11:22:55, 5.51s/it] {'loss': 0.0771, 'grad_norm': 0.9003276228904724, 'learning_rate': 3.483831045514113e-05, 'epoch': 2.57} 26%|██▌ | 2569/10000 [4:02:23<11:22:55, 5.51s/it][2025-06-19 17:32:07,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:32:07,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.15 | bwd_microstep: 3325.75 | bwd_inner_microstep: 3324.96 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 17:32:07,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.15 | bwd: 3325.76 | bwd_inner: 3324.96 | bwd_allreduce: 0.76 | step: 6.67 26%|██▌ | 2570/10000 [4:02:28<11:21:08, 5.50s/it] {'loss': 0.0499, 'grad_norm': 0.44966503977775574, 'learning_rate': 3.4833966549063554e-05, 'epoch': 2.57} 26%|██▌ | 2570/10000 [4:02:28<11:21:08, 5.50s/it][2025-06-19 17:32:13,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:32:13,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.43 | bwd_microstep: 3377.31 | bwd_inner_microstep: 3376.49 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.19 [2025-06-19 17:32:13,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.43 | bwd: 3377.32 | bwd_inner: 3376.49 | bwd_allreduce: 0.79 | step: 7.20 26%|██▌ | 2571/10000 [4:02:34<11:22:40, 5.51s/it] {'loss': 0.1008, 'grad_norm': 0.9442005753517151, 'learning_rate': 3.4829621086971724e-05, 'epoch': 2.57} 26%|██▌ | 2571/10000 [4:02:34<11:22:40, 5.51s/it][2025-06-19 17:32:19,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:32:19,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.50 | bwd_microstep: 3372.77 | bwd_inner_microstep: 3371.96 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 17:32:19,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.50 | bwd: 3372.78 | bwd_inner: 3371.96 | bwd_allreduce: 0.78 | step: 7.23 26%|██▌ | 2572/10000 [4:02:39<11:23:56, 5.52s/it] {'loss': 0.0578, 'grad_norm': 0.6926882266998291, 'learning_rate': 3.482527406932147e-05, 'epoch': 2.57} 26%|██▌ | 2572/10000 [4:02:39<11:23:56, 5.52s/it][2025-06-19 17:32:24,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:32:24,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.36 | bwd_microstep: 3330.33 | bwd_inner_microstep: 3329.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.89 [2025-06-19 17:32:24,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.36 | bwd: 3330.34 | bwd_inner: 3329.55 | bwd_allreduce: 0.75 | step: 6.90 26%|██▌ | 2573/10000 [4:02:45<11:22:49, 5.52s/it] {'loss': 0.0765, 'grad_norm': 0.7938640117645264, 'learning_rate': 3.4820925496568775e-05, 'epoch': 2.57} 26%|██▌ | 2573/10000 [4:02:45<11:22:49, 5.52s/it][2025-06-19 17:32:30,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 17:32:30,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.60 | bwd_microstep: 3381.04 | bwd_inner_microstep: 3380.25 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 17:32:30,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.60 | bwd: 3381.05 | bwd_inner: 3380.25 | bwd_allreduce: 0.75 | step: 6.53 26%|██▌ | 2574/10000 [4:02:50<11:24:16, 5.53s/it] {'loss': 0.098, 'grad_norm': 0.7980673313140869, 'learning_rate': 3.481657536916978e-05, 'epoch': 2.57} 26%|██▌ | 2574/10000 [4:02:50<11:24:16, 5.53s/it][2025-06-19 17:32:35,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:32:35,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.29 | bwd_microstep: 3376.76 | bwd_inner_microstep: 3375.84 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.01 [2025-06-19 17:32:35,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.29 | bwd: 3376.78 | bwd_inner: 3375.84 | bwd_allreduce: 0.89 | step: 7.01 26%|██▌ | 2575/10000 [4:02:56<11:24:44, 5.53s/it] {'loss': 0.1348, 'grad_norm': 1.2024329900741577, 'learning_rate': 3.48122236875808e-05, 'epoch': 2.58} 26%|██▌ | 2575/10000 [4:02:56<11:24:44, 5.53s/it][2025-06-19 17:32:41,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:32:41,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.74 | bwd_microstep: 3384.15 | bwd_inner_microstep: 3383.25 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.75 [2025-06-19 17:32:41,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.74 | bwd: 3384.18 | bwd_inner: 3383.25 | bwd_allreduce: 0.87 | step: 7.75 26%|██▌ | 2576/10000 [4:03:02<11:25:46, 5.54s/it] {'loss': 0.0412, 'grad_norm': 0.4926794171333313, 'learning_rate': 3.48078704522583e-05, 'epoch': 2.58} 26%|██▌ | 2576/10000 [4:03:02<11:25:46, 5.54s/it][2025-06-19 17:32:46,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 17:32:46,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.72 | bwd_microstep: 3331.43 | bwd_inner_microstep: 3330.38 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.71 [2025-06-19 17:32:46,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.72 | bwd: 3331.45 | bwd_inner: 3330.38 | bwd_allreduce: 1.01 | step: 7.72 26%|██▌ | 2577/10000 [4:03:07<11:24:05, 5.53s/it] {'loss': 0.0622, 'grad_norm': 0.7113697528839111, 'learning_rate': 3.480351566365891e-05, 'epoch': 2.58} 26%|██▌ | 2577/10000 [4:03:07<11:24:05, 5.53s/it][2025-06-19 17:32:52,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:32:52,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.12 | bwd_microstep: 3388.15 | bwd_inner_microstep: 3387.36 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 17:32:52,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.12 | bwd: 3388.16 | bwd_inner: 3387.36 | bwd_allreduce: 0.76 | step: 6.69 26%|██▌ | 2578/10000 [4:03:13<11:25:25, 5.54s/it] {'loss': 0.1385, 'grad_norm': 1.6031584739685059, 'learning_rate': 3.4799159322239426e-05, 'epoch': 2.58} 26%|██▌ | 2578/10000 [4:03:13<11:25:25, 5.54s/it][2025-06-19 17:32:57,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:32:57,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.60 | bwd_microstep: 3330.58 | bwd_inner_microstep: 3329.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 17:32:57,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.60 | bwd: 3330.60 | bwd_inner: 3329.79 | bwd_allreduce: 0.76 | step: 6.73 26%|██▌ | 2579/10000 [4:03:18<11:23:10, 5.52s/it] {'loss': 0.0517, 'grad_norm': 0.5930078625679016, 'learning_rate': 3.479480142845683e-05, 'epoch': 2.58} 26%|██▌ | 2579/10000 [4:03:18<11:23:10, 5.52s/it][2025-06-19 17:33:03,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:33:03,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.99 | bwd_microstep: 3381.21 | bwd_inner_microstep: 3380.40 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.31 [2025-06-19 17:33:03,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.99 | bwd: 3381.23 | bwd_inner: 3380.40 | bwd_allreduce: 0.78 | step: 7.31 26%|██▌ | 2580/10000 [4:03:24<11:24:12, 5.53s/it] {'loss': 0.1046, 'grad_norm': 1.1964540481567383, 'learning_rate': 3.479044198276822e-05, 'epoch': 2.58} 26%|██▌ | 2580/10000 [4:03:24<11:24:12, 5.53s/it][2025-06-19 17:33:08,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:33:08,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.73 | bwd_microstep: 3402.87 | bwd_inner_microstep: 3401.76 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.48 [2025-06-19 17:33:08,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.73 | bwd: 3402.89 | bwd_inner: 3401.76 | bwd_allreduce: 1.07 | step: 7.49 26%|██▌ | 2581/10000 [4:03:29<11:25:57, 5.55s/it] {'loss': 0.0883, 'grad_norm': 0.9421090483665466, 'learning_rate': 3.478608098563089e-05, 'epoch': 2.58} 26%|██▌ | 2581/10000 [4:03:29<11:25:57, 5.55s/it][2025-06-19 17:33:14,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:33:14,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.60 | bwd_microstep: 3333.94 | bwd_inner_microstep: 3333.05 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.81 [2025-06-19 17:33:14,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.60 | bwd: 3333.95 | bwd_inner: 3333.05 | bwd_allreduce: 0.86 | step: 6.82 26%|██▌ | 2582/10000 [4:03:35<11:23:27, 5.53s/it] {'loss': 0.0556, 'grad_norm': 0.7046862244606018, 'learning_rate': 3.478171843750229e-05, 'epoch': 2.58} 26%|██▌ | 2582/10000 [4:03:35<11:23:27, 5.53s/it][2025-06-19 17:33:19,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.78 | optimizer_step: 2.73 [2025-06-19 17:33:19,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.81 | bwd_microstep: 3379.10 | bwd_inner_microstep: 3378.29 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 17:33:19,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.81 | bwd: 3379.11 | bwd_inner: 3378.29 | bwd_allreduce: 0.78 | step: 7.21 26%|██▌ | 2583/10000 [4:03:40<11:24:06, 5.53s/it] {'loss': 0.0972, 'grad_norm': 1.1417051553726196, 'learning_rate': 3.477735433884003e-05, 'epoch': 2.58} 26%|██▌ | 2583/10000 [4:03:40<11:24:06, 5.53s/it][2025-06-19 17:33:25,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:33:25,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.51 | bwd_microstep: 3314.02 | bwd_inner_microstep: 3313.25 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 17:33:25,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.51 | bwd: 3314.04 | bwd_inner: 3313.25 | bwd_allreduce: 0.75 | step: 6.54 26%|██▌ | 2584/10000 [4:03:46<11:21:21, 5.51s/it] {'loss': 0.044, 'grad_norm': 0.5743748545646667, 'learning_rate': 3.477298869010189e-05, 'epoch': 2.58} 26%|██▌ | 2584/10000 [4:03:46<11:21:21, 5.51s/it][2025-06-19 17:33:30,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 17:33:30,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2188.56 | bwd_microstep: 3318.22 | bwd_inner_microstep: 3317.08 | bwd_allreduce_microstep: 1.08 | step_microstep: 8.31 [2025-06-19 17:33:30,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2188.56 | bwd: 3318.24 | bwd_inner: 3317.08 | bwd_allreduce: 1.11 | step: 8.33 26%|██▌ | 2585/10000 [4:03:51<11:22:36, 5.52s/it] {'loss': 0.0441, 'grad_norm': 0.42142122983932495, 'learning_rate': 3.476862149174579e-05, 'epoch': 2.58} 26%|██▌ | 2585/10000 [4:03:51<11:22:36, 5.52s/it][2025-06-19 17:33:36,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:33:36,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.73 | bwd_microstep: 3379.22 | bwd_inner_microstep: 3378.41 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-19 17:33:36,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.73 | bwd: 3379.23 | bwd_inner: 3378.41 | bwd_allreduce: 0.78 | step: 6.83 26%|██▌ | 2586/10000 [4:03:57<11:23:28, 5.53s/it] {'loss': 0.0628, 'grad_norm': 0.750679612159729, 'learning_rate': 3.476425274422984e-05, 'epoch': 2.59} 26%|██▌ | 2586/10000 [4:03:57<11:23:28, 5.53s/it][2025-06-19 17:33:41,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 17:33:41,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.81 | bwd_microstep: 3325.50 | bwd_inner_microstep: 3324.32 | bwd_allreduce_microstep: 1.10 | step_microstep: 8.28 [2025-06-19 17:33:41,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.81 | bwd: 3325.53 | bwd_inner: 3324.33 | bwd_allreduce: 1.14 | step: 8.29 26%|██▌ | 2587/10000 [4:04:02<11:21:24, 5.52s/it] {'loss': 0.0909, 'grad_norm': 1.019659161567688, 'learning_rate': 3.4759882448012304e-05, 'epoch': 2.59} 26%|██▌ | 2587/10000 [4:04:02<11:21:24, 5.52s/it][2025-06-19 17:33:47,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:33:47,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.17 | bwd_microstep: 3375.24 | bwd_inner_microstep: 3374.38 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.79 [2025-06-19 17:33:47,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.17 | bwd: 3375.26 | bwd_inner: 3374.38 | bwd_allreduce: 0.82 | step: 7.80 26%|██▌ | 2588/10000 [4:04:08<11:23:03, 5.53s/it] {'loss': 0.1714, 'grad_norm': 1.4622738361358643, 'learning_rate': 3.47555106035516e-05, 'epoch': 2.59} 26%|██▌ | 2588/10000 [4:04:08<11:23:03, 5.53s/it][2025-06-19 17:33:53,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:33:53,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.92 | bwd_microstep: 3378.55 | bwd_inner_microstep: 3377.58 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.26 [2025-06-19 17:33:53,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.92 | bwd: 3378.57 | bwd_inner: 3377.58 | bwd_allreduce: 0.94 | step: 7.26 26%|██▌ | 2589/10000 [4:04:13<11:23:31, 5.53s/it] {'loss': 0.0648, 'grad_norm': 0.6211598515510559, 'learning_rate': 3.475113721130632e-05, 'epoch': 2.59} 26%|██▌ | 2589/10000 [4:04:13<11:23:31, 5.53s/it][2025-06-19 17:33:58,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:33:58,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.26 | bwd_microstep: 3321.69 | bwd_inner_microstep: 3320.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 17:33:58,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.26 | bwd: 3321.70 | bwd_inner: 3320.89 | bwd_allreduce: 0.76 | step: 6.71 26%|██▌ | 2590/10000 [4:04:19<11:21:31, 5.52s/it] {'loss': 0.0855, 'grad_norm': 1.0642553567886353, 'learning_rate': 3.474676227173521e-05, 'epoch': 2.59} 26%|██▌ | 2590/10000 [4:04:19<11:21:31, 5.52s/it][2025-06-19 17:34:04,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:34:04,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.93 | bwd_microstep: 3321.52 | bwd_inner_microstep: 3320.71 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.29 [2025-06-19 17:34:04,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.93 | bwd: 3321.54 | bwd_inner: 3320.71 | bwd_allreduce: 0.78 | step: 7.30 26%|██▌ | 2591/10000 [4:04:24<11:19:29, 5.50s/it] {'loss': 0.0888, 'grad_norm': 2.2016191482543945, 'learning_rate': 3.4742385785297175e-05, 'epoch': 2.59} 26%|██▌ | 2591/10000 [4:04:24<11:19:29, 5.50s/it][2025-06-19 17:34:09,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 17:34:09,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.30 | bwd_microstep: 3317.52 | bwd_inner_microstep: 3316.47 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.60 [2025-06-19 17:34:09,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.30 | bwd: 3317.53 | bwd_inner: 3316.47 | bwd_allreduce: 0.82 | step: 7.60 26%|██▌ | 2592/10000 [4:04:30<11:18:15, 5.49s/it] {'loss': 0.0479, 'grad_norm': 0.7028258442878723, 'learning_rate': 3.473800775245129e-05, 'epoch': 2.59} 26%|██▌ | 2592/10000 [4:04:30<11:18:15, 5.49s/it][2025-06-19 17:34:14,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:34:14,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.04 | bwd_microstep: 3318.30 | bwd_inner_microstep: 3317.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 17:34:14,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.04 | bwd: 3318.31 | bwd_inner: 3317.50 | bwd_allreduce: 0.76 | step: 6.82 26%|██▌ | 2593/10000 [4:04:35<11:17:01, 5.48s/it] {'loss': 0.074, 'grad_norm': 1.0743649005889893, 'learning_rate': 3.473362817365679e-05, 'epoch': 2.59} 26%|██▌ | 2593/10000 [4:04:35<11:17:01, 5.48s/it][2025-06-19 17:34:20,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:34:20,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.91 | bwd_microstep: 3331.67 | bwd_inner_microstep: 3330.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 17:34:20,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.91 | bwd: 3331.68 | bwd_inner: 3330.87 | bwd_allreduce: 0.77 | step: 7.21 26%|██▌ | 2594/10000 [4:04:41<11:16:47, 5.48s/it] {'loss': 0.0597, 'grad_norm': 1.217820167541504, 'learning_rate': 3.4729247049373084e-05, 'epoch': 2.59} 26%|██▌ | 2594/10000 [4:04:41<11:16:47, 5.48s/it][2025-06-19 17:34:25,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:34:25,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.31 | bwd_microstep: 3330.15 | bwd_inner_microstep: 3329.22 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.19 [2025-06-19 17:34:25,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.31 | bwd: 3330.17 | bwd_inner: 3329.22 | bwd_allreduce: 0.89 | step: 7.19 26%|██▌ | 2595/10000 [4:04:46<11:16:29, 5.48s/it] {'loss': 0.0487, 'grad_norm': 0.5751772522926331, 'learning_rate': 3.4724864380059724e-05, 'epoch': 2.59} 26%|██▌ | 2595/10000 [4:04:46<11:16:29, 5.48s/it][2025-06-19 17:34:31,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:34:31,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.20 | bwd_microstep: 3376.84 | bwd_inner_microstep: 3376.04 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 17:34:31,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.20 | bwd: 3376.86 | bwd_inner: 3376.04 | bwd_allreduce: 0.78 | step: 7.07 26%|██▌ | 2596/10000 [4:04:52<11:18:53, 5.50s/it] {'loss': 0.0294, 'grad_norm': 0.38014134764671326, 'learning_rate': 3.4720480166176425e-05, 'epoch': 2.6} 26%|██▌ | 2596/10000 [4:04:52<11:18:53, 5.50s/it][2025-06-19 17:34:37,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:34:37,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.69 | bwd_microstep: 3373.35 | bwd_inner_microstep: 3372.44 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.05 [2025-06-19 17:34:37,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.69 | bwd: 3373.36 | bwd_inner: 3372.44 | bwd_allreduce: 0.88 | step: 7.05 26%|██▌ | 2597/10000 [4:04:57<11:20:22, 5.51s/it] {'loss': 0.0866, 'grad_norm': 0.7924826741218567, 'learning_rate': 3.4716094408183085e-05, 'epoch': 2.6} 26%|██▌ | 2597/10000 [4:04:57<11:20:22, 5.51s/it][2025-06-19 17:34:42,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:34:42,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.31 | bwd_microstep: 3312.04 | bwd_inner_microstep: 3310.81 | bwd_allreduce_microstep: 1.16 | step_microstep: 7.94 [2025-06-19 17:34:42,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.31 | bwd: 3312.06 | bwd_inner: 3310.81 | bwd_allreduce: 1.19 | step: 7.94 26%|██▌ | 2598/10000 [4:05:03<11:18:25, 5.50s/it] {'loss': 0.0724, 'grad_norm': 0.8261373043060303, 'learning_rate': 3.471170710653973e-05, 'epoch': 2.6} 26%|██▌ | 2598/10000 [4:05:03<11:18:25, 5.50s/it][2025-06-19 17:34:48,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:34:48,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.87 | bwd_microstep: 3399.89 | bwd_inner_microstep: 3398.95 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.27 [2025-06-19 17:34:48,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.87 | bwd: 3399.91 | bwd_inner: 3398.95 | bwd_allreduce: 0.92 | step: 7.27 26%|██▌ | 2599/10000 [4:05:08<11:21:40, 5.53s/it] {'loss': 0.0651, 'grad_norm': 1.4430012702941895, 'learning_rate': 3.470731826170658e-05, 'epoch': 2.6} 26%|██▌ | 2599/10000 [4:05:08<11:21:40, 5.53s/it][2025-06-19 17:34:53,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:34:53,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.72 | bwd_microstep: 3371.04 | bwd_inner_microstep: 3370.02 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.12 [2025-06-19 17:34:53,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.72 | bwd: 3371.06 | bwd_inner: 3370.02 | bwd_allreduce: 0.99 | step: 7.12 26%|██▌ | 2600/10000 [4:05:14<11:22:19, 5.53s/it] {'loss': 0.0633, 'grad_norm': 0.6961753964424133, 'learning_rate': 3.4702927874144015e-05, 'epoch': 2.6} 26%|██▌ | 2600/10000 [4:05:14<11:22:19, 5.53s/it][2025-06-19 17:34:59,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:34:59,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.01 | bwd_microstep: 3318.43 | bwd_inner_microstep: 3317.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-19 17:34:59,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.01 | bwd: 3318.45 | bwd_inner: 3317.64 | bwd_allreduce: 0.77 | step: 7.01 26%|██▌ | 2601/10000 [4:05:19<11:19:55, 5.51s/it] {'loss': 0.0471, 'grad_norm': 0.5823310613632202, 'learning_rate': 3.469853594431254e-05, 'epoch': 2.6} 26%|██▌ | 2601/10000 [4:05:19<11:19:55, 5.51s/it][2025-06-19 17:35:04,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:35:04,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.41 | bwd_microstep: 3324.76 | bwd_inner_microstep: 3323.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 17:35:04,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.41 | bwd: 3324.78 | bwd_inner: 3323.98 | bwd_allreduce: 0.76 | step: 6.58 26%|██▌ | 2602/10000 [4:05:25<11:18:01, 5.50s/it] {'loss': 0.0182, 'grad_norm': 0.19554543495178223, 'learning_rate': 3.469414247267287e-05, 'epoch': 2.6} 26%|██▌ | 2602/10000 [4:05:25<11:18:01, 5.50s/it][2025-06-19 17:35:10,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:35:10,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.49 | bwd_microstep: 3315.46 | bwd_inner_microstep: 3314.68 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-19 17:35:10,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.49 | bwd: 3315.47 | bwd_inner: 3314.68 | bwd_allreduce: 0.75 | step: 6.73 26%|██▌ | 2603/10000 [4:05:30<11:16:27, 5.49s/it] {'loss': 0.0624, 'grad_norm': 0.634865403175354, 'learning_rate': 3.4689747459685854e-05, 'epoch': 2.6} 26%|██▌ | 2603/10000 [4:05:30<11:16:27, 5.49s/it][2025-06-19 17:35:15,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:35:15,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.20 | bwd_microstep: 3326.31 | bwd_inner_microstep: 3325.43 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.24 [2025-06-19 17:35:15,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.20 | bwd: 3326.33 | bwd_inner: 3325.43 | bwd_allreduce: 0.84 | step: 7.24 26%|██▌ | 2604/10000 [4:05:36<11:16:01, 5.48s/it] {'loss': 0.1625, 'grad_norm': 1.4431259632110596, 'learning_rate': 3.46853509058125e-05, 'epoch': 2.6} 26%|██▌ | 2604/10000 [4:05:36<11:16:01, 5.48s/it][2025-06-19 17:35:20,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:35:20,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.27 | bwd_microstep: 3333.08 | bwd_inner_microstep: 3332.16 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.95 [2025-06-19 17:35:20,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.27 | bwd: 3333.10 | bwd_inner: 3332.16 | bwd_allreduce: 0.89 | step: 6.95 26%|██▌ | 2605/10000 [4:05:41<11:15:52, 5.48s/it] {'loss': 0.0656, 'grad_norm': 1.363511562347412, 'learning_rate': 3.4680952811514e-05, 'epoch': 2.6} 26%|██▌ | 2605/10000 [4:05:41<11:15:52, 5.48s/it][2025-06-19 17:35:26,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:35:26,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.33 | bwd_microstep: 3333.74 | bwd_inner_microstep: 3332.85 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.85 [2025-06-19 17:35:26,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.33 | bwd: 3333.76 | bwd_inner: 3332.85 | bwd_allreduce: 0.86 | step: 6.85 26%|██▌ | 2606/10000 [4:05:47<11:15:34, 5.48s/it] {'loss': 0.0736, 'grad_norm': 1.104615569114685, 'learning_rate': 3.467655317725168e-05, 'epoch': 2.61} 26%|██▌ | 2606/10000 [4:05:47<11:15:34, 5.48s/it][2025-06-19 17:35:31,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:35:31,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.21 | bwd_microstep: 3366.98 | bwd_inner_microstep: 3366.05 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.63 [2025-06-19 17:35:31,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.21 | bwd: 3366.99 | bwd_inner: 3366.05 | bwd_allreduce: 0.90 | step: 7.65 26%|██▌ | 2607/10000 [4:05:52<11:17:22, 5.50s/it] {'loss': 0.1194, 'grad_norm': 1.4428709745407104, 'learning_rate': 3.467215200348705e-05, 'epoch': 2.61} 26%|██▌ | 2607/10000 [4:05:52<11:17:22, 5.50s/it][2025-06-19 17:35:37,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:35:37,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.12 | bwd_microstep: 3317.95 | bwd_inner_microstep: 3316.88 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.31 [2025-06-19 17:35:37,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.12 | bwd: 3317.97 | bwd_inner: 3316.88 | bwd_allreduce: 1.04 | step: 7.32 26%|██▌ | 2608/10000 [4:05:58<11:16:17, 5.49s/it] {'loss': 0.0359, 'grad_norm': 0.424300879240036, 'learning_rate': 3.466774929068177e-05, 'epoch': 2.61} 26%|██▌ | 2608/10000 [4:05:58<11:16:17, 5.49s/it][2025-06-19 17:35:42,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:35:42,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.30 | bwd_microstep: 3316.58 | bwd_inner_microstep: 3315.80 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 17:35:42,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.30 | bwd: 3316.60 | bwd_inner: 3315.80 | bwd_allreduce: 0.76 | step: 6.67 26%|██▌ | 2609/10000 [4:06:03<11:15:27, 5.48s/it] {'loss': 0.0501, 'grad_norm': 0.7585856914520264, 'learning_rate': 3.466334503929767e-05, 'epoch': 2.61} 26%|██▌ | 2609/10000 [4:06:03<11:15:27, 5.48s/it][2025-06-19 17:35:48,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:35:48,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.28 | bwd_microstep: 3320.76 | bwd_inner_microstep: 3319.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 17:35:48,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.28 | bwd: 3320.77 | bwd_inner: 3319.95 | bwd_allreduce: 0.78 | step: 7.06 26%|██▌ | 2610/10000 [4:06:09<11:14:29, 5.48s/it] {'loss': 0.0927, 'grad_norm': 1.113942265510559, 'learning_rate': 3.465893924979672e-05, 'epoch': 2.61} 26%|██▌ | 2610/10000 [4:06:09<11:14:29, 5.48s/it][2025-06-19 17:35:53,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:35:53,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.66 | bwd_microstep: 3309.35 | bwd_inner_microstep: 3308.29 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.52 [2025-06-19 17:35:53,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.66 | bwd: 3309.37 | bwd_inner: 3308.29 | bwd_allreduce: 1.02 | step: 7.53 26%|██▌ | 2611/10000 [4:06:14<11:13:40, 5.47s/it] {'loss': 0.0588, 'grad_norm': 0.7673001289367676, 'learning_rate': 3.465453192264108e-05, 'epoch': 2.61} 26%|██▌ | 2611/10000 [4:06:14<11:13:40, 5.47s/it][2025-06-19 17:35:59,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:35:59,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.60 | bwd_microstep: 3315.13 | bwd_inner_microstep: 3314.32 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.83 [2025-06-19 17:35:59,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.60 | bwd: 3315.14 | bwd_inner: 3314.32 | bwd_allreduce: 0.78 | step: 6.84 26%|██▌ | 2612/10000 [4:06:20<11:13:25, 5.47s/it] {'loss': 0.1762, 'grad_norm': 1.5470184087753296, 'learning_rate': 3.4650123058293057e-05, 'epoch': 2.61} 26%|██▌ | 2612/10000 [4:06:20<11:13:25, 5.47s/it][2025-06-19 17:36:04,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:36:04,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.88 | bwd_microstep: 3317.73 | bwd_inner_microstep: 3316.92 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.26 [2025-06-19 17:36:04,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.88 | bwd: 3317.75 | bwd_inner: 3316.92 | bwd_allreduce: 0.78 | step: 7.26 26%|██▌ | 2613/10000 [4:06:25<11:13:12, 5.47s/it] {'loss': 0.0379, 'grad_norm': 0.46087583899497986, 'learning_rate': 3.464571265721512e-05, 'epoch': 2.61} 26%|██▌ | 2613/10000 [4:06:25<11:13:12, 5.47s/it][2025-06-19 17:36:10,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:36:10,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.41 | bwd_microstep: 3377.49 | bwd_inner_microstep: 3376.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 17:36:10,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.41 | bwd: 3377.51 | bwd_inner: 3376.71 | bwd_allreduce: 0.76 | step: 6.66 26%|██▌ | 2614/10000 [4:06:31<11:16:02, 5.49s/it] {'loss': 0.06, 'grad_norm': 0.7730385065078735, 'learning_rate': 3.46413007198699e-05, 'epoch': 2.61} 26%|██▌ | 2614/10000 [4:06:31<11:16:02, 5.49s/it][2025-06-19 17:36:15,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:36:15,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.14 | bwd_microstep: 3316.32 | bwd_inner_microstep: 3315.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 17:36:15,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.14 | bwd: 3316.34 | bwd_inner: 3315.51 | bwd_allreduce: 0.78 | step: 7.11 26%|██▌ | 2615/10000 [4:06:36<11:14:55, 5.48s/it] {'loss': 0.1336, 'grad_norm': 1.329726219177246, 'learning_rate': 3.4636887246720184e-05, 'epoch': 2.62} 26%|██▌ | 2615/10000 [4:06:36<11:14:55, 5.48s/it][2025-06-19 17:36:21,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:36:21,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.53 | bwd_microstep: 3318.71 | bwd_inner_microstep: 3317.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 17:36:21,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.53 | bwd: 3318.72 | bwd_inner: 3317.92 | bwd_allreduce: 0.76 | step: 6.77 26%|██▌ | 2616/10000 [4:06:42<11:13:58, 5.48s/it] {'loss': 0.1001, 'grad_norm': 1.36361563205719, 'learning_rate': 3.463247223822893e-05, 'epoch': 2.62} 26%|██▌ | 2616/10000 [4:06:42<11:13:58, 5.48s/it][2025-06-19 17:36:26,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:36:26,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.33 | bwd_microstep: 3313.71 | bwd_inner_microstep: 3312.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.11 [2025-06-19 17:36:26,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.33 | bwd: 3313.72 | bwd_inner: 3312.91 | bwd_allreduce: 0.77 | step: 7.11 26%|██▌ | 2617/10000 [4:06:47<11:13:09, 5.47s/it] {'loss': 0.0637, 'grad_norm': 0.9157941937446594, 'learning_rate': 3.462805569485924e-05, 'epoch': 2.62} 26%|██▌ | 2617/10000 [4:06:47<11:13:09, 5.47s/it][2025-06-19 17:36:32,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:36:32,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.60 | bwd_microstep: 3365.48 | bwd_inner_microstep: 3364.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 17:36:32,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.60 | bwd: 3365.50 | bwd_inner: 3364.67 | bwd_allreduce: 0.78 | step: 6.95 26%|██▌ | 2618/10000 [4:06:53<11:15:39, 5.49s/it] {'loss': 0.0912, 'grad_norm': 0.8935137391090393, 'learning_rate': 3.462363761707441e-05, 'epoch': 2.62} 26%|██▌ | 2618/10000 [4:06:53<11:15:39, 5.49s/it][2025-06-19 17:36:37,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:36:37,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.21 | bwd_microstep: 3312.63 | bwd_inner_microstep: 3311.70 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.27 [2025-06-19 17:36:37,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.21 | bwd: 3312.65 | bwd_inner: 3311.70 | bwd_allreduce: 0.90 | step: 7.27 26%|██▌ | 2619/10000 [4:06:58<11:13:59, 5.48s/it] {'loss': 0.0564, 'grad_norm': 0.9001161456108093, 'learning_rate': 3.4619218005337864e-05, 'epoch': 2.62} 26%|██▌ | 2619/10000 [4:06:58<11:13:59, 5.48s/it][2025-06-19 17:36:43,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:36:43,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.13 | bwd_microstep: 3319.44 | bwd_inner_microstep: 3318.58 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.49 [2025-06-19 17:36:43,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.13 | bwd: 3319.46 | bwd_inner: 3318.58 | bwd_allreduce: 0.83 | step: 7.49 26%|██▌ | 2620/10000 [4:07:03<11:13:31, 5.48s/it] {'loss': 0.1524, 'grad_norm': 1.0964585542678833, 'learning_rate': 3.461479686011319e-05, 'epoch': 2.62} 26%|██▌ | 2620/10000 [4:07:03<11:13:31, 5.48s/it][2025-06-19 17:36:48,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:36:48,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.16 | bwd_microstep: 3322.75 | bwd_inner_microstep: 3321.69 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.37 [2025-06-19 17:36:48,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.16 | bwd: 3322.77 | bwd_inner: 3321.69 | bwd_allreduce: 1.02 | step: 7.37 26%|██▌ | 2621/10000 [4:07:09<11:13:07, 5.47s/it] {'loss': 0.1146, 'grad_norm': 1.3525280952453613, 'learning_rate': 3.4610374181864166e-05, 'epoch': 2.62} 26%|██▌ | 2621/10000 [4:07:09<11:13:07, 5.47s/it][2025-06-19 17:36:54,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:36:54,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.31 | bwd_microstep: 3320.84 | bwd_inner_microstep: 3319.80 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.19 [2025-06-19 17:36:54,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.31 | bwd: 3320.86 | bwd_inner: 3319.80 | bwd_allreduce: 1.01 | step: 7.20 26%|██▌ | 2622/10000 [4:07:14<11:12:39, 5.47s/it] {'loss': 0.0646, 'grad_norm': 0.7146269083023071, 'learning_rate': 3.460594997105469e-05, 'epoch': 2.62} 26%|██▌ | 2622/10000 [4:07:14<11:12:39, 5.47s/it][2025-06-19 17:36:59,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:36:59,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.73 | bwd_microstep: 3311.67 | bwd_inner_microstep: 3310.48 | bwd_allreduce_microstep: 1.12 | step_microstep: 8.15 [2025-06-19 17:36:59,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.73 | bwd: 3311.69 | bwd_inner: 3310.48 | bwd_allreduce: 1.15 | step: 8.15 26%|██▌ | 2623/10000 [4:07:20<11:12:31, 5.47s/it] {'loss': 0.0898, 'grad_norm': 1.4826542139053345, 'learning_rate': 3.4601524228148856e-05, 'epoch': 2.62} 26%|██▌ | 2623/10000 [4:07:20<11:12:31, 5.47s/it][2025-06-19 17:37:05,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:37:05,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.88 | bwd_microstep: 3312.65 | bwd_inner_microstep: 3311.83 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.94 [2025-06-19 17:37:05,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.88 | bwd: 3312.67 | bwd_inner: 3311.83 | bwd_allreduce: 0.80 | step: 6.94 26%|██▌ | 2624/10000 [4:07:25<11:12:11, 5.47s/it] {'loss': 0.0887, 'grad_norm': 1.205973744392395, 'learning_rate': 3.459709695361089e-05, 'epoch': 2.62} 26%|██▌ | 2624/10000 [4:07:25<11:12:11, 5.47s/it][2025-06-19 17:37:10,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:37:10,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.90 | bwd_microstep: 3311.63 | bwd_inner_microstep: 3310.82 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.59 [2025-06-19 17:37:10,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.90 | bwd: 3311.65 | bwd_inner: 3310.82 | bwd_allreduce: 0.78 | step: 7.59 26%|██▋ | 2625/10000 [4:07:31<11:12:04, 5.47s/it] {'loss': 0.0688, 'grad_norm': 0.9146414399147034, 'learning_rate': 3.459266814790521e-05, 'epoch': 2.62} 26%|██▋ | 2625/10000 [4:07:31<11:12:04, 5.47s/it][2025-06-19 17:37:16,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:37:16,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.48 | bwd_microstep: 3373.15 | bwd_inner_microstep: 3372.10 | bwd_allreduce_microstep: 1.00 | step_microstep: 6.93 [2025-06-19 17:37:16,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.48 | bwd: 3373.17 | bwd_inner: 3372.10 | bwd_allreduce: 1.02 | step: 6.93 26%|██▋ | 2626/10000 [4:07:36<11:14:39, 5.49s/it] {'loss': 0.131, 'grad_norm': 1.487491250038147, 'learning_rate': 3.458823781149637e-05, 'epoch': 2.63} 26%|██▋ | 2626/10000 [4:07:36<11:14:39, 5.49s/it][2025-06-19 17:37:21,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:37:21,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.65 | bwd_microstep: 3312.91 | bwd_inner_microstep: 3311.91 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.55 [2025-06-19 17:37:21,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.65 | bwd: 3312.93 | bwd_inner: 3311.91 | bwd_allreduce: 0.97 | step: 7.55 26%|██▋ | 2627/10000 [4:07:42<11:13:29, 5.48s/it] {'loss': 0.1715, 'grad_norm': 1.2064518928527832, 'learning_rate': 3.458380594484908e-05, 'epoch': 2.63} 26%|██▋ | 2627/10000 [4:07:42<11:13:29, 5.48s/it][2025-06-19 17:37:26,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:37:26,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.29 | bwd_microstep: 3322.10 | bwd_inner_microstep: 3321.15 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.31 [2025-06-19 17:37:26,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.29 | bwd: 3322.11 | bwd_inner: 3321.15 | bwd_allreduce: 0.91 | step: 7.32 26%|██▋ | 2628/10000 [4:07:47<11:13:18, 5.48s/it] {'loss': 0.0585, 'grad_norm': 0.7165259122848511, 'learning_rate': 3.457937254842823e-05, 'epoch': 2.63} 26%|██▋ | 2628/10000 [4:07:47<11:13:18, 5.48s/it][2025-06-19 17:37:32,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:37:32,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.07 | bwd_microstep: 3317.12 | bwd_inner_microstep: 3316.32 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.21 [2025-06-19 17:37:32,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.07 | bwd: 3317.13 | bwd_inner: 3316.32 | bwd_allreduce: 0.77 | step: 7.22 26%|██▋ | 2629/10000 [4:07:53<11:12:27, 5.47s/it] {'loss': 0.0725, 'grad_norm': 0.7253003120422363, 'learning_rate': 3.4574937622698874e-05, 'epoch': 2.63} 26%|██▋ | 2629/10000 [4:07:53<11:12:27, 5.47s/it][2025-06-19 17:37:37,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:37:37,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.55 | bwd_microstep: 3370.88 | bwd_inner_microstep: 3370.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 17:37:37,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.55 | bwd: 3370.89 | bwd_inner: 3370.08 | bwd_allreduce: 0.77 | step: 6.71 26%|██▋ | 2630/10000 [4:07:58<11:14:53, 5.49s/it] {'loss': 0.179, 'grad_norm': 1.4470140933990479, 'learning_rate': 3.45705011681262e-05, 'epoch': 2.63} 26%|██▋ | 2630/10000 [4:07:58<11:14:53, 5.49s/it][2025-06-19 17:37:43,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:37:43,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.61 | bwd_microstep: 3369.54 | bwd_inner_microstep: 3368.61 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.16 [2025-06-19 17:37:43,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.61 | bwd: 3369.55 | bwd_inner: 3368.61 | bwd_allreduce: 0.89 | step: 7.17 26%|██▋ | 2631/10000 [4:08:04<11:16:27, 5.51s/it] {'loss': 0.067, 'grad_norm': 0.6631638407707214, 'learning_rate': 3.4566063185175585e-05, 'epoch': 2.63} 26%|██▋ | 2631/10000 [4:08:04<11:16:27, 5.51s/it][2025-06-19 17:37:48,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:37:48,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.83 | bwd_microstep: 3323.59 | bwd_inner_microstep: 3322.44 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.74 [2025-06-19 17:37:48,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.83 | bwd: 3323.61 | bwd_inner: 3322.44 | bwd_allreduce: 1.11 | step: 7.75 26%|██▋ | 2632/10000 [4:08:09<11:15:17, 5.50s/it] {'loss': 0.0522, 'grad_norm': 0.5637523531913757, 'learning_rate': 3.4561623674312535e-05, 'epoch': 2.63} 26%|██▋ | 2632/10000 [4:08:09<11:15:17, 5.50s/it][2025-06-19 17:37:54,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:37:54,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.51 | bwd_microstep: 3373.94 | bwd_inner_microstep: 3373.00 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.37 [2025-06-19 17:37:54,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.51 | bwd: 3373.95 | bwd_inner: 3373.00 | bwd_allreduce: 0.91 | step: 7.37 26%|██▋ | 2633/10000 [4:08:15<11:16:56, 5.51s/it] {'loss': 0.065, 'grad_norm': 0.8103379607200623, 'learning_rate': 3.455718263600275e-05, 'epoch': 2.63} 26%|██▋ | 2633/10000 [4:08:15<11:16:56, 5.51s/it][2025-06-19 17:37:59,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:37:59,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.28 | bwd_microstep: 3320.84 | bwd_inner_microstep: 3320.01 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.03 [2025-06-19 17:37:59,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.28 | bwd: 3320.86 | bwd_inner: 3320.01 | bwd_allreduce: 0.81 | step: 7.03 26%|██▋ | 2634/10000 [4:08:20<11:15:21, 5.50s/it] {'loss': 0.048, 'grad_norm': 0.4943788945674896, 'learning_rate': 3.4552740070712074e-05, 'epoch': 2.63} 26%|██▋ | 2634/10000 [4:08:20<11:15:21, 5.50s/it][2025-06-19 17:38:05,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 17:38:05,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.61 | bwd_microstep: 3320.80 | bwd_inner_microstep: 3319.81 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.01 [2025-06-19 17:38:05,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.61 | bwd: 3320.81 | bwd_inner: 3319.81 | bwd_allreduce: 0.95 | step: 7.02 26%|██▋ | 2635/10000 [4:08:26<11:14:21, 5.49s/it] {'loss': 0.1537, 'grad_norm': 1.4371272325515747, 'learning_rate': 3.454829597890649e-05, 'epoch': 2.63} 26%|██▋ | 2635/10000 [4:08:26<11:14:21, 5.49s/it][2025-06-19 17:38:10,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:38:10,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.36 | bwd_microstep: 3317.52 | bwd_inner_microstep: 3316.67 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.39 [2025-06-19 17:38:10,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.36 | bwd: 3317.54 | bwd_inner: 3316.67 | bwd_allreduce: 0.82 | step: 7.40 26%|██▋ | 2636/10000 [4:08:31<11:13:23, 5.49s/it] {'loss': 0.0897, 'grad_norm': 0.8271963596343994, 'learning_rate': 3.454385036105219e-05, 'epoch': 2.64} 26%|██▋ | 2636/10000 [4:08:31<11:13:23, 5.49s/it][2025-06-19 17:38:16,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:38:16,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.82 | bwd_microstep: 3318.80 | bwd_inner_microstep: 3317.73 | bwd_allreduce_microstep: 1.02 | step_microstep: 6.94 [2025-06-19 17:38:16,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.82 | bwd: 3318.82 | bwd_inner: 3317.73 | bwd_allreduce: 1.04 | step: 6.94 26%|██▋ | 2637/10000 [4:08:37<11:12:36, 5.48s/it] {'loss': 0.0871, 'grad_norm': 1.1669297218322754, 'learning_rate': 3.453940321761549e-05, 'epoch': 2.64} 26%|██▋ | 2637/10000 [4:08:37<11:12:36, 5.48s/it][2025-06-19 17:38:21,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.87 [2025-06-19 17:38:21,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.64 | bwd_microstep: 3363.12 | bwd_inner_microstep: 3362.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.33 [2025-06-19 17:38:21,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.64 | bwd: 3363.13 | bwd_inner: 3362.31 | bwd_allreduce: 0.78 | step: 7.33 26%|██▋ | 2638/10000 [4:08:42<11:14:13, 5.49s/it] {'loss': 0.0386, 'grad_norm': 0.5423636436462402, 'learning_rate': 3.4534954549062865e-05, 'epoch': 2.64} 26%|██▋ | 2638/10000 [4:08:42<11:14:13, 5.49s/it][2025-06-19 17:38:27,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:38:27,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.36 | bwd_microstep: 3310.52 | bwd_inner_microstep: 3309.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 17:38:27,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.36 | bwd: 3310.53 | bwd_inner: 3309.73 | bwd_allreduce: 0.76 | step: 6.64 26%|██▋ | 2639/10000 [4:08:48<11:12:43, 5.48s/it] {'loss': 0.0706, 'grad_norm': 0.8861117362976074, 'learning_rate': 3.4530504355860966e-05, 'epoch': 2.64} 26%|██▋ | 2639/10000 [4:08:48<11:12:43, 5.48s/it][2025-06-19 17:38:32,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:38:32,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.17 | bwd_microstep: 3370.60 | bwd_inner_microstep: 3369.80 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 17:38:32,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.17 | bwd: 3370.62 | bwd_inner: 3369.80 | bwd_allreduce: 0.78 | step: 7.10 26%|██▋ | 2640/10000 [4:08:53<11:14:33, 5.50s/it] {'loss': 0.0925, 'grad_norm': 0.8904737830162048, 'learning_rate': 3.45260526384766e-05, 'epoch': 2.64} 26%|██▋ | 2640/10000 [4:08:53<11:14:33, 5.50s/it][2025-06-19 17:38:38,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:38:38,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.56 | bwd_microstep: 3307.33 | bwd_inner_microstep: 3306.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-19 17:38:38,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.56 | bwd: 3307.34 | bwd_inner: 3306.54 | bwd_allreduce: 0.76 | step: 6.73 26%|██▋ | 2641/10000 [4:08:59<11:12:44, 5.49s/it] {'loss': 0.0518, 'grad_norm': 0.6759771108627319, 'learning_rate': 3.4521599397376734e-05, 'epoch': 2.64} 26%|██▋ | 2641/10000 [4:08:59<11:12:44, 5.49s/it][2025-06-19 17:38:43,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:38:43,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.60 | bwd_microstep: 3321.16 | bwd_inner_microstep: 3320.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 17:38:43,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.60 | bwd: 3321.18 | bwd_inner: 3320.36 | bwd_allreduce: 0.77 | step: 6.94 26%|██▋ | 2642/10000 [4:09:04<11:11:55, 5.48s/it] {'loss': 0.0385, 'grad_norm': 0.30662280321121216, 'learning_rate': 3.451714463302848e-05, 'epoch': 2.64} 26%|██▋ | 2642/10000 [4:09:04<11:11:55, 5.48s/it][2025-06-19 17:38:49,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:38:49,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.63 | bwd_microstep: 3321.54 | bwd_inner_microstep: 3320.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 17:38:49,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.63 | bwd: 3321.55 | bwd_inner: 3320.73 | bwd_allreduce: 0.78 | step: 7.09 26%|██▋ | 2643/10000 [4:09:10<11:11:04, 5.47s/it] {'loss': 0.034, 'grad_norm': 0.5793211460113525, 'learning_rate': 3.451268834589914e-05, 'epoch': 2.64} 26%|██▋ | 2643/10000 [4:09:10<11:11:04, 5.47s/it][2025-06-19 17:38:54,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:38:54,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.38 | bwd_microstep: 3308.01 | bwd_inner_microstep: 3307.08 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.05 [2025-06-19 17:38:54,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.38 | bwd: 3308.02 | bwd_inner: 3307.08 | bwd_allreduce: 0.90 | step: 7.06 26%|██▋ | 2644/10000 [4:09:15<11:10:17, 5.47s/it] {'loss': 0.0378, 'grad_norm': 0.6358691453933716, 'learning_rate': 3.4508230536456145e-05, 'epoch': 2.64} 26%|██▋ | 2644/10000 [4:09:15<11:10:17, 5.47s/it][2025-06-19 17:39:00,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.88 [2025-06-19 17:39:00,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.76 | bwd_microstep: 3314.83 | bwd_inner_microstep: 3313.83 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.07 [2025-06-19 17:39:00,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.76 | bwd: 3314.85 | bwd_inner: 3313.83 | bwd_allreduce: 0.97 | step: 7.08 26%|██▋ | 2645/10000 [4:09:21<11:09:53, 5.46s/it] {'loss': 0.2039, 'grad_norm': 2.462287187576294, 'learning_rate': 3.4503771205167095e-05, 'epoch': 2.65} 26%|██▋ | 2645/10000 [4:09:21<11:09:53, 5.46s/it][2025-06-19 17:39:05,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.79 [2025-06-19 17:39:05,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.91 | bwd_microstep: 3309.24 | bwd_inner_microstep: 3308.25 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.51 [2025-06-19 17:39:05,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.91 | bwd: 3309.26 | bwd_inner: 3308.25 | bwd_allreduce: 0.96 | step: 7.52 26%|██▋ | 2646/10000 [4:09:26<11:09:35, 5.46s/it] {'loss': 0.074, 'grad_norm': 0.95374995470047, 'learning_rate': 3.4499310352499765e-05, 'epoch': 2.65} 26%|██▋ | 2646/10000 [4:09:26<11:09:35, 5.46s/it][2025-06-19 17:39:11,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:39:11,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.35 | bwd_microstep: 3312.36 | bwd_inner_microstep: 3311.54 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-19 17:39:11,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.35 | bwd: 3312.38 | bwd_inner: 3311.54 | bwd_allreduce: 0.79 | step: 7.31 26%|██▋ | 2647/10000 [4:09:31<11:09:15, 5.46s/it] {'loss': 0.058, 'grad_norm': 1.252198338508606, 'learning_rate': 3.449484797892207e-05, 'epoch': 2.65} 26%|██▋ | 2647/10000 [4:09:31<11:09:15, 5.46s/it][2025-06-19 17:39:16,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:39:16,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.72 | bwd_microstep: 3362.00 | bwd_inner_microstep: 3361.21 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 17:39:16,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.72 | bwd: 3362.01 | bwd_inner: 3361.21 | bwd_allreduce: 0.76 | step: 6.67 26%|██▋ | 2648/10000 [4:09:37<11:11:27, 5.48s/it] {'loss': 0.2029, 'grad_norm': 1.3051226139068604, 'learning_rate': 3.44903840849021e-05, 'epoch': 2.65} 26%|██▋ | 2648/10000 [4:09:37<11:11:27, 5.48s/it][2025-06-19 17:39:22,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:39:22,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.89 | bwd_microstep: 3310.80 | bwd_inner_microstep: 3309.84 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.57 [2025-06-19 17:39:22,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.89 | bwd: 3310.82 | bwd_inner: 3309.84 | bwd_allreduce: 0.92 | step: 7.57 26%|██▋ | 2649/10000 [4:09:42<11:10:14, 5.47s/it] {'loss': 0.0765, 'grad_norm': 0.7431747913360596, 'learning_rate': 3.448591867090808e-05, 'epoch': 2.65} 26%|██▋ | 2649/10000 [4:09:42<11:10:14, 5.47s/it][2025-06-19 17:39:27,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:39:27,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.04 | bwd_microstep: 3321.99 | bwd_inner_microstep: 3321.19 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 17:39:27,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.04 | bwd: 3322.00 | bwd_inner: 3321.19 | bwd_allreduce: 0.77 | step: 6.83 26%|██▋ | 2650/10000 [4:09:48<11:10:02, 5.47s/it] {'loss': 0.215, 'grad_norm': 1.6384199857711792, 'learning_rate': 3.4481451737408437e-05, 'epoch': 2.65} 26%|██▋ | 2650/10000 [4:09:48<11:10:02, 5.47s/it][2025-06-19 17:39:33,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:39:33,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.41 | bwd_microstep: 3313.27 | bwd_inner_microstep: 3312.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 17:39:33,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.41 | bwd: 3313.29 | bwd_inner: 3312.47 | bwd_allreduce: 0.77 | step: 6.82 27%|██▋ | 2651/10000 [4:09:53<11:09:19, 5.46s/it] {'loss': 0.0775, 'grad_norm': 0.8236333727836609, 'learning_rate': 3.447698328487171e-05, 'epoch': 2.65} 27%|██▋ | 2651/10000 [4:09:53<11:09:19, 5.46s/it][2025-06-19 17:39:38,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:39:38,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.52 | bwd_microstep: 3323.21 | bwd_inner_microstep: 3322.37 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.23 [2025-06-19 17:39:38,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.52 | bwd: 3323.22 | bwd_inner: 3322.37 | bwd_allreduce: 0.81 | step: 7.23 27%|██▋ | 2652/10000 [4:09:59<11:09:12, 5.46s/it] {'loss': 0.1328, 'grad_norm': 1.4473464488983154, 'learning_rate': 3.447251331376662e-05, 'epoch': 2.65} 27%|██▋ | 2652/10000 [4:09:59<11:09:12, 5.46s/it][2025-06-19 17:39:44,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:39:44,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.78 | bwd_microstep: 3387.10 | bwd_inner_microstep: 3386.21 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.80 [2025-06-19 17:39:44,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.78 | bwd: 3387.11 | bwd_inner: 3386.21 | bwd_allreduce: 0.86 | step: 6.81 27%|██▋ | 2653/10000 [4:10:04<11:12:39, 5.49s/it] {'loss': 0.048, 'grad_norm': 0.4637841582298279, 'learning_rate': 3.446804182456206e-05, 'epoch': 2.65} 27%|██▋ | 2653/10000 [4:10:04<11:12:39, 5.49s/it][2025-06-19 17:39:49,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:39:49,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.25 | bwd_microstep: 3313.17 | bwd_inner_microstep: 3312.02 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.95 [2025-06-19 17:39:49,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.25 | bwd: 3313.19 | bwd_inner: 3312.02 | bwd_allreduce: 1.12 | step: 7.96 27%|██▋ | 2654/10000 [4:10:10<11:11:31, 5.48s/it] {'loss': 0.0474, 'grad_norm': 0.7539172768592834, 'learning_rate': 3.446356881772706e-05, 'epoch': 2.65} 27%|██▋ | 2654/10000 [4:10:10<11:11:31, 5.48s/it][2025-06-19 17:39:55,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:39:55,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.43 | bwd_microstep: 3368.27 | bwd_inner_microstep: 3367.48 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 17:39:55,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.43 | bwd: 3368.28 | bwd_inner: 3367.48 | bwd_allreduce: 0.76 | step: 6.68 27%|██▋ | 2655/10000 [4:10:15<11:13:11, 5.50s/it] {'loss': 0.0414, 'grad_norm': 0.5956442356109619, 'learning_rate': 3.445909429373082e-05, 'epoch': 2.66} 27%|██▋ | 2655/10000 [4:10:15<11:13:11, 5.50s/it][2025-06-19 17:40:00,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:40:00,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.92 | bwd_microstep: 3315.39 | bwd_inner_microstep: 3314.58 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 17:40:00,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.92 | bwd: 3315.41 | bwd_inner: 3314.58 | bwd_allreduce: 0.78 | step: 7.10 27%|██▋ | 2656/10000 [4:10:21<11:11:38, 5.49s/it] {'loss': 0.0375, 'grad_norm': 0.5114007592201233, 'learning_rate': 3.4454618253042693e-05, 'epoch': 2.66} 27%|██▋ | 2656/10000 [4:10:21<11:11:38, 5.49s/it][2025-06-19 17:40:06,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:40:06,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.91 | bwd_microstep: 3371.05 | bwd_inner_microstep: 3370.08 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.20 [2025-06-19 17:40:06,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.91 | bwd: 3371.07 | bwd_inner: 3370.08 | bwd_allreduce: 0.94 | step: 7.20 27%|██▋ | 2657/10000 [4:10:26<11:13:14, 5.50s/it] {'loss': 0.0707, 'grad_norm': 0.5974304676055908, 'learning_rate': 3.44501406961322e-05, 'epoch': 2.66} 27%|██▋ | 2657/10000 [4:10:26<11:13:14, 5.50s/it][2025-06-19 17:40:11,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:40:11,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.99 | bwd_microstep: 3321.97 | bwd_inner_microstep: 3321.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 17:40:11,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.99 | bwd: 3321.98 | bwd_inner: 3321.18 | bwd_allreduce: 0.76 | step: 6.66 27%|██▋ | 2658/10000 [4:10:32<11:12:03, 5.49s/it] {'loss': 0.057, 'grad_norm': 0.9707270860671997, 'learning_rate': 3.4445661623469004e-05, 'epoch': 2.66} 27%|██▋ | 2658/10000 [4:10:32<11:12:03, 5.49s/it][2025-06-19 17:40:16,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:40:16,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.04 | bwd_microstep: 3321.03 | bwd_inner_microstep: 3320.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-19 17:40:16,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.04 | bwd: 3321.04 | bwd_inner: 3320.24 | bwd_allreduce: 0.77 | step: 7.03 27%|██▋ | 2659/10000 [4:10:37<11:10:44, 5.48s/it] {'loss': 0.21, 'grad_norm': 1.9401229619979858, 'learning_rate': 3.444118103552296e-05, 'epoch': 2.66} 27%|██▋ | 2659/10000 [4:10:37<11:10:44, 5.48s/it][2025-06-19 17:40:22,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:40:22,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.41 | bwd_microstep: 3316.33 | bwd_inner_microstep: 3315.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 17:40:22,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.41 | bwd: 3316.34 | bwd_inner: 3315.55 | bwd_allreduce: 0.75 | step: 6.64 27%|██▋ | 2660/10000 [4:10:43<11:09:45, 5.47s/it] {'loss': 0.0505, 'grad_norm': 0.6956577897071838, 'learning_rate': 3.443669893276405e-05, 'epoch': 2.66} 27%|██▋ | 2660/10000 [4:10:43<11:09:45, 5.47s/it][2025-06-19 17:40:27,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:40:27,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.43 | bwd_microstep: 3318.34 | bwd_inner_microstep: 3317.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 17:40:27,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.43 | bwd: 3318.35 | bwd_inner: 3317.54 | bwd_allreduce: 0.76 | step: 6.71 27%|██▋ | 2661/10000 [4:10:48<11:09:02, 5.47s/it] {'loss': 0.1344, 'grad_norm': 0.8913388252258301, 'learning_rate': 3.443221531566241e-05, 'epoch': 2.66} 27%|██▋ | 2661/10000 [4:10:48<11:09:02, 5.47s/it][2025-06-19 17:40:33,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:40:33,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.59 | bwd_microstep: 3373.19 | bwd_inner_microstep: 3372.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 17:40:33,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.59 | bwd: 3373.20 | bwd_inner: 3372.39 | bwd_allreduce: 0.77 | step: 6.72 27%|██▋ | 2662/10000 [4:10:54<11:11:39, 5.49s/it] {'loss': 0.1072, 'grad_norm': 1.5927729606628418, 'learning_rate': 3.4427730184688376e-05, 'epoch': 2.66} 27%|██▋ | 2662/10000 [4:10:54<11:11:39, 5.49s/it][2025-06-19 17:40:38,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:40:38,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.91 | bwd_microstep: 3320.29 | bwd_inner_microstep: 3319.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 17:40:38,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.91 | bwd: 3320.30 | bwd_inner: 3319.50 | bwd_allreduce: 0.76 | step: 6.56 27%|██▋ | 2663/10000 [4:10:59<11:11:02, 5.49s/it] {'loss': 0.0511, 'grad_norm': 0.4839211106300354, 'learning_rate': 3.44232435403124e-05, 'epoch': 2.66} 27%|██▋ | 2663/10000 [4:10:59<11:11:02, 5.49s/it][2025-06-19 17:40:44,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:40:44,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.30 | bwd_microstep: 3368.76 | bwd_inner_microstep: 3367.97 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 17:40:44,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.30 | bwd: 3368.77 | bwd_inner: 3367.97 | bwd_allreduce: 0.76 | step: 6.62 27%|██▋ | 2664/10000 [4:11:05<11:12:30, 5.50s/it] {'loss': 0.0577, 'grad_norm': 1.0092145204544067, 'learning_rate': 3.441875538300513e-05, 'epoch': 2.66} 27%|██▋ | 2664/10000 [4:11:05<11:12:30, 5.50s/it][2025-06-19 17:40:49,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:40:49,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.86 | bwd_microstep: 3321.76 | bwd_inner_microstep: 3320.96 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-19 17:40:49,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.86 | bwd: 3321.78 | bwd_inner: 3320.96 | bwd_allreduce: 0.77 | step: 7.04 27%|██▋ | 2665/10000 [4:11:10<11:11:02, 5.49s/it] {'loss': 0.0551, 'grad_norm': 0.5841197371482849, 'learning_rate': 3.4414265713237335e-05, 'epoch': 2.67} 27%|██▋ | 2665/10000 [4:11:10<11:11:02, 5.49s/it][2025-06-19 17:40:55,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:40:55,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.36 | bwd_microstep: 3320.22 | bwd_inner_microstep: 3319.39 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.84 [2025-06-19 17:40:55,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.36 | bwd: 3320.23 | bwd_inner: 3319.39 | bwd_allreduce: 0.79 | step: 6.85 27%|██▋ | 2666/10000 [4:11:16<11:10:09, 5.48s/it] {'loss': 0.0337, 'grad_norm': 0.33794650435447693, 'learning_rate': 3.4409774531479956e-05, 'epoch': 2.67} 27%|██▋ | 2666/10000 [4:11:16<11:10:09, 5.48s/it][2025-06-19 17:41:00,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.89 [2025-06-19 17:41:00,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.52 | bwd_microstep: 3326.92 | bwd_inner_microstep: 3326.12 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.87 [2025-06-19 17:41:00,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.52 | bwd: 3326.94 | bwd_inner: 3326.12 | bwd_allreduce: 0.77 | step: 6.88 27%|██▋ | 2667/10000 [4:11:21<11:10:01, 5.48s/it] {'loss': 0.0987, 'grad_norm': 1.3207751512527466, 'learning_rate': 3.4405281838204115e-05, 'epoch': 2.67} 27%|██▋ | 2667/10000 [4:11:21<11:10:01, 5.48s/it][2025-06-19 17:41:06,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:41:06,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.79 | bwd_microstep: 3327.92 | bwd_inner_microstep: 3326.91 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.49 [2025-06-19 17:41:06,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.79 | bwd: 3327.94 | bwd_inner: 3326.91 | bwd_allreduce: 0.98 | step: 7.49 27%|██▋ | 2668/10000 [4:11:27<11:10:01, 5.48s/it] {'loss': 0.0719, 'grad_norm': 0.9968460202217102, 'learning_rate': 3.440078763388106e-05, 'epoch': 2.67} 27%|██▋ | 2668/10000 [4:11:27<11:10:01, 5.48s/it][2025-06-19 17:41:11,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 17:41:11,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.44 | bwd_microstep: 3368.11 | bwd_inner_microstep: 3367.33 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.52 [2025-06-19 17:41:11,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.44 | bwd: 3368.12 | bwd_inner: 3367.33 | bwd_allreduce: 0.75 | step: 6.52 27%|██▋ | 2669/10000 [4:11:32<11:12:11, 5.50s/it] {'loss': 0.0801, 'grad_norm': 1.5135444402694702, 'learning_rate': 3.4396291918982226e-05, 'epoch': 2.67} 27%|██▋ | 2669/10000 [4:11:32<11:12:11, 5.50s/it][2025-06-19 17:41:17,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:41:17,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.67 | bwd_microstep: 3331.68 | bwd_inner_microstep: 3330.72 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.24 [2025-06-19 17:41:17,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.67 | bwd: 3331.69 | bwd_inner: 3330.72 | bwd_allreduce: 0.93 | step: 7.24 27%|██▋ | 2670/10000 [4:11:38<11:10:59, 5.49s/it] {'loss': 0.0695, 'grad_norm': 1.6350492238998413, 'learning_rate': 3.439179469397918e-05, 'epoch': 2.67} 27%|██▋ | 2670/10000 [4:11:38<11:10:59, 5.49s/it][2025-06-19 17:41:22,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:41:22,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.50 | bwd_microstep: 3322.88 | bwd_inner_microstep: 3322.09 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 17:41:22,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.50 | bwd: 3322.90 | bwd_inner: 3322.09 | bwd_allreduce: 0.76 | step: 6.64 27%|██▋ | 2671/10000 [4:11:43<11:10:10, 5.49s/it] {'loss': 0.0777, 'grad_norm': 1.0831340551376343, 'learning_rate': 3.438729595934366e-05, 'epoch': 2.67} 27%|██▋ | 2671/10000 [4:11:43<11:10:10, 5.49s/it][2025-06-19 17:41:28,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:41:28,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.70 | bwd_microstep: 3329.09 | bwd_inner_microstep: 3328.27 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.22 [2025-06-19 17:41:28,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.70 | bwd: 3329.11 | bwd_inner: 3328.27 | bwd_allreduce: 0.79 | step: 7.23 27%|██▋ | 2672/10000 [4:11:49<11:09:50, 5.48s/it] {'loss': 0.0893, 'grad_norm': 1.1073458194732666, 'learning_rate': 3.4382795715547574e-05, 'epoch': 2.67} 27%|██▋ | 2672/10000 [4:11:49<11:09:50, 5.48s/it][2025-06-19 17:41:33,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:41:33,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.45 | bwd_microstep: 3320.34 | bwd_inner_microstep: 3319.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 17:41:33,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.45 | bwd: 3320.36 | bwd_inner: 3319.55 | bwd_allreduce: 0.76 | step: 6.72 27%|██▋ | 2673/10000 [4:11:54<11:09:08, 5.48s/it] {'loss': 0.0447, 'grad_norm': 0.8326358199119568, 'learning_rate': 3.437829396306297e-05, 'epoch': 2.67} 27%|██▋ | 2673/10000 [4:11:54<11:09:08, 5.48s/it][2025-06-19 17:41:39,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:41:39,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.23 | bwd_microstep: 3313.73 | bwd_inner_microstep: 3312.90 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.88 [2025-06-19 17:41:39,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.23 | bwd: 3313.74 | bwd_inner: 3312.90 | bwd_allreduce: 0.80 | step: 6.89 27%|██▋ | 2674/10000 [4:12:00<11:08:19, 5.47s/it] {'loss': 0.0269, 'grad_norm': 0.5567437410354614, 'learning_rate': 3.437379070236206e-05, 'epoch': 2.67} 27%|██▋ | 2674/10000 [4:12:00<11:08:19, 5.47s/it][2025-06-19 17:41:44,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:41:44,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.07 | bwd_microstep: 3315.15 | bwd_inner_microstep: 3314.36 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 17:41:44,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.07 | bwd: 3315.16 | bwd_inner: 3314.36 | bwd_allreduce: 0.76 | step: 6.67 27%|██▋ | 2675/10000 [4:12:05<11:07:40, 5.47s/it] {'loss': 0.0436, 'grad_norm': 0.7477120161056519, 'learning_rate': 3.436928593391722e-05, 'epoch': 2.67} 27%|██▋ | 2675/10000 [4:12:05<11:07:40, 5.47s/it][2025-06-19 17:41:50,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:41:50,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.19 | bwd_microstep: 3374.39 | bwd_inner_microstep: 3373.59 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 17:41:50,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.19 | bwd: 3374.41 | bwd_inner: 3373.59 | bwd_allreduce: 0.78 | step: 7.24 27%|██▋ | 2676/10000 [4:12:11<11:10:21, 5.49s/it] {'loss': 0.0877, 'grad_norm': 1.2348458766937256, 'learning_rate': 3.436477965820097e-05, 'epoch': 2.68} 27%|██▋ | 2676/10000 [4:12:11<11:10:21, 5.49s/it][2025-06-19 17:41:55,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:41:55,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.87 | bwd_microstep: 3328.16 | bwd_inner_microstep: 3327.07 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.28 [2025-06-19 17:41:55,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.87 | bwd: 3328.18 | bwd_inner: 3327.07 | bwd_allreduce: 1.05 | step: 7.28 27%|██▋ | 2677/10000 [4:12:16<11:09:36, 5.49s/it] {'loss': 0.0478, 'grad_norm': 0.642593264579773, 'learning_rate': 3.436027187568601e-05, 'epoch': 2.68} 27%|██▋ | 2677/10000 [4:12:16<11:09:36, 5.49s/it][2025-06-19 17:42:01,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:42:01,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.71 | bwd_microstep: 3376.28 | bwd_inner_microstep: 3375.47 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 17:42:01,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.71 | bwd: 3376.30 | bwd_inner: 3375.47 | bwd_allreduce: 0.78 | step: 7.14 27%|██▋ | 2678/10000 [4:12:22<11:12:07, 5.51s/it] {'loss': 0.0781, 'grad_norm': 1.330012321472168, 'learning_rate': 3.435576258684517e-05, 'epoch': 2.68} 27%|██▋ | 2678/10000 [4:12:22<11:12:07, 5.51s/it][2025-06-19 17:42:06,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:42:06,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.06 | bwd_microstep: 3325.08 | bwd_inner_microstep: 3324.08 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.22 [2025-06-19 17:42:06,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.06 | bwd: 3325.09 | bwd_inner: 3324.08 | bwd_allreduce: 0.97 | step: 7.22 27%|██▋ | 2679/10000 [4:12:27<11:10:42, 5.50s/it] {'loss': 0.1568, 'grad_norm': 2.1460111141204834, 'learning_rate': 3.4351251792151464e-05, 'epoch': 2.68} 27%|██▋ | 2679/10000 [4:12:27<11:10:42, 5.50s/it][2025-06-19 17:42:12,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:42:12,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.92 | bwd_microstep: 3381.52 | bwd_inner_microstep: 3380.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 17:42:12,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.92 | bwd: 3381.54 | bwd_inner: 3380.73 | bwd_allreduce: 0.76 | step: 6.68 27%|██▋ | 2680/10000 [4:12:33<11:12:50, 5.52s/it] {'loss': 0.0817, 'grad_norm': 1.0427401065826416, 'learning_rate': 3.434673949207805e-05, 'epoch': 2.68} 27%|██▋ | 2680/10000 [4:12:33<11:12:50, 5.52s/it][2025-06-19 17:42:17,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:42:17,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.08 | bwd_microstep: 3324.87 | bwd_inner_microstep: 3323.91 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.29 [2025-06-19 17:42:17,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.08 | bwd: 3324.89 | bwd_inner: 3323.91 | bwd_allreduce: 0.93 | step: 7.29 27%|██▋ | 2681/10000 [4:12:38<11:11:19, 5.50s/it] {'loss': 0.0743, 'grad_norm': 1.0139843225479126, 'learning_rate': 3.434222568709825e-05, 'epoch': 2.68} 27%|██▋ | 2681/10000 [4:12:38<11:11:19, 5.50s/it][2025-06-19 17:42:23,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:42:23,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.35 | bwd_microstep: 3329.16 | bwd_inner_microstep: 3328.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.29 [2025-06-19 17:42:23,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.35 | bwd: 3329.18 | bwd_inner: 3328.35 | bwd_allreduce: 0.78 | step: 7.29 27%|██▋ | 2682/10000 [4:12:44<11:10:34, 5.50s/it] {'loss': 0.0802, 'grad_norm': 1.1617306470870972, 'learning_rate': 3.433771037768554e-05, 'epoch': 2.68} 27%|██▋ | 2682/10000 [4:12:44<11:10:34, 5.50s/it][2025-06-19 17:42:28,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:42:28,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.93 | bwd_microstep: 3330.10 | bwd_inner_microstep: 3329.16 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.03 [2025-06-19 17:42:28,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.93 | bwd: 3330.11 | bwd_inner: 3329.16 | bwd_allreduce: 0.90 | step: 7.04 27%|██▋ | 2683/10000 [4:12:49<11:09:58, 5.49s/it] {'loss': 0.0888, 'grad_norm': 1.2421629428863525, 'learning_rate': 3.4333193564313556e-05, 'epoch': 2.68} 27%|██▋ | 2683/10000 [4:12:49<11:09:58, 5.49s/it][2025-06-19 17:42:34,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:42:34,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.32 | bwd_microstep: 3327.54 | bwd_inner_microstep: 3326.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 17:42:34,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.32 | bwd: 3327.55 | bwd_inner: 3326.74 | bwd_allreduce: 0.77 | step: 6.68 27%|██▋ | 2684/10000 [4:12:55<11:09:17, 5.49s/it] {'loss': 0.0402, 'grad_norm': 0.5027452111244202, 'learning_rate': 3.432867524745609e-05, 'epoch': 2.68} 27%|██▋ | 2684/10000 [4:12:55<11:09:17, 5.49s/it][2025-06-19 17:42:39,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:42:39,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.33 | bwd_microstep: 3373.41 | bwd_inner_microstep: 3372.34 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.15 [2025-06-19 17:42:39,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.33 | bwd: 3373.43 | bwd_inner: 3372.34 | bwd_allreduce: 1.04 | step: 7.16 27%|██▋ | 2685/10000 [4:13:00<11:11:09, 5.50s/it] {'loss': 0.117, 'grad_norm': 1.5419117212295532, 'learning_rate': 3.432415542758709e-05, 'epoch': 2.69} 27%|██▋ | 2685/10000 [4:13:00<11:11:09, 5.50s/it][2025-06-19 17:42:45,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:42:45,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.86 | bwd_microstep: 3325.72 | bwd_inner_microstep: 3324.90 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-19 17:42:45,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.86 | bwd: 3325.74 | bwd_inner: 3324.90 | bwd_allreduce: 0.79 | step: 7.32 27%|██▋ | 2686/10000 [4:13:06<11:10:22, 5.50s/it] {'loss': 0.0775, 'grad_norm': 0.8692333102226257, 'learning_rate': 3.4319634105180664e-05, 'epoch': 2.69} 27%|██▋ | 2686/10000 [4:13:06<11:10:22, 5.50s/it][2025-06-19 17:42:50,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 17:42:50,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.44 | bwd_microstep: 3322.20 | bwd_inner_microstep: 3321.27 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.06 [2025-06-19 17:42:50,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.44 | bwd: 3322.22 | bwd_inner: 3321.27 | bwd_allreduce: 0.90 | step: 7.07 27%|██▋ | 2687/10000 [4:13:11<11:09:28, 5.49s/it] {'loss': 0.0495, 'grad_norm': 0.9392088651657104, 'learning_rate': 3.431511128071109e-05, 'epoch': 2.69} 27%|██▋ | 2687/10000 [4:13:11<11:09:28, 5.49s/it][2025-06-19 17:42:56,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:42:56,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.07 | bwd_microstep: 3376.81 | bwd_inner_microstep: 3376.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 17:42:56,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.07 | bwd: 3376.82 | bwd_inner: 3376.01 | bwd_allreduce: 0.77 | step: 6.69 27%|██▋ | 2688/10000 [4:13:17<11:11:45, 5.51s/it] {'loss': 0.1027, 'grad_norm': 0.7654172778129578, 'learning_rate': 3.431058695465277e-05, 'epoch': 2.69} 27%|██▋ | 2688/10000 [4:13:17<11:11:45, 5.51s/it][2025-06-19 17:43:01,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:43:01,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.98 | bwd_microstep: 3376.46 | bwd_inner_microstep: 3375.64 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-19 17:43:01,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.98 | bwd: 3376.47 | bwd_inner: 3375.64 | bwd_allreduce: 0.79 | step: 7.12 27%|██▋ | 2689/10000 [4:13:22<11:12:53, 5.52s/it] {'loss': 0.0436, 'grad_norm': 0.632519006729126, 'learning_rate': 3.43060611274803e-05, 'epoch': 2.69} 27%|██▋ | 2689/10000 [4:13:22<11:12:53, 5.52s/it][2025-06-19 17:43:07,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:43:07,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.30 | bwd_microstep: 3325.36 | bwd_inner_microstep: 3324.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 17:43:07,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.30 | bwd: 3325.37 | bwd_inner: 3324.57 | bwd_allreduce: 0.76 | step: 6.78 27%|██▋ | 2690/10000 [4:13:28<11:11:10, 5.51s/it] {'loss': 0.0323, 'grad_norm': 0.8272361159324646, 'learning_rate': 3.430153379966841e-05, 'epoch': 2.69} 27%|██▋ | 2690/10000 [4:13:28<11:11:10, 5.51s/it][2025-06-19 17:43:12,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:43:12,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.30 | bwd_microstep: 3320.27 | bwd_inner_microstep: 3319.46 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-19 17:43:12,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.30 | bwd: 3320.29 | bwd_inner: 3319.46 | bwd_allreduce: 0.78 | step: 6.82 27%|██▋ | 2691/10000 [4:13:33<11:09:30, 5.50s/it] {'loss': 0.1458, 'grad_norm': 1.3194891214370728, 'learning_rate': 3.429700497169201e-05, 'epoch': 2.69} 27%|██▋ | 2691/10000 [4:13:33<11:09:30, 5.50s/it][2025-06-19 17:43:18,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:43:18,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.82 | bwd_microstep: 3327.00 | bwd_inner_microstep: 3326.22 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 17:43:18,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.82 | bwd: 3327.01 | bwd_inner: 3326.22 | bwd_allreduce: 0.75 | step: 6.54 27%|██▋ | 2692/10000 [4:13:39<11:08:34, 5.49s/it] {'loss': 0.1277, 'grad_norm': 1.5830693244934082, 'learning_rate': 3.429247464402614e-05, 'epoch': 2.69} 27%|██▋ | 2692/10000 [4:13:39<11:08:34, 5.49s/it][2025-06-19 17:43:23,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:43:23,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.11 | bwd_microstep: 3323.88 | bwd_inner_microstep: 3323.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 17:43:23,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.11 | bwd: 3323.90 | bwd_inner: 3323.09 | bwd_allreduce: 0.76 | step: 6.67 27%|██▋ | 2693/10000 [4:13:44<11:07:45, 5.48s/it] {'loss': 0.0476, 'grad_norm': 0.7180774807929993, 'learning_rate': 3.4287942817146e-05, 'epoch': 2.69} 27%|██▋ | 2693/10000 [4:13:44<11:07:45, 5.48s/it][2025-06-19 17:43:29,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:43:29,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.16 | bwd_microstep: 3377.51 | bwd_inner_microstep: 3376.69 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 17:43:29,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.16 | bwd: 3377.52 | bwd_inner: 3376.69 | bwd_allreduce: 0.79 | step: 7.29 27%|██▋ | 2694/10000 [4:13:50<11:10:02, 5.50s/it] {'loss': 0.0415, 'grad_norm': 0.8941817283630371, 'learning_rate': 3.428340949152699e-05, 'epoch': 2.69} 27%|██▋ | 2694/10000 [4:13:50<11:10:02, 5.50s/it][2025-06-19 17:43:34,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 17:43:34,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.38 | bwd_microstep: 3323.24 | bwd_inner_microstep: 3322.46 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.57 [2025-06-19 17:43:34,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.38 | bwd: 3323.25 | bwd_inner: 3322.46 | bwd_allreduce: 0.75 | step: 6.57 27%|██▋ | 2695/10000 [4:13:55<11:08:40, 5.49s/it] {'loss': 0.0779, 'grad_norm': 1.1860601902008057, 'learning_rate': 3.427887466764461e-05, 'epoch': 2.69} 27%|██▋ | 2695/10000 [4:13:55<11:08:40, 5.49s/it][2025-06-19 17:43:40,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:43:40,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.08 | bwd_microstep: 3315.14 | bwd_inner_microstep: 3314.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 17:43:40,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.08 | bwd: 3315.15 | bwd_inner: 3314.35 | bwd_allreduce: 0.76 | step: 6.89 27%|██▋ | 2696/10000 [4:14:00<11:07:28, 5.48s/it] {'loss': 0.1204, 'grad_norm': 1.8717583417892456, 'learning_rate': 3.427433834597454e-05, 'epoch': 2.7} 27%|██▋ | 2696/10000 [4:14:00<11:07:28, 5.48s/it][2025-06-19 17:43:45,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 17:43:45,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.27 | bwd_microstep: 3323.53 | bwd_inner_microstep: 3322.44 | bwd_allreduce_microstep: 1.02 | step_microstep: 8.28 [2025-06-19 17:43:45,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.27 | bwd: 3323.55 | bwd_inner: 3322.44 | bwd_allreduce: 1.05 | step: 8.29 27%|██▋ | 2697/10000 [4:14:06<11:07:28, 5.48s/it] {'loss': 0.0646, 'grad_norm': 1.4062551259994507, 'learning_rate': 3.426980052699262e-05, 'epoch': 2.7} 27%|██▋ | 2697/10000 [4:14:06<11:07:28, 5.48s/it][2025-06-19 17:43:51,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:43:51,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.23 | bwd_microstep: 3327.34 | bwd_inner_microstep: 3326.37 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.56 [2025-06-19 17:43:51,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.23 | bwd: 3327.36 | bwd_inner: 3326.37 | bwd_allreduce: 0.94 | step: 7.56 27%|██▋ | 2698/10000 [4:14:11<11:07:15, 5.48s/it] {'loss': 0.0546, 'grad_norm': 1.7550221681594849, 'learning_rate': 3.426526121117487e-05, 'epoch': 2.7} 27%|██▋ | 2698/10000 [4:14:11<11:07:15, 5.48s/it][2025-06-19 17:43:56,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:43:56,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.98 | bwd_microstep: 3323.20 | bwd_inner_microstep: 3322.26 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.12 [2025-06-19 17:43:56,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.98 | bwd: 3323.22 | bwd_inner: 3322.26 | bwd_allreduce: 0.91 | step: 7.12 27%|██▋ | 2699/10000 [4:14:17<11:06:56, 5.48s/it] {'loss': 0.0566, 'grad_norm': 0.8374243378639221, 'learning_rate': 3.426072039899742e-05, 'epoch': 2.7} 27%|██▋ | 2699/10000 [4:14:17<11:06:56, 5.48s/it][2025-06-19 17:44:02,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:44:02,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.03 | bwd_microstep: 3367.61 | bwd_inner_microstep: 3366.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-19 17:44:02,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.03 | bwd: 3367.63 | bwd_inner: 3366.81 | bwd_allreduce: 0.77 | step: 6.83 27%|██▋ | 2700/10000 [4:14:22<11:08:54, 5.50s/it] {'loss': 0.0634, 'grad_norm': 0.9214553236961365, 'learning_rate': 3.425617809093659e-05, 'epoch': 2.7} 27%|██▋ | 2700/10000 [4:14:22<11:08:54, 5.50s/it][2025-06-19 17:44:07,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:44:07,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.05 | bwd_microstep: 3327.38 | bwd_inner_microstep: 3326.41 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.78 [2025-06-19 17:44:07,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.05 | bwd: 3327.40 | bwd_inner: 3326.41 | bwd_allreduce: 0.94 | step: 7.78 27%|██▋ | 2701/10000 [4:14:28<11:07:57, 5.49s/it] {'loss': 0.0428, 'grad_norm': 0.5820180177688599, 'learning_rate': 3.425163428746884e-05, 'epoch': 2.7} 27%|██▋ | 2701/10000 [4:14:28<11:07:57, 5.49s/it][2025-06-19 17:44:13,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:44:13,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.77 | bwd_microstep: 3314.71 | bwd_inner_microstep: 3313.77 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.13 [2025-06-19 17:44:13,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.77 | bwd: 3314.73 | bwd_inner: 3313.77 | bwd_allreduce: 0.91 | step: 7.14 27%|██▋ | 2702/10000 [4:14:33<11:06:48, 5.48s/it] {'loss': 0.0402, 'grad_norm': 0.4488537907600403, 'learning_rate': 3.42470889890708e-05, 'epoch': 2.7} 27%|██▋ | 2702/10000 [4:14:33<11:06:48, 5.48s/it][2025-06-19 17:44:18,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:44:18,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.06 | bwd_microstep: 3324.32 | bwd_inner_microstep: 3323.40 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.95 [2025-06-19 17:44:18,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.06 | bwd: 3324.34 | bwd_inner: 3323.40 | bwd_allreduce: 0.89 | step: 6.95 27%|██▋ | 2703/10000 [4:14:39<11:06:12, 5.48s/it] {'loss': 0.055, 'grad_norm': 0.9167932868003845, 'learning_rate': 3.424254219621924e-05, 'epoch': 2.7} 27%|██▋ | 2703/10000 [4:14:39<11:06:12, 5.48s/it][2025-06-19 17:44:24,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:44:24,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.28 | bwd_microstep: 3376.43 | bwd_inner_microstep: 3375.65 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 17:44:24,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.28 | bwd: 3376.45 | bwd_inner: 3375.65 | bwd_allreduce: 0.76 | step: 6.65 27%|██▋ | 2704/10000 [4:14:44<11:08:40, 5.50s/it] {'loss': 0.0425, 'grad_norm': 0.7403500080108643, 'learning_rate': 3.423799390939111e-05, 'epoch': 2.7} 27%|██▋ | 2704/10000 [4:14:44<11:08:40, 5.50s/it][2025-06-19 17:44:29,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:44:29,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.03 | bwd_microstep: 3312.79 | bwd_inner_microstep: 3312.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 17:44:29,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.03 | bwd: 3312.81 | bwd_inner: 3312.00 | bwd_allreduce: 0.76 | step: 6.78 27%|██▋ | 2705/10000 [4:14:50<11:06:54, 5.49s/it] {'loss': 0.0189, 'grad_norm': 0.23155876994132996, 'learning_rate': 3.42334441290635e-05, 'epoch': 2.71} 27%|██▋ | 2705/10000 [4:14:50<11:06:54, 5.49s/it][2025-06-19 17:44:35,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:44:35,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.40 | bwd_microstep: 3366.32 | bwd_inner_microstep: 3365.49 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.87 [2025-06-19 17:44:35,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.40 | bwd: 3366.33 | bwd_inner: 3365.49 | bwd_allreduce: 0.79 | step: 6.87 27%|██▋ | 2706/10000 [4:14:55<11:08:37, 5.50s/it] {'loss': 0.078, 'grad_norm': 0.8348245620727539, 'learning_rate': 3.422889285571366e-05, 'epoch': 2.71} 27%|██▋ | 2706/10000 [4:14:55<11:08:37, 5.50s/it][2025-06-19 17:44:40,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:44:40,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.55 | bwd_microstep: 3369.94 | bwd_inner_microstep: 3369.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.81 [2025-06-19 17:44:40,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.55 | bwd: 3369.96 | bwd_inner: 3369.13 | bwd_allreduce: 0.78 | step: 7.81 27%|██▋ | 2707/10000 [4:15:01<11:09:59, 5.51s/it] {'loss': 0.025, 'grad_norm': 0.2708546817302704, 'learning_rate': 3.4224340089818994e-05, 'epoch': 2.71} 27%|██▋ | 2707/10000 [4:15:01<11:09:59, 5.51s/it][2025-06-19 17:44:46,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:44:46,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.78 | bwd_microstep: 3321.68 | bwd_inner_microstep: 3320.72 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.10 [2025-06-19 17:44:46,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.78 | bwd: 3321.69 | bwd_inner: 3320.72 | bwd_allreduce: 0.92 | step: 7.10 27%|██▋ | 2708/10000 [4:15:06<11:08:19, 5.50s/it] {'loss': 0.0219, 'grad_norm': 0.3113284111022949, 'learning_rate': 3.421978583185707e-05, 'epoch': 2.71} 27%|██▋ | 2708/10000 [4:15:06<11:08:19, 5.50s/it][2025-06-19 17:44:51,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:44:51,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.23 | bwd_microstep: 3370.32 | bwd_inner_microstep: 3369.36 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.24 [2025-06-19 17:44:51,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.23 | bwd: 3370.33 | bwd_inner: 3369.36 | bwd_allreduce: 0.93 | step: 7.24 27%|██▋ | 2709/10000 [4:15:12<11:09:36, 5.51s/it] {'loss': 0.0919, 'grad_norm': 0.9058119058609009, 'learning_rate': 3.4215230082305615e-05, 'epoch': 2.71} 27%|██▋ | 2709/10000 [4:15:12<11:09:36, 5.51s/it][2025-06-19 17:44:57,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:44:57,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.37 | bwd_microstep: 3367.62 | bwd_inner_microstep: 3366.70 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.06 [2025-06-19 17:44:57,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.37 | bwd: 3367.64 | bwd_inner: 3366.70 | bwd_allreduce: 0.88 | step: 7.06 27%|██▋ | 2710/10000 [4:15:17<11:10:39, 5.52s/it] {'loss': 0.1381, 'grad_norm': 1.8775663375854492, 'learning_rate': 3.4210672841642495e-05, 'epoch': 2.71} 27%|██▋ | 2710/10000 [4:15:18<11:10:39, 5.52s/it][2025-06-19 17:45:02,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:45:02,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.18 | bwd_microstep: 3365.54 | bwd_inner_microstep: 3364.67 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.88 [2025-06-19 17:45:02,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.18 | bwd: 3365.56 | bwd_inner: 3364.67 | bwd_allreduce: 0.85 | step: 6.88 27%|██▋ | 2711/10000 [4:15:23<11:10:59, 5.52s/it] {'loss': 0.0369, 'grad_norm': 0.6237083673477173, 'learning_rate': 3.420611411034575e-05, 'epoch': 2.71} 27%|██▋ | 2711/10000 [4:15:23<11:10:59, 5.52s/it][2025-06-19 17:45:08,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:45:08,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.77 | bwd_microstep: 3370.92 | bwd_inner_microstep: 3370.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 17:45:08,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.77 | bwd: 3370.94 | bwd_inner: 3370.11 | bwd_allreduce: 0.78 | step: 7.01 27%|██▋ | 2712/10000 [4:15:29<11:11:21, 5.53s/it] {'loss': 0.0267, 'grad_norm': 0.461090624332428, 'learning_rate': 3.4201553888893566e-05, 'epoch': 2.71} 27%|██▋ | 2712/10000 [4:15:29<11:11:21, 5.53s/it][2025-06-19 17:45:13,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:45:13,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.71 | bwd_microstep: 3368.80 | bwd_inner_microstep: 3367.83 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.01 [2025-06-19 17:45:13,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.71 | bwd: 3368.83 | bwd_inner: 3367.83 | bwd_allreduce: 0.94 | step: 7.01 27%|██▋ | 2713/10000 [4:15:34<11:11:43, 5.53s/it] {'loss': 0.0956, 'grad_norm': 1.030530333518982, 'learning_rate': 3.419699217776429e-05, 'epoch': 2.71} 27%|██▋ | 2713/10000 [4:15:34<11:11:43, 5.53s/it][2025-06-19 17:45:19,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:45:19,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.94 | bwd_microstep: 3310.54 | bwd_inner_microstep: 3309.66 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.08 [2025-06-19 17:45:19,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.94 | bwd: 3310.56 | bwd_inner: 3309.66 | bwd_allreduce: 0.86 | step: 7.08 27%|██▋ | 2714/10000 [4:15:40<11:08:45, 5.51s/it] {'loss': 0.0424, 'grad_norm': 0.8369131684303284, 'learning_rate': 3.419242897743642e-05, 'epoch': 2.71} 27%|██▋ | 2714/10000 [4:15:40<11:08:45, 5.51s/it][2025-06-19 17:45:24,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:45:24,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.47 | bwd_microstep: 3319.14 | bwd_inner_microstep: 3318.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 17:45:24,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.47 | bwd: 3319.15 | bwd_inner: 3318.35 | bwd_allreduce: 0.76 | step: 6.56 27%|██▋ | 2715/10000 [4:15:45<11:07:13, 5.50s/it] {'loss': 0.2876, 'grad_norm': 2.0805885791778564, 'learning_rate': 3.418786428838862e-05, 'epoch': 2.71} 27%|██▋ | 2715/10000 [4:15:45<11:07:13, 5.50s/it][2025-06-19 17:45:30,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:45:30,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.43 | bwd_microstep: 3384.22 | bwd_inner_microstep: 3383.27 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.03 [2025-06-19 17:45:30,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.43 | bwd: 3384.23 | bwd_inner: 3383.27 | bwd_allreduce: 0.92 | step: 7.03 27%|██▋ | 2716/10000 [4:15:51<11:09:12, 5.51s/it] {'loss': 0.0624, 'grad_norm': 0.9468430876731873, 'learning_rate': 3.418329811109972e-05, 'epoch': 2.72} 27%|██▋ | 2716/10000 [4:15:51<11:09:12, 5.51s/it][2025-06-19 17:45:35,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:45:35,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.40 | bwd_microstep: 3367.04 | bwd_inner_microstep: 3366.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 17:45:35,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.40 | bwd: 3367.05 | bwd_inner: 3366.22 | bwd_allreduce: 0.78 | step: 7.11 27%|██▋ | 2717/10000 [4:15:56<11:10:11, 5.52s/it] {'loss': 0.036, 'grad_norm': 0.786655604839325, 'learning_rate': 3.417873044604866e-05, 'epoch': 2.72} 27%|██▋ | 2717/10000 [4:15:56<11:10:11, 5.52s/it][2025-06-19 17:45:41,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:45:41,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.75 | bwd_microstep: 3323.82 | bwd_inner_microstep: 3322.98 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.82 [2025-06-19 17:45:41,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.75 | bwd: 3323.83 | bwd_inner: 3322.98 | bwd_allreduce: 0.80 | step: 6.82 27%|██▋ | 2718/10000 [4:16:02<11:08:19, 5.51s/it] {'loss': 0.0341, 'grad_norm': 0.6122233271598816, 'learning_rate': 3.4174161293714586e-05, 'epoch': 2.72} 27%|██▋ | 2718/10000 [4:16:02<11:08:19, 5.51s/it][2025-06-19 17:45:46,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:45:46,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.30 | bwd_microstep: 3319.68 | bwd_inner_microstep: 3318.81 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.77 [2025-06-19 17:45:46,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.30 | bwd: 3319.69 | bwd_inner: 3318.81 | bwd_allreduce: 0.84 | step: 6.78 27%|██▋ | 2719/10000 [4:16:07<11:06:38, 5.49s/it] {'loss': 0.0301, 'grad_norm': 0.5047602653503418, 'learning_rate': 3.4169590654576773e-05, 'epoch': 2.72} 27%|██▋ | 2719/10000 [4:16:07<11:06:38, 5.49s/it][2025-06-19 17:45:52,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:45:52,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.05 | bwd_microstep: 3365.16 | bwd_inner_microstep: 3364.14 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.10 [2025-06-19 17:45:52,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.05 | bwd: 3365.18 | bwd_inner: 3364.14 | bwd_allreduce: 0.98 | step: 7.10 27%|██▋ | 2720/10000 [4:16:13<11:08:05, 5.51s/it] {'loss': 0.0541, 'grad_norm': 0.9244663119316101, 'learning_rate': 3.416501852911467e-05, 'epoch': 2.72} 27%|██▋ | 2720/10000 [4:16:13<11:08:05, 5.51s/it][2025-06-19 17:45:57,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:45:57,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.39 | bwd_microstep: 3370.18 | bwd_inner_microstep: 3369.00 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.34 [2025-06-19 17:45:57,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.39 | bwd: 3370.21 | bwd_inner: 3369.00 | bwd_allreduce: 1.15 | step: 7.33 27%|██▋ | 2721/10000 [4:16:18<11:09:14, 5.52s/it] {'loss': 0.102, 'grad_norm': 1.246044635772705, 'learning_rate': 3.4160444917807855e-05, 'epoch': 2.72} 27%|██▋ | 2721/10000 [4:16:18<11:09:14, 5.52s/it][2025-06-19 17:46:03,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:46:03,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.20 | bwd_microstep: 3369.76 | bwd_inner_microstep: 3368.73 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.76 [2025-06-19 17:46:03,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.20 | bwd: 3369.78 | bwd_inner: 3368.73 | bwd_allreduce: 1.00 | step: 7.77 27%|██▋ | 2722/10000 [4:16:24<11:10:43, 5.53s/it] {'loss': 0.0569, 'grad_norm': 0.8801455497741699, 'learning_rate': 3.41558698211361e-05, 'epoch': 2.72} 27%|██▋ | 2722/10000 [4:16:24<11:10:43, 5.53s/it][2025-06-19 17:46:08,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 17:46:08,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.10 | bwd_microstep: 3315.03 | bwd_inner_microstep: 3314.07 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.18 [2025-06-19 17:46:08,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.10 | bwd: 3315.06 | bwd_inner: 3314.07 | bwd_allreduce: 0.92 | step: 7.18 27%|██▋ | 2723/10000 [4:16:29<11:08:15, 5.51s/it] {'loss': 0.0431, 'grad_norm': 1.4373582601547241, 'learning_rate': 3.415129323957929e-05, 'epoch': 2.72} 27%|██▋ | 2723/10000 [4:16:29<11:08:15, 5.51s/it][2025-06-19 17:46:14,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:46:14,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.62 | bwd_microstep: 3320.47 | bwd_inner_microstep: 3319.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 17:46:14,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.62 | bwd: 3320.49 | bwd_inner: 3319.68 | bwd_allreduce: 0.76 | step: 6.68 27%|██▋ | 2724/10000 [4:16:35<11:06:25, 5.50s/it] {'loss': 0.1777, 'grad_norm': 1.4733259677886963, 'learning_rate': 3.4146715173617506e-05, 'epoch': 2.72} 27%|██▋ | 2724/10000 [4:16:35<11:06:25, 5.50s/it][2025-06-19 17:46:19,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 17:46:19,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.22 | bwd_microstep: 3318.83 | bwd_inner_microstep: 3317.82 | bwd_allreduce_microstep: 0.96 | step_microstep: 8.01 [2025-06-19 17:46:19,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.22 | bwd: 3318.85 | bwd_inner: 3317.82 | bwd_allreduce: 0.98 | step: 8.02 27%|██▋ | 2725/10000 [4:16:40<11:05:06, 5.49s/it] {'loss': 0.0284, 'grad_norm': 0.8609981536865234, 'learning_rate': 3.4142135623730954e-05, 'epoch': 2.73} 27%|██▋ | 2725/10000 [4:16:40<11:05:06, 5.49s/it][2025-06-19 17:46:25,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:46:25,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.49 | bwd_microstep: 3359.99 | bwd_inner_microstep: 3359.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 17:46:25,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.49 | bwd: 3360.01 | bwd_inner: 3359.19 | bwd_allreduce: 0.77 | step: 6.81 27%|██▋ | 2726/10000 [4:16:46<11:06:48, 5.50s/it] {'loss': 0.1504, 'grad_norm': 2.0109262466430664, 'learning_rate': 3.4137554590400014e-05, 'epoch': 2.73} 27%|██▋ | 2726/10000 [4:16:46<11:06:48, 5.50s/it][2025-06-19 17:46:30,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.77 [2025-06-19 17:46:30,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.60 | bwd_microstep: 3317.56 | bwd_inner_microstep: 3316.55 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.26 [2025-06-19 17:46:30,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.60 | bwd: 3317.58 | bwd_inner: 3316.55 | bwd_allreduce: 0.97 | step: 7.26 27%|██▋ | 2727/10000 [4:16:51<11:05:24, 5.49s/it] {'loss': 0.0846, 'grad_norm': 1.3307448625564575, 'learning_rate': 3.41329720741052e-05, 'epoch': 2.73} 27%|██▋ | 2727/10000 [4:16:51<11:05:24, 5.49s/it][2025-06-19 17:46:36,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 17:46:36,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.69 | bwd_microstep: 3311.33 | bwd_inner_microstep: 3310.10 | bwd_allreduce_microstep: 1.16 | step_microstep: 7.71 [2025-06-19 17:46:36,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.69 | bwd: 3311.36 | bwd_inner: 3310.10 | bwd_allreduce: 1.19 | step: 7.72 27%|██▋ | 2728/10000 [4:16:57<11:04:08, 5.48s/it] {'loss': 0.0551, 'grad_norm': 0.7751511335372925, 'learning_rate': 3.412838807532722e-05, 'epoch': 2.73} 27%|██▋ | 2728/10000 [4:16:57<11:04:08, 5.48s/it][2025-06-19 17:46:41,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 17:46:41,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.85 | bwd_microstep: 3379.48 | bwd_inner_microstep: 3378.42 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.50 [2025-06-19 17:46:41,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.85 | bwd: 3379.50 | bwd_inner: 3378.42 | bwd_allreduce: 1.02 | step: 7.50 27%|██▋ | 2729/10000 [4:17:02<11:06:47, 5.50s/it] {'loss': 0.0397, 'grad_norm': 0.6609041094779968, 'learning_rate': 3.41238025945469e-05, 'epoch': 2.73} 27%|██▋ | 2729/10000 [4:17:02<11:06:47, 5.50s/it][2025-06-19 17:46:47,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 17:46:47,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.58 | bwd_microstep: 3366.87 | bwd_inner_microstep: 3365.77 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.43 [2025-06-19 17:46:47,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.58 | bwd: 3366.89 | bwd_inner: 3365.77 | bwd_allreduce: 1.06 | step: 7.44 27%|██▋ | 2730/10000 [4:17:08<11:07:46, 5.51s/it] {'loss': 0.0892, 'grad_norm': 0.9199958443641663, 'learning_rate': 3.411921563224524e-05, 'epoch': 2.73} 27%|██▋ | 2730/10000 [4:17:08<11:07:46, 5.51s/it][2025-06-19 17:46:52,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:46:52,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.05 | bwd_microstep: 3314.19 | bwd_inner_microstep: 3313.26 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.38 [2025-06-19 17:46:52,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.05 | bwd: 3314.21 | bwd_inner: 3313.26 | bwd_allreduce: 0.90 | step: 7.40 27%|██▋ | 2731/10000 [4:17:13<11:06:08, 5.50s/it] {'loss': 0.079, 'grad_norm': 1.3926771879196167, 'learning_rate': 3.411462718890339e-05, 'epoch': 2.73} 27%|██▋ | 2731/10000 [4:17:13<11:06:08, 5.50s/it][2025-06-19 17:46:58,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:46:58,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.47 | bwd_microstep: 3316.04 | bwd_inner_microstep: 3315.09 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.16 [2025-06-19 17:46:58,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.47 | bwd: 3316.06 | bwd_inner: 3315.09 | bwd_allreduce: 0.92 | step: 7.17 27%|██▋ | 2732/10000 [4:17:19<11:04:40, 5.49s/it] {'loss': 0.0551, 'grad_norm': 0.9398183822631836, 'learning_rate': 3.411003726500265e-05, 'epoch': 2.73} 27%|██▋ | 2732/10000 [4:17:19<11:04:40, 5.49s/it][2025-06-19 17:47:03,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:47:03,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.43 | bwd_microstep: 3315.10 | bwd_inner_microstep: 3314.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 17:47:03,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.43 | bwd: 3315.11 | bwd_inner: 3314.32 | bwd_allreduce: 0.76 | step: 6.65 27%|██▋ | 2733/10000 [4:17:24<11:03:29, 5.48s/it] {'loss': 0.1155, 'grad_norm': 2.4442138671875, 'learning_rate': 3.410544586102449e-05, 'epoch': 2.73} 27%|██▋ | 2733/10000 [4:17:24<11:03:29, 5.48s/it][2025-06-19 17:47:09,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:47:09,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.43 | bwd_microstep: 3366.76 | bwd_inner_microstep: 3365.92 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.32 [2025-06-19 17:47:09,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.43 | bwd: 3366.77 | bwd_inner: 3365.92 | bwd_allreduce: 0.81 | step: 7.32 27%|██▋ | 2734/10000 [4:17:30<11:05:42, 5.50s/it] {'loss': 0.0989, 'grad_norm': 1.1509180068969727, 'learning_rate': 3.410085297745053e-05, 'epoch': 2.73} 27%|██▋ | 2734/10000 [4:17:30<11:05:42, 5.50s/it][2025-06-19 17:47:14,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:47:14,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.46 | bwd_microstep: 3319.11 | bwd_inner_microstep: 3318.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 17:47:14,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.46 | bwd: 3319.12 | bwd_inner: 3318.31 | bwd_allreduce: 0.78 | step: 6.80 27%|██▋ | 2735/10000 [4:17:35<11:04:36, 5.49s/it] {'loss': 0.0705, 'grad_norm': 1.364245891571045, 'learning_rate': 3.409625861476253e-05, 'epoch': 2.73} 27%|██▋ | 2735/10000 [4:17:35<11:04:36, 5.49s/it][2025-06-19 17:47:20,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:47:20,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.14 | bwd_microstep: 3371.68 | bwd_inner_microstep: 3370.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-19 17:47:20,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.14 | bwd: 3371.69 | bwd_inner: 3370.89 | bwd_allreduce: 0.76 | step: 6.52 27%|██▋ | 2736/10000 [4:17:41<11:06:11, 5.50s/it] {'loss': 0.0975, 'grad_norm': 1.1517276763916016, 'learning_rate': 3.409166277344243e-05, 'epoch': 2.74} 27%|██▋ | 2736/10000 [4:17:41<11:06:11, 5.50s/it][2025-06-19 17:47:25,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:47:25,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.72 | bwd_microstep: 3316.99 | bwd_inner_microstep: 3316.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-19 17:47:25,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.72 | bwd: 3317.01 | bwd_inner: 3316.20 | bwd_allreduce: 0.76 | step: 7.07 27%|██▋ | 2737/10000 [4:17:46<11:04:38, 5.49s/it] {'loss': 0.0743, 'grad_norm': 0.8594849705696106, 'learning_rate': 3.40870654539723e-05, 'epoch': 2.74} 27%|██▋ | 2737/10000 [4:17:46<11:04:38, 5.49s/it][2025-06-19 17:47:31,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:47:31,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.54 | bwd_microstep: 3318.56 | bwd_inner_microstep: 3317.73 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.75 [2025-06-19 17:47:31,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.54 | bwd: 3318.58 | bwd_inner: 3317.73 | bwd_allreduce: 0.80 | step: 6.75 27%|██▋ | 2738/10000 [4:17:51<11:03:15, 5.48s/it] {'loss': 0.0437, 'grad_norm': 0.46550485491752625, 'learning_rate': 3.408246665683439e-05, 'epoch': 2.74} 27%|██▋ | 2738/10000 [4:17:51<11:03:15, 5.48s/it][2025-06-19 17:47:36,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 17:47:36,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.26 | bwd_microstep: 3323.13 | bwd_inner_microstep: 3322.04 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.27 [2025-06-19 17:47:36,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.26 | bwd: 3323.15 | bwd_inner: 3322.04 | bwd_allreduce: 1.06 | step: 7.27 27%|██▋ | 2739/10000 [4:17:57<11:02:42, 5.48s/it] {'loss': 0.0339, 'grad_norm': 0.6423228979110718, 'learning_rate': 3.407786638251108e-05, 'epoch': 2.74} 27%|██▋ | 2739/10000 [4:17:57<11:02:42, 5.48s/it][2025-06-19 17:47:42,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:47:42,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.23 | bwd_microstep: 3310.48 | bwd_inner_microstep: 3309.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 17:47:42,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.23 | bwd: 3310.49 | bwd_inner: 3309.70 | bwd_allreduce: 0.75 | step: 6.58 27%|██▋ | 2740/10000 [4:18:02<11:02:14, 5.47s/it] {'loss': 0.1335, 'grad_norm': 0.8782171010971069, 'learning_rate': 3.407326463148493e-05, 'epoch': 2.74} 27%|██▋ | 2740/10000 [4:18:02<11:02:14, 5.47s/it][2025-06-19 17:47:47,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:47:47,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.57 | bwd_microstep: 3315.50 | bwd_inner_microstep: 3314.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 17:47:47,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.58 | bwd: 3315.51 | bwd_inner: 3314.72 | bwd_allreduce: 0.75 | step: 6.55 27%|██▋ | 2741/10000 [4:18:08<11:01:51, 5.47s/it] {'loss': 0.1447, 'grad_norm': 1.5669158697128296, 'learning_rate': 3.406866140423863e-05, 'epoch': 2.74} 27%|██▋ | 2741/10000 [4:18:08<11:01:51, 5.47s/it][2025-06-19 17:47:53,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:47:53,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.52 | bwd_microstep: 3361.79 | bwd_inner_microstep: 3361.00 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-19 17:47:53,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.52 | bwd: 3361.80 | bwd_inner: 3361.00 | bwd_allreduce: 0.76 | step: 6.79 27%|██▋ | 2742/10000 [4:18:13<11:03:43, 5.49s/it] {'loss': 0.1265, 'grad_norm': 1.167015790939331, 'learning_rate': 3.406405670125505e-05, 'epoch': 2.74} 27%|██▋ | 2742/10000 [4:18:13<11:03:43, 5.49s/it][2025-06-19 17:47:58,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 17:47:58,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.61 | bwd_microstep: 3368.82 | bwd_inner_microstep: 3367.77 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.58 [2025-06-19 17:47:58,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.61 | bwd: 3368.83 | bwd_inner: 3367.77 | bwd_allreduce: 1.02 | step: 7.58 27%|██▋ | 2743/10000 [4:18:19<11:05:33, 5.50s/it] {'loss': 0.174, 'grad_norm': 1.5693811178207397, 'learning_rate': 3.405945052301719e-05, 'epoch': 2.74} 27%|██▋ | 2743/10000 [4:18:19<11:05:33, 5.50s/it][2025-06-19 17:48:04,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:48:04,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.84 | bwd_microstep: 3314.54 | bwd_inner_microstep: 3313.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 17:48:04,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.84 | bwd: 3314.55 | bwd_inner: 3313.75 | bwd_allreduce: 0.76 | step: 6.59 27%|██▋ | 2744/10000 [4:18:24<11:03:55, 5.49s/it] {'loss': 0.0594, 'grad_norm': 0.8394474387168884, 'learning_rate': 3.405484287000823e-05, 'epoch': 2.74} 27%|██▋ | 2744/10000 [4:18:24<11:03:55, 5.49s/it][2025-06-19 17:48:09,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:48:09,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.48 | bwd_microstep: 3308.81 | bwd_inner_microstep: 3308.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 17:48:09,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.48 | bwd: 3308.82 | bwd_inner: 3308.02 | bwd_allreduce: 0.76 | step: 6.78 27%|██▋ | 2745/10000 [4:18:30<11:02:25, 5.48s/it] {'loss': 0.0764, 'grad_norm': 1.4620256423950195, 'learning_rate': 3.405023374271147e-05, 'epoch': 2.75} 27%|██▋ | 2745/10000 [4:18:30<11:02:25, 5.48s/it][2025-06-19 17:48:15,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:48:15,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2095.60 | bwd_microstep: 3316.01 | bwd_inner_microstep: 3314.95 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.28 [2025-06-19 17:48:15,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2095.60 | bwd: 3316.03 | bwd_inner: 3314.95 | bwd_allreduce: 1.02 | step: 7.28 27%|██▋ | 2746/10000 [4:18:35<11:01:28, 5.47s/it] {'loss': 0.1224, 'grad_norm': 1.6355775594711304, 'learning_rate': 3.404562314161041e-05, 'epoch': 2.75} 27%|██▋ | 2746/10000 [4:18:35<11:01:28, 5.47s/it][2025-06-19 17:48:20,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:48:20,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.60 | bwd_microstep: 3314.84 | bwd_inner_microstep: 3314.02 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.40 [2025-06-19 17:48:20,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.60 | bwd: 3314.85 | bwd_inner: 3314.02 | bwd_allreduce: 0.78 | step: 7.40 27%|██▋ | 2747/10000 [4:18:41<11:01:05, 5.47s/it] {'loss': 0.0899, 'grad_norm': 1.1108055114746094, 'learning_rate': 3.4041011067188664e-05, 'epoch': 2.75} 27%|██▋ | 2747/10000 [4:18:41<11:01:05, 5.47s/it][2025-06-19 17:48:25,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:48:25,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.41 | bwd_microstep: 3312.34 | bwd_inner_microstep: 3311.36 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.20 [2025-06-19 17:48:25,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.41 | bwd: 3312.36 | bwd_inner: 3311.36 | bwd_allreduce: 0.96 | step: 7.20 27%|██▋ | 2748/10000 [4:18:46<11:00:32, 5.47s/it] {'loss': 0.0474, 'grad_norm': 0.39906296133995056, 'learning_rate': 3.4036397519930036e-05, 'epoch': 2.75} 27%|██▋ | 2748/10000 [4:18:46<11:00:32, 5.47s/it][2025-06-19 17:48:31,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 17:48:31,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.93 | bwd_microstep: 3317.82 | bwd_inner_microstep: 3316.86 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.27 [2025-06-19 17:48:31,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.93 | bwd: 3317.84 | bwd_inner: 3316.87 | bwd_allreduce: 0.92 | step: 7.28 27%|██▋ | 2749/10000 [4:18:52<11:00:28, 5.47s/it] {'loss': 0.0633, 'grad_norm': 0.572074294090271, 'learning_rate': 3.403178250031844e-05, 'epoch': 2.75} 27%|██▋ | 2749/10000 [4:18:52<11:00:28, 5.47s/it][2025-06-19 17:48:36,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:48:36,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.80 | bwd_microstep: 3317.93 | bwd_inner_microstep: 3317.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.85 [2025-06-19 17:48:36,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.80 | bwd: 3317.94 | bwd_inner: 3317.14 | bwd_allreduce: 0.76 | step: 6.86 28%|██▊ | 2750/10000 [4:18:57<11:00:19, 5.46s/it] {'loss': 0.0494, 'grad_norm': 0.6340100765228271, 'learning_rate': 3.402716600883799e-05, 'epoch': 2.75} 28%|██▊ | 2750/10000 [4:18:57<11:00:19, 5.46s/it][2025-06-19 17:48:42,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:48:42,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.97 | bwd_microstep: 3317.38 | bwd_inner_microstep: 3316.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 17:48:42,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.97 | bwd: 3317.40 | bwd_inner: 3316.60 | bwd_allreduce: 0.76 | step: 6.60 28%|██▊ | 2751/10000 [4:19:03<10:59:46, 5.46s/it] {'loss': 0.1013, 'grad_norm': 1.3284634351730347, 'learning_rate': 3.402254804597292e-05, 'epoch': 2.75} 28%|██▊ | 2751/10000 [4:19:03<10:59:46, 5.46s/it][2025-06-19 17:48:47,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:48:47,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.92 | bwd_microstep: 3367.15 | bwd_inner_microstep: 3366.18 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.17 [2025-06-19 17:48:47,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.92 | bwd: 3367.17 | bwd_inner: 3366.18 | bwd_allreduce: 0.94 | step: 7.17 28%|██▊ | 2752/10000 [4:19:08<11:02:25, 5.48s/it] {'loss': 0.0549, 'grad_norm': 0.6922467350959778, 'learning_rate': 3.401792861220765e-05, 'epoch': 2.75} 28%|██▊ | 2752/10000 [4:19:08<11:02:25, 5.48s/it][2025-06-19 17:48:53,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 17:48:53,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.17 | bwd_microstep: 3319.40 | bwd_inner_microstep: 3318.50 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.93 [2025-06-19 17:48:53,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.17 | bwd: 3319.41 | bwd_inner: 3318.50 | bwd_allreduce: 0.87 | step: 6.93 28%|██▊ | 2753/10000 [4:19:14<11:01:40, 5.48s/it] {'loss': 0.0584, 'grad_norm': 0.7174385786056519, 'learning_rate': 3.4013307708026724e-05, 'epoch': 2.75} 28%|██▊ | 2753/10000 [4:19:14<11:01:40, 5.48s/it][2025-06-19 17:48:58,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:48:58,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.78 | bwd_microstep: 3314.90 | bwd_inner_microstep: 3313.92 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.56 [2025-06-19 17:48:58,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.78 | bwd: 3314.92 | bwd_inner: 3313.92 | bwd_allreduce: 0.95 | step: 7.56 28%|██▊ | 2754/10000 [4:19:19<11:01:09, 5.47s/it] {'loss': 0.2158, 'grad_norm': 1.6142874956130981, 'learning_rate': 3.400868533391485e-05, 'epoch': 2.75} 28%|██▊ | 2754/10000 [4:19:19<11:01:09, 5.47s/it][2025-06-19 17:49:04,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:49:04,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.98 | bwd_microstep: 3321.48 | bwd_inner_microstep: 3320.43 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.46 [2025-06-19 17:49:04,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.98 | bwd: 3321.50 | bwd_inner: 3320.43 | bwd_allreduce: 1.01 | step: 7.46 28%|██▊ | 2755/10000 [4:19:25<11:01:05, 5.47s/it] {'loss': 0.1172, 'grad_norm': 1.2463182210922241, 'learning_rate': 3.400406149035691e-05, 'epoch': 2.75} 28%|██▊ | 2755/10000 [4:19:25<11:01:05, 5.47s/it][2025-06-19 17:49:09,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:49:09,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.03 | bwd_microstep: 3312.35 | bwd_inner_microstep: 3311.38 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.03 [2025-06-19 17:49:09,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.03 | bwd: 3312.36 | bwd_inner: 3311.38 | bwd_allreduce: 0.94 | step: 7.03 28%|██▊ | 2756/10000 [4:19:30<11:00:22, 5.47s/it] {'loss': 0.1088, 'grad_norm': 1.2321388721466064, 'learning_rate': 3.399943617783791e-05, 'epoch': 2.76} 28%|██▊ | 2756/10000 [4:19:30<11:00:22, 5.47s/it][2025-06-19 17:49:15,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:49:15,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.49 | bwd_microstep: 3309.64 | bwd_inner_microstep: 3308.68 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.35 [2025-06-19 17:49:15,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.49 | bwd: 3309.66 | bwd_inner: 3308.68 | bwd_allreduce: 0.93 | step: 7.35 28%|██▊ | 2757/10000 [4:19:35<10:59:51, 5.47s/it] {'loss': 0.1084, 'grad_norm': 1.592449426651001, 'learning_rate': 3.399480939684303e-05, 'epoch': 2.76} 28%|██▊ | 2757/10000 [4:19:35<10:59:51, 5.47s/it][2025-06-19 17:49:20,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:49:20,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.03 | bwd_microstep: 3314.87 | bwd_inner_microstep: 3314.01 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.25 [2025-06-19 17:49:20,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.03 | bwd: 3314.89 | bwd_inner: 3314.01 | bwd_allreduce: 0.83 | step: 7.25 28%|██▊ | 2758/10000 [4:19:41<11:00:01, 5.47s/it] {'loss': 0.1227, 'grad_norm': 1.2124226093292236, 'learning_rate': 3.39901811478576e-05, 'epoch': 2.76} 28%|██▊ | 2758/10000 [4:19:41<11:00:01, 5.47s/it][2025-06-19 17:49:26,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:49:26,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.07 | bwd_microstep: 3392.67 | bwd_inner_microstep: 3391.88 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 17:49:26,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.07 | bwd: 3392.68 | bwd_inner: 3391.88 | bwd_allreduce: 0.76 | step: 6.68 28%|██▊ | 2759/10000 [4:19:47<11:03:32, 5.50s/it] {'loss': 0.0706, 'grad_norm': 0.6857085824012756, 'learning_rate': 3.398555143136709e-05, 'epoch': 2.76} 28%|██▊ | 2759/10000 [4:19:47<11:03:32, 5.50s/it][2025-06-19 17:49:31,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:49:31,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.66 | bwd_microstep: 3316.84 | bwd_inner_microstep: 3315.85 | bwd_allreduce_microstep: 0.94 | step_microstep: 6.90 [2025-06-19 17:49:31,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.66 | bwd: 3316.86 | bwd_inner: 3315.85 | bwd_allreduce: 0.96 | step: 6.90 28%|██▊ | 2760/10000 [4:19:52<11:01:55, 5.49s/it] {'loss': 0.0598, 'grad_norm': 0.4864434599876404, 'learning_rate': 3.3980920247857155e-05, 'epoch': 2.76} 28%|██▊ | 2760/10000 [4:19:52<11:01:55, 5.49s/it][2025-06-19 17:49:37,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:49:37,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.75 | bwd_microstep: 3317.43 | bwd_inner_microstep: 3316.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 17:49:37,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.75 | bwd: 3317.45 | bwd_inner: 3316.64 | bwd_allreduce: 0.77 | step: 6.94 28%|██▊ | 2761/10000 [4:19:57<11:00:55, 5.48s/it] {'loss': 0.088, 'grad_norm': 0.9261415004730225, 'learning_rate': 3.397628759781357e-05, 'epoch': 2.76} 28%|██▊ | 2761/10000 [4:19:57<11:00:55, 5.48s/it][2025-06-19 17:49:42,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:49:42,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.03 | bwd_microstep: 3319.73 | bwd_inner_microstep: 3318.79 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.03 [2025-06-19 17:49:42,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.03 | bwd: 3319.74 | bwd_inner: 3318.79 | bwd_allreduce: 0.90 | step: 7.03 28%|██▊ | 2762/10000 [4:20:03<11:00:25, 5.47s/it] {'loss': 0.0889, 'grad_norm': 0.852999210357666, 'learning_rate': 3.397165348172228e-05, 'epoch': 2.76} 28%|██▊ | 2762/10000 [4:20:03<11:00:25, 5.47s/it][2025-06-19 17:49:48,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:49:48,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.59 | bwd_microstep: 3313.69 | bwd_inner_microstep: 3312.73 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.50 [2025-06-19 17:49:48,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.59 | bwd: 3313.70 | bwd_inner: 3312.73 | bwd_allreduce: 0.93 | step: 7.50 28%|██▊ | 2763/10000 [4:20:08<10:59:49, 5.47s/it] {'loss': 0.0693, 'grad_norm': 0.9150974750518799, 'learning_rate': 3.3967017900069375e-05, 'epoch': 2.76} 28%|██▊ | 2763/10000 [4:20:08<10:59:49, 5.47s/it][2025-06-19 17:49:53,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:49:53,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.79 | bwd_microstep: 3362.57 | bwd_inner_microstep: 3361.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-19 17:49:53,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.79 | bwd: 3362.59 | bwd_inner: 3361.78 | bwd_allreduce: 0.76 | step: 6.83 28%|██▊ | 2764/10000 [4:20:14<11:01:56, 5.49s/it] {'loss': 0.0749, 'grad_norm': 0.7260497808456421, 'learning_rate': 3.396238085334112e-05, 'epoch': 2.76} 28%|██▊ | 2764/10000 [4:20:14<11:01:56, 5.49s/it][2025-06-19 17:49:59,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:49:59,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.97 | bwd_microstep: 3366.33 | bwd_inner_microstep: 3365.19 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.39 [2025-06-19 17:49:59,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.97 | bwd: 3366.35 | bwd_inner: 3365.19 | bwd_allreduce: 1.09 | step: 7.40 28%|██▊ | 2765/10000 [4:20:19<11:03:47, 5.50s/it] {'loss': 0.0562, 'grad_norm': 0.6410726308822632, 'learning_rate': 3.3957742342023925e-05, 'epoch': 2.77} 28%|██▊ | 2765/10000 [4:20:19<11:03:47, 5.50s/it][2025-06-19 17:50:04,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:50:04,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.94 | bwd_microstep: 3383.62 | bwd_inner_microstep: 3382.80 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-19 17:50:04,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.94 | bwd: 3383.63 | bwd_inner: 3382.80 | bwd_allreduce: 0.78 | step: 7.15 28%|██▊ | 2766/10000 [4:20:25<11:06:09, 5.53s/it] {'loss': 0.1007, 'grad_norm': 0.6754788160324097, 'learning_rate': 3.395310236660433e-05, 'epoch': 2.77} 28%|██▊ | 2766/10000 [4:20:25<11:06:09, 5.53s/it][2025-06-19 17:50:10,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 17:50:10,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.99 | bwd_microstep: 3320.44 | bwd_inner_microstep: 3319.35 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.27 [2025-06-19 17:50:10,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.99 | bwd: 3320.46 | bwd_inner: 3319.35 | bwd_allreduce: 1.06 | step: 7.27 28%|██▊ | 2767/10000 [4:20:30<11:03:43, 5.51s/it] {'loss': 0.0716, 'grad_norm': 0.8331437706947327, 'learning_rate': 3.3948460927569056e-05, 'epoch': 2.77} 28%|██▊ | 2767/10000 [4:20:30<11:03:43, 5.51s/it][2025-06-19 17:50:15,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:50:15,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.46 | bwd_microstep: 3378.52 | bwd_inner_microstep: 3377.68 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.91 [2025-06-19 17:50:15,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.46 | bwd: 3378.53 | bwd_inner: 3377.68 | bwd_allreduce: 0.81 | step: 6.91 28%|██▊ | 2768/10000 [4:20:36<11:05:29, 5.52s/it] {'loss': 0.063, 'grad_norm': 0.6575868129730225, 'learning_rate': 3.3943818025404965e-05, 'epoch': 2.77} 28%|██▊ | 2768/10000 [4:20:36<11:05:29, 5.52s/it][2025-06-19 17:50:21,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:50:21,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.77 | bwd_microstep: 3326.27 | bwd_inner_microstep: 3325.42 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.40 [2025-06-19 17:50:21,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.77 | bwd: 3326.29 | bwd_inner: 3325.42 | bwd_allreduce: 0.82 | step: 7.41 28%|██▊ | 2769/10000 [4:20:41<11:03:49, 5.51s/it] {'loss': 0.0979, 'grad_norm': 0.8983533978462219, 'learning_rate': 3.393917366059908e-05, 'epoch': 2.77} 28%|██▊ | 2769/10000 [4:20:41<11:03:49, 5.51s/it][2025-06-19 17:50:26,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:50:26,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.26 | bwd_microstep: 3340.28 | bwd_inner_microstep: 3339.47 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-19 17:50:26,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.26 | bwd: 3340.29 | bwd_inner: 3339.47 | bwd_allreduce: 0.78 | step: 6.82 28%|██▊ | 2770/10000 [4:20:47<11:02:55, 5.50s/it] {'loss': 0.0476, 'grad_norm': 0.4302448332309723, 'learning_rate': 3.393452783363858e-05, 'epoch': 2.77} 28%|██▊ | 2770/10000 [4:20:47<11:02:55, 5.50s/it][2025-06-19 17:50:32,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:50:32,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.91 | bwd_microstep: 3329.08 | bwd_inner_microstep: 3328.08 | bwd_allreduce_microstep: 0.95 | step_microstep: 8.00 [2025-06-19 17:50:32,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.91 | bwd: 3329.10 | bwd_inner: 3328.08 | bwd_allreduce: 0.97 | step: 8.01 28%|██▊ | 2771/10000 [4:20:52<11:02:16, 5.50s/it] {'loss': 0.0546, 'grad_norm': 0.5141285061836243, 'learning_rate': 3.3929880545010774e-05, 'epoch': 2.77} 28%|██▊ | 2771/10000 [4:20:52<11:02:16, 5.50s/it][2025-06-19 17:50:37,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 3.16 [2025-06-19 17:50:37,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.62 | bwd_microstep: 3337.33 | bwd_inner_microstep: 3336.51 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.55 [2025-06-19 17:50:37,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.62 | bwd: 3337.34 | bwd_inner: 3336.52 | bwd_allreduce: 0.78 | step: 7.56 28%|██▊ | 2772/10000 [4:20:58<11:02:10, 5.50s/it] {'loss': 0.1194, 'grad_norm': 1.0508005619049072, 'learning_rate': 3.392523179520315e-05, 'epoch': 2.77} 28%|██▊ | 2772/10000 [4:20:58<11:02:10, 5.50s/it][2025-06-19 17:50:43,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:50:43,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.91 | bwd_microstep: 3379.83 | bwd_inner_microstep: 3379.05 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 17:50:43,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.91 | bwd: 3379.85 | bwd_inner: 3379.05 | bwd_allreduce: 0.76 | step: 6.64 28%|██▊ | 2773/10000 [4:21:04<11:04:10, 5.51s/it] {'loss': 0.0662, 'grad_norm': 0.6685059666633606, 'learning_rate': 3.3920581584703337e-05, 'epoch': 2.77} 28%|██▊ | 2773/10000 [4:21:04<11:04:10, 5.51s/it][2025-06-19 17:50:48,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:50:48,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.77 | bwd_microstep: 3335.73 | bwd_inner_microstep: 3334.79 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.64 [2025-06-19 17:50:48,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.77 | bwd: 3335.75 | bwd_inner: 3334.79 | bwd_allreduce: 0.90 | step: 7.64 28%|██▊ | 2774/10000 [4:21:09<11:03:31, 5.51s/it] {'loss': 0.1074, 'grad_norm': 1.4901760816574097, 'learning_rate': 3.3915929913999126e-05, 'epoch': 2.77} 28%|██▊ | 2774/10000 [4:21:09<11:03:31, 5.51s/it][2025-06-19 17:50:54,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:50:54,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.31 | bwd_microstep: 3327.20 | bwd_inner_microstep: 3326.42 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.57 [2025-06-19 17:50:54,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.31 | bwd: 3327.21 | bwd_inner: 3326.42 | bwd_allreduce: 0.75 | step: 6.57 28%|██▊ | 2775/10000 [4:21:14<11:02:10, 5.50s/it] {'loss': 0.1266, 'grad_norm': 0.892826497554779, 'learning_rate': 3.391127678357846e-05, 'epoch': 2.77} 28%|██▊ | 2775/10000 [4:21:14<11:02:10, 5.50s/it][2025-06-19 17:50:59,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:50:59,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.42 | bwd_microstep: 3331.65 | bwd_inner_microstep: 3330.70 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.07 [2025-06-19 17:50:59,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.42 | bwd: 3331.67 | bwd_inner: 3330.70 | bwd_allreduce: 0.93 | step: 7.07 28%|██▊ | 2776/10000 [4:21:20<11:01:23, 5.49s/it] {'loss': 0.0804, 'grad_norm': 0.9235659837722778, 'learning_rate': 3.390662219392941e-05, 'epoch': 2.78} 28%|██▊ | 2776/10000 [4:21:20<11:01:23, 5.49s/it][2025-06-19 17:51:05,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:51:05,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.80 | bwd_microstep: 3373.86 | bwd_inner_microstep: 3373.05 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.46 [2025-06-19 17:51:05,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.80 | bwd: 3373.88 | bwd_inner: 3373.05 | bwd_allreduce: 0.78 | step: 7.46 28%|██▊ | 2777/10000 [4:21:26<11:03:18, 5.51s/it] {'loss': 0.0678, 'grad_norm': 0.7952314019203186, 'learning_rate': 3.3901966145540246e-05, 'epoch': 2.78} 28%|██▊ | 2777/10000 [4:21:26<11:03:18, 5.51s/it][2025-06-19 17:51:10,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 17:51:10,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.11 | bwd_microstep: 3327.26 | bwd_inner_microstep: 3326.16 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.27 [2025-06-19 17:51:10,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.11 | bwd: 3327.27 | bwd_inner: 3326.16 | bwd_allreduce: 1.06 | step: 7.28 28%|██▊ | 2778/10000 [4:21:31<11:01:52, 5.50s/it] {'loss': 0.0959, 'grad_norm': 1.2742128372192383, 'learning_rate': 3.389730863889935e-05, 'epoch': 2.78} 28%|██▊ | 2778/10000 [4:21:31<11:01:52, 5.50s/it][2025-06-19 17:51:16,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 17:51:16,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.78 | bwd_microstep: 3328.11 | bwd_inner_microstep: 3327.13 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.97 [2025-06-19 17:51:16,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.77 | bwd: 3328.13 | bwd_inner: 3327.13 | bwd_allreduce: 0.95 | step: 7.98 28%|██▊ | 2779/10000 [4:21:36<11:01:08, 5.49s/it] {'loss': 0.2439, 'grad_norm': 0.8473408818244934, 'learning_rate': 3.389264967449527e-05, 'epoch': 2.78} 28%|██▊ | 2779/10000 [4:21:36<11:01:08, 5.49s/it][2025-06-19 17:51:21,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:51:21,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.97 | bwd_microstep: 3330.91 | bwd_inner_microstep: 3330.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 17:51:21,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.97 | bwd: 3330.92 | bwd_inner: 3330.12 | bwd_allreduce: 0.76 | step: 6.63 28%|██▊ | 2780/10000 [4:21:42<11:00:43, 5.49s/it] {'loss': 0.0706, 'grad_norm': 0.5233252048492432, 'learning_rate': 3.388798925281673e-05, 'epoch': 2.78} 28%|██▊ | 2780/10000 [4:21:42<11:00:43, 5.49s/it][2025-06-19 17:51:27,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:51:27,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.92 | bwd_microstep: 3379.32 | bwd_inner_microstep: 3378.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 17:51:27,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.92 | bwd: 3379.34 | bwd_inner: 3378.51 | bwd_allreduce: 0.78 | step: 7.09 28%|██▊ | 2781/10000 [4:21:48<11:02:53, 5.51s/it] {'loss': 0.0466, 'grad_norm': 0.49745914340019226, 'learning_rate': 3.388332737435256e-05, 'epoch': 2.78} 28%|██▊ | 2781/10000 [4:21:48<11:02:53, 5.51s/it][2025-06-19 17:51:32,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:51:32,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.17 | bwd_microstep: 3372.10 | bwd_inner_microstep: 3371.23 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.98 [2025-06-19 17:51:32,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.17 | bwd: 3372.12 | bwd_inner: 3371.23 | bwd_allreduce: 0.83 | step: 6.99 28%|██▊ | 2782/10000 [4:21:53<11:04:06, 5.52s/it] {'loss': 0.0502, 'grad_norm': 0.49699509143829346, 'learning_rate': 3.38786640395918e-05, 'epoch': 2.78} 28%|██▊ | 2782/10000 [4:21:53<11:04:06, 5.52s/it][2025-06-19 17:51:38,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:51:38,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.29 | bwd_microstep: 3330.54 | bwd_inner_microstep: 3329.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 17:51:38,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.29 | bwd: 3330.55 | bwd_inner: 3329.74 | bwd_allreduce: 0.77 | step: 6.88 28%|██▊ | 2783/10000 [4:21:59<11:02:23, 5.51s/it] {'loss': 0.0465, 'grad_norm': 0.5582541227340698, 'learning_rate': 3.3873999249023585e-05, 'epoch': 2.78} 28%|██▊ | 2783/10000 [4:21:59<11:02:23, 5.51s/it][2025-06-19 17:51:43,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:51:43,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.82 | bwd_microstep: 3332.44 | bwd_inner_microstep: 3331.64 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 17:51:43,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.82 | bwd: 3332.46 | bwd_inner: 3331.64 | bwd_allreduce: 0.77 | step: 7.19 28%|██▊ | 2784/10000 [4:22:04<11:01:31, 5.50s/it] {'loss': 0.0817, 'grad_norm': 1.0518126487731934, 'learning_rate': 3.3869333003137235e-05, 'epoch': 2.78} 28%|██▊ | 2784/10000 [4:22:04<11:01:31, 5.50s/it][2025-06-19 17:51:49,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:51:49,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.56 | bwd_microstep: 3318.05 | bwd_inner_microstep: 3317.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 17:51:49,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.56 | bwd: 3318.06 | bwd_inner: 3317.27 | bwd_allreduce: 0.75 | step: 6.67 28%|██▊ | 2785/10000 [4:22:09<11:00:16, 5.49s/it] {'loss': 0.1199, 'grad_norm': 1.2410483360290527, 'learning_rate': 3.386466530242223e-05, 'epoch': 2.79} 28%|██▊ | 2785/10000 [4:22:09<11:00:16, 5.49s/it][2025-06-19 17:51:54,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:51:54,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.15 | bwd_microstep: 3342.48 | bwd_inner_microstep: 3341.52 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.01 [2025-06-19 17:51:54,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.16 | bwd: 3342.50 | bwd_inner: 3341.52 | bwd_allreduce: 0.93 | step: 7.01 28%|██▊ | 2786/10000 [4:22:15<11:00:06, 5.49s/it] {'loss': 0.0442, 'grad_norm': 0.412885457277298, 'learning_rate': 3.385999614736818e-05, 'epoch': 2.79} 28%|██▊ | 2786/10000 [4:22:15<11:00:06, 5.49s/it][2025-06-19 17:52:00,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:52:00,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.18 | bwd_microstep: 3377.57 | bwd_inner_microstep: 3376.75 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.21 [2025-06-19 17:52:00,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.18 | bwd: 3377.58 | bwd_inner: 3376.75 | bwd_allreduce: 0.79 | step: 7.21 28%|██▊ | 2787/10000 [4:22:21<11:02:10, 5.51s/it] {'loss': 0.1395, 'grad_norm': 1.2930724620819092, 'learning_rate': 3.385532553846486e-05, 'epoch': 2.79} 28%|██▊ | 2787/10000 [4:22:21<11:02:10, 5.51s/it][2025-06-19 17:52:05,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:52:05,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.35 | bwd_microstep: 3323.69 | bwd_inner_microstep: 3322.87 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.16 [2025-06-19 17:52:05,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.35 | bwd: 3323.70 | bwd_inner: 3322.87 | bwd_allreduce: 0.78 | step: 7.16 28%|██▊ | 2788/10000 [4:22:26<11:00:58, 5.50s/it] {'loss': 0.0796, 'grad_norm': 0.7454897165298462, 'learning_rate': 3.38506534762022e-05, 'epoch': 2.79} 28%|██▊ | 2788/10000 [4:22:26<11:00:58, 5.50s/it][2025-06-19 17:52:11,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:52:11,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.36 | bwd_microstep: 3323.58 | bwd_inner_microstep: 3322.76 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.88 [2025-06-19 17:52:11,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.36 | bwd: 3323.59 | bwd_inner: 3322.76 | bwd_allreduce: 0.79 | step: 6.88 28%|██▊ | 2789/10000 [4:22:31<10:59:58, 5.49s/it] {'loss': 0.0598, 'grad_norm': 0.6515461206436157, 'learning_rate': 3.384597996107027e-05, 'epoch': 2.79} 28%|██▊ | 2789/10000 [4:22:31<10:59:58, 5.49s/it][2025-06-19 17:52:16,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:52:16,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.45 | bwd_microstep: 3374.01 | bwd_inner_microstep: 3373.19 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.16 [2025-06-19 17:52:16,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.45 | bwd: 3374.02 | bwd_inner: 3373.19 | bwd_allreduce: 0.78 | step: 7.16 28%|██▊ | 2790/10000 [4:22:37<11:01:50, 5.51s/it] {'loss': 0.0688, 'grad_norm': 0.81672203540802, 'learning_rate': 3.38413049935593e-05, 'epoch': 2.79} 28%|██▊ | 2790/10000 [4:22:37<11:01:50, 5.51s/it][2025-06-19 17:52:22,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:52:22,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.34 | bwd_microstep: 3380.94 | bwd_inner_microstep: 3380.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 17:52:22,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.34 | bwd: 3380.95 | bwd_inner: 3380.14 | bwd_allreduce: 0.77 | step: 6.79 28%|██▊ | 2791/10000 [4:22:43<11:03:16, 5.52s/it] {'loss': 0.1136, 'grad_norm': 1.2443867921829224, 'learning_rate': 3.383662857415968e-05, 'epoch': 2.79} 28%|██▊ | 2791/10000 [4:22:43<11:03:16, 5.52s/it][2025-06-19 17:52:27,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:52:27,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.11 | bwd_microstep: 3323.75 | bwd_inner_microstep: 3322.49 | bwd_allreduce_microstep: 1.17 | step_microstep: 7.79 [2025-06-19 17:52:27,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.11 | bwd: 3323.78 | bwd_inner: 3322.49 | bwd_allreduce: 1.21 | step: 7.78 28%|██▊ | 2792/10000 [4:22:48<11:01:40, 5.51s/it] {'loss': 0.0626, 'grad_norm': 0.9576929211616516, 'learning_rate': 3.3831950703361937e-05, 'epoch': 2.79} 28%|██▊ | 2792/10000 [4:22:48<11:01:40, 5.51s/it][2025-06-19 17:52:33,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:52:33,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.28 | bwd_microstep: 3331.26 | bwd_inner_microstep: 3330.32 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.09 [2025-06-19 17:52:33,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.28 | bwd: 3331.28 | bwd_inner: 3330.32 | bwd_allreduce: 0.91 | step: 7.10 28%|██▊ | 2793/10000 [4:22:54<11:00:45, 5.50s/it] {'loss': 0.1124, 'grad_norm': 0.5970611572265625, 'learning_rate': 3.3827271381656764e-05, 'epoch': 2.79} 28%|██▊ | 2793/10000 [4:22:54<11:00:45, 5.50s/it][2025-06-19 17:52:38,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 17:52:38,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.02 | bwd_microstep: 3323.62 | bwd_inner_microstep: 3322.68 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.46 [2025-06-19 17:52:38,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.02 | bwd: 3323.63 | bwd_inner: 3322.68 | bwd_allreduce: 0.91 | step: 7.47 28%|██▊ | 2794/10000 [4:22:59<10:59:34, 5.49s/it] {'loss': 0.0778, 'grad_norm': 0.6948049068450928, 'learning_rate': 3.3822590609534995e-05, 'epoch': 2.79} 28%|██▊ | 2794/10000 [4:22:59<10:59:34, 5.49s/it][2025-06-19 17:52:44,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:52:44,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.91 | bwd_microstep: 3324.50 | bwd_inner_microstep: 3323.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 17:52:44,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.91 | bwd: 3324.51 | bwd_inner: 3323.71 | bwd_allreduce: 0.76 | step: 6.66 28%|██▊ | 2795/10000 [4:23:04<10:58:36, 5.48s/it] {'loss': 0.0532, 'grad_norm': 0.5763687491416931, 'learning_rate': 3.381790838748763e-05, 'epoch': 2.79} 28%|██▊ | 2795/10000 [4:23:04<10:58:36, 5.48s/it][2025-06-19 17:52:49,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:52:49,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.13 | bwd_microstep: 3386.18 | bwd_inner_microstep: 3385.20 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.53 [2025-06-19 17:52:49,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.13 | bwd: 3386.20 | bwd_inner: 3385.20 | bwd_allreduce: 0.95 | step: 7.54 28%|██▊ | 2796/10000 [4:23:10<11:01:39, 5.51s/it] {'loss': 0.0743, 'grad_norm': 1.1011260747909546, 'learning_rate': 3.38132247160058e-05, 'epoch': 2.8} 28%|██▊ | 2796/10000 [4:23:10<11:01:39, 5.51s/it][2025-06-19 17:52:55,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:52:55,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.80 | bwd_microstep: 3327.32 | bwd_inner_microstep: 3326.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 17:52:55,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.80 | bwd: 3327.34 | bwd_inner: 3326.54 | bwd_allreduce: 0.75 | step: 6.65 28%|██▊ | 2797/10000 [4:23:16<11:00:21, 5.50s/it] {'loss': 0.0401, 'grad_norm': 0.5480641722679138, 'learning_rate': 3.380853959558081e-05, 'epoch': 2.8} 28%|██▊ | 2797/10000 [4:23:16<11:00:21, 5.50s/it][2025-06-19 17:53:00,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:53:00,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.52 | bwd_microstep: 3325.20 | bwd_inner_microstep: 3324.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 17:53:00,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.52 | bwd: 3325.22 | bwd_inner: 3324.42 | bwd_allreduce: 0.76 | step: 6.62 28%|██▊ | 2798/10000 [4:23:21<10:59:05, 5.49s/it] {'loss': 0.0431, 'grad_norm': 0.6218523383140564, 'learning_rate': 3.3803853026704104e-05, 'epoch': 2.8} 28%|██▊ | 2798/10000 [4:23:21<10:59:05, 5.49s/it][2025-06-19 17:53:06,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:53:06,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.77 | bwd_microstep: 3378.04 | bwd_inner_microstep: 3377.23 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.20 [2025-06-19 17:53:06,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.77 | bwd: 3378.06 | bwd_inner: 3377.23 | bwd_allreduce: 0.78 | step: 7.21 28%|██▊ | 2799/10000 [4:23:27<11:01:03, 5.51s/it] {'loss': 0.052, 'grad_norm': 0.6857895851135254, 'learning_rate': 3.3799165009867274e-05, 'epoch': 2.8} 28%|██▊ | 2799/10000 [4:23:27<11:01:03, 5.51s/it][2025-06-19 17:53:11,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:53:11,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.66 | bwd_microstep: 3340.87 | bwd_inner_microstep: 3339.92 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.08 [2025-06-19 17:53:11,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.66 | bwd: 3340.88 | bwd_inner: 3339.92 | bwd_allreduce: 0.92 | step: 7.08 28%|██▊ | 2800/10000 [4:23:32<11:00:19, 5.50s/it] {'loss': 0.0402, 'grad_norm': 0.5155930519104004, 'learning_rate': 3.379447554556209e-05, 'epoch': 2.8} 28%|██▊ | 2800/10000 [4:23:32<11:00:19, 5.50s/it][2025-06-19 17:53:17,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:53:17,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.54 | bwd_microstep: 3373.83 | bwd_inner_microstep: 3373.05 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 17:53:17,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.54 | bwd: 3373.85 | bwd_inner: 3373.05 | bwd_allreduce: 0.76 | step: 6.63 28%|██▊ | 2801/10000 [4:23:38<11:01:27, 5.51s/it] {'loss': 0.0484, 'grad_norm': 0.5574751496315002, 'learning_rate': 3.378978463428043e-05, 'epoch': 2.8} 28%|██▊ | 2801/10000 [4:23:38<11:01:27, 5.51s/it][2025-06-19 17:53:22,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 17:53:22,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.98 | bwd_microstep: 3328.70 | bwd_inner_microstep: 3327.73 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.80 [2025-06-19 17:53:22,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.98 | bwd: 3328.71 | bwd_inner: 3327.73 | bwd_allreduce: 0.94 | step: 7.81 28%|██▊ | 2802/10000 [4:23:43<11:00:36, 5.51s/it] {'loss': 0.0897, 'grad_norm': 0.8187443017959595, 'learning_rate': 3.3785092276514374e-05, 'epoch': 2.8} 28%|██▊ | 2802/10000 [4:23:43<11:00:36, 5.51s/it][2025-06-19 17:53:28,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:53:28,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.73 | bwd_microstep: 3383.54 | bwd_inner_microstep: 3382.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 17:53:28,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.73 | bwd: 3383.56 | bwd_inner: 3382.75 | bwd_allreduce: 0.76 | step: 6.71 28%|██▊ | 2803/10000 [4:23:49<11:02:13, 5.52s/it] {'loss': 0.059, 'grad_norm': 1.1461602449417114, 'learning_rate': 3.378039847275611e-05, 'epoch': 2.8} 28%|██▊ | 2803/10000 [4:23:49<11:02:13, 5.52s/it][2025-06-19 17:53:33,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:53:33,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.29 | bwd_microstep: 3340.14 | bwd_inner_microstep: 3339.27 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.96 [2025-06-19 17:53:33,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.29 | bwd: 3340.16 | bwd_inner: 3339.27 | bwd_allreduce: 0.84 | step: 6.96 28%|██▊ | 2804/10000 [4:23:54<11:01:03, 5.51s/it] {'loss': 0.0957, 'grad_norm': 1.6074657440185547, 'learning_rate': 3.377570322349801e-05, 'epoch': 2.8} 28%|██▊ | 2804/10000 [4:23:54<11:01:03, 5.51s/it][2025-06-19 17:53:39,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:53:39,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.56 | bwd_microstep: 3324.16 | bwd_inner_microstep: 3323.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 17:53:39,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.56 | bwd: 3324.18 | bwd_inner: 3323.38 | bwd_allreduce: 0.76 | step: 6.72 28%|██▊ | 2805/10000 [4:24:00<10:59:22, 5.50s/it] {'loss': 0.2823, 'grad_norm': 2.5864036083221436, 'learning_rate': 3.3771006529232564e-05, 'epoch': 2.81} 28%|██▊ | 2805/10000 [4:24:00<10:59:22, 5.50s/it][2025-06-19 17:53:44,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:53:44,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.05 | bwd_microstep: 3342.17 | bwd_inner_microstep: 3341.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 17:53:44,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.05 | bwd: 3342.18 | bwd_inner: 3341.38 | bwd_allreduce: 0.76 | step: 6.68 28%|██▊ | 2806/10000 [4:24:05<10:58:54, 5.50s/it] {'loss': 0.0655, 'grad_norm': 1.6000584363937378, 'learning_rate': 3.376630839045246e-05, 'epoch': 2.81} 28%|██▊ | 2806/10000 [4:24:05<10:58:54, 5.50s/it][2025-06-19 17:53:50,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:53:50,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.04 | bwd_microstep: 3370.44 | bwd_inner_microstep: 3369.64 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-19 17:53:50,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.04 | bwd: 3370.46 | bwd_inner: 3369.64 | bwd_allreduce: 0.78 | step: 6.74 28%|██▊ | 2807/10000 [4:24:11<11:00:07, 5.51s/it] {'loss': 0.1329, 'grad_norm': 0.9993259310722351, 'learning_rate': 3.376160880765049e-05, 'epoch': 2.81} 28%|██▊ | 2807/10000 [4:24:11<11:00:07, 5.51s/it][2025-06-19 17:53:55,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:53:55,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.12 | bwd_microstep: 3370.42 | bwd_inner_microstep: 3369.61 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 17:53:55,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.12 | bwd: 3370.43 | bwd_inner: 3369.61 | bwd_allreduce: 0.78 | step: 7.08 28%|██▊ | 2808/10000 [4:24:16<11:01:11, 5.52s/it] {'loss': 0.0473, 'grad_norm': 0.463863730430603, 'learning_rate': 3.375690778131963e-05, 'epoch': 2.81} 28%|██▊ | 2808/10000 [4:24:16<11:01:11, 5.52s/it][2025-06-19 17:54:01,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:54:01,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.35 | bwd_microstep: 3327.76 | bwd_inner_microstep: 3326.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 17:54:01,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.35 | bwd: 3327.77 | bwd_inner: 3326.95 | bwd_allreduce: 0.77 | step: 6.77 28%|██▊ | 2809/10000 [4:24:22<10:59:31, 5.50s/it] {'loss': 0.0559, 'grad_norm': 0.6411008238792419, 'learning_rate': 3.3752205311952984e-05, 'epoch': 2.81} 28%|██▊ | 2809/10000 [4:24:22<10:59:31, 5.50s/it][2025-06-19 17:54:06,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:54:06,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.69 | bwd_microstep: 3328.10 | bwd_inner_microstep: 3327.15 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.32 [2025-06-19 17:54:06,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.69 | bwd: 3328.11 | bwd_inner: 3327.15 | bwd_allreduce: 0.92 | step: 7.33 28%|██▊ | 2810/10000 [4:24:27<10:58:13, 5.49s/it] {'loss': 0.1723, 'grad_norm': 1.36210298538208, 'learning_rate': 3.374750140004383e-05, 'epoch': 2.81} 28%|██▊ | 2810/10000 [4:24:27<10:58:13, 5.49s/it][2025-06-19 17:54:12,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:54:12,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.35 | bwd_microstep: 3379.88 | bwd_inner_microstep: 3378.92 | bwd_allreduce_microstep: 0.91 | step_microstep: 6.79 [2025-06-19 17:54:12,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.35 | bwd: 3379.89 | bwd_inner: 3378.92 | bwd_allreduce: 0.93 | step: 6.79 28%|██▊ | 2811/10000 [4:24:33<11:00:34, 5.51s/it] {'loss': 0.0975, 'grad_norm': 0.8224037885665894, 'learning_rate': 3.3742796046085586e-05, 'epoch': 2.81} 28%|██▊ | 2811/10000 [4:24:33<11:00:34, 5.51s/it][2025-06-19 17:54:17,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:54:17,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.46 | bwd_microstep: 3323.04 | bwd_inner_microstep: 3322.09 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.10 [2025-06-19 17:54:17,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.46 | bwd: 3323.06 | bwd_inner: 3322.09 | bwd_allreduce: 0.92 | step: 7.11 28%|██▊ | 2812/10000 [4:24:38<10:59:03, 5.50s/it] {'loss': 0.0524, 'grad_norm': 0.5053207874298096, 'learning_rate': 3.373808925057182e-05, 'epoch': 2.81} 28%|██▊ | 2812/10000 [4:24:38<10:59:03, 5.50s/it][2025-06-19 17:54:23,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:54:23,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.44 | bwd_microstep: 3397.44 | bwd_inner_microstep: 3396.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 17:54:23,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.44 | bwd: 3397.45 | bwd_inner: 3396.66 | bwd_allreduce: 0.76 | step: 6.65 28%|██▊ | 2813/10000 [4:24:44<11:01:37, 5.52s/it] {'loss': 0.1208, 'grad_norm': 1.0614031553268433, 'learning_rate': 3.373338101399625e-05, 'epoch': 2.81} 28%|██▊ | 2813/10000 [4:24:44<11:01:37, 5.52s/it][2025-06-19 17:54:28,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:54:28,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.18 | bwd_microstep: 3324.45 | bwd_inner_microstep: 3323.62 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.62 [2025-06-19 17:54:28,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.18 | bwd: 3324.46 | bwd_inner: 3323.62 | bwd_allreduce: 0.80 | step: 6.62 28%|██▊ | 2814/10000 [4:24:49<10:59:31, 5.51s/it] {'loss': 0.0542, 'grad_norm': 0.6143859028816223, 'learning_rate': 3.372867133685275e-05, 'epoch': 2.81} 28%|██▊ | 2814/10000 [4:24:49<10:59:31, 5.51s/it][2025-06-19 17:54:34,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:54:34,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.42 | bwd_microstep: 3317.94 | bwd_inner_microstep: 3317.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-19 17:54:34,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.42 | bwd: 3317.96 | bwd_inner: 3317.14 | bwd_allreduce: 0.78 | step: 7.15 28%|██▊ | 2815/10000 [4:24:55<10:57:55, 5.49s/it] {'loss': 0.0691, 'grad_norm': 0.7280918955802917, 'learning_rate': 3.372396021963534e-05, 'epoch': 2.81} 28%|██▊ | 2815/10000 [4:24:55<10:57:55, 5.49s/it][2025-06-19 17:54:39,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:54:39,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.39 | bwd_microstep: 3368.08 | bwd_inner_microstep: 3367.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-19 17:54:39,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.39 | bwd: 3368.09 | bwd_inner: 3367.29 | bwd_allreduce: 0.76 | step: 6.79 28%|██▊ | 2816/10000 [4:25:00<10:59:27, 5.51s/it] {'loss': 0.0423, 'grad_norm': 0.8039044737815857, 'learning_rate': 3.37192476628382e-05, 'epoch': 2.82} 28%|██▊ | 2816/10000 [4:25:00<10:59:27, 5.51s/it][2025-06-19 17:54:45,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:54:45,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.00 | bwd_microstep: 3312.14 | bwd_inner_microstep: 3311.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 17:54:45,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.00 | bwd: 3312.15 | bwd_inner: 3311.35 | bwd_allreduce: 0.76 | step: 6.54 28%|██▊ | 2817/10000 [4:25:06<10:57:25, 5.49s/it] {'loss': 0.0521, 'grad_norm': 0.7281683683395386, 'learning_rate': 3.3714533666955654e-05, 'epoch': 2.82} 28%|██▊ | 2817/10000 [4:25:06<10:57:25, 5.49s/it][2025-06-19 17:54:50,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:54:50,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.93 | bwd_microstep: 3316.56 | bwd_inner_microstep: 3315.78 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.57 [2025-06-19 17:54:50,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.94 | bwd: 3316.57 | bwd_inner: 3315.78 | bwd_allreduce: 0.75 | step: 6.57 28%|██▊ | 2818/10000 [4:25:11<10:55:57, 5.48s/it] {'loss': 0.0896, 'grad_norm': 1.3056426048278809, 'learning_rate': 3.370981823248218e-05, 'epoch': 2.82} 28%|██▊ | 2818/10000 [4:25:11<10:55:57, 5.48s/it][2025-06-19 17:54:56,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:54:56,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.81 | bwd_microstep: 3375.77 | bwd_inner_microstep: 3374.80 | bwd_allreduce_microstep: 0.92 | step_microstep: 6.98 [2025-06-19 17:54:56,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.81 | bwd: 3375.79 | bwd_inner: 3374.80 | bwd_allreduce: 0.94 | step: 6.98 28%|██▊ | 2819/10000 [4:25:17<10:58:08, 5.50s/it] {'loss': 0.0877, 'grad_norm': 1.0843595266342163, 'learning_rate': 3.37051013599124e-05, 'epoch': 2.82} 28%|██▊ | 2819/10000 [4:25:17<10:58:08, 5.50s/it][2025-06-19 17:55:01,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:55:01,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.59 | bwd_microstep: 3377.84 | bwd_inner_microstep: 3376.97 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.42 [2025-06-19 17:55:01,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.59 | bwd: 3377.86 | bwd_inner: 3376.97 | bwd_allreduce: 0.83 | step: 7.42 28%|██▊ | 2820/10000 [4:25:22<11:00:21, 5.52s/it] {'loss': 0.0444, 'grad_norm': 0.41668450832366943, 'learning_rate': 3.3700383049741095e-05, 'epoch': 2.82} 28%|██▊ | 2820/10000 [4:25:22<11:00:21, 5.52s/it][2025-06-19 17:55:07,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:55:07,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.56 | bwd_microstep: 3329.15 | bwd_inner_microstep: 3328.31 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.90 [2025-06-19 17:55:07,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.56 | bwd: 3329.17 | bwd_inner: 3328.31 | bwd_allreduce: 0.80 | step: 6.90 28%|██▊ | 2821/10000 [4:25:28<10:59:01, 5.51s/it] {'loss': 0.0537, 'grad_norm': 0.5978298783302307, 'learning_rate': 3.3695663302463196e-05, 'epoch': 2.82} 28%|██▊ | 2821/10000 [4:25:28<10:59:01, 5.51s/it][2025-06-19 17:55:12,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:55:12,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.58 | bwd_microstep: 3325.16 | bwd_inner_microstep: 3324.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 17:55:12,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.58 | bwd: 3325.17 | bwd_inner: 3324.36 | bwd_allreduce: 0.77 | step: 6.69 28%|██▊ | 2822/10000 [4:25:33<10:57:34, 5.50s/it] {'loss': 0.1351, 'grad_norm': 1.094325065612793, 'learning_rate': 3.369094211857378e-05, 'epoch': 2.82} 28%|██▊ | 2822/10000 [4:25:33<10:57:34, 5.50s/it][2025-06-19 17:55:18,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:55:18,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.99 | bwd_microstep: 3368.23 | bwd_inner_microstep: 3367.39 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.84 [2025-06-19 17:55:18,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.99 | bwd: 3368.25 | bwd_inner: 3367.39 | bwd_allreduce: 0.81 | step: 6.85 28%|██▊ | 2823/10000 [4:25:39<10:58:54, 5.51s/it] {'loss': 0.1304, 'grad_norm': 0.9698296189308167, 'learning_rate': 3.368621949856807e-05, 'epoch': 2.82} 28%|██▊ | 2823/10000 [4:25:39<10:58:54, 5.51s/it][2025-06-19 17:55:23,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:55:23,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.91 | bwd_microstep: 3376.47 | bwd_inner_microstep: 3375.50 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.70 [2025-06-19 17:55:23,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.91 | bwd: 3376.48 | bwd_inner: 3375.50 | bwd_allreduce: 0.94 | step: 7.71 28%|██▊ | 2824/10000 [4:25:44<11:00:07, 5.52s/it] {'loss': 0.0636, 'grad_norm': 0.844477117061615, 'learning_rate': 3.3681495442941466e-05, 'epoch': 2.82} 28%|██▊ | 2824/10000 [4:25:44<11:00:07, 5.52s/it][2025-06-19 17:55:29,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:55:29,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.51 | bwd_microstep: 3313.17 | bwd_inner_microstep: 3312.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 17:55:29,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.51 | bwd: 3313.19 | bwd_inner: 3312.39 | bwd_allreduce: 0.75 | step: 6.68 28%|██▊ | 2825/10000 [4:25:50<10:58:10, 5.50s/it] {'loss': 0.1002, 'grad_norm': 1.0165863037109375, 'learning_rate': 3.3676769952189476e-05, 'epoch': 2.83} 28%|██▊ | 2825/10000 [4:25:50<10:58:10, 5.50s/it][2025-06-19 17:55:34,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:55:34,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.89 | bwd_microstep: 3316.05 | bwd_inner_microstep: 3315.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 17:55:34,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.89 | bwd: 3316.06 | bwd_inner: 3315.26 | bwd_allreduce: 0.76 | step: 6.66 28%|██▊ | 2826/10000 [4:25:55<10:56:26, 5.49s/it] {'loss': 0.1474, 'grad_norm': 1.2418863773345947, 'learning_rate': 3.36720430268078e-05, 'epoch': 2.83} 28%|██▊ | 2826/10000 [4:25:55<10:56:26, 5.49s/it][2025-06-19 17:55:40,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:55:40,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.98 | bwd_microstep: 3312.63 | bwd_inner_microstep: 3311.73 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.00 [2025-06-19 17:55:40,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.98 | bwd: 3312.64 | bwd_inner: 3311.73 | bwd_allreduce: 0.87 | step: 7.00 28%|██▊ | 2827/10000 [4:26:01<10:55:22, 5.48s/it] {'loss': 0.0576, 'grad_norm': 0.5519735217094421, 'learning_rate': 3.3667314667292265e-05, 'epoch': 2.83} 28%|██▊ | 2827/10000 [4:26:01<10:55:22, 5.48s/it][2025-06-19 17:55:45,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:55:45,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.83 | bwd_microstep: 3318.35 | bwd_inner_microstep: 3317.22 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.08 [2025-06-19 17:55:45,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.83 | bwd: 3318.37 | bwd_inner: 3317.22 | bwd_allreduce: 1.10 | step: 7.08 28%|██▊ | 2828/10000 [4:26:06<10:54:29, 5.48s/it] {'loss': 0.0454, 'grad_norm': 0.40861377120018005, 'learning_rate': 3.366258487413885e-05, 'epoch': 2.83} 28%|██▊ | 2828/10000 [4:26:06<10:54:29, 5.48s/it][2025-06-19 17:55:51,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:55:51,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.78 | bwd_microstep: 3373.22 | bwd_inner_microstep: 3372.42 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 17:55:51,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.78 | bwd: 3373.24 | bwd_inner: 3372.42 | bwd_allreduce: 0.78 | step: 7.07 28%|██▊ | 2829/10000 [4:26:12<10:56:54, 5.50s/it] {'loss': 0.172, 'grad_norm': 1.9108960628509521, 'learning_rate': 3.365785364784369e-05, 'epoch': 2.83} 28%|██▊ | 2829/10000 [4:26:12<10:56:54, 5.50s/it][2025-06-19 17:55:56,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:55:56,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.70 | bwd_microstep: 3319.58 | bwd_inner_microstep: 3318.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 17:55:56,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.70 | bwd: 3319.59 | bwd_inner: 3318.79 | bwd_allreduce: 0.76 | step: 6.63 28%|██▊ | 2830/10000 [4:26:17<10:56:06, 5.49s/it] {'loss': 0.0627, 'grad_norm': 0.7569171190261841, 'learning_rate': 3.365312098890308e-05, 'epoch': 2.83} 28%|██▊ | 2830/10000 [4:26:17<10:56:06, 5.49s/it][2025-06-19 17:56:02,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:56:02,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.27 | bwd_microstep: 3365.60 | bwd_inner_microstep: 3364.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 17:56:02,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.27 | bwd: 3365.61 | bwd_inner: 3364.81 | bwd_allreduce: 0.76 | step: 6.62 28%|██▊ | 2831/10000 [4:26:23<10:57:29, 5.50s/it] {'loss': 0.0384, 'grad_norm': 0.3516467213630676, 'learning_rate': 3.3648386897813436e-05, 'epoch': 2.83} 28%|██▊ | 2831/10000 [4:26:23<10:57:29, 5.50s/it][2025-06-19 17:56:07,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:56:07,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.69 | bwd_microstep: 3307.80 | bwd_inner_microstep: 3307.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-19 17:56:07,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.69 | bwd: 3307.81 | bwd_inner: 3306.99 | bwd_allreduce: 0.77 | step: 6.88 28%|██▊ | 2832/10000 [4:26:28<10:55:36, 5.49s/it] {'loss': 0.0609, 'grad_norm': 0.6318345665931702, 'learning_rate': 3.364365137507135e-05, 'epoch': 2.83} 28%|██▊ | 2832/10000 [4:26:28<10:55:36, 5.49s/it][2025-06-19 17:56:13,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:56:13,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.94 | bwd_microstep: 3317.41 | bwd_inner_microstep: 3316.36 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.08 [2025-06-19 17:56:13,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.94 | bwd: 3317.43 | bwd_inner: 3316.36 | bwd_allreduce: 1.02 | step: 7.08 28%|██▊ | 2833/10000 [4:26:34<10:54:25, 5.48s/it] {'loss': 0.0514, 'grad_norm': 0.548976480960846, 'learning_rate': 3.363891442117356e-05, 'epoch': 2.83} 28%|██▊ | 2833/10000 [4:26:34<10:54:25, 5.48s/it][2025-06-19 17:56:18,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:56:18,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.83 | bwd_microstep: 3366.10 | bwd_inner_microstep: 3365.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.90 [2025-06-19 17:56:18,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.83 | bwd: 3366.11 | bwd_inner: 3365.31 | bwd_allreduce: 0.76 | step: 6.90 28%|██▊ | 2834/10000 [4:26:39<10:56:28, 5.50s/it] {'loss': 0.089, 'grad_norm': 1.2126882076263428, 'learning_rate': 3.3634176036616955e-05, 'epoch': 2.83} 28%|██▊ | 2834/10000 [4:26:39<10:56:28, 5.50s/it][2025-06-19 17:56:24,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:56:24,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.78 | bwd_microstep: 3362.76 | bwd_inner_microstep: 3361.87 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.89 [2025-06-19 17:56:24,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.78 | bwd: 3362.77 | bwd_inner: 3361.87 | bwd_allreduce: 0.87 | step: 6.89 28%|██▊ | 2835/10000 [4:26:45<10:57:27, 5.51s/it] {'loss': 0.2321, 'grad_norm': 1.5738955736160278, 'learning_rate': 3.362943622189855e-05, 'epoch': 2.83} 28%|██▊ | 2835/10000 [4:26:45<10:57:27, 5.51s/it][2025-06-19 17:56:29,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-19 17:56:29,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.99 | bwd_microstep: 3323.33 | bwd_inner_microstep: 3322.51 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.88 [2025-06-19 17:56:29,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.99 | bwd: 3323.34 | bwd_inner: 3322.51 | bwd_allreduce: 0.79 | step: 6.89 28%|██▊ | 2836/10000 [4:26:50<10:56:05, 5.49s/it] {'loss': 0.0974, 'grad_norm': 1.7087554931640625, 'learning_rate': 3.362469497751555e-05, 'epoch': 2.84} 28%|██▊ | 2836/10000 [4:26:50<10:56:05, 5.49s/it][2025-06-19 17:56:35,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:56:35,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.88 | bwd_microstep: 3308.63 | bwd_inner_microstep: 3307.83 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 17:56:35,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.88 | bwd: 3308.65 | bwd_inner: 3307.83 | bwd_allreduce: 0.77 | step: 7.11 28%|██▊ | 2837/10000 [4:26:55<10:54:23, 5.48s/it] {'loss': 0.0466, 'grad_norm': 0.5770370960235596, 'learning_rate': 3.361995230396527e-05, 'epoch': 2.84} 28%|██▊ | 2837/10000 [4:26:55<10:54:23, 5.48s/it][2025-06-19 17:56:40,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 17:56:40,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.69 | bwd_microstep: 3312.90 | bwd_inner_microstep: 3312.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 17:56:40,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.69 | bwd: 3312.91 | bwd_inner: 3312.12 | bwd_allreduce: 0.75 | step: 6.63 28%|██▊ | 2838/10000 [4:27:01<10:53:13, 5.47s/it] {'loss': 0.1252, 'grad_norm': 1.2331701517105103, 'learning_rate': 3.361520820174521e-05, 'epoch': 2.84} 28%|██▊ | 2838/10000 [4:27:01<10:53:13, 5.47s/it][2025-06-19 17:56:46,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:56:46,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.03 | bwd_microstep: 3317.30 | bwd_inner_microstep: 3316.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 17:56:46,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.03 | bwd: 3317.31 | bwd_inner: 3316.50 | bwd_allreduce: 0.77 | step: 6.63 28%|██▊ | 2839/10000 [4:27:06<10:52:47, 5.47s/it] {'loss': 0.0883, 'grad_norm': 1.3428890705108643, 'learning_rate': 3.361046267135301e-05, 'epoch': 2.84} 28%|██▊ | 2839/10000 [4:27:06<10:52:47, 5.47s/it][2025-06-19 17:56:51,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:56:51,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.41 | bwd_microstep: 3363.08 | bwd_inner_microstep: 3362.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 17:56:51,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.41 | bwd: 3363.10 | bwd_inner: 3362.30 | bwd_allreduce: 0.75 | step: 6.57 28%|██▊ | 2840/10000 [4:27:12<10:54:32, 5.49s/it] {'loss': 0.0257, 'grad_norm': 0.19972604513168335, 'learning_rate': 3.360571571328643e-05, 'epoch': 2.84} 28%|██▊ | 2840/10000 [4:27:12<10:54:32, 5.49s/it][2025-06-19 17:56:57,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:56:57,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.36 | bwd_microstep: 3317.68 | bwd_inner_microstep: 3316.73 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.10 [2025-06-19 17:56:57,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.36 | bwd: 3317.70 | bwd_inner: 3316.73 | bwd_allreduce: 0.92 | step: 7.11 28%|██▊ | 2841/10000 [4:27:17<10:53:18, 5.48s/it] {'loss': 0.0646, 'grad_norm': 0.5689498782157898, 'learning_rate': 3.360096732804342e-05, 'epoch': 2.84} 28%|██▊ | 2841/10000 [4:27:17<10:53:18, 5.48s/it][2025-06-19 17:57:02,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:57:02,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.42 | bwd_microstep: 3322.38 | bwd_inner_microstep: 3321.44 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.21 [2025-06-19 17:57:02,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.42 | bwd: 3322.39 | bwd_inner: 3321.44 | bwd_allreduce: 0.91 | step: 7.22 28%|██▊ | 2842/10000 [4:27:23<10:52:53, 5.47s/it] {'loss': 0.1378, 'grad_norm': 1.3595596551895142, 'learning_rate': 3.359621751612207e-05, 'epoch': 2.84} 28%|██▊ | 2842/10000 [4:27:23<10:52:53, 5.47s/it][2025-06-19 17:57:08,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:57:08,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.95 | bwd_microstep: 3316.00 | bwd_inner_microstep: 3315.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.03 [2025-06-19 17:57:08,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.95 | bwd: 3316.02 | bwd_inner: 3315.21 | bwd_allreduce: 0.77 | step: 7.03 28%|██▊ | 2843/10000 [4:27:28<10:52:21, 5.47s/it] {'loss': 0.0539, 'grad_norm': 0.6897534132003784, 'learning_rate': 3.35914662780206e-05, 'epoch': 2.84} 28%|██▊ | 2843/10000 [4:27:28<10:52:21, 5.47s/it][2025-06-19 17:57:13,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:57:13,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.95 | bwd_microstep: 3315.70 | bwd_inner_microstep: 3314.92 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.53 [2025-06-19 17:57:13,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.95 | bwd: 3315.71 | bwd_inner: 3314.92 | bwd_allreduce: 0.75 | step: 6.53 28%|██▊ | 2844/10000 [4:27:34<10:52:11, 5.47s/it] {'loss': 0.0768, 'grad_norm': 0.8717577457427979, 'learning_rate': 3.358671361423739e-05, 'epoch': 2.84} 28%|██▊ | 2844/10000 [4:27:34<10:52:11, 5.47s/it][2025-06-19 17:57:18,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:57:18,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.75 | bwd_microstep: 3314.64 | bwd_inner_microstep: 3313.69 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.41 [2025-06-19 17:57:18,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.75 | bwd: 3314.66 | bwd_inner: 3313.69 | bwd_allreduce: 0.93 | step: 7.41 28%|██▊ | 2845/10000 [4:27:39<10:51:41, 5.46s/it] {'loss': 0.1177, 'grad_norm': 1.4378886222839355, 'learning_rate': 3.3581959525270995e-05, 'epoch': 2.84} 28%|██▊ | 2845/10000 [4:27:39<10:51:41, 5.46s/it][2025-06-19 17:57:24,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:57:24,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.23 | bwd_microstep: 3314.34 | bwd_inner_microstep: 3313.36 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.08 [2025-06-19 17:57:24,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.23 | bwd: 3314.36 | bwd_inner: 3313.36 | bwd_allreduce: 0.94 | step: 7.08 28%|██▊ | 2846/10000 [4:27:45<10:51:31, 5.46s/it] {'loss': 0.1136, 'grad_norm': 1.0355253219604492, 'learning_rate': 3.357720401162007e-05, 'epoch': 2.85} 28%|██▊ | 2846/10000 [4:27:45<10:51:31, 5.46s/it][2025-06-19 17:57:29,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:57:29,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.98 | bwd_microstep: 3365.21 | bwd_inner_microstep: 3364.32 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.11 [2025-06-19 17:57:29,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.98 | bwd: 3365.23 | bwd_inner: 3364.32 | bwd_allreduce: 0.84 | step: 7.11 28%|██▊ | 2847/10000 [4:27:50<10:53:56, 5.49s/it] {'loss': 0.0416, 'grad_norm': 0.5321719646453857, 'learning_rate': 3.357244707378346e-05, 'epoch': 2.85} 28%|██▊ | 2847/10000 [4:27:50<10:53:56, 5.49s/it][2025-06-19 17:57:35,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:57:35,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.63 | bwd_microstep: 3314.82 | bwd_inner_microstep: 3313.83 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.70 [2025-06-19 17:57:35,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.64 | bwd: 3314.83 | bwd_inner: 3313.83 | bwd_allreduce: 0.95 | step: 7.70 28%|██▊ | 2848/10000 [4:27:56<10:53:03, 5.48s/it] {'loss': 0.0716, 'grad_norm': 0.8087490797042847, 'learning_rate': 3.356768871226015e-05, 'epoch': 2.85} 28%|██▊ | 2848/10000 [4:27:56<10:53:03, 5.48s/it][2025-06-19 17:57:40,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:57:40,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.11 | bwd_microstep: 3316.42 | bwd_inner_microstep: 3315.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 17:57:40,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.11 | bwd: 3316.44 | bwd_inner: 3315.63 | bwd_allreduce: 0.76 | step: 6.74 28%|██▊ | 2849/10000 [4:28:01<10:52:26, 5.47s/it] {'loss': 0.1104, 'grad_norm': 1.8354004621505737, 'learning_rate': 3.3562928927549257e-05, 'epoch': 2.85} 28%|██▊ | 2849/10000 [4:28:01<10:52:26, 5.47s/it][2025-06-19 17:57:46,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:57:46,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.78 | bwd_microstep: 3325.63 | bwd_inner_microstep: 3324.68 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.22 [2025-06-19 17:57:46,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.78 | bwd: 3325.65 | bwd_inner: 3324.68 | bwd_allreduce: 0.92 | step: 7.23 28%|██▊ | 2850/10000 [4:28:07<10:52:29, 5.48s/it] {'loss': 0.0944, 'grad_norm': 1.8194440603256226, 'learning_rate': 3.3558167720150064e-05, 'epoch': 2.85} 28%|██▊ | 2850/10000 [4:28:07<10:52:29, 5.48s/it][2025-06-19 17:57:51,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 17:57:51,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.14 | bwd_microstep: 3359.90 | bwd_inner_microstep: 3358.95 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.07 [2025-06-19 17:57:51,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.14 | bwd: 3359.92 | bwd_inner: 3358.95 | bwd_allreduce: 0.92 | step: 7.08 29%|██▊ | 2851/10000 [4:28:12<10:54:22, 5.49s/it] {'loss': 0.0707, 'grad_norm': 0.7158206701278687, 'learning_rate': 3.355340509056201e-05, 'epoch': 2.85} 29%|██▊ | 2851/10000 [4:28:12<10:54:22, 5.49s/it][2025-06-19 17:57:57,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:57:57,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.22 | bwd_microstep: 3314.37 | bwd_inner_microstep: 3313.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 17:57:57,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.22 | bwd: 3314.38 | bwd_inner: 3313.58 | bwd_allreduce: 0.76 | step: 6.69 29%|██▊ | 2852/10000 [4:28:18<10:53:10, 5.48s/it] {'loss': 0.0618, 'grad_norm': 1.504309058189392, 'learning_rate': 3.354864103928466e-05, 'epoch': 2.85} 29%|██▊ | 2852/10000 [4:28:18<10:53:10, 5.48s/it][2025-06-19 17:58:02,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:58:02,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.98 | bwd_microstep: 3312.27 | bwd_inner_microstep: 3311.42 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.91 [2025-06-19 17:58:02,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.98 | bwd: 3312.29 | bwd_inner: 3311.42 | bwd_allreduce: 0.81 | step: 6.92 29%|██▊ | 2853/10000 [4:28:23<10:52:09, 5.47s/it] {'loss': 0.0504, 'grad_norm': 0.6842585206031799, 'learning_rate': 3.3543875566817746e-05, 'epoch': 2.85} 29%|██▊ | 2853/10000 [4:28:23<10:52:09, 5.47s/it][2025-06-19 17:58:08,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:58:08,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.06 | bwd_microstep: 3316.04 | bwd_inner_microstep: 3315.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 17:58:08,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.06 | bwd: 3316.05 | bwd_inner: 3315.24 | bwd_allreduce: 0.78 | step: 7.06 29%|██▊ | 2854/10000 [4:28:29<10:51:26, 5.47s/it] {'loss': 0.2337, 'grad_norm': 1.8310211896896362, 'learning_rate': 3.353910867366115e-05, 'epoch': 2.85} 29%|██▊ | 2854/10000 [4:28:29<10:51:26, 5.47s/it][2025-06-19 17:58:13,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 17:58:13,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.87 | bwd_microstep: 3315.81 | bwd_inner_microstep: 3314.88 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.13 [2025-06-19 17:58:13,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.87 | bwd: 3315.83 | bwd_inner: 3314.88 | bwd_allreduce: 0.90 | step: 7.13 29%|██▊ | 2855/10000 [4:28:34<10:51:15, 5.47s/it] {'loss': 0.1069, 'grad_norm': 1.1596863269805908, 'learning_rate': 3.3534340360314884e-05, 'epoch': 2.85} 29%|██▊ | 2855/10000 [4:28:34<10:51:15, 5.47s/it][2025-06-19 17:58:19,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 17:58:19,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.05 | bwd_microstep: 3374.98 | bwd_inner_microstep: 3374.12 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.87 [2025-06-19 17:58:19,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.05 | bwd: 3374.99 | bwd_inner: 3374.12 | bwd_allreduce: 0.83 | step: 6.87 29%|██▊ | 2856/10000 [4:28:40<10:53:50, 5.49s/it] {'loss': 0.0832, 'grad_norm': 1.4499366283416748, 'learning_rate': 3.3529570627279136e-05, 'epoch': 2.86} 29%|██▊ | 2856/10000 [4:28:40<10:53:50, 5.49s/it][2025-06-19 17:58:24,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:58:24,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.11 | bwd_microstep: 3320.60 | bwd_inner_microstep: 3319.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 17:58:24,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.11 | bwd: 3320.62 | bwd_inner: 3319.82 | bwd_allreduce: 0.76 | step: 6.60 29%|██▊ | 2857/10000 [4:28:45<10:52:58, 5.48s/it] {'loss': 0.0824, 'grad_norm': 0.93221116065979, 'learning_rate': 3.352479947505422e-05, 'epoch': 2.86} 29%|██▊ | 2857/10000 [4:28:45<10:52:58, 5.48s/it][2025-06-19 17:58:30,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:58:30,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.84 | bwd_microstep: 3368.22 | bwd_inner_microstep: 3367.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 17:58:30,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.84 | bwd: 3368.23 | bwd_inner: 3367.43 | bwd_allreduce: 0.76 | step: 6.59 29%|██▊ | 2858/10000 [4:28:51<10:54:21, 5.50s/it] {'loss': 0.0679, 'grad_norm': 0.8143379092216492, 'learning_rate': 3.352002690414061e-05, 'epoch': 2.86} 29%|██▊ | 2858/10000 [4:28:51<10:54:21, 5.50s/it][2025-06-19 17:58:35,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:58:35,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.41 | bwd_microstep: 3314.35 | bwd_inner_microstep: 3313.57 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 17:58:35,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.41 | bwd: 3314.37 | bwd_inner: 3313.57 | bwd_allreduce: 0.76 | step: 6.62 29%|██▊ | 2859/10000 [4:28:56<10:52:45, 5.48s/it] {'loss': 0.0437, 'grad_norm': 0.8200605511665344, 'learning_rate': 3.351525291503892e-05, 'epoch': 2.86} 29%|██▊ | 2859/10000 [4:28:56<10:52:45, 5.48s/it][2025-06-19 17:58:41,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:58:41,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.22 | bwd_microstep: 3377.18 | bwd_inner_microstep: 3376.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-19 17:58:41,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.22 | bwd: 3377.20 | bwd_inner: 3376.38 | bwd_allreduce: 0.77 | step: 6.83 29%|██▊ | 2860/10000 [4:29:02<10:54:43, 5.50s/it] {'loss': 0.131, 'grad_norm': 2.26263427734375, 'learning_rate': 3.351047750824993e-05, 'epoch': 2.86} 29%|██▊ | 2860/10000 [4:29:02<10:54:43, 5.50s/it][2025-06-19 17:58:46,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:58:46,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.35 | bwd_microstep: 3317.82 | bwd_inner_microstep: 3317.01 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 17:58:46,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.35 | bwd: 3317.83 | bwd_inner: 3317.01 | bwd_allreduce: 0.78 | step: 7.07 29%|██▊ | 2861/10000 [4:29:07<10:53:22, 5.49s/it] {'loss': 0.0568, 'grad_norm': 0.9529992938041687, 'learning_rate': 3.350570068427456e-05, 'epoch': 2.86} 29%|██▊ | 2861/10000 [4:29:07<10:53:22, 5.49s/it][2025-06-19 17:58:52,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 17:58:52,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.26 | bwd_microstep: 3314.78 | bwd_inner_microstep: 3313.95 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.86 [2025-06-19 17:58:52,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.26 | bwd: 3314.79 | bwd_inner: 3313.95 | bwd_allreduce: 0.79 | step: 6.86 29%|██▊ | 2862/10000 [4:29:12<10:52:21, 5.48s/it] {'loss': 0.0707, 'grad_norm': 1.148967981338501, 'learning_rate': 3.350092244361386e-05, 'epoch': 2.86} 29%|██▊ | 2862/10000 [4:29:12<10:52:21, 5.48s/it][2025-06-19 17:58:57,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.95 [2025-06-19 17:58:57,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.37 | bwd_microstep: 3324.21 | bwd_inner_microstep: 3323.12 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.22 [2025-06-19 17:58:57,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.37 | bwd: 3324.23 | bwd_inner: 3323.12 | bwd_allreduce: 1.07 | step: 7.23 29%|██▊ | 2863/10000 [4:29:18<10:51:32, 5.48s/it] {'loss': 0.1033, 'grad_norm': 1.0559228658676147, 'learning_rate': 3.3496142786769054e-05, 'epoch': 2.86} 29%|██▊ | 2863/10000 [4:29:18<10:51:32, 5.48s/it][2025-06-19 17:59:03,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 17:59:03,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.14 | bwd_microstep: 3321.25 | bwd_inner_microstep: 3320.41 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.81 [2025-06-19 17:59:03,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.14 | bwd: 3321.26 | bwd_inner: 3320.41 | bwd_allreduce: 0.81 | step: 6.81 29%|██▊ | 2864/10000 [4:29:23<10:51:00, 5.47s/it] {'loss': 0.0718, 'grad_norm': 1.0100892782211304, 'learning_rate': 3.3491361714241514e-05, 'epoch': 2.86} 29%|██▊ | 2864/10000 [4:29:23<10:51:00, 5.47s/it][2025-06-19 17:59:08,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:59:08,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.54 | bwd_microstep: 3326.33 | bwd_inner_microstep: 3325.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 17:59:08,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.54 | bwd: 3326.34 | bwd_inner: 3325.55 | bwd_allreduce: 0.75 | step: 6.57 29%|██▊ | 2865/10000 [4:29:29<10:50:56, 5.47s/it] {'loss': 0.0985, 'grad_norm': 0.7712397575378418, 'learning_rate': 3.348657922653274e-05, 'epoch': 2.87} 29%|██▊ | 2865/10000 [4:29:29<10:50:56, 5.47s/it][2025-06-19 17:59:14,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 17:59:14,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.71 | bwd_microstep: 3369.97 | bwd_inner_microstep: 3369.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 17:59:14,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.71 | bwd: 3369.98 | bwd_inner: 3369.18 | bwd_allreduce: 0.75 | step: 6.61 29%|██▊ | 2866/10000 [4:29:34<10:53:08, 5.49s/it] {'loss': 0.078, 'grad_norm': 1.2051405906677246, 'learning_rate': 3.3481795324144406e-05, 'epoch': 2.87} 29%|██▊ | 2866/10000 [4:29:34<10:53:08, 5.49s/it][2025-06-19 17:59:19,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 17:59:19,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.05 | bwd_microstep: 3324.39 | bwd_inner_microstep: 3323.29 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.26 [2025-06-19 17:59:19,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.05 | bwd: 3324.42 | bwd_inner: 3323.29 | bwd_allreduce: 1.05 | step: 7.27 29%|██▊ | 2867/10000 [4:29:40<10:52:28, 5.49s/it] {'loss': 0.0599, 'grad_norm': 0.6395843625068665, 'learning_rate': 3.3477010007578315e-05, 'epoch': 2.87} 29%|██▊ | 2867/10000 [4:29:40<10:52:28, 5.49s/it][2025-06-19 17:59:25,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:59:25,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.34 | bwd_microstep: 3375.70 | bwd_inner_microstep: 3374.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 17:59:25,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.34 | bwd: 3375.71 | bwd_inner: 3374.90 | bwd_allreduce: 0.76 | step: 6.82 29%|██▊ | 2868/10000 [4:29:45<10:54:57, 5.51s/it] {'loss': 0.1557, 'grad_norm': 1.5187076330184937, 'learning_rate': 3.3472223277336406e-05, 'epoch': 2.87} 29%|██▊ | 2868/10000 [4:29:45<10:54:57, 5.51s/it][2025-06-19 17:59:30,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 17:59:30,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.35 | bwd_microstep: 3326.90 | bwd_inner_microstep: 3326.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.84 [2025-06-19 17:59:30,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.35 | bwd: 3326.91 | bwd_inner: 3326.11 | bwd_allreduce: 0.76 | step: 6.85 29%|██▊ | 2869/10000 [4:29:51<10:53:50, 5.50s/it] {'loss': 0.0413, 'grad_norm': 0.5509110689163208, 'learning_rate': 3.346743513392082e-05, 'epoch': 2.87} 29%|██▊ | 2869/10000 [4:29:51<10:53:50, 5.50s/it][2025-06-19 17:59:36,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 17:59:36,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.60 | bwd_microstep: 3380.36 | bwd_inner_microstep: 3379.45 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.23 [2025-06-19 17:59:36,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.60 | bwd: 3380.37 | bwd_inner: 3379.45 | bwd_allreduce: 0.88 | step: 7.24 29%|██▊ | 2870/10000 [4:29:56<10:55:44, 5.52s/it] {'loss': 0.0468, 'grad_norm': 0.6435752511024475, 'learning_rate': 3.3462645577833785e-05, 'epoch': 2.87} 29%|██▊ | 2870/10000 [4:29:56<10:55:44, 5.52s/it][2025-06-19 17:59:41,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:59:41,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.71 | bwd_microstep: 3376.31 | bwd_inner_microstep: 3375.53 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.55 [2025-06-19 17:59:41,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.71 | bwd: 3376.32 | bwd_inner: 3375.53 | bwd_allreduce: 0.75 | step: 6.56 29%|██▊ | 2871/10000 [4:30:02<10:56:40, 5.53s/it] {'loss': 0.0368, 'grad_norm': 0.5756291747093201, 'learning_rate': 3.345785460957771e-05, 'epoch': 2.87} 29%|██▊ | 2871/10000 [4:30:02<10:56:40, 5.53s/it][2025-06-19 17:59:47,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 17:59:47,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.85 | bwd_microstep: 3373.57 | bwd_inner_microstep: 3372.58 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.46 [2025-06-19 17:59:47,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.85 | bwd: 3373.59 | bwd_inner: 3372.58 | bwd_allreduce: 0.97 | step: 7.46 29%|██▊ | 2872/10000 [4:30:08<10:57:18, 5.53s/it] {'loss': 0.0798, 'grad_norm': 0.773941159248352, 'learning_rate': 3.3453062229655144e-05, 'epoch': 2.87} 29%|██▊ | 2872/10000 [4:30:08<10:57:18, 5.53s/it][2025-06-19 17:59:52,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 17:59:52,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.80 | bwd_microstep: 3331.77 | bwd_inner_microstep: 3330.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 17:59:52,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.80 | bwd: 3331.78 | bwd_inner: 3330.99 | bwd_allreduce: 0.75 | step: 6.65 29%|██▊ | 2873/10000 [4:30:13<10:55:30, 5.52s/it] {'loss': 0.1593, 'grad_norm': 1.5281177759170532, 'learning_rate': 3.344826843856879e-05, 'epoch': 2.87} 29%|██▊ | 2873/10000 [4:30:13<10:55:30, 5.52s/it][2025-06-19 17:59:58,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 17:59:58,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.41 | bwd_microstep: 3329.46 | bwd_inner_microstep: 3328.66 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-19 17:59:58,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.41 | bwd: 3329.47 | bwd_inner: 3328.66 | bwd_allreduce: 0.77 | step: 6.85 29%|██▊ | 2874/10000 [4:30:19<10:53:57, 5.51s/it] {'loss': 0.1669, 'grad_norm': 1.1681859493255615, 'learning_rate': 3.34434732368215e-05, 'epoch': 2.87} 29%|██▊ | 2874/10000 [4:30:19<10:53:57, 5.51s/it][2025-06-19 18:00:03,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:00:03,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.40 | bwd_microstep: 3328.71 | bwd_inner_microstep: 3327.86 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.20 [2025-06-19 18:00:03,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.40 | bwd: 3328.73 | bwd_inner: 3327.86 | bwd_allreduce: 0.81 | step: 7.20 29%|██▉ | 2875/10000 [4:30:24<10:53:02, 5.50s/it] {'loss': 0.1578, 'grad_norm': 1.5187526941299438, 'learning_rate': 3.3438676624916246e-05, 'epoch': 2.88} 29%|██▉ | 2875/10000 [4:30:24<10:53:02, 5.50s/it][2025-06-19 18:00:09,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:00:09,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.50 | bwd_microstep: 3330.47 | bwd_inner_microstep: 3329.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 18:00:09,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.50 | bwd: 3330.49 | bwd_inner: 3329.67 | bwd_allreduce: 0.78 | step: 7.13 29%|██▉ | 2876/10000 [4:30:30<10:52:31, 5.50s/it] {'loss': 0.0775, 'grad_norm': 1.0149104595184326, 'learning_rate': 3.3433878603356197e-05, 'epoch': 2.88} 29%|██▉ | 2876/10000 [4:30:30<10:52:31, 5.50s/it][2025-06-19 18:00:14,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 18:00:14,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.29 | bwd_microstep: 3319.38 | bwd_inner_microstep: 3318.27 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.80 [2025-06-19 18:00:14,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.29 | bwd: 3319.39 | bwd_inner: 3318.27 | bwd_allreduce: 1.07 | step: 7.80 29%|██▉ | 2877/10000 [4:30:35<10:51:21, 5.49s/it] {'loss': 0.0603, 'grad_norm': 0.9908583164215088, 'learning_rate': 3.342907917264463e-05, 'epoch': 2.88} 29%|██▉ | 2877/10000 [4:30:35<10:51:21, 5.49s/it][2025-06-19 18:00:20,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:00:20,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.81 | bwd_microstep: 3324.29 | bwd_inner_microstep: 3323.30 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.20 [2025-06-19 18:00:20,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.81 | bwd: 3324.31 | bwd_inner: 3323.30 | bwd_allreduce: 0.96 | step: 7.20 29%|██▉ | 2878/10000 [4:30:40<10:50:51, 5.48s/it] {'loss': 0.079, 'grad_norm': 1.3855036497116089, 'learning_rate': 3.342427833328498e-05, 'epoch': 2.88} 29%|██▉ | 2878/10000 [4:30:40<10:50:51, 5.48s/it][2025-06-19 18:00:25,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.91 [2025-06-19 18:00:25,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.15 | bwd_microstep: 3376.17 | bwd_inner_microstep: 3375.34 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.84 [2025-06-19 18:00:25,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.15 | bwd: 3376.18 | bwd_inner: 3375.34 | bwd_allreduce: 0.80 | step: 7.85 29%|██▉ | 2879/10000 [4:30:46<10:53:33, 5.51s/it] {'loss': 0.0823, 'grad_norm': 0.7091799974441528, 'learning_rate': 3.341947608578084e-05, 'epoch': 2.88} 29%|██▉ | 2879/10000 [4:30:46<10:53:33, 5.51s/it][2025-06-19 18:00:31,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:00:31,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.97 | bwd_microstep: 3336.22 | bwd_inner_microstep: 3335.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 18:00:31,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.97 | bwd: 3336.23 | bwd_inner: 3335.43 | bwd_allreduce: 0.76 | step: 6.72 29%|██▉ | 2880/10000 [4:30:52<10:52:59, 5.50s/it] {'loss': 0.1098, 'grad_norm': 1.8644660711288452, 'learning_rate': 3.3414672430635935e-05, 'epoch': 2.88} 29%|██▉ | 2880/10000 [4:30:52<10:52:59, 5.50s/it][2025-06-19 18:00:36,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:00:36,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.31 | bwd_microstep: 3329.76 | bwd_inner_microstep: 3328.69 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.77 [2025-06-19 18:00:36,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.31 | bwd: 3329.78 | bwd_inner: 3328.69 | bwd_allreduce: 1.03 | step: 7.77 29%|██▉ | 2881/10000 [4:30:57<10:52:16, 5.50s/it] {'loss': 0.0895, 'grad_norm': 1.068560004234314, 'learning_rate': 3.3409867368354154e-05, 'epoch': 2.88} 29%|██▉ | 2881/10000 [4:30:57<10:52:16, 5.50s/it][2025-06-19 18:00:42,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:00:42,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.70 | bwd_microstep: 3384.74 | bwd_inner_microstep: 3383.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-19 18:00:42,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.70 | bwd: 3384.76 | bwd_inner: 3383.94 | bwd_allreduce: 0.77 | step: 6.92 29%|██▉ | 2882/10000 [4:31:03<10:54:23, 5.52s/it] {'loss': 0.1006, 'grad_norm': 1.190238356590271, 'learning_rate': 3.3405060899439527e-05, 'epoch': 2.88} 29%|██▉ | 2882/10000 [4:31:03<10:54:23, 5.52s/it][2025-06-19 18:00:47,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:00:47,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.32 | bwd_microstep: 3322.46 | bwd_inner_microstep: 3321.56 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.22 [2025-06-19 18:00:47,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.32 | bwd: 3322.49 | bwd_inner: 3321.56 | bwd_allreduce: 0.87 | step: 7.22 29%|██▉ | 2883/10000 [4:31:08<10:53:17, 5.51s/it] {'loss': 0.0628, 'grad_norm': 0.6716764569282532, 'learning_rate': 3.340025302439622e-05, 'epoch': 2.88} 29%|██▉ | 2883/10000 [4:31:08<10:53:17, 5.51s/it][2025-06-19 18:00:53,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:00:53,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.13 | bwd_microstep: 3376.29 | bwd_inner_microstep: 3375.35 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.51 [2025-06-19 18:00:53,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.13 | bwd: 3376.30 | bwd_inner: 3375.35 | bwd_allreduce: 0.91 | step: 7.51 29%|██▉ | 2884/10000 [4:31:14<10:55:09, 5.52s/it] {'loss': 0.0508, 'grad_norm': 0.7623066902160645, 'learning_rate': 3.339544374372857e-05, 'epoch': 2.88} 29%|██▉ | 2884/10000 [4:31:14<10:55:09, 5.52s/it][2025-06-19 18:00:58,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:00:58,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.21 | bwd_microstep: 3383.45 | bwd_inner_microstep: 3382.63 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.92 [2025-06-19 18:00:58,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.21 | bwd: 3383.47 | bwd_inner: 3382.63 | bwd_allreduce: 0.79 | step: 6.92 29%|██▉ | 2885/10000 [4:31:19<10:56:46, 5.54s/it] {'loss': 0.1677, 'grad_norm': 1.4962033033370972, 'learning_rate': 3.339063305794103e-05, 'epoch': 2.88} 29%|██▉ | 2885/10000 [4:31:19<10:56:46, 5.54s/it][2025-06-19 18:01:04,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:01:04,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.74 | bwd_microstep: 3329.94 | bwd_inner_microstep: 3328.85 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.81 [2025-06-19 18:01:04,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.74 | bwd: 3329.96 | bwd_inner: 3328.85 | bwd_allreduce: 1.06 | step: 7.81 29%|██▉ | 2886/10000 [4:31:25<10:54:44, 5.52s/it] {'loss': 0.0603, 'grad_norm': 1.0602161884307861, 'learning_rate': 3.338582096753825e-05, 'epoch': 2.89} 29%|██▉ | 2886/10000 [4:31:25<10:54:44, 5.52s/it][2025-06-19 18:01:09,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:01:09,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.25 | bwd_microstep: 3330.85 | bwd_inner_microstep: 3330.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 18:01:09,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.25 | bwd: 3330.87 | bwd_inner: 3330.06 | bwd_allreduce: 0.77 | step: 6.75 29%|██▉ | 2887/10000 [4:31:30<10:53:14, 5.51s/it] {'loss': 0.0815, 'grad_norm': 1.287452220916748, 'learning_rate': 3.338100747302496e-05, 'epoch': 2.89} 29%|██▉ | 2887/10000 [4:31:30<10:53:14, 5.51s/it][2025-06-19 18:01:15,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:01:15,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.92 | bwd_microstep: 3334.27 | bwd_inner_microstep: 3333.42 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.00 [2025-06-19 18:01:15,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.92 | bwd: 3334.28 | bwd_inner: 3333.42 | bwd_allreduce: 0.82 | step: 7.00 29%|██▉ | 2888/10000 [4:31:36<10:52:16, 5.50s/it] {'loss': 0.0663, 'grad_norm': 0.8352635502815247, 'learning_rate': 3.3376192574906095e-05, 'epoch': 2.89} 29%|██▉ | 2888/10000 [4:31:36<10:52:16, 5.50s/it][2025-06-19 18:01:20,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:01:20,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.71 | bwd_microstep: 3343.73 | bwd_inner_microstep: 3342.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-19 18:01:20,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.71 | bwd: 3343.74 | bwd_inner: 3342.94 | bwd_allreduce: 0.76 | step: 6.86 29%|██▉ | 2889/10000 [4:31:41<10:52:22, 5.50s/it] {'loss': 0.0547, 'grad_norm': 0.47571125626564026, 'learning_rate': 3.337137627368671e-05, 'epoch': 2.89} 29%|██▉ | 2889/10000 [4:31:41<10:52:22, 5.50s/it][2025-06-19 18:01:26,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:01:26,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.69 | bwd_microstep: 3326.16 | bwd_inner_microstep: 3325.32 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.34 [2025-06-19 18:01:26,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.69 | bwd: 3326.18 | bwd_inner: 3325.32 | bwd_allreduce: 0.80 | step: 7.34 29%|██▉ | 2890/10000 [4:31:47<10:51:28, 5.50s/it] {'loss': 0.0386, 'grad_norm': 0.5673298239707947, 'learning_rate': 3.336655856987201e-05, 'epoch': 2.89} 29%|██▉ | 2890/10000 [4:31:47<10:51:28, 5.50s/it][2025-06-19 18:01:31,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:01:31,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.55 | bwd_microstep: 3327.73 | bwd_inner_microstep: 3326.91 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.15 [2025-06-19 18:01:31,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.55 | bwd: 3327.74 | bwd_inner: 3326.91 | bwd_allreduce: 0.79 | step: 7.16 29%|██▉ | 2891/10000 [4:31:52<10:50:30, 5.49s/it] {'loss': 0.1153, 'grad_norm': 1.2515524625778198, 'learning_rate': 3.336173946396735e-05, 'epoch': 2.89} 29%|██▉ | 2891/10000 [4:31:52<10:50:30, 5.49s/it][2025-06-19 18:01:37,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:01:37,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.24 | bwd_microstep: 3374.71 | bwd_inner_microstep: 3373.76 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.18 [2025-06-19 18:01:37,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.24 | bwd: 3374.73 | bwd_inner: 3373.76 | bwd_allreduce: 0.92 | step: 7.19 29%|██▉ | 2892/10000 [4:31:58<10:52:34, 5.51s/it] {'loss': 0.0442, 'grad_norm': 0.7627000212669373, 'learning_rate': 3.335691895647823e-05, 'epoch': 2.89} 29%|██▉ | 2892/10000 [4:31:58<10:52:34, 5.51s/it][2025-06-19 18:01:42,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:01:42,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.38 | bwd_microstep: 3339.85 | bwd_inner_microstep: 3338.94 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.79 [2025-06-19 18:01:42,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.38 | bwd: 3339.87 | bwd_inner: 3338.94 | bwd_allreduce: 0.88 | step: 7.79 29%|██▉ | 2893/10000 [4:32:03<10:52:30, 5.51s/it] {'loss': 0.0804, 'grad_norm': 0.9888818264007568, 'learning_rate': 3.335209704791031e-05, 'epoch': 2.89} 29%|██▉ | 2893/10000 [4:32:03<10:52:30, 5.51s/it][2025-06-19 18:01:48,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:01:48,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.83 | bwd_microstep: 3376.26 | bwd_inner_microstep: 3375.47 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-19 18:01:48,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.83 | bwd: 3376.28 | bwd_inner: 3375.47 | bwd_allreduce: 0.77 | step: 6.84 29%|██▉ | 2894/10000 [4:32:09<10:54:12, 5.52s/it] {'loss': 0.0767, 'grad_norm': 1.0340653657913208, 'learning_rate': 3.334727373876938e-05, 'epoch': 2.89} 29%|██▉ | 2894/10000 [4:32:09<10:54:12, 5.52s/it][2025-06-19 18:01:53,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:01:53,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.93 | bwd_microstep: 3375.50 | bwd_inner_microstep: 3374.57 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.03 [2025-06-19 18:01:53,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.93 | bwd: 3375.51 | bwd_inner: 3374.57 | bwd_allreduce: 0.89 | step: 7.03 29%|██▉ | 2895/10000 [4:32:14<10:54:56, 5.53s/it] {'loss': 0.0959, 'grad_norm': 1.012245774269104, 'learning_rate': 3.3342449029561384e-05, 'epoch': 2.9} 29%|██▉ | 2895/10000 [4:32:14<10:54:56, 5.53s/it][2025-06-19 18:01:59,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:01:59,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.05 | bwd_microstep: 3324.31 | bwd_inner_microstep: 3323.50 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-19 18:01:59,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.05 | bwd: 3324.32 | bwd_inner: 3323.50 | bwd_allreduce: 0.77 | step: 7.04 29%|██▉ | 2896/10000 [4:32:20<10:52:44, 5.51s/it] {'loss': 0.0807, 'grad_norm': 1.204182744026184, 'learning_rate': 3.3337622920792404e-05, 'epoch': 2.9} 29%|██▉ | 2896/10000 [4:32:20<10:52:44, 5.51s/it][2025-06-19 18:02:04,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:02:04,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.10 | bwd_microstep: 3381.25 | bwd_inner_microstep: 3380.37 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.92 [2025-06-19 18:02:04,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.10 | bwd: 3381.27 | bwd_inner: 3380.37 | bwd_allreduce: 0.86 | step: 6.92 29%|██▉ | 2897/10000 [4:32:25<10:54:02, 5.52s/it] {'loss': 0.0626, 'grad_norm': 0.7899775505065918, 'learning_rate': 3.3332795412968684e-05, 'epoch': 2.9} 29%|██▉ | 2897/10000 [4:32:25<10:54:02, 5.52s/it][2025-06-19 18:02:10,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:02:10,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.81 | bwd_microstep: 3384.46 | bwd_inner_microstep: 3383.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 18:02:10,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.81 | bwd: 3384.47 | bwd_inner: 3383.65 | bwd_allreduce: 0.78 | step: 7.13 29%|██▉ | 2898/10000 [4:32:31<10:55:15, 5.54s/it] {'loss': 0.0435, 'grad_norm': 0.903350293636322, 'learning_rate': 3.33279665065966e-05, 'epoch': 2.9} 29%|██▉ | 2898/10000 [4:32:31<10:55:15, 5.54s/it][2025-06-19 18:02:16,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:02:16,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.25 | bwd_microstep: 3379.26 | bwd_inner_microstep: 3378.46 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.69 [2025-06-19 18:02:16,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.25 | bwd: 3379.27 | bwd_inner: 3378.46 | bwd_allreduce: 0.77 | step: 6.69 29%|██▉ | 2899/10000 [4:32:36<10:55:47, 5.54s/it] {'loss': 0.0989, 'grad_norm': 2.2064712047576904, 'learning_rate': 3.332313620218269e-05, 'epoch': 2.9} 29%|██▉ | 2899/10000 [4:32:36<10:55:47, 5.54s/it][2025-06-19 18:02:21,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:02:21,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.92 | bwd_microstep: 3343.55 | bwd_inner_microstep: 3342.74 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 18:02:21,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.92 | bwd: 3343.56 | bwd_inner: 3342.74 | bwd_allreduce: 0.78 | step: 7.06 29%|██▉ | 2900/10000 [4:32:42<10:54:12, 5.53s/it] {'loss': 0.0653, 'grad_norm': 1.0380902290344238, 'learning_rate': 3.331830450023362e-05, 'epoch': 2.9} 29%|██▉ | 2900/10000 [4:32:42<10:54:12, 5.53s/it][2025-06-19 18:02:27,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:02:27,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.76 | bwd_microstep: 3324.51 | bwd_inner_microstep: 3323.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 18:02:27,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.76 | bwd: 3324.52 | bwd_inner: 3323.72 | bwd_allreduce: 0.76 | step: 6.71 29%|██▉ | 2901/10000 [4:32:47<10:52:16, 5.51s/it] {'loss': 0.1235, 'grad_norm': 1.3192474842071533, 'learning_rate': 3.331347140125623e-05, 'epoch': 2.9} 29%|██▉ | 2901/10000 [4:32:47<10:52:16, 5.51s/it][2025-06-19 18:02:32,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:02:32,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.93 | bwd_microstep: 3330.35 | bwd_inner_microstep: 3329.45 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.88 [2025-06-19 18:02:32,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.93 | bwd: 3330.36 | bwd_inner: 3329.45 | bwd_allreduce: 0.87 | step: 6.88 29%|██▉ | 2902/10000 [4:32:53<10:51:05, 5.50s/it] {'loss': 0.0611, 'grad_norm': 1.009300708770752, 'learning_rate': 3.330863690575748e-05, 'epoch': 2.9} 29%|██▉ | 2902/10000 [4:32:53<10:51:05, 5.50s/it][2025-06-19 18:02:38,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:02:38,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.00 | bwd_microstep: 3326.63 | bwd_inner_microstep: 3325.70 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.91 [2025-06-19 18:02:38,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.00 | bwd: 3326.65 | bwd_inner: 3325.70 | bwd_allreduce: 0.90 | step: 6.91 29%|██▉ | 2903/10000 [4:32:58<10:50:07, 5.50s/it] {'loss': 0.1213, 'grad_norm': 1.3218456506729126, 'learning_rate': 3.330380101424448e-05, 'epoch': 2.9} 29%|██▉ | 2903/10000 [4:32:58<10:50:07, 5.50s/it][2025-06-19 18:02:43,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:02:43,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.08 | bwd_microstep: 3386.93 | bwd_inner_microstep: 3386.07 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.23 [2025-06-19 18:02:43,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.08 | bwd: 3386.94 | bwd_inner: 3386.07 | bwd_allreduce: 0.82 | step: 7.23 29%|██▉ | 2904/10000 [4:33:04<10:52:33, 5.52s/it] {'loss': 0.1332, 'grad_norm': 1.6015836000442505, 'learning_rate': 3.32989637272245e-05, 'epoch': 2.9} 29%|██▉ | 2904/10000 [4:33:04<10:52:33, 5.52s/it][2025-06-19 18:02:49,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:02:49,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.70 | bwd_microstep: 3377.05 | bwd_inner_microstep: 3376.13 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.09 [2025-06-19 18:02:49,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.70 | bwd: 3377.06 | bwd_inner: 3376.13 | bwd_allreduce: 0.89 | step: 7.10 29%|██▉ | 2905/10000 [4:33:09<10:53:36, 5.53s/it] {'loss': 0.0616, 'grad_norm': 1.0301777124404907, 'learning_rate': 3.329412504520495e-05, 'epoch': 2.91} 29%|██▉ | 2905/10000 [4:33:09<10:53:36, 5.53s/it][2025-06-19 18:02:54,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:02:54,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.93 | bwd_microstep: 3322.71 | bwd_inner_microstep: 3321.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 18:02:54,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.93 | bwd: 3322.73 | bwd_inner: 3321.90 | bwd_allreduce: 0.78 | step: 7.00 29%|██▉ | 2906/10000 [4:33:15<10:51:40, 5.51s/it] {'loss': 0.1335, 'grad_norm': 1.3915318250656128, 'learning_rate': 3.3289284968693377e-05, 'epoch': 2.91} 29%|██▉ | 2906/10000 [4:33:15<10:51:40, 5.51s/it][2025-06-19 18:03:00,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:03:00,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.58 | bwd_microstep: 3317.27 | bwd_inner_microstep: 3316.42 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.05 [2025-06-19 18:03:00,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.58 | bwd: 3317.29 | bwd_inner: 3316.42 | bwd_allreduce: 0.81 | step: 7.05 29%|██▉ | 2907/10000 [4:33:20<10:49:52, 5.50s/it] {'loss': 0.1048, 'grad_norm': 1.8119821548461914, 'learning_rate': 3.32844434981975e-05, 'epoch': 2.91} 29%|██▉ | 2907/10000 [4:33:20<10:49:52, 5.50s/it][2025-06-19 18:03:05,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:03:05,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.35 | bwd_microstep: 3324.08 | bwd_inner_microstep: 3323.02 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.82 [2025-06-19 18:03:05,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.35 | bwd: 3324.10 | bwd_inner: 3323.02 | bwd_allreduce: 1.03 | step: 7.82 29%|██▉ | 2908/10000 [4:33:26<10:49:10, 5.49s/it] {'loss': 0.0436, 'grad_norm': 0.5305957198143005, 'learning_rate': 3.327960063422514e-05, 'epoch': 2.91} 29%|██▉ | 2908/10000 [4:33:26<10:49:10, 5.49s/it][2025-06-19 18:03:11,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 18:03:11,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.83 | bwd_microstep: 3377.07 | bwd_inner_microstep: 3375.87 | bwd_allreduce_microstep: 1.14 | step_microstep: 8.14 [2025-06-19 18:03:11,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.83 | bwd: 3377.08 | bwd_inner: 3375.87 | bwd_allreduce: 1.16 | step: 8.15 29%|██▉ | 2909/10000 [4:33:31<10:51:31, 5.51s/it] {'loss': 0.1577, 'grad_norm': 2.015157461166382, 'learning_rate': 3.327475637728431e-05, 'epoch': 2.91} 29%|██▉ | 2909/10000 [4:33:31<10:51:31, 5.51s/it][2025-06-19 18:03:16,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:03:16,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.79 | bwd_microstep: 3368.30 | bwd_inner_microstep: 3367.50 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 18:03:16,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.79 | bwd: 3368.32 | bwd_inner: 3367.50 | bwd_allreduce: 0.78 | step: 6.84 29%|██▉ | 2910/10000 [4:33:37<10:52:23, 5.52s/it] {'loss': 0.1031, 'grad_norm': 0.8588460087776184, 'learning_rate': 3.3269910727883146e-05, 'epoch': 2.91} 29%|██▉ | 2910/10000 [4:33:37<10:52:23, 5.52s/it][2025-06-19 18:03:22,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:03:22,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.95 | bwd_microstep: 3371.08 | bwd_inner_microstep: 3370.24 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.74 [2025-06-19 18:03:22,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.95 | bwd: 3371.09 | bwd_inner: 3370.24 | bwd_allreduce: 0.80 | step: 6.74 29%|██▉ | 2911/10000 [4:33:43<10:52:52, 5.53s/it] {'loss': 0.0508, 'grad_norm': 0.4134731888771057, 'learning_rate': 3.326506368652994e-05, 'epoch': 2.91} 29%|██▉ | 2911/10000 [4:33:43<10:52:52, 5.53s/it][2025-06-19 18:03:27,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.70 | optimizer_step: 2.72 [2025-06-19 18:03:27,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.50 | bwd_microstep: 3323.83 | bwd_inner_microstep: 3323.02 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.35 [2025-06-19 18:03:27,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.50 | bwd: 3323.85 | bwd_inner: 3323.02 | bwd_allreduce: 0.78 | step: 7.35 29%|██▉ | 2912/10000 [4:33:48<10:50:56, 5.51s/it] {'loss': 0.3312, 'grad_norm': 2.578463554382324, 'learning_rate': 3.3260215253733115e-05, 'epoch': 2.91} 29%|██▉ | 2912/10000 [4:33:48<10:50:56, 5.51s/it][2025-06-19 18:03:33,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:03:33,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.41 | bwd_microstep: 3371.06 | bwd_inner_microstep: 3370.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.56 [2025-06-19 18:03:33,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.41 | bwd: 3371.07 | bwd_inner: 3370.27 | bwd_allreduce: 0.76 | step: 6.56 29%|██▉ | 2913/10000 [4:33:54<10:51:59, 5.52s/it] {'loss': 0.0399, 'grad_norm': 0.45318731665611267, 'learning_rate': 3.3255365430001254e-05, 'epoch': 2.91} 29%|██▉ | 2913/10000 [4:33:54<10:51:59, 5.52s/it][2025-06-19 18:03:38,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:03:38,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.15 | bwd_microstep: 3372.03 | bwd_inner_microstep: 3371.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 18:03:38,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.15 | bwd: 3372.04 | bwd_inner: 3371.23 | bwd_allreduce: 0.76 | step: 6.74 29%|██▉ | 2914/10000 [4:33:59<10:52:41, 5.53s/it] {'loss': 0.0658, 'grad_norm': 0.9117196798324585, 'learning_rate': 3.325051421584308e-05, 'epoch': 2.91} 29%|██▉ | 2914/10000 [4:33:59<10:52:41, 5.53s/it][2025-06-19 18:03:44,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:03:44,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.86 | bwd_microstep: 3373.29 | bwd_inner_microstep: 3372.48 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-19 18:03:44,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.86 | bwd: 3373.30 | bwd_inner: 3372.48 | bwd_allreduce: 0.78 | step: 6.77 29%|██▉ | 2915/10000 [4:34:05<10:53:10, 5.53s/it] {'loss': 0.0765, 'grad_norm': 1.1849061250686646, 'learning_rate': 3.324566161176746e-05, 'epoch': 2.92} 29%|██▉ | 2915/10000 [4:34:05<10:53:10, 5.53s/it][2025-06-19 18:03:49,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:03:49,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.65 | bwd_microstep: 3371.62 | bwd_inner_microstep: 3370.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 18:03:49,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.65 | bwd: 3371.63 | bwd_inner: 3370.81 | bwd_allreduce: 0.78 | step: 7.18 29%|██▉ | 2916/10000 [4:34:10<10:53:29, 5.53s/it] {'loss': 0.0649, 'grad_norm': 1.2856334447860718, 'learning_rate': 3.3240807618283414e-05, 'epoch': 2.92} 29%|██▉ | 2916/10000 [4:34:10<10:53:29, 5.53s/it][2025-06-19 18:03:55,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:03:55,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.37 | bwd_microstep: 3327.21 | bwd_inner_microstep: 3326.24 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.23 [2025-06-19 18:03:55,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.37 | bwd: 3327.23 | bwd_inner: 3326.24 | bwd_allreduce: 0.93 | step: 7.24 29%|██▉ | 2917/10000 [4:34:16<10:51:14, 5.52s/it] {'loss': 0.0463, 'grad_norm': 0.623371422290802, 'learning_rate': 3.323595223590009e-05, 'epoch': 2.92} 29%|██▉ | 2917/10000 [4:34:16<10:51:14, 5.52s/it][2025-06-19 18:04:00,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:04:00,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.20 | bwd_microstep: 3374.82 | bwd_inner_microstep: 3374.01 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 18:04:00,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.20 | bwd: 3374.83 | bwd_inner: 3374.01 | bwd_allreduce: 0.78 | step: 6.95 29%|██▉ | 2918/10000 [4:34:21<10:52:04, 5.52s/it] {'loss': 0.061, 'grad_norm': 0.6124993562698364, 'learning_rate': 3.323109546512682e-05, 'epoch': 2.92} 29%|██▉ | 2918/10000 [4:34:21<10:52:04, 5.52s/it][2025-06-19 18:04:06,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 18:04:06,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.33 | bwd_microstep: 3365.13 | bwd_inner_microstep: 3364.19 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.78 [2025-06-19 18:04:06,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.33 | bwd: 3365.15 | bwd_inner: 3364.19 | bwd_allreduce: 0.91 | step: 7.78 29%|██▉ | 2919/10000 [4:34:27<10:52:12, 5.53s/it] {'loss': 0.0801, 'grad_norm': 0.9140844941139221, 'learning_rate': 3.322623730647304e-05, 'epoch': 2.92} 29%|██▉ | 2919/10000 [4:34:27<10:52:12, 5.53s/it][2025-06-19 18:04:11,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:04:11,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.54 | bwd_microstep: 3317.08 | bwd_inner_microstep: 3316.19 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.14 [2025-06-19 18:04:11,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.54 | bwd: 3317.10 | bwd_inner: 3316.19 | bwd_allreduce: 0.86 | step: 7.15 29%|██▉ | 2920/10000 [4:34:32<10:49:56, 5.51s/it] {'loss': 0.1128, 'grad_norm': 1.919082760810852, 'learning_rate': 3.3221377760448356e-05, 'epoch': 2.92} 29%|██▉ | 2920/10000 [4:34:32<10:49:56, 5.51s/it][2025-06-19 18:04:17,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:04:17,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.33 | bwd_microstep: 3320.38 | bwd_inner_microstep: 3319.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-19 18:04:17,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.33 | bwd: 3320.40 | bwd_inner: 3319.58 | bwd_allreduce: 0.77 | step: 7.02 29%|██▉ | 2921/10000 [4:34:38<10:48:16, 5.49s/it] {'loss': 0.0905, 'grad_norm': 1.402141809463501, 'learning_rate': 3.32165168275625e-05, 'epoch': 2.92} 29%|██▉ | 2921/10000 [4:34:38<10:48:16, 5.49s/it][2025-06-19 18:04:22,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:04:22,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.07 | bwd_microstep: 3320.02 | bwd_inner_microstep: 3319.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 18:04:22,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.07 | bwd: 3320.22 | bwd_inner: 3319.22 | bwd_allreduce: 0.77 | step: 6.85 29%|██▉ | 2922/10000 [4:34:43<10:47:14, 5.49s/it] {'loss': 0.1044, 'grad_norm': 1.143973469734192, 'learning_rate': 3.321165450832536e-05, 'epoch': 2.92} 29%|██▉ | 2922/10000 [4:34:43<10:47:14, 5.49s/it][2025-06-19 18:04:28,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:04:28,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.26 | bwd_microstep: 3316.18 | bwd_inner_microstep: 3315.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 18:04:28,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.26 | bwd: 3316.19 | bwd_inner: 3315.37 | bwd_allreduce: 0.78 | step: 7.10 29%|██▉ | 2923/10000 [4:34:49<10:46:00, 5.48s/it] {'loss': 0.1954, 'grad_norm': 1.3547252416610718, 'learning_rate': 3.3206790803246997e-05, 'epoch': 2.92} 29%|██▉ | 2923/10000 [4:34:49<10:46:00, 5.48s/it][2025-06-19 18:04:33,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:04:33,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.00 | bwd_microstep: 3323.33 | bwd_inner_microstep: 3322.52 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.73 [2025-06-19 18:04:33,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.00 | bwd: 3323.34 | bwd_inner: 3322.52 | bwd_allreduce: 0.78 | step: 6.73 29%|██▉ | 2924/10000 [4:34:54<10:45:29, 5.47s/it] {'loss': 0.0922, 'grad_norm': 0.9098964333534241, 'learning_rate': 3.3201925712837565e-05, 'epoch': 2.92} 29%|██▉ | 2924/10000 [4:34:54<10:45:29, 5.47s/it][2025-06-19 18:04:39,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:04:39,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.83 | bwd_microstep: 3398.29 | bwd_inner_microstep: 3397.49 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 18:04:39,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.83 | bwd: 3398.30 | bwd_inner: 3397.49 | bwd_allreduce: 0.77 | step: 7.05 29%|██▉ | 2925/10000 [4:35:00<10:48:59, 5.50s/it] {'loss': 0.0736, 'grad_norm': 0.9205750226974487, 'learning_rate': 3.31970592376074e-05, 'epoch': 2.92} 29%|██▉ | 2925/10000 [4:35:00<10:48:59, 5.50s/it][2025-06-19 18:04:44,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:04:44,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.34 | bwd_microstep: 3366.36 | bwd_inner_microstep: 3365.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 18:04:44,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.34 | bwd: 3366.37 | bwd_inner: 3365.58 | bwd_allreduce: 0.75 | step: 6.65 29%|██▉ | 2926/10000 [4:35:05<10:49:41, 5.51s/it] {'loss': 0.0553, 'grad_norm': 0.6095800399780273, 'learning_rate': 3.319219137806697e-05, 'epoch': 2.93} 29%|██▉ | 2926/10000 [4:35:05<10:49:41, 5.51s/it][2025-06-19 18:04:50,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:04:50,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.68 | bwd_microstep: 3363.66 | bwd_inner_microstep: 3362.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 18:04:50,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.68 | bwd: 3363.67 | bwd_inner: 3362.87 | bwd_allreduce: 0.75 | step: 6.58 29%|██▉ | 2927/10000 [4:35:11<10:50:00, 5.51s/it] {'loss': 0.0775, 'grad_norm': 1.099732518196106, 'learning_rate': 3.318732213472689e-05, 'epoch': 2.93} 29%|██▉ | 2927/10000 [4:35:11<10:50:00, 5.51s/it][2025-06-19 18:04:55,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:04:55,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.21 | bwd_microstep: 3363.31 | bwd_inner_microstep: 3362.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-19 18:04:55,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.21 | bwd: 3363.32 | bwd_inner: 3362.51 | bwd_allreduce: 0.77 | step: 6.86 29%|██▉ | 2928/10000 [4:35:16<10:50:27, 5.52s/it] {'loss': 0.0669, 'grad_norm': 1.0308266878128052, 'learning_rate': 3.3182451508097927e-05, 'epoch': 2.93} 29%|██▉ | 2928/10000 [4:35:16<10:50:27, 5.52s/it][2025-06-19 18:05:01,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:05:01,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.59 | bwd_microstep: 3314.06 | bwd_inner_microstep: 3313.25 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.93 [2025-06-19 18:05:01,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.59 | bwd: 3314.07 | bwd_inner: 3313.25 | bwd_allreduce: 0.78 | step: 6.93 29%|██▉ | 2929/10000 [4:35:22<10:48:31, 5.50s/it] {'loss': 0.065, 'grad_norm': 0.7739723920822144, 'learning_rate': 3.3177579498690974e-05, 'epoch': 2.93} 29%|██▉ | 2929/10000 [4:35:22<10:48:31, 5.50s/it][2025-06-19 18:05:06,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:05:06,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.76 | bwd_microstep: 3364.28 | bwd_inner_microstep: 3363.48 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 18:05:06,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.76 | bwd: 3364.29 | bwd_inner: 3363.48 | bwd_allreduce: 0.77 | step: 6.64 29%|██▉ | 2930/10000 [4:35:27<10:49:13, 5.51s/it] {'loss': 0.0747, 'grad_norm': 0.7675967812538147, 'learning_rate': 3.3172706107017095e-05, 'epoch': 2.93} 29%|██▉ | 2930/10000 [4:35:27<10:49:13, 5.51s/it][2025-06-19 18:05:12,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:05:12,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.68 | bwd_microstep: 3359.45 | bwd_inner_microstep: 3358.48 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.12 [2025-06-19 18:05:12,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.68 | bwd: 3359.46 | bwd_inner: 3358.48 | bwd_allreduce: 0.94 | step: 7.12 29%|██▉ | 2931/10000 [4:35:33<10:49:34, 5.51s/it] {'loss': 0.0592, 'grad_norm': 0.6468053460121155, 'learning_rate': 3.316783133358748e-05, 'epoch': 2.93} 29%|██▉ | 2931/10000 [4:35:33<10:49:34, 5.51s/it][2025-06-19 18:05:17,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:05:17,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.55 | bwd_microstep: 3399.86 | bwd_inner_microstep: 3399.07 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 18:05:17,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.55 | bwd: 3399.88 | bwd_inner: 3399.07 | bwd_allreduce: 0.76 | step: 6.65 29%|██▉ | 2932/10000 [4:35:38<10:51:38, 5.53s/it] {'loss': 0.0513, 'grad_norm': 0.5335564017295837, 'learning_rate': 3.3162955178913475e-05, 'epoch': 2.93} 29%|██▉ | 2932/10000 [4:35:38<10:51:38, 5.53s/it][2025-06-19 18:05:23,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:05:23,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.87 | bwd_microstep: 3369.54 | bwd_inner_microstep: 3368.73 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.35 [2025-06-19 18:05:23,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.88 | bwd: 3369.56 | bwd_inner: 3368.73 | bwd_allreduce: 0.79 | step: 7.36 29%|██▉ | 2933/10000 [4:35:44<10:51:38, 5.53s/it] {'loss': 0.0878, 'grad_norm': 1.2613036632537842, 'learning_rate': 3.3158077643506564e-05, 'epoch': 2.93} 29%|██▉ | 2933/10000 [4:35:44<10:51:38, 5.53s/it][2025-06-19 18:05:28,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:05:28,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.15 | bwd_microstep: 3319.77 | bwd_inner_microstep: 3318.77 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.16 [2025-06-19 18:05:28,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.15 | bwd: 3319.79 | bwd_inner: 3318.77 | bwd_allreduce: 0.97 | step: 7.16 29%|██▉ | 2934/10000 [4:35:49<10:49:09, 5.51s/it] {'loss': 0.051, 'grad_norm': 0.6574454307556152, 'learning_rate': 3.315319872787837e-05, 'epoch': 2.93} 29%|██▉ | 2934/10000 [4:35:49<10:49:09, 5.51s/it][2025-06-19 18:05:34,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:05:34,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.70 | bwd_microstep: 3324.59 | bwd_inner_microstep: 3323.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 18:05:34,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.70 | bwd: 3324.60 | bwd_inner: 3323.79 | bwd_allreduce: 0.77 | step: 6.71 29%|██▉ | 2935/10000 [4:35:55<10:47:23, 5.50s/it] {'loss': 0.0579, 'grad_norm': 0.38330522179603577, 'learning_rate': 3.314831843254068e-05, 'epoch': 2.94} 29%|██▉ | 2935/10000 [4:35:55<10:47:23, 5.50s/it][2025-06-19 18:05:39,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:05:39,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.39 | bwd_microstep: 3320.28 | bwd_inner_microstep: 3319.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 18:05:39,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.39 | bwd: 3320.29 | bwd_inner: 3319.49 | bwd_allreduce: 0.76 | step: 6.65 29%|██▉ | 2936/10000 [4:36:00<10:46:23, 5.49s/it] {'loss': 0.0754, 'grad_norm': 1.0166512727737427, 'learning_rate': 3.314343675800541e-05, 'epoch': 2.94} 29%|██▉ | 2936/10000 [4:36:00<10:46:23, 5.49s/it][2025-06-19 18:05:45,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:05:45,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.91 | bwd_microstep: 3326.12 | bwd_inner_microstep: 3325.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 18:05:45,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.91 | bwd: 3326.13 | bwd_inner: 3325.33 | bwd_allreduce: 0.76 | step: 6.61 29%|██▉ | 2937/10000 [4:36:06<10:45:19, 5.48s/it] {'loss': 0.0605, 'grad_norm': 1.266669511795044, 'learning_rate': 3.313855370478462e-05, 'epoch': 2.94} 29%|██▉ | 2937/10000 [4:36:06<10:45:19, 5.48s/it][2025-06-19 18:05:50,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:05:50,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.24 | bwd_microstep: 3314.64 | bwd_inner_microstep: 3313.85 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 18:05:50,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.24 | bwd: 3314.65 | bwd_inner: 3313.85 | bwd_allreduce: 0.76 | step: 6.66 29%|██▉ | 2938/10000 [4:36:11<10:44:32, 5.48s/it] {'loss': 0.0606, 'grad_norm': 0.8489739894866943, 'learning_rate': 3.313366927339053e-05, 'epoch': 2.94} 29%|██▉ | 2938/10000 [4:36:11<10:44:32, 5.48s/it][2025-06-19 18:05:56,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:05:56,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.74 | bwd_microstep: 3316.22 | bwd_inner_microstep: 3315.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 8.39 [2025-06-19 18:05:56,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.74 | bwd: 3316.23 | bwd_inner: 3315.43 | bwd_allreduce: 0.76 | step: 8.40 29%|██▉ | 2939/10000 [4:36:17<10:43:42, 5.47s/it] {'loss': 0.1393, 'grad_norm': 1.082297682762146, 'learning_rate': 3.3128783464335487e-05, 'epoch': 2.94} 29%|██▉ | 2939/10000 [4:36:17<10:43:42, 5.47s/it][2025-06-19 18:06:01,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:06:01,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.29 | bwd_microstep: 3320.00 | bwd_inner_microstep: 3319.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-19 18:06:01,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.29 | bwd: 3320.02 | bwd_inner: 3319.20 | bwd_allreduce: 0.77 | step: 6.89 29%|██▉ | 2940/10000 [4:36:22<10:43:08, 5.47s/it] {'loss': 0.0418, 'grad_norm': 0.4252167344093323, 'learning_rate': 3.3123896278131995e-05, 'epoch': 2.94} 29%|██▉ | 2940/10000 [4:36:22<10:43:08, 5.47s/it][2025-06-19 18:06:07,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:06:07,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.78 | bwd_microstep: 3377.31 | bwd_inner_microstep: 3376.28 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.42 [2025-06-19 18:06:07,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.78 | bwd: 3377.34 | bwd_inner: 3376.28 | bwd_allreduce: 0.99 | step: 7.42 29%|██▉ | 2941/10000 [4:36:28<10:45:43, 5.49s/it] {'loss': 0.0821, 'grad_norm': 1.2991055250167847, 'learning_rate': 3.3119007715292685e-05, 'epoch': 2.94} 29%|██▉ | 2941/10000 [4:36:28<10:45:43, 5.49s/it][2025-06-19 18:06:12,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:06:12,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.71 | bwd_microstep: 3318.39 | bwd_inner_microstep: 3317.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 18:06:12,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.71 | bwd: 3318.40 | bwd_inner: 3317.59 | bwd_allreduce: 0.77 | step: 6.65 29%|██▉ | 2942/10000 [4:36:33<10:44:52, 5.48s/it] {'loss': 0.0855, 'grad_norm': 0.8374187350273132, 'learning_rate': 3.311411777633036e-05, 'epoch': 2.94} 29%|██▉ | 2942/10000 [4:36:33<10:44:52, 5.48s/it][2025-06-19 18:06:18,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:06:18,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.80 | bwd_microstep: 3353.16 | bwd_inner_microstep: 3352.37 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 18:06:18,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.80 | bwd: 3353.18 | bwd_inner: 3352.37 | bwd_allreduce: 0.76 | step: 6.67 29%|██▉ | 2943/10000 [4:36:39<10:45:50, 5.49s/it] {'loss': 0.0524, 'grad_norm': 0.7358115911483765, 'learning_rate': 3.310922646175794e-05, 'epoch': 2.94} 29%|██▉ | 2943/10000 [4:36:39<10:45:50, 5.49s/it][2025-06-19 18:06:23,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:06:23,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.99 | bwd_microstep: 3371.95 | bwd_inner_microstep: 3371.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 18:06:23,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.99 | bwd: 3371.97 | bwd_inner: 3371.17 | bwd_allreduce: 0.75 | step: 6.63 29%|██▉ | 2944/10000 [4:36:44<10:47:16, 5.50s/it] {'loss': 0.0346, 'grad_norm': 0.5145301222801208, 'learning_rate': 3.3104333772088507e-05, 'epoch': 2.94} 29%|██▉ | 2944/10000 [4:36:44<10:47:16, 5.50s/it][2025-06-19 18:06:29,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:06:29,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.41 | bwd_microstep: 3328.78 | bwd_inner_microstep: 3327.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 18:06:29,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.41 | bwd: 3328.80 | bwd_inner: 3327.98 | bwd_allreduce: 0.77 | step: 6.88 29%|██▉ | 2945/10000 [4:36:50<10:46:12, 5.50s/it] {'loss': 0.1439, 'grad_norm': 1.2120639085769653, 'learning_rate': 3.309943970783528e-05, 'epoch': 2.94} 29%|██▉ | 2945/10000 [4:36:50<10:46:12, 5.50s/it][2025-06-19 18:06:34,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:06:34,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.53 | bwd_microstep: 3369.72 | bwd_inner_microstep: 3368.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 18:06:34,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.53 | bwd: 3369.73 | bwd_inner: 3368.92 | bwd_allreduce: 0.77 | step: 6.88 29%|██▉ | 2946/10000 [4:36:55<10:47:26, 5.51s/it] {'loss': 0.0787, 'grad_norm': 1.1669425964355469, 'learning_rate': 3.3094544269511627e-05, 'epoch': 2.95} 29%|██▉ | 2946/10000 [4:36:55<10:47:26, 5.51s/it][2025-06-19 18:06:40,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:06:40,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.68 | bwd_microstep: 3315.88 | bwd_inner_microstep: 3315.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 18:06:40,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.68 | bwd: 3315.89 | bwd_inner: 3315.08 | bwd_allreduce: 0.77 | step: 6.72 29%|██▉ | 2947/10000 [4:37:01<10:45:26, 5.49s/it] {'loss': 0.1101, 'grad_norm': 1.1309458017349243, 'learning_rate': 3.3089647457631055e-05, 'epoch': 2.95} 29%|██▉ | 2947/10000 [4:37:01<10:45:26, 5.49s/it][2025-06-19 18:06:45,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:06:45,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.25 | bwd_microstep: 3314.89 | bwd_inner_microstep: 3314.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-19 18:06:45,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.25 | bwd: 3314.91 | bwd_inner: 3314.11 | bwd_allreduce: 0.76 | step: 6.81 29%|██▉ | 2948/10000 [4:37:06<10:43:55, 5.48s/it] {'loss': 0.0336, 'grad_norm': 0.6715387105941772, 'learning_rate': 3.308474927270721e-05, 'epoch': 2.95} 29%|██▉ | 2948/10000 [4:37:06<10:43:55, 5.48s/it][2025-06-19 18:06:51,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:06:51,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.64 | bwd_microstep: 3317.09 | bwd_inner_microstep: 3316.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 18:06:51,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.64 | bwd: 3317.10 | bwd_inner: 3316.30 | bwd_allreduce: 0.76 | step: 6.63 29%|██▉ | 2949/10000 [4:37:11<10:42:52, 5.47s/it] {'loss': 0.0376, 'grad_norm': 0.4361356198787689, 'learning_rate': 3.3079849715253894e-05, 'epoch': 2.95} 29%|██▉ | 2949/10000 [4:37:11<10:42:52, 5.47s/it][2025-06-19 18:06:56,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:06:56,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.91 | bwd_microstep: 3362.36 | bwd_inner_microstep: 3361.41 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.05 [2025-06-19 18:06:56,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.91 | bwd: 3362.37 | bwd_inner: 3361.41 | bwd_allreduce: 0.92 | step: 7.06 30%|██▉ | 2950/10000 [4:37:17<10:44:46, 5.49s/it] {'loss': 0.1802, 'grad_norm': 1.1941131353378296, 'learning_rate': 3.3074948785785054e-05, 'epoch': 2.95} 30%|██▉ | 2950/10000 [4:37:17<10:44:46, 5.49s/it][2025-06-19 18:07:02,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:07:02,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.88 | bwd_microstep: 3309.46 | bwd_inner_microstep: 3308.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 18:07:02,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.88 | bwd: 3309.47 | bwd_inner: 3308.65 | bwd_allreduce: 0.78 | step: 7.08 30%|██▉ | 2951/10000 [4:37:22<10:43:24, 5.48s/it] {'loss': 0.0604, 'grad_norm': 0.6891881227493286, 'learning_rate': 3.307004648481477e-05, 'epoch': 2.95} 30%|██▉ | 2951/10000 [4:37:22<10:43:24, 5.48s/it][2025-06-19 18:07:07,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.91 [2025-06-19 18:07:07,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.47 | bwd_microstep: 3322.76 | bwd_inner_microstep: 3321.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.13 [2025-06-19 18:07:07,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.47 | bwd: 3322.78 | bwd_inner: 3321.97 | bwd_allreduce: 0.76 | step: 7.13 30%|██▉ | 2952/10000 [4:37:28<10:42:51, 5.47s/it] {'loss': 0.0382, 'grad_norm': 0.4936199188232422, 'learning_rate': 3.306514281285726e-05, 'epoch': 2.95} 30%|██▉ | 2952/10000 [4:37:28<10:42:51, 5.47s/it][2025-06-19 18:07:13,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:07:13,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.74 | bwd_microstep: 3313.48 | bwd_inner_microstep: 3312.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 18:07:13,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.74 | bwd: 3313.50 | bwd_inner: 3312.70 | bwd_allreduce: 0.76 | step: 6.70 30%|██▉ | 2953/10000 [4:37:33<10:41:55, 5.47s/it] {'loss': 0.0646, 'grad_norm': 0.9067991971969604, 'learning_rate': 3.3060237770426915e-05, 'epoch': 2.95} 30%|██▉ | 2953/10000 [4:37:33<10:41:55, 5.47s/it][2025-06-19 18:07:18,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:07:18,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.31 | bwd_microstep: 3367.60 | bwd_inner_microstep: 3366.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 18:07:18,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.31 | bwd: 3367.62 | bwd_inner: 3366.82 | bwd_allreduce: 0.76 | step: 6.59 30%|██▉ | 2954/10000 [4:37:39<10:43:58, 5.48s/it] {'loss': 0.0307, 'grad_norm': 0.6844479441642761, 'learning_rate': 3.305533135803824e-05, 'epoch': 2.95} 30%|██▉ | 2954/10000 [4:37:39<10:43:58, 5.48s/it][2025-06-19 18:07:24,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:07:24,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.40 | bwd_microstep: 3377.81 | bwd_inner_microstep: 3376.98 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.92 [2025-06-19 18:07:24,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.40 | bwd: 3377.82 | bwd_inner: 3376.98 | bwd_allreduce: 0.79 | step: 6.92 30%|██▉ | 2955/10000 [4:37:44<10:45:50, 5.50s/it] {'loss': 0.1019, 'grad_norm': 1.217779278755188, 'learning_rate': 3.305042357620589e-05, 'epoch': 2.96} 30%|██▉ | 2955/10000 [4:37:44<10:45:50, 5.50s/it][2025-06-19 18:07:29,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:07:29,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.22 | bwd_microstep: 3314.72 | bwd_inner_microstep: 3313.79 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.15 [2025-06-19 18:07:29,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.22 | bwd: 3314.73 | bwd_inner: 3313.79 | bwd_allreduce: 0.90 | step: 7.15 30%|██▉ | 2956/10000 [4:37:50<10:44:06, 5.49s/it] {'loss': 0.0403, 'grad_norm': 0.497857004404068, 'learning_rate': 3.304551442544469e-05, 'epoch': 2.96} 30%|██▉ | 2956/10000 [4:37:50<10:44:06, 5.49s/it][2025-06-19 18:07:35,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:07:35,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.90 | bwd_microstep: 3318.55 | bwd_inner_microstep: 3317.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.61 [2025-06-19 18:07:35,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.90 | bwd: 3318.56 | bwd_inner: 3317.75 | bwd_allreduce: 0.76 | step: 6.61 30%|██▉ | 2957/10000 [4:37:55<10:42:51, 5.48s/it] {'loss': 0.046, 'grad_norm': 0.7950946092605591, 'learning_rate': 3.3040603906269564e-05, 'epoch': 2.96} 30%|██▉ | 2957/10000 [4:37:55<10:42:51, 5.48s/it][2025-06-19 18:07:40,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:07:40,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.75 | bwd_microstep: 3374.06 | bwd_inner_microstep: 3373.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 18:07:40,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.75 | bwd: 3374.07 | bwd_inner: 3373.27 | bwd_allreduce: 0.76 | step: 6.64 30%|██▉ | 2958/10000 [4:38:01<10:44:43, 5.49s/it] {'loss': 0.1249, 'grad_norm': 2.3587558269500732, 'learning_rate': 3.303569201919561e-05, 'epoch': 2.96} 30%|██▉ | 2958/10000 [4:38:01<10:44:43, 5.49s/it][2025-06-19 18:07:46,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:07:46,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.53 | bwd_microstep: 3391.51 | bwd_inner_microstep: 3390.68 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.97 [2025-06-19 18:07:46,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.53 | bwd: 3391.53 | bwd_inner: 3390.68 | bwd_allreduce: 0.79 | step: 6.97 30%|██▉ | 2959/10000 [4:38:06<10:47:22, 5.52s/it] {'loss': 0.0379, 'grad_norm': 0.5986440181732178, 'learning_rate': 3.303077876473807e-05, 'epoch': 2.96} 30%|██▉ | 2959/10000 [4:38:06<10:47:22, 5.52s/it][2025-06-19 18:07:51,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:07:51,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.79 | bwd_microstep: 3366.48 | bwd_inner_microstep: 3365.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 18:07:51,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.79 | bwd: 3366.50 | bwd_inner: 3365.69 | bwd_allreduce: 0.76 | step: 6.67 30%|██▉ | 2960/10000 [4:38:12<10:47:46, 5.52s/it] {'loss': 0.0335, 'grad_norm': 0.4370097815990448, 'learning_rate': 3.302586414341231e-05, 'epoch': 2.96} 30%|██▉ | 2960/10000 [4:38:12<10:47:46, 5.52s/it][2025-06-19 18:07:57,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:07:57,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.91 | bwd_microstep: 3313.11 | bwd_inner_microstep: 3312.28 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.81 [2025-06-19 18:07:57,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.91 | bwd: 3313.12 | bwd_inner: 3312.28 | bwd_allreduce: 0.80 | step: 6.82 30%|██▉ | 2961/10000 [4:38:17<10:45:22, 5.50s/it] {'loss': 0.1093, 'grad_norm': 1.4169583320617676, 'learning_rate': 3.302094815573386e-05, 'epoch': 2.96} 30%|██▉ | 2961/10000 [4:38:17<10:45:22, 5.50s/it][2025-06-19 18:08:02,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:08:02,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.72 | bwd_microstep: 3325.01 | bwd_inner_microstep: 3324.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 18:08:02,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.72 | bwd: 3325.02 | bwd_inner: 3324.22 | bwd_allreduce: 0.76 | step: 6.66 30%|██▉ | 2962/10000 [4:38:23<10:43:48, 5.49s/it] {'loss': 0.0589, 'grad_norm': 0.6933485865592957, 'learning_rate': 3.301603080221838e-05, 'epoch': 2.96} 30%|██▉ | 2962/10000 [4:38:23<10:43:48, 5.49s/it][2025-06-19 18:08:08,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:08:08,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.50 | bwd_microstep: 3317.52 | bwd_inner_microstep: 3316.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 18:08:08,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.50 | bwd: 3317.53 | bwd_inner: 3316.73 | bwd_allreduce: 0.76 | step: 6.66 30%|██▉ | 2963/10000 [4:38:28<10:42:21, 5.48s/it] {'loss': 0.1382, 'grad_norm': 1.5032042264938354, 'learning_rate': 3.301111208338167e-05, 'epoch': 2.96} 30%|██▉ | 2963/10000 [4:38:28<10:42:21, 5.48s/it][2025-06-19 18:08:13,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:08:13,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.94 | bwd_microstep: 3313.44 | bwd_inner_microstep: 3312.54 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.11 [2025-06-19 18:08:13,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.94 | bwd: 3313.45 | bwd_inner: 3312.54 | bwd_allreduce: 0.86 | step: 7.11 30%|██▉ | 2964/10000 [4:38:34<10:41:21, 5.47s/it] {'loss': 0.0281, 'grad_norm': 0.6445109248161316, 'learning_rate': 3.30061919997397e-05, 'epoch': 2.96} 30%|██▉ | 2964/10000 [4:38:34<10:41:21, 5.47s/it][2025-06-19 18:08:18,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:08:18,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.45 | bwd_microstep: 3362.34 | bwd_inner_microstep: 3361.54 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-19 18:08:18,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.45 | bwd: 3362.35 | bwd_inner: 3361.54 | bwd_allreduce: 0.77 | step: 6.92 30%|██▉ | 2965/10000 [4:38:39<10:43:16, 5.49s/it] {'loss': 0.0709, 'grad_norm': 1.3193880319595337, 'learning_rate': 3.300127055180855e-05, 'epoch': 2.96} 30%|██▉ | 2965/10000 [4:38:39<10:43:16, 5.49s/it][2025-06-19 18:08:24,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:08:24,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.84 | bwd_microstep: 3316.92 | bwd_inner_microstep: 3316.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.59 [2025-06-19 18:08:24,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.84 | bwd: 3316.93 | bwd_inner: 3316.12 | bwd_allreduce: 0.76 | step: 6.60 30%|██▉ | 2966/10000 [4:38:45<10:42:01, 5.48s/it] {'loss': 0.1256, 'grad_norm': 2.1715810298919678, 'learning_rate': 3.299634774010445e-05, 'epoch': 2.97} 30%|██▉ | 2966/10000 [4:38:45<10:42:01, 5.48s/it][2025-06-19 18:08:29,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:08:29,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.52 | bwd_microstep: 3358.89 | bwd_inner_microstep: 3358.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 18:08:29,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.52 | bwd: 3358.91 | bwd_inner: 3358.10 | bwd_allreduce: 0.76 | step: 6.57 30%|██▉ | 2967/10000 [4:38:50<10:43:35, 5.49s/it] {'loss': 0.0957, 'grad_norm': 1.2130507230758667, 'learning_rate': 3.2991423565143805e-05, 'epoch': 2.97} 30%|██▉ | 2967/10000 [4:38:50<10:43:35, 5.49s/it][2025-06-19 18:08:35,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:08:35,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.72 | bwd_microstep: 3317.86 | bwd_inner_microstep: 3316.93 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.16 [2025-06-19 18:08:35,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.72 | bwd: 3317.87 | bwd_inner: 3316.93 | bwd_allreduce: 0.90 | step: 7.16 30%|██▉ | 2968/10000 [4:38:56<10:42:07, 5.48s/it] {'loss': 0.1011, 'grad_norm': 1.2769784927368164, 'learning_rate': 3.2986498027443117e-05, 'epoch': 2.97} 30%|██▉ | 2968/10000 [4:38:56<10:42:07, 5.48s/it][2025-06-19 18:08:40,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 18:08:40,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.42 | bwd_microstep: 3318.61 | bwd_inner_microstep: 3317.50 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.40 [2025-06-19 18:08:40,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.42 | bwd: 3318.62 | bwd_inner: 3317.50 | bwd_allreduce: 1.08 | step: 7.40 30%|██▉ | 2969/10000 [4:39:01<10:41:45, 5.48s/it] {'loss': 0.0553, 'grad_norm': 0.8435788750648499, 'learning_rate': 3.298157112751906e-05, 'epoch': 2.97} 30%|██▉ | 2969/10000 [4:39:01<10:41:45, 5.48s/it][2025-06-19 18:08:46,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:08:46,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.16 | bwd_microstep: 3312.58 | bwd_inner_microstep: 3311.58 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.64 [2025-06-19 18:08:46,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.16 | bwd: 3312.59 | bwd_inner: 3311.58 | bwd_allreduce: 0.97 | step: 7.65 30%|██▉ | 2970/10000 [4:39:07<10:41:25, 5.47s/it] {'loss': 0.0764, 'grad_norm': 1.064292550086975, 'learning_rate': 3.2976642865888436e-05, 'epoch': 2.97} 30%|██▉ | 2970/10000 [4:39:07<10:41:25, 5.47s/it][2025-06-19 18:08:51,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:08:51,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.66 | bwd_microstep: 3355.58 | bwd_inner_microstep: 3354.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.85 [2025-06-19 18:08:51,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.66 | bwd: 3355.59 | bwd_inner: 3354.79 | bwd_allreduce: 0.76 | step: 6.86 30%|██▉ | 2971/10000 [4:39:12<10:43:09, 5.49s/it] {'loss': 0.2328, 'grad_norm': 2.786815881729126, 'learning_rate': 3.2971713243068204e-05, 'epoch': 2.97} 30%|██▉ | 2971/10000 [4:39:12<10:43:09, 5.49s/it][2025-06-19 18:08:57,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:08:57,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.31 | bwd_microstep: 3325.56 | bwd_inner_microstep: 3324.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 18:08:57,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.31 | bwd: 3325.57 | bwd_inner: 3324.78 | bwd_allreduce: 0.76 | step: 6.62 30%|██▉ | 2972/10000 [4:39:18<10:42:24, 5.48s/it] {'loss': 0.1695, 'grad_norm': 1.701831340789795, 'learning_rate': 3.296678225957545e-05, 'epoch': 2.97} 30%|██▉ | 2972/10000 [4:39:18<10:42:24, 5.48s/it][2025-06-19 18:09:02,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:09:02,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.44 | bwd_microstep: 3316.82 | bwd_inner_microstep: 3315.90 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.47 [2025-06-19 18:09:02,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.44 | bwd: 3316.83 | bwd_inner: 3315.90 | bwd_allreduce: 0.89 | step: 7.47 30%|██▉ | 2973/10000 [4:39:23<10:41:26, 5.48s/it] {'loss': 0.0893, 'grad_norm': 1.1063247919082642, 'learning_rate': 3.296184991592744e-05, 'epoch': 2.97} 30%|██▉ | 2973/10000 [4:39:23<10:41:26, 5.48s/it][2025-06-19 18:09:08,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:09:08,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.50 | bwd_microstep: 3377.94 | bwd_inner_microstep: 3377.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-19 18:09:08,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.50 | bwd: 3377.96 | bwd_inner: 3377.13 | bwd_allreduce: 0.78 | step: 6.91 30%|██▉ | 2974/10000 [4:39:29<10:43:54, 5.50s/it] {'loss': 0.0912, 'grad_norm': 0.9236240386962891, 'learning_rate': 3.295691621264151e-05, 'epoch': 2.97} 30%|██▉ | 2974/10000 [4:39:29<10:43:54, 5.50s/it][2025-06-19 18:09:13,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:09:13,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.46 | bwd_microstep: 3317.30 | bwd_inner_microstep: 3316.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 18:09:13,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.46 | bwd: 3317.31 | bwd_inner: 3316.51 | bwd_allreduce: 0.76 | step: 6.57 30%|██▉ | 2975/10000 [4:39:34<10:42:10, 5.48s/it] {'loss': 0.0518, 'grad_norm': 0.6350626349449158, 'learning_rate': 3.2951981150235205e-05, 'epoch': 2.98} 30%|██▉ | 2975/10000 [4:39:34<10:42:10, 5.48s/it][2025-06-19 18:09:19,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:09:19,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.50 | bwd_microstep: 3397.47 | bwd_inner_microstep: 3396.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 18:09:19,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.50 | bwd: 3397.49 | bwd_inner: 3396.68 | bwd_allreduce: 0.76 | step: 6.89 30%|██▉ | 2976/10000 [4:39:40<10:44:49, 5.51s/it] {'loss': 0.0465, 'grad_norm': 0.7700843214988708, 'learning_rate': 3.29470447292262e-05, 'epoch': 2.98} 30%|██▉ | 2976/10000 [4:39:40<10:44:49, 5.51s/it][2025-06-19 18:09:24,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:09:24,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.73 | bwd_microstep: 3329.20 | bwd_inner_microstep: 3328.30 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.20 [2025-06-19 18:09:24,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.73 | bwd: 3329.21 | bwd_inner: 3328.30 | bwd_allreduce: 0.86 | step: 7.21 30%|██▉ | 2977/10000 [4:39:45<10:43:43, 5.50s/it] {'loss': 0.0564, 'grad_norm': 0.855527400970459, 'learning_rate': 3.294210695013228e-05, 'epoch': 2.98} 30%|██▉ | 2977/10000 [4:39:45<10:43:43, 5.50s/it][2025-06-19 18:09:30,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.84 [2025-06-19 18:09:30,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.63 | bwd_microstep: 3321.45 | bwd_inner_microstep: 3320.57 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.27 [2025-06-19 18:09:30,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.63 | bwd: 3321.47 | bwd_inner: 3320.57 | bwd_allreduce: 0.85 | step: 7.27 30%|██▉ | 2978/10000 [4:39:51<10:42:34, 5.49s/it] {'loss': 0.0819, 'grad_norm': 1.1809662580490112, 'learning_rate': 3.293716781347142e-05, 'epoch': 2.98} 30%|██▉ | 2978/10000 [4:39:51<10:42:34, 5.49s/it][2025-06-19 18:09:35,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:09:35,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.38 | bwd_microstep: 3324.74 | bwd_inner_microstep: 3323.77 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.10 [2025-06-19 18:09:35,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.38 | bwd: 3324.76 | bwd_inner: 3323.77 | bwd_allreduce: 0.94 | step: 7.10 30%|██▉ | 2979/10000 [4:39:56<10:41:43, 5.48s/it] {'loss': 0.1263, 'grad_norm': 1.115727424621582, 'learning_rate': 3.293222731976169e-05, 'epoch': 2.98} 30%|██▉ | 2979/10000 [4:39:56<10:41:43, 5.48s/it][2025-06-19 18:09:41,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:09:41,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.17 | bwd_microstep: 3318.15 | bwd_inner_microstep: 3317.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 18:09:41,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.17 | bwd: 3318.16 | bwd_inner: 3317.35 | bwd_allreduce: 0.77 | step: 6.71 30%|██▉ | 2980/10000 [4:40:02<10:40:45, 5.48s/it] {'loss': 0.0425, 'grad_norm': 0.6611645817756653, 'learning_rate': 3.292728546952134e-05, 'epoch': 2.98} 30%|██▉ | 2980/10000 [4:40:02<10:40:45, 5.48s/it][2025-06-19 18:09:46,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:09:46,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.47 | bwd_microstep: 3317.20 | bwd_inner_microstep: 3316.40 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-19 18:09:46,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.47 | bwd: 3317.21 | bwd_inner: 3316.40 | bwd_allreduce: 0.77 | step: 6.91 30%|██▉ | 2981/10000 [4:40:07<10:40:10, 5.47s/it] {'loss': 0.1351, 'grad_norm': 1.5807896852493286, 'learning_rate': 3.292234226326874e-05, 'epoch': 2.98} 30%|██▉ | 2981/10000 [4:40:07<10:40:10, 5.47s/it][2025-06-19 18:09:52,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:09:52,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.64 | bwd_microstep: 3374.63 | bwd_inner_microstep: 3373.84 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 18:09:52,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.64 | bwd: 3374.64 | bwd_inner: 3373.84 | bwd_allreduce: 0.76 | step: 6.64 30%|██▉ | 2982/10000 [4:40:13<10:42:28, 5.49s/it] {'loss': 0.0783, 'grad_norm': 1.2549463510513306, 'learning_rate': 3.291739770152241e-05, 'epoch': 2.98} 30%|██▉ | 2982/10000 [4:40:13<10:42:28, 5.49s/it][2025-06-19 18:09:57,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:09:57,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.47 | bwd_microstep: 3329.11 | bwd_inner_microstep: 3328.10 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.19 [2025-06-19 18:09:57,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.47 | bwd: 3329.12 | bwd_inner: 3328.10 | bwd_allreduce: 0.98 | step: 7.20 30%|██▉ | 2983/10000 [4:40:18<10:41:35, 5.49s/it] {'loss': 0.0551, 'grad_norm': 1.2543765306472778, 'learning_rate': 3.291245178480101e-05, 'epoch': 2.98} 30%|██▉ | 2983/10000 [4:40:18<10:41:35, 5.49s/it][2025-06-19 18:10:03,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:10:03,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.49 | bwd_microstep: 3368.83 | bwd_inner_microstep: 3368.00 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.86 [2025-06-19 18:10:03,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.49 | bwd: 3368.84 | bwd_inner: 3368.00 | bwd_allreduce: 0.80 | step: 6.86 30%|██▉ | 2984/10000 [4:40:24<10:43:13, 5.50s/it] {'loss': 0.106, 'grad_norm': 1.5776692628860474, 'learning_rate': 3.290750451362335e-05, 'epoch': 2.98} 30%|██▉ | 2984/10000 [4:40:24<10:43:13, 5.50s/it][2025-06-19 18:10:08,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:10:08,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.05 | bwd_microstep: 3324.17 | bwd_inner_microstep: 3323.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 18:10:08,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.05 | bwd: 3324.19 | bwd_inner: 3323.39 | bwd_allreduce: 0.76 | step: 6.62 30%|██▉ | 2985/10000 [4:40:29<10:41:59, 5.49s/it] {'loss': 0.0721, 'grad_norm': 0.8436490893363953, 'learning_rate': 3.290255588850837e-05, 'epoch': 2.98} 30%|██▉ | 2985/10000 [4:40:29<10:41:59, 5.49s/it][2025-06-19 18:10:14,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:10:14,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.71 | bwd_microstep: 3368.29 | bwd_inner_microstep: 3367.29 | bwd_allreduce_microstep: 0.95 | step_microstep: 6.86 [2025-06-19 18:10:14,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.71 | bwd: 3368.31 | bwd_inner: 3367.29 | bwd_allreduce: 0.97 | step: 6.87 30%|██▉ | 2986/10000 [4:40:35<10:43:42, 5.51s/it] {'loss': 0.0916, 'grad_norm': 1.0490764379501343, 'learning_rate': 3.289760590997516e-05, 'epoch': 2.99} 30%|██▉ | 2986/10000 [4:40:35<10:43:42, 5.51s/it][2025-06-19 18:10:19,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:10:19,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.00 | bwd_microstep: 3377.50 | bwd_inner_microstep: 3376.68 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.82 [2025-06-19 18:10:19,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.00 | bwd: 3377.52 | bwd_inner: 3376.68 | bwd_allreduce: 0.79 | step: 6.82 30%|██▉ | 2987/10000 [4:40:40<10:45:18, 5.52s/it] {'loss': 0.0335, 'grad_norm': 0.41028183698654175, 'learning_rate': 3.2892654578542956e-05, 'epoch': 2.99} 30%|██▉ | 2987/10000 [4:40:40<10:45:18, 5.52s/it][2025-06-19 18:10:25,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.88 [2025-06-19 18:10:25,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.41 | bwd_microstep: 3404.09 | bwd_inner_microstep: 3403.10 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.26 [2025-06-19 18:10:25,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.41 | bwd: 3404.11 | bwd_inner: 3403.10 | bwd_allreduce: 0.95 | step: 7.27 30%|██▉ | 2988/10000 [4:40:46<10:47:39, 5.54s/it] {'loss': 0.048, 'grad_norm': 0.6568938493728638, 'learning_rate': 3.288770189473112e-05, 'epoch': 2.99} 30%|██▉ | 2988/10000 [4:40:46<10:47:39, 5.54s/it][2025-06-19 18:10:30,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:10:30,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.11 | bwd_microstep: 3320.17 | bwd_inner_microstep: 3319.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 18:10:30,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.11 | bwd: 3320.18 | bwd_inner: 3319.36 | bwd_allreduce: 0.78 | step: 7.08 30%|██▉ | 2989/10000 [4:40:51<10:45:08, 5.52s/it] {'loss': 0.0516, 'grad_norm': 0.6548559069633484, 'learning_rate': 3.2882747859059166e-05, 'epoch': 2.99} 30%|██▉ | 2989/10000 [4:40:51<10:45:08, 5.52s/it][2025-06-19 18:10:36,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:10:36,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.26 | bwd_microstep: 3321.95 | bwd_inner_microstep: 3320.91 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.36 [2025-06-19 18:10:36,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.26 | bwd: 3321.97 | bwd_inner: 3320.91 | bwd_allreduce: 1.01 | step: 7.36 30%|██▉ | 2990/10000 [4:40:57<10:43:17, 5.51s/it] {'loss': 0.0513, 'grad_norm': 1.1231170892715454, 'learning_rate': 3.287779247204675e-05, 'epoch': 2.99} 30%|██▉ | 2990/10000 [4:40:57<10:43:17, 5.51s/it][2025-06-19 18:10:41,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:10:41,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.25 | bwd_microstep: 3319.88 | bwd_inner_microstep: 3318.93 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.97 [2025-06-19 18:10:41,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.25 | bwd: 3319.89 | bwd_inner: 3318.93 | bwd_allreduce: 0.92 | step: 6.97 30%|██▉ | 2991/10000 [4:41:02<10:41:49, 5.49s/it] {'loss': 0.09, 'grad_norm': 0.7315100431442261, 'learning_rate': 3.287283573421368e-05, 'epoch': 2.99} 30%|██▉ | 2991/10000 [4:41:02<10:41:49, 5.49s/it][2025-06-19 18:10:47,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:10:47,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.81 | bwd_microstep: 3372.15 | bwd_inner_microstep: 3371.36 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 18:10:47,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.81 | bwd: 3372.16 | bwd_inner: 3371.36 | bwd_allreduce: 0.76 | step: 6.65 30%|██▉ | 2992/10000 [4:41:08<10:43:14, 5.51s/it] {'loss': 0.0787, 'grad_norm': 0.6862033605575562, 'learning_rate': 3.2867877646079884e-05, 'epoch': 2.99} 30%|██▉ | 2992/10000 [4:41:08<10:43:14, 5.51s/it][2025-06-19 18:10:52,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:10:52,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.18 | bwd_microstep: 3369.97 | bwd_inner_microstep: 3369.04 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.85 [2025-06-19 18:10:52,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.18 | bwd: 3369.98 | bwd_inner: 3369.04 | bwd_allreduce: 0.90 | step: 6.86 30%|██▉ | 2993/10000 [4:41:13<10:44:07, 5.52s/it] {'loss': 0.019, 'grad_norm': 0.1842752993106842, 'learning_rate': 3.286291820816543e-05, 'epoch': 2.99} 30%|██▉ | 2993/10000 [4:41:13<10:44:07, 5.52s/it][2025-06-19 18:10:58,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:10:58,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.37 | bwd_microstep: 3325.34 | bwd_inner_microstep: 3324.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 18:10:58,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.37 | bwd: 3325.35 | bwd_inner: 3324.56 | bwd_allreduce: 0.76 | step: 6.70 30%|██▉ | 2994/10000 [4:41:19<10:42:23, 5.50s/it] {'loss': 0.0499, 'grad_norm': 0.5188974142074585, 'learning_rate': 3.285795742099057e-05, 'epoch': 2.99} 30%|██▉ | 2994/10000 [4:41:19<10:42:23, 5.50s/it][2025-06-19 18:11:03,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:11:03,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.36 | bwd_microstep: 3317.37 | bwd_inner_microstep: 3316.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.04 [2025-06-19 18:11:03,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.36 | bwd: 3317.38 | bwd_inner: 3316.56 | bwd_allreduce: 0.77 | step: 7.04 30%|██▉ | 2995/10000 [4:41:24<10:40:51, 5.49s/it] {'loss': 0.0265, 'grad_norm': 0.4379063546657562, 'learning_rate': 3.285299528507565e-05, 'epoch': 3.0} 30%|██▉ | 2995/10000 [4:41:24<10:40:51, 5.49s/it][2025-06-19 18:11:09,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:11:09,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.22 | bwd_microstep: 3323.24 | bwd_inner_microstep: 3322.33 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.34 [2025-06-19 18:11:09,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.23 | bwd: 3323.25 | bwd_inner: 3322.33 | bwd_allreduce: 0.88 | step: 7.34 30%|██▉ | 2996/10000 [4:41:30<10:40:13, 5.48s/it] {'loss': 0.0296, 'grad_norm': 0.3985658288002014, 'learning_rate': 3.284803180094118e-05, 'epoch': 3.0} 30%|██▉ | 2996/10000 [4:41:30<10:40:13, 5.48s/it][2025-06-19 18:11:14,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:11:14,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.76 | bwd_microstep: 3374.83 | bwd_inner_microstep: 3374.05 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.52 [2025-06-19 18:11:14,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.76 | bwd: 3374.84 | bwd_inner: 3374.05 | bwd_allreduce: 0.75 | step: 6.52 30%|██▉ | 2997/10000 [4:41:35<10:42:28, 5.50s/it] {'loss': 0.0545, 'grad_norm': 1.0609190464019775, 'learning_rate': 3.28430669691078e-05, 'epoch': 3.0} 30%|██▉ | 2997/10000 [4:41:35<10:42:28, 5.50s/it][2025-06-19 18:11:20,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:11:20,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.50 | bwd_microstep: 3327.44 | bwd_inner_microstep: 3326.53 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.15 [2025-06-19 18:11:20,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.50 | bwd: 3327.45 | bwd_inner: 3326.53 | bwd_allreduce: 0.88 | step: 7.16 30%|██▉ | 2998/10000 [4:41:41<10:41:09, 5.49s/it] {'loss': 0.0782, 'grad_norm': 1.6229254007339478, 'learning_rate': 3.2838100790096294e-05, 'epoch': 3.0} 30%|██▉ | 2998/10000 [4:41:41<10:41:09, 5.49s/it][2025-06-19 18:11:25,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-19 18:11:25,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.10 | bwd_microstep: 3375.35 | bwd_inner_microstep: 3374.57 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.84 [2025-06-19 18:11:25,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.10 | bwd: 3375.37 | bwd_inner: 3374.57 | bwd_allreduce: 0.76 | step: 6.84 30%|██▉ | 2999/10000 [4:41:46<10:42:55, 5.51s/it] {'loss': 0.1582, 'grad_norm': 1.2476526498794556, 'learning_rate': 3.2833133264427605e-05, 'epoch': 3.0} 30%|██▉ | 2999/10000 [4:41:46<10:42:55, 5.51s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-06-19 18:11:33,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:11:33,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.66 | bwd_microstep: 3369.32 | bwd_inner_microstep: 3368.53 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 18:11:33,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.66 | bwd: 3369.34 | bwd_inner: 3368.53 | bwd_allreduce: 0.77 | step: 6.70 30%|███ | 3000/10000 [4:41:54<11:58:37, 6.16s/it] {'loss': 0.026, 'grad_norm': 0.4432789087295532, 'learning_rate': 3.2828164392622804e-05, 'epoch': 3.0} 30%|███ | 3000/10000 [4:41:54<11:58:37, 6.16s/it]evaluate! [INFO|trainer.py:3910] 2025-06-19 18:11:43,561 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 18:11:43,566 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 18:11:43,566 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 18:12:37,150 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 18:12:37,152 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 18:12:37,153 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 18:12:37,153 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json evaluate! [INFO|trainer.py:3910] 2025-06-19 18:12:55,746 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 18:12:55,754 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 18:12:55,755 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 18:13:57,803 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 18:13:57,807 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 18:13:57,807 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 18:13:57,808 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json [2025-06-19 18:14:02,479] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 18:14:08,479] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 18:14:14,335] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 18:14:20,164] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 18:14:38,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:14:38,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2075.24 | bwd_microstep: 3260.61 | bwd_inner_microstep: 3259.72 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.65 [2025-06-19 18:14:38,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2075.19 | bwd: 3260.63 | bwd_inner: 3259.72 | bwd_allreduce: 0.87 | step: 7.65 30%|███ | 3001/10000 [4:44:59<116:12:51, 59.78s/it] {'loss': 0.0246, 'grad_norm': 0.3242437243461609, 'learning_rate': 3.2823194175203085e-05, 'epoch': 3.0} 30%|███ | 3001/10000 [4:44:59<116:12:51, 59.78s/it][2025-06-19 18:14:43,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:14:43,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.02 | bwd_microstep: 3316.29 | bwd_inner_microstep: 3315.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 18:14:43,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.02 | bwd: 3316.30 | bwd_inner: 3315.51 | bwd_allreduce: 0.75 | step: 6.57 30%|███ | 3002/10000 [4:45:04<84:31:33, 43.48s/it] {'loss': 0.0451, 'grad_norm': 0.6230310797691345, 'learning_rate': 3.281822261268982e-05, 'epoch': 3.0} 30%|███ | 3002/10000 [4:45:04<84:31:33, 43.48s/it][2025-06-19 18:14:49,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:14:49,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2088.86 | bwd_microstep: 3273.48 | bwd_inner_microstep: 3272.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 18:14:49,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2088.86 | bwd: 3273.50 | bwd_inner: 3272.69 | bwd_allreduce: 0.77 | step: 6.67 30%|███ | 3003/10000 [4:45:10<62:18:31, 32.06s/it] {'loss': 0.1847, 'grad_norm': 1.3231362104415894, 'learning_rate': 3.2813249705604486e-05, 'epoch': 3.0} 30%|███ | 3003/10000 [4:45:10<62:18:31, 32.06s/it][2025-06-19 18:14:54,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:14:54,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.45 | bwd_microstep: 3331.56 | bwd_inner_microstep: 3330.75 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 18:14:54,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.45 | bwd: 3331.58 | bwd_inner: 3330.75 | bwd_allreduce: 0.78 | step: 7.12 30%|███ | 3004/10000 [4:45:15<46:48:13, 24.08s/it] {'loss': 0.0619, 'grad_norm': 0.8076927661895752, 'learning_rate': 3.280827545446873e-05, 'epoch': 3.0} 30%|███ | 3004/10000 [4:45:15<46:48:13, 24.08s/it][2025-06-19 18:15:00,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:15:00,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.75 | bwd_microstep: 3352.04 | bwd_inner_microstep: 3351.16 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.92 [2025-06-19 18:15:00,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.75 | bwd: 3352.06 | bwd_inner: 3351.16 | bwd_allreduce: 0.85 | step: 6.93 30%|███ | 3005/10000 [4:45:21<35:58:02, 18.51s/it] {'loss': 0.034, 'grad_norm': 0.4536639451980591, 'learning_rate': 3.280329985980432e-05, 'epoch': 3.0} 30%|███ | 3005/10000 [4:45:21<35:58:02, 18.51s/it][2025-06-19 18:15:05,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:15:05,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.02 | bwd_microstep: 3349.03 | bwd_inner_microstep: 3348.25 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 18:15:05,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.02 | bwd: 3349.05 | bwd_inner: 3348.25 | bwd_allreduce: 0.75 | step: 6.54 30%|███ | 3006/10000 [4:45:26<28:23:18, 14.61s/it] {'loss': 0.0289, 'grad_norm': 0.276415079832077, 'learning_rate': 3.2798322922133186e-05, 'epoch': 3.01} 30%|███ | 3006/10000 [4:45:26<28:23:18, 14.61s/it][2025-06-19 18:15:11,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:15:11,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.59 | bwd_microstep: 3346.23 | bwd_inner_microstep: 3345.44 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 18:15:11,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.59 | bwd: 3346.24 | bwd_inner: 3345.44 | bwd_allreduce: 0.76 | step: 6.57 30%|███ | 3007/10000 [4:45:32<23:04:18, 11.88s/it] {'loss': 0.0575, 'grad_norm': 1.0033786296844482, 'learning_rate': 3.279334464197737e-05, 'epoch': 3.01} 30%|███ | 3007/10000 [4:45:32<23:04:18, 11.88s/it][2025-06-19 18:15:16,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:15:16,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.89 | bwd_microstep: 3282.52 | bwd_inner_microstep: 3281.62 | bwd_allreduce_microstep: 0.82 | step_microstep: 8.01 [2025-06-19 18:15:16,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.89 | bwd: 3282.55 | bwd_inner: 3281.62 | bwd_allreduce: 0.86 | step: 8.02 30%|███ | 3008/10000 [4:45:37<19:18:58, 9.95s/it] {'loss': 0.0495, 'grad_norm': 1.0819511413574219, 'learning_rate': 3.278836501985908e-05, 'epoch': 3.01} 30%|███ | 3008/10000 [4:45:37<19:18:58, 9.95s/it][2025-06-19 18:15:22,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:15:22,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.65 | bwd_microstep: 3334.34 | bwd_inner_microstep: 3333.48 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.25 [2025-06-19 18:15:22,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.65 | bwd: 3334.36 | bwd_inner: 3333.48 | bwd_allreduce: 0.82 | step: 7.25 30%|███ | 3009/10000 [4:45:43<16:43:26, 8.61s/it] {'loss': 0.0836, 'grad_norm': 1.2101727724075317, 'learning_rate': 3.2783384056300645e-05, 'epoch': 3.01} 30%|███ | 3009/10000 [4:45:43<16:43:26, 8.61s/it][2025-06-19 18:15:27,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:15:27,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2086.25 | bwd_microstep: 3281.56 | bwd_inner_microstep: 3280.77 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.57 [2025-06-19 18:15:27,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2086.25 | bwd: 3281.57 | bwd_inner: 3280.77 | bwd_allreduce: 0.76 | step: 6.58 30%|███ | 3010/10000 [4:45:48<14:51:14, 7.65s/it] {'loss': 0.0931, 'grad_norm': 0.9348538517951965, 'learning_rate': 3.277840175182456e-05, 'epoch': 3.01} 30%|███ | 3010/10000 [4:45:48<14:51:14, 7.65s/it][2025-06-19 18:15:33,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:15:33,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.85 | bwd_microstep: 3337.69 | bwd_inner_microstep: 3336.91 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 18:15:33,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.85 | bwd: 3337.71 | bwd_inner: 3336.91 | bwd_allreduce: 0.76 | step: 6.58 30%|███ | 3011/10000 [4:45:53<13:35:20, 7.00s/it] {'loss': 0.0473, 'grad_norm': 1.2414675951004028, 'learning_rate': 3.277341810695343e-05, 'epoch': 3.01} 30%|███ | 3011/10000 [4:45:53<13:35:20, 7.00s/it][2025-06-19 18:15:38,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:15:38,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2088.39 | bwd_microstep: 3291.41 | bwd_inner_microstep: 3290.63 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 18:15:38,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2088.39 | bwd: 3291.42 | bwd_inner: 3290.63 | bwd_allreduce: 0.75 | step: 6.61 30%|███ | 3012/10000 [4:45:59<12:39:51, 6.52s/it] {'loss': 0.0254, 'grad_norm': 0.3819456696510315, 'learning_rate': 3.276843312221003e-05, 'epoch': 3.01} 30%|███ | 3012/10000 [4:45:59<12:39:51, 6.52s/it][2025-06-19 18:15:43,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:15:43,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2084.23 | bwd_microstep: 3283.60 | bwd_inner_microstep: 3282.78 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.24 [2025-06-19 18:15:43,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2084.23 | bwd: 3283.62 | bwd_inner: 3282.78 | bwd_allreduce: 0.80 | step: 7.25 30%|███ | 3013/10000 [4:46:04<12:00:52, 6.19s/it] {'loss': 0.1238, 'grad_norm': 1.411462664604187, 'learning_rate': 3.2763446798117246e-05, 'epoch': 3.01} 30%|███ | 3013/10000 [4:46:04<12:00:52, 6.19s/it][2025-06-19 18:15:49,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:15:49,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2090.42 | bwd_microstep: 3291.41 | bwd_inner_microstep: 3290.52 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.57 [2025-06-19 18:15:49,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2090.42 | bwd: 3291.44 | bwd_inner: 3290.52 | bwd_allreduce: 0.85 | step: 7.57 30%|███ | 3014/10000 [4:46:10<11:34:02, 5.96s/it] {'loss': 0.0318, 'grad_norm': 0.48039349913597107, 'learning_rate': 3.2758459135198143e-05, 'epoch': 3.01} 30%|███ | 3014/10000 [4:46:10<11:34:02, 5.96s/it][2025-06-19 18:15:54,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-19 18:15:54,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.33 | bwd_microstep: 3339.56 | bwd_inner_microstep: 3338.42 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.09 [2025-06-19 18:15:54,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.33 | bwd: 3339.59 | bwd_inner: 3338.42 | bwd_allreduce: 1.10 | step: 8.10 30%|███ | 3015/10000 [4:46:15<11:19:08, 5.83s/it] {'loss': 0.018, 'grad_norm': 0.39379796385765076, 'learning_rate': 3.275347013397588e-05, 'epoch': 3.02} 30%|███ | 3015/10000 [4:46:15<11:19:08, 5.83s/it][2025-06-19 18:16:00,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 18:16:00,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.32 | bwd_microstep: 3293.89 | bwd_inner_microstep: 3292.99 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.32 [2025-06-19 18:16:00,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.32 | bwd: 3293.92 | bwd_inner: 3292.99 | bwd_allreduce: 0.86 | step: 7.32 30%|███ | 3016/10000 [4:46:21<11:06:38, 5.73s/it] {'loss': 0.1329, 'grad_norm': 1.4634767770767212, 'learning_rate': 3.27484797949738e-05, 'epoch': 3.02} 30%|███ | 3016/10000 [4:46:21<11:06:38, 5.73s/it][2025-06-19 18:16:05,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:16:05,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.61 | bwd_microstep: 3303.94 | bwd_inner_microstep: 3303.08 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.10 [2025-06-19 18:16:05,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.61 | bwd: 3303.96 | bwd_inner: 3303.08 | bwd_allreduce: 0.82 | step: 7.10 30%|███ | 3017/10000 [4:46:26<10:58:00, 5.65s/it] {'loss': 0.0249, 'grad_norm': 0.6246158480644226, 'learning_rate': 3.274348811871535e-05, 'epoch': 3.02} 30%|███ | 3017/10000 [4:46:26<10:58:00, 5.65s/it][2025-06-19 18:16:11,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:16:11,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.77 | bwd_microstep: 3348.67 | bwd_inner_microstep: 3347.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.55 [2025-06-19 18:16:11,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.77 | bwd: 3348.69 | bwd_inner: 3347.87 | bwd_allreduce: 0.78 | step: 7.56 30%|███ | 3018/10000 [4:46:32<10:54:31, 5.62s/it] {'loss': 0.0456, 'grad_norm': 0.6808897256851196, 'learning_rate': 3.273849510572414e-05, 'epoch': 3.02} 30%|███ | 3018/10000 [4:46:32<10:54:31, 5.62s/it][2025-06-19 18:16:16,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:16:16,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.50 | bwd_microstep: 3349.73 | bwd_inner_microstep: 3348.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.35 [2025-06-19 18:16:16,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.50 | bwd: 3349.75 | bwd_inner: 3348.92 | bwd_allreduce: 0.78 | step: 7.35 30%|███ | 3019/10000 [4:46:37<10:50:32, 5.59s/it] {'loss': 0.0635, 'grad_norm': 0.9736310243606567, 'learning_rate': 3.273350075652392e-05, 'epoch': 3.02} 30%|███ | 3019/10000 [4:46:37<10:50:32, 5.59s/it][2025-06-19 18:16:22,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:16:22,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.47 | bwd_microstep: 3352.17 | bwd_inner_microstep: 3351.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 18:16:22,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.47 | bwd: 3352.18 | bwd_inner: 3351.18 | bwd_allreduce: 0.76 | step: 6.73 30%|███ | 3020/10000 [4:46:43<10:47:42, 5.57s/it] {'loss': 0.0677, 'grad_norm': 0.8219600915908813, 'learning_rate': 3.272850507163857e-05, 'epoch': 3.02} 30%|███ | 3020/10000 [4:46:43<10:47:42, 5.57s/it][2025-06-19 18:16:27,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:16:27,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.20 | bwd_microstep: 3301.67 | bwd_inner_microstep: 3300.77 | bwd_allreduce_microstep: 0.82 | step_microstep: 8.24 [2025-06-19 18:16:27,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.20 | bwd: 3301.70 | bwd_inner: 3300.77 | bwd_allreduce: 0.85 | step: 8.24 30%|███ | 3021/10000 [4:46:48<10:43:39, 5.53s/it] {'loss': 0.0227, 'grad_norm': 0.3230370879173279, 'learning_rate': 3.27235080515921e-05, 'epoch': 3.02} 30%|███ | 3021/10000 [4:46:48<10:43:39, 5.53s/it][2025-06-19 18:16:33,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:16:33,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.23 | bwd_microstep: 3357.30 | bwd_inner_microstep: 3356.49 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 18:16:33,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.23 | bwd: 3357.32 | bwd_inner: 3356.49 | bwd_allreduce: 0.78 | step: 7.09 30%|███ | 3022/10000 [4:46:54<10:43:07, 5.53s/it] {'loss': 0.0392, 'grad_norm': 0.5389090776443481, 'learning_rate': 3.2718509696908704e-05, 'epoch': 3.02} 30%|███ | 3022/10000 [4:46:54<10:43:07, 5.53s/it][2025-06-19 18:16:38,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:16:38,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2092.59 | bwd_microstep: 3298.87 | bwd_inner_microstep: 3298.04 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.03 [2025-06-19 18:16:38,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2092.59 | bwd: 3298.88 | bwd_inner: 3298.04 | bwd_allreduce: 0.79 | step: 7.04 30%|███ | 3023/10000 [4:46:59<10:39:33, 5.50s/it] {'loss': 0.0398, 'grad_norm': 0.6487160921096802, 'learning_rate': 3.271351000811266e-05, 'epoch': 3.02} 30%|███ | 3023/10000 [4:46:59<10:39:33, 5.50s/it][2025-06-19 18:16:44,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:16:44,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2093.40 | bwd_microstep: 3306.01 | bwd_inner_microstep: 3305.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 18:16:44,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2093.40 | bwd: 3306.02 | bwd_inner: 3305.20 | bwd_allreduce: 0.78 | step: 7.23 30%|███ | 3024/10000 [4:47:05<10:37:15, 5.48s/it] {'loss': 0.0149, 'grad_norm': 0.3389451801776886, 'learning_rate': 3.270850898572842e-05, 'epoch': 3.02} 30%|███ | 3024/10000 [4:47:05<10:37:15, 5.48s/it][2025-06-19 18:16:49,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:16:49,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.51 | bwd_microstep: 3304.76 | bwd_inner_microstep: 3303.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.75 [2025-06-19 18:16:49,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.51 | bwd: 3304.77 | bwd_inner: 3303.95 | bwd_allreduce: 0.78 | step: 6.76 30%|███ | 3025/10000 [4:47:10<10:36:18, 5.47s/it] {'loss': 0.0372, 'grad_norm': 0.8119674324989319, 'learning_rate': 3.270350663028057e-05, 'epoch': 3.02} 30%|███ | 3025/10000 [4:47:10<10:36:18, 5.47s/it][2025-06-19 18:16:55,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:16:55,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.20 | bwd_microstep: 3343.83 | bwd_inner_microstep: 3343.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 18:16:55,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.20 | bwd: 3343.85 | bwd_inner: 3343.03 | bwd_allreduce: 0.77 | step: 6.94 30%|███ | 3026/10000 [4:47:16<10:37:33, 5.49s/it] {'loss': 0.0297, 'grad_norm': 2.025693416595459, 'learning_rate': 3.2698502942293834e-05, 'epoch': 3.03} 30%|███ | 3026/10000 [4:47:16<10:37:33, 5.49s/it][2025-06-19 18:17:00,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:17:00,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.00 | bwd_microstep: 3354.76 | bwd_inner_microstep: 3353.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 18:17:00,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.00 | bwd: 3354.78 | bwd_inner: 3353.97 | bwd_allreduce: 0.77 | step: 6.95 30%|███ | 3027/10000 [4:47:21<10:38:34, 5.49s/it] {'loss': 0.056, 'grad_norm': 1.4691886901855469, 'learning_rate': 3.2693497922293085e-05, 'epoch': 3.03} 30%|███ | 3027/10000 [4:47:21<10:38:34, 5.49s/it][2025-06-19 18:17:06,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:17:06,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2090.80 | bwd_microstep: 3299.15 | bwd_inner_microstep: 3298.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 18:17:06,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2090.80 | bwd: 3299.16 | bwd_inner: 3298.36 | bwd_allreduce: 0.76 | step: 6.65 30%|███ | 3028/10000 [4:47:27<10:36:04, 5.47s/it] {'loss': 0.1231, 'grad_norm': 1.6038709878921509, 'learning_rate': 3.2688491570803305e-05, 'epoch': 3.03} 30%|███ | 3028/10000 [4:47:27<10:36:04, 5.47s/it][2025-06-19 18:17:11,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:17:11,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.40 | bwd_microstep: 3342.99 | bwd_inner_microstep: 3342.18 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.77 [2025-06-19 18:17:11,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.40 | bwd: 3343.01 | bwd_inner: 3342.18 | bwd_allreduce: 0.79 | step: 6.78 30%|███ | 3029/10000 [4:47:32<10:36:37, 5.48s/it] {'loss': 0.0291, 'grad_norm': 0.3316851556301117, 'learning_rate': 3.268348388834965e-05, 'epoch': 3.03} 30%|███ | 3029/10000 [4:47:32<10:36:37, 5.48s/it][2025-06-19 18:17:17,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.74 [2025-06-19 18:17:17,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.62 | bwd_microstep: 3309.53 | bwd_inner_microstep: 3308.59 | bwd_allreduce_microstep: 0.89 | step_microstep: 8.23 [2025-06-19 18:17:17,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.62 | bwd: 3309.54 | bwd_inner: 3308.59 | bwd_allreduce: 0.91 | step: 8.25 30%|███ | 3030/10000 [4:47:37<10:36:01, 5.48s/it] {'loss': 0.0231, 'grad_norm': 0.6443019509315491, 'learning_rate': 3.26784748754574e-05, 'epoch': 3.03} 30%|███ | 3030/10000 [4:47:37<10:36:01, 5.48s/it][2025-06-19 18:17:22,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.81 [2025-06-19 18:17:22,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2092.55 | bwd_microstep: 3297.87 | bwd_inner_microstep: 3297.05 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.94 [2025-06-19 18:17:22,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2092.55 | bwd: 3297.88 | bwd_inner: 3297.05 | bwd_allreduce: 0.78 | step: 6.94 30%|███ | 3031/10000 [4:47:43<10:34:39, 5.46s/it] {'loss': 0.0424, 'grad_norm': 0.7027938961982727, 'learning_rate': 3.267346453265198e-05, 'epoch': 3.03} 30%|███ | 3031/10000 [4:47:43<10:34:39, 5.46s/it][2025-06-19 18:17:28,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:17:28,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2091.43 | bwd_microstep: 3295.29 | bwd_inner_microstep: 3294.41 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.24 [2025-06-19 18:17:28,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2091.43 | bwd: 3295.32 | bwd_inner: 3294.41 | bwd_allreduce: 0.83 | step: 7.24 30%|███ | 3032/10000 [4:47:48<10:33:19, 5.45s/it] {'loss': 0.0228, 'grad_norm': 0.5745491981506348, 'learning_rate': 3.266845286045895e-05, 'epoch': 3.03} 30%|███ | 3032/10000 [4:47:48<10:33:19, 5.45s/it][2025-06-19 18:17:33,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:17:33,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2155.79 | bwd_microstep: 3345.09 | bwd_inner_microstep: 3344.30 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 18:17:33,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2155.79 | bwd: 3345.10 | bwd_inner: 3344.30 | bwd_allreduce: 0.76 | step: 6.74 30%|███ | 3033/10000 [4:47:54<10:36:23, 5.48s/it] {'loss': 0.018, 'grad_norm': 0.7170143723487854, 'learning_rate': 3.266343985940401e-05, 'epoch': 3.03} 30%|███ | 3033/10000 [4:47:54<10:36:23, 5.48s/it][2025-06-19 18:17:39,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:17:39,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.73 | bwd_microstep: 3297.02 | bwd_inner_microstep: 3296.18 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.79 [2025-06-19 18:17:39,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.73 | bwd: 3297.03 | bwd_inner: 3296.18 | bwd_allreduce: 0.81 | step: 6.80 30%|███ | 3034/10000 [4:47:59<10:35:01, 5.47s/it] {'loss': 0.0384, 'grad_norm': 0.6195054054260254, 'learning_rate': 3.2658425530013005e-05, 'epoch': 3.03} 30%|███ | 3034/10000 [4:47:59<10:35:01, 5.47s/it][2025-06-19 18:17:44,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:17:44,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.02 | bwd_microstep: 3306.16 | bwd_inner_microstep: 3305.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-19 18:17:44,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.02 | bwd: 3306.17 | bwd_inner: 3305.36 | bwd_allreduce: 0.77 | step: 7.03 30%|███ | 3035/10000 [4:48:05<10:34:08, 5.46s/it] {'loss': 0.036, 'grad_norm': 0.9853877425193787, 'learning_rate': 3.2653409872811907e-05, 'epoch': 3.04} 30%|███ | 3035/10000 [4:48:05<10:34:08, 5.46s/it][2025-06-19 18:17:49,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:17:49,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.90 | bwd_microstep: 3301.05 | bwd_inner_microstep: 3300.09 | bwd_allreduce_microstep: 0.91 | step_microstep: 6.73 [2025-06-19 18:17:49,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.90 | bwd: 3301.06 | bwd_inner: 3300.09 | bwd_allreduce: 0.93 | step: 6.74 30%|███ | 3036/10000 [4:48:10<10:33:10, 5.46s/it] {'loss': 0.1126, 'grad_norm': 1.345015048980713, 'learning_rate': 3.264839288832684e-05, 'epoch': 3.04} 30%|███ | 3036/10000 [4:48:10<10:33:10, 5.46s/it][2025-06-19 18:17:55,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:17:55,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.92 | bwd_microstep: 3349.26 | bwd_inner_microstep: 3348.47 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 18:17:55,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.92 | bwd: 3349.27 | bwd_inner: 3348.47 | bwd_allreduce: 0.76 | step: 6.73 30%|███ | 3037/10000 [4:48:16<10:34:36, 5.47s/it] {'loss': 0.1241, 'grad_norm': 2.1008358001708984, 'learning_rate': 3.264337457708407e-05, 'epoch': 3.04} 30%|███ | 3037/10000 [4:48:16<10:34:36, 5.47s/it][2025-06-19 18:18:00,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:18:00,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.55 | bwd_microstep: 3309.43 | bwd_inner_microstep: 3308.62 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.29 [2025-06-19 18:18:00,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.55 | bwd: 3309.44 | bwd_inner: 3308.62 | bwd_allreduce: 0.78 | step: 7.29 30%|███ | 3038/10000 [4:48:21<10:33:55, 5.46s/it] {'loss': 0.0244, 'grad_norm': 0.6049602627754211, 'learning_rate': 3.263835493960998e-05, 'epoch': 3.04} 30%|███ | 3038/10000 [4:48:21<10:33:55, 5.46s/it][2025-06-19 18:18:06,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:18:06,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.09 | bwd_microstep: 3310.52 | bwd_inner_microstep: 3309.55 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.18 [2025-06-19 18:18:06,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.09 | bwd: 3310.53 | bwd_inner: 3309.55 | bwd_allreduce: 0.94 | step: 7.18 30%|███ | 3039/10000 [4:48:27<10:33:16, 5.46s/it] {'loss': 0.0161, 'grad_norm': 0.6556470394134521, 'learning_rate': 3.2633333976431116e-05, 'epoch': 3.04} 30%|███ | 3039/10000 [4:48:27<10:33:16, 5.46s/it][2025-06-19 18:18:11,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:18:11,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.52 | bwd_microstep: 3323.67 | bwd_inner_microstep: 3322.65 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.46 [2025-06-19 18:18:11,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.52 | bwd: 3323.69 | bwd_inner: 3322.65 | bwd_allreduce: 0.99 | step: 7.47 30%|███ | 3040/10000 [4:48:32<10:33:34, 5.46s/it] {'loss': 0.0521, 'grad_norm': 1.5714585781097412, 'learning_rate': 3.262831168807415e-05, 'epoch': 3.04} 30%|███ | 3040/10000 [4:48:32<10:33:34, 5.46s/it][2025-06-19 18:18:17,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:18:17,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.64 | bwd_microstep: 3321.28 | bwd_inner_microstep: 3320.39 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.16 [2025-06-19 18:18:17,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.64 | bwd: 3321.30 | bwd_inner: 3320.39 | bwd_allreduce: 0.86 | step: 7.17 30%|███ | 3041/10000 [4:48:38<10:33:58, 5.47s/it] {'loss': 0.0277, 'grad_norm': 0.8897724747657776, 'learning_rate': 3.262328807506589e-05, 'epoch': 3.04} 30%|███ | 3041/10000 [4:48:38<10:33:58, 5.47s/it][2025-06-19 18:18:22,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:18:22,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.22 | bwd_microstep: 3316.93 | bwd_inner_microstep: 3316.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 18:18:22,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.22 | bwd: 3316.94 | bwd_inner: 3316.13 | bwd_allreduce: 0.77 | step: 6.84 30%|███ | 3042/10000 [4:48:43<10:33:48, 5.47s/it] {'loss': 0.0614, 'grad_norm': 2.3423447608947754, 'learning_rate': 3.26182631379333e-05, 'epoch': 3.04} 30%|███ | 3042/10000 [4:48:43<10:33:48, 5.47s/it][2025-06-19 18:18:28,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:18:28,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.07 | bwd_microstep: 3308.36 | bwd_inner_microstep: 3307.52 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.38 [2025-06-19 18:18:28,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.07 | bwd: 3308.38 | bwd_inner: 3307.52 | bwd_allreduce: 0.80 | step: 7.39 30%|███ | 3043/10000 [4:48:48<10:33:29, 5.46s/it] {'loss': 0.0759, 'grad_norm': 1.6412723064422607, 'learning_rate': 3.2613236877203475e-05, 'epoch': 3.04} 30%|███ | 3043/10000 [4:48:48<10:33:29, 5.46s/it][2025-06-19 18:18:33,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:18:33,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.68 | bwd_microstep: 3363.67 | bwd_inner_microstep: 3362.86 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.01 [2025-06-19 18:18:33,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.68 | bwd: 3363.69 | bwd_inner: 3362.86 | bwd_allreduce: 0.79 | step: 7.02 30%|███ | 3044/10000 [4:48:54<10:35:54, 5.49s/it] {'loss': 0.0156, 'grad_norm': 0.6863116025924683, 'learning_rate': 3.2608209293403636e-05, 'epoch': 3.04} 30%|███ | 3044/10000 [4:48:54<10:35:54, 5.49s/it][2025-06-19 18:18:39,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:18:39,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.73 | bwd_microstep: 3315.47 | bwd_inner_microstep: 3314.65 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.05 [2025-06-19 18:18:39,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.73 | bwd: 3315.48 | bwd_inner: 3314.65 | bwd_allreduce: 0.79 | step: 7.05 30%|███ | 3045/10000 [4:48:59<10:34:49, 5.48s/it] {'loss': 0.0084, 'grad_norm': 0.16479024291038513, 'learning_rate': 3.260318038706117e-05, 'epoch': 3.04} 30%|███ | 3045/10000 [4:48:59<10:34:49, 5.48s/it][2025-06-19 18:18:44,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:18:44,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.48 | bwd_microstep: 3310.91 | bwd_inner_microstep: 3310.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 18:18:44,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.48 | bwd: 3310.93 | bwd_inner: 3310.12 | bwd_allreduce: 0.76 | step: 6.69 30%|███ | 3046/10000 [4:49:05<10:33:57, 5.47s/it] {'loss': 0.0358, 'grad_norm': 1.3727360963821411, 'learning_rate': 3.259815015870357e-05, 'epoch': 3.05} 30%|███ | 3046/10000 [4:49:05<10:33:57, 5.47s/it][2025-06-19 18:18:50,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:18:50,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.46 | bwd_microstep: 3321.79 | bwd_inner_microstep: 3320.80 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.36 [2025-06-19 18:18:50,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.46 | bwd: 3321.81 | bwd_inner: 3320.81 | bwd_allreduce: 0.95 | step: 7.36 30%|███ | 3047/10000 [4:49:10<10:33:43, 5.47s/it] {'loss': 0.0414, 'grad_norm': 1.3134448528289795, 'learning_rate': 3.2593118608858484e-05, 'epoch': 3.05} 30%|███ | 3047/10000 [4:49:10<10:33:43, 5.47s/it][2025-06-19 18:18:55,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:18:55,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.96 | bwd_microstep: 3372.02 | bwd_inner_microstep: 3371.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-19 18:18:55,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.96 | bwd: 3372.04 | bwd_inner: 3371.23 | bwd_allreduce: 0.77 | step: 6.92 30%|███ | 3048/10000 [4:49:16<10:36:04, 5.49s/it] {'loss': 0.0937, 'grad_norm': 1.6349172592163086, 'learning_rate': 3.25880857380537e-05, 'epoch': 3.05} 30%|███ | 3048/10000 [4:49:16<10:36:04, 5.49s/it][2025-06-19 18:19:01,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:19:01,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.15 | bwd_microstep: 3397.95 | bwd_inner_microstep: 3397.14 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.82 [2025-06-19 18:19:01,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.15 | bwd: 3397.96 | bwd_inner: 3397.14 | bwd_allreduce: 0.78 | step: 6.84 30%|███ | 3049/10000 [4:49:22<10:39:10, 5.52s/it] {'loss': 0.0273, 'grad_norm': 0.6891468167304993, 'learning_rate': 3.258305154681715e-05, 'epoch': 3.05} 30%|███ | 3049/10000 [4:49:22<10:39:10, 5.52s/it][2025-06-19 18:19:06,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:19:06,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.82 | bwd_microstep: 3327.20 | bwd_inner_microstep: 3326.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 18:19:06,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.82 | bwd: 3327.21 | bwd_inner: 3326.40 | bwd_allreduce: 0.76 | step: 6.65 30%|███ | 3050/10000 [4:49:27<10:37:30, 5.50s/it] {'loss': 0.0151, 'grad_norm': 0.7540111541748047, 'learning_rate': 3.2578016035676895e-05, 'epoch': 3.05} 30%|███ | 3050/10000 [4:49:27<10:37:30, 5.50s/it][2025-06-19 18:19:12,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:19:12,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.86 | bwd_microstep: 3378.43 | bwd_inner_microstep: 3377.65 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 18:19:12,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.86 | bwd: 3378.45 | bwd_inner: 3377.65 | bwd_allreduce: 0.76 | step: 6.64 31%|███ | 3051/10000 [4:49:33<10:38:44, 5.52s/it] {'loss': 0.0931, 'grad_norm': 2.292304754257202, 'learning_rate': 3.257297920516113e-05, 'epoch': 3.05} 31%|███ | 3051/10000 [4:49:33<10:38:44, 5.52s/it][2025-06-19 18:19:17,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:19:17,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.72 | bwd_microstep: 3313.69 | bwd_inner_microstep: 3312.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 18:19:17,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.72 | bwd: 3313.70 | bwd_inner: 3312.90 | bwd_allreduce: 0.76 | step: 6.64 31%|███ | 3052/10000 [4:49:38<10:36:39, 5.50s/it] {'loss': 0.1583, 'grad_norm': 2.574383020401001, 'learning_rate': 3.25679410557982e-05, 'epoch': 3.05} 31%|███ | 3052/10000 [4:49:38<10:36:39, 5.50s/it][2025-06-19 18:19:23,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:19:23,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.33 | bwd_microstep: 3329.43 | bwd_inner_microstep: 3328.42 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.80 [2025-06-19 18:19:23,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.33 | bwd: 3329.45 | bwd_inner: 3328.42 | bwd_allreduce: 0.97 | step: 7.80 31%|███ | 3053/10000 [4:49:43<10:35:53, 5.49s/it] {'loss': 0.069, 'grad_norm': 3.6732776165008545, 'learning_rate': 3.256290158811658e-05, 'epoch': 3.05} 31%|███ | 3053/10000 [4:49:43<10:35:53, 5.49s/it][2025-06-19 18:19:28,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:19:28,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.58 | bwd_microstep: 3329.53 | bwd_inner_microstep: 3328.72 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.47 [2025-06-19 18:19:28,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.58 | bwd: 3329.55 | bwd_inner: 3328.72 | bwd_allreduce: 0.79 | step: 7.47 31%|███ | 3054/10000 [4:49:49<10:35:38, 5.49s/it] {'loss': 0.0311, 'grad_norm': 1.0477384328842163, 'learning_rate': 3.255786080264489e-05, 'epoch': 3.05} 31%|███ | 3054/10000 [4:49:49<10:35:38, 5.49s/it][2025-06-19 18:19:34,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:19:34,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.21 | bwd_microstep: 3381.47 | bwd_inner_microstep: 3380.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 18:19:34,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.21 | bwd: 3381.48 | bwd_inner: 3380.67 | bwd_allreduce: 0.76 | step: 6.71 31%|███ | 3055/10000 [4:49:54<10:37:37, 5.51s/it] {'loss': 0.0887, 'grad_norm': 2.3002004623413086, 'learning_rate': 3.255281869991189e-05, 'epoch': 3.06} 31%|███ | 3055/10000 [4:49:54<10:37:37, 5.51s/it][2025-06-19 18:19:39,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:19:39,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.38 | bwd_microstep: 3331.43 | bwd_inner_microstep: 3330.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 18:19:39,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.38 | bwd: 3331.45 | bwd_inner: 3330.64 | bwd_allreduce: 0.76 | step: 6.67 31%|███ | 3056/10000 [4:50:00<10:36:13, 5.50s/it] {'loss': 0.0344, 'grad_norm': 0.8626468777656555, 'learning_rate': 3.254777528044646e-05, 'epoch': 3.06} 31%|███ | 3056/10000 [4:50:00<10:36:13, 5.50s/it][2025-06-19 18:19:45,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:19:45,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.03 | bwd_microstep: 3325.02 | bwd_inner_microstep: 3324.19 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.08 [2025-06-19 18:19:45,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.03 | bwd: 3325.03 | bwd_inner: 3324.19 | bwd_allreduce: 0.79 | step: 7.09 31%|███ | 3057/10000 [4:50:05<10:35:08, 5.49s/it] {'loss': 0.0353, 'grad_norm': 0.8459868431091309, 'learning_rate': 3.2542730544777654e-05, 'epoch': 3.06} 31%|███ | 3057/10000 [4:50:05<10:35:08, 5.49s/it][2025-06-19 18:19:50,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:19:50,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.47 | bwd_microstep: 3334.63 | bwd_inner_microstep: 3333.82 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-19 18:19:50,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.47 | bwd: 3334.64 | bwd_inner: 3333.82 | bwd_allreduce: 0.77 | step: 7.00 31%|███ | 3058/10000 [4:50:11<10:34:53, 5.49s/it] {'loss': 0.0366, 'grad_norm': 0.9793805480003357, 'learning_rate': 3.253768449343461e-05, 'epoch': 3.06} 31%|███ | 3058/10000 [4:50:11<10:34:53, 5.49s/it][2025-06-19 18:19:56,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:19:56,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.80 | bwd_microstep: 3371.23 | bwd_inner_microstep: 3370.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 18:19:56,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.80 | bwd: 3371.25 | bwd_inner: 3370.44 | bwd_allreduce: 0.77 | step: 6.69 31%|███ | 3059/10000 [4:50:16<10:36:33, 5.50s/it] {'loss': 0.0656, 'grad_norm': 2.0603761672973633, 'learning_rate': 3.253263712694666e-05, 'epoch': 3.06} 31%|███ | 3059/10000 [4:50:16<10:36:33, 5.50s/it][2025-06-19 18:20:01,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:20:01,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.59 | bwd_microstep: 3384.69 | bwd_inner_microstep: 3383.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 18:20:01,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.59 | bwd: 3384.70 | bwd_inner: 3383.90 | bwd_allreduce: 0.76 | step: 6.67 31%|███ | 3060/10000 [4:50:22<10:38:07, 5.52s/it] {'loss': 0.0086, 'grad_norm': 0.3396371304988861, 'learning_rate': 3.252758844584324e-05, 'epoch': 3.06} 31%|███ | 3060/10000 [4:50:22<10:38:07, 5.52s/it][2025-06-19 18:20:07,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:20:07,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.22 | bwd_microstep: 3371.03 | bwd_inner_microstep: 3370.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 18:20:07,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.22 | bwd: 3371.05 | bwd_inner: 3370.24 | bwd_allreduce: 0.77 | step: 6.99 31%|███ | 3061/10000 [4:50:28<10:38:41, 5.52s/it] {'loss': 0.041, 'grad_norm': 0.9000378251075745, 'learning_rate': 3.2522538450653935e-05, 'epoch': 3.06} 31%|███ | 3061/10000 [4:50:28<10:38:41, 5.52s/it][2025-06-19 18:20:12,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:20:12,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.89 | bwd_microstep: 3326.83 | bwd_inner_microstep: 3325.94 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.95 [2025-06-19 18:20:12,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.89 | bwd: 3326.85 | bwd_inner: 3325.94 | bwd_allreduce: 0.86 | step: 6.95 31%|███ | 3062/10000 [4:50:33<10:36:49, 5.51s/it] {'loss': 0.0642, 'grad_norm': 3.041079521179199, 'learning_rate': 3.251748714190847e-05, 'epoch': 3.06} 31%|███ | 3062/10000 [4:50:33<10:36:49, 5.51s/it][2025-06-19 18:20:18,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:20:18,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.42 | bwd_microstep: 3329.96 | bwd_inner_microstep: 3329.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.63 [2025-06-19 18:20:18,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.42 | bwd: 3329.98 | bwd_inner: 3329.16 | bwd_allreduce: 0.78 | step: 6.63 31%|███ | 3063/10000 [4:50:39<10:35:54, 5.50s/it] {'loss': 0.0147, 'grad_norm': 0.3374073803424835, 'learning_rate': 3.251243452013669e-05, 'epoch': 3.06} 31%|███ | 3063/10000 [4:50:39<10:35:54, 5.50s/it][2025-06-19 18:20:23,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:20:23,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.90 | bwd_microstep: 3331.40 | bwd_inner_microstep: 3330.63 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 18:20:23,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.90 | bwd: 3331.42 | bwd_inner: 3330.63 | bwd_allreduce: 0.75 | step: 6.57 31%|███ | 3064/10000 [4:50:44<10:35:12, 5.49s/it] {'loss': 0.0323, 'grad_norm': 0.9276639223098755, 'learning_rate': 3.2507380585868605e-05, 'epoch': 3.06} 31%|███ | 3064/10000 [4:50:44<10:35:12, 5.49s/it][2025-06-19 18:20:29,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:20:29,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.93 | bwd_microstep: 3332.66 | bwd_inner_microstep: 3331.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.93 [2025-06-19 18:20:29,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.93 | bwd: 3332.67 | bwd_inner: 3331.87 | bwd_allreduce: 0.76 | step: 6.93 31%|███ | 3065/10000 [4:50:49<10:34:48, 5.49s/it] {'loss': 0.0252, 'grad_norm': 1.151438593864441, 'learning_rate': 3.2502325339634344e-05, 'epoch': 3.06} 31%|███ | 3065/10000 [4:50:49<10:34:48, 5.49s/it][2025-06-19 18:20:34,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:20:34,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.68 | bwd_microstep: 3325.50 | bwd_inner_microstep: 3324.32 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.27 [2025-06-19 18:20:34,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.68 | bwd: 3325.52 | bwd_inner: 3324.32 | bwd_allreduce: 1.14 | step: 7.27 31%|███ | 3066/10000 [4:50:55<10:33:57, 5.49s/it] {'loss': 0.0548, 'grad_norm': 1.686381459236145, 'learning_rate': 3.249726878196418e-05, 'epoch': 3.07} 31%|███ | 3066/10000 [4:50:55<10:33:57, 5.49s/it][2025-06-19 18:20:40,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:20:40,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.85 | bwd_microstep: 3322.63 | bwd_inner_microstep: 3321.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 18:20:40,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.85 | bwd: 3322.65 | bwd_inner: 3321.84 | bwd_allreduce: 0.77 | step: 6.63 31%|███ | 3067/10000 [4:51:00<10:33:48, 5.49s/it] {'loss': 0.0092, 'grad_norm': 0.49463534355163574, 'learning_rate': 3.249221091338853e-05, 'epoch': 3.07} 31%|███ | 3067/10000 [4:51:00<10:33:48, 5.49s/it][2025-06-19 18:20:45,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:20:45,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.50 | bwd_microstep: 3379.22 | bwd_inner_microstep: 3378.41 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.74 [2025-06-19 18:20:45,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.50 | bwd: 3379.24 | bwd_inner: 3378.41 | bwd_allreduce: 0.79 | step: 6.74 31%|███ | 3068/10000 [4:51:06<10:35:59, 5.50s/it] {'loss': 0.0178, 'grad_norm': 0.7754970192909241, 'learning_rate': 3.248715173443792e-05, 'epoch': 3.07} 31%|███ | 3068/10000 [4:51:06<10:35:59, 5.50s/it][2025-06-19 18:20:51,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:20:51,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.82 | bwd_microstep: 3323.08 | bwd_inner_microstep: 3322.24 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.80 [2025-06-19 18:20:51,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.82 | bwd: 3323.11 | bwd_inner: 3322.24 | bwd_allreduce: 0.81 | step: 6.80 31%|███ | 3069/10000 [4:51:11<10:34:51, 5.50s/it] {'loss': 0.0641, 'grad_norm': 2.6342053413391113, 'learning_rate': 3.248209124564305e-05, 'epoch': 3.07} 31%|███ | 3069/10000 [4:51:11<10:34:51, 5.50s/it][2025-06-19 18:20:56,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 18:20:56,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.78 | bwd_microstep: 3323.02 | bwd_inner_microstep: 3322.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 18:20:56,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.78 | bwd: 3323.03 | bwd_inner: 3322.24 | bwd_allreduce: 0.75 | step: 6.65 31%|███ | 3070/10000 [4:51:17<10:33:51, 5.49s/it] {'loss': 0.1106, 'grad_norm': 2.8440616130828857, 'learning_rate': 3.2477029447534744e-05, 'epoch': 3.07} 31%|███ | 3070/10000 [4:51:17<10:33:51, 5.49s/it][2025-06-19 18:21:02,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:21:02,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.85 | bwd_microstep: 3378.98 | bwd_inner_microstep: 3378.16 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.17 [2025-06-19 18:21:02,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.85 | bwd: 3379.00 | bwd_inner: 3378.16 | bwd_allreduce: 0.79 | step: 7.17 31%|███ | 3071/10000 [4:51:22<10:35:49, 5.51s/it] {'loss': 0.0157, 'grad_norm': 1.2233357429504395, 'learning_rate': 3.247196634064396e-05, 'epoch': 3.07} 31%|███ | 3071/10000 [4:51:22<10:35:49, 5.51s/it][2025-06-19 18:21:07,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:21:07,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.00 | bwd_microstep: 3324.07 | bwd_inner_microstep: 3323.03 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.41 [2025-06-19 18:21:07,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.00 | bwd: 3324.09 | bwd_inner: 3323.03 | bwd_allreduce: 1.01 | step: 7.41 31%|███ | 3072/10000 [4:51:28<10:34:47, 5.50s/it] {'loss': 0.013, 'grad_norm': 0.6696852445602417, 'learning_rate': 3.246690192550179e-05, 'epoch': 3.07} 31%|███ | 3072/10000 [4:51:28<10:34:47, 5.50s/it][2025-06-19 18:21:13,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:21:13,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.45 | bwd_microstep: 3328.01 | bwd_inner_microstep: 3327.06 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.06 [2025-06-19 18:21:13,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.45 | bwd: 3328.03 | bwd_inner: 3327.06 | bwd_allreduce: 0.92 | step: 7.08 31%|███ | 3073/10000 [4:51:33<10:34:18, 5.49s/it] {'loss': 0.1075, 'grad_norm': 2.286181926727295, 'learning_rate': 3.2461836202639466e-05, 'epoch': 3.07} 31%|███ | 3073/10000 [4:51:33<10:34:18, 5.49s/it][2025-06-19 18:21:18,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:21:18,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.91 | bwd_microstep: 3328.88 | bwd_inner_microstep: 3328.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 18:21:18,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.91 | bwd: 3328.89 | bwd_inner: 3328.09 | bwd_allreduce: 0.76 | step: 6.64 31%|███ | 3074/10000 [4:51:39<10:33:25, 5.49s/it] {'loss': 0.0223, 'grad_norm': 0.7672392129898071, 'learning_rate': 3.2456769172588356e-05, 'epoch': 3.07} 31%|███ | 3074/10000 [4:51:39<10:33:25, 5.49s/it][2025-06-19 18:21:24,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:21:24,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.96 | bwd_microstep: 3317.00 | bwd_inner_microstep: 3316.01 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.22 [2025-06-19 18:21:24,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.96 | bwd: 3317.02 | bwd_inner: 3316.01 | bwd_allreduce: 0.95 | step: 7.23 31%|███ | 3075/10000 [4:51:44<10:32:25, 5.48s/it] {'loss': 0.1855, 'grad_norm': 4.090266704559326, 'learning_rate': 3.245170083587998e-05, 'epoch': 3.08} 31%|███ | 3075/10000 [4:51:44<10:32:25, 5.48s/it][2025-06-19 18:21:29,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:21:29,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.39 | bwd_microstep: 3332.98 | bwd_inner_microstep: 3332.18 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 18:21:29,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.39 | bwd: 3333.00 | bwd_inner: 3332.18 | bwd_allreduce: 0.77 | step: 7.21 31%|███ | 3076/10000 [4:51:50<10:32:36, 5.48s/it] {'loss': 0.0051, 'grad_norm': 0.3022397458553314, 'learning_rate': 3.244663119304598e-05, 'epoch': 3.08} 31%|███ | 3076/10000 [4:51:50<10:32:36, 5.48s/it][2025-06-19 18:21:35,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:21:35,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.35 | bwd_microstep: 3379.42 | bwd_inner_microstep: 3378.64 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 18:21:35,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.35 | bwd: 3379.43 | bwd_inner: 3378.64 | bwd_allreduce: 0.76 | step: 6.58 31%|███ | 3077/10000 [4:51:55<10:34:47, 5.50s/it] {'loss': 0.0893, 'grad_norm': 1.863778829574585, 'learning_rate': 3.244156024461813e-05, 'epoch': 3.08} 31%|███ | 3077/10000 [4:51:55<10:34:47, 5.50s/it][2025-06-19 18:21:40,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:21:40,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.45 | bwd_microstep: 3375.45 | bwd_inner_microstep: 3374.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.45 [2025-06-19 18:21:40,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.45 | bwd: 3375.47 | bwd_inner: 3374.65 | bwd_allreduce: 0.77 | step: 7.45 31%|███ | 3078/10000 [4:52:01<10:36:15, 5.52s/it] {'loss': 0.139, 'grad_norm': 3.566577196121216, 'learning_rate': 3.2436487991128354e-05, 'epoch': 3.08} 31%|███ | 3078/10000 [4:52:01<10:36:15, 5.52s/it][2025-06-19 18:21:46,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.78 [2025-06-19 18:21:46,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.08 | bwd_microstep: 3375.70 | bwd_inner_microstep: 3374.82 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.44 [2025-06-19 18:21:46,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.08 | bwd: 3375.72 | bwd_inner: 3374.82 | bwd_allreduce: 0.83 | step: 7.44 31%|███ | 3079/10000 [4:52:06<10:37:26, 5.53s/it] {'loss': 0.0784, 'grad_norm': 2.2845101356506348, 'learning_rate': 3.24314144331087e-05, 'epoch': 3.08} 31%|███ | 3079/10000 [4:52:07<10:37:26, 5.53s/it][2025-06-19 18:21:51,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:21:51,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.56 | bwd_microstep: 3328.89 | bwd_inner_microstep: 3328.09 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-19 18:21:51,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.57 | bwd: 3328.91 | bwd_inner: 3328.09 | bwd_allreduce: 0.77 | step: 7.02 31%|███ | 3080/10000 [4:52:12<10:35:36, 5.51s/it] {'loss': 0.0138, 'grad_norm': 0.6454886794090271, 'learning_rate': 3.2426339571091377e-05, 'epoch': 3.08} 31%|███ | 3080/10000 [4:52:12<10:35:36, 5.51s/it][2025-06-19 18:21:57,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:21:57,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.48 | bwd_microstep: 3329.41 | bwd_inner_microstep: 3328.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 18:21:57,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.48 | bwd: 3329.42 | bwd_inner: 3328.62 | bwd_allreduce: 0.76 | step: 6.71 31%|███ | 3081/10000 [4:52:17<10:34:23, 5.50s/it] {'loss': 0.0378, 'grad_norm': 1.1640686988830566, 'learning_rate': 3.2421263405608706e-05, 'epoch': 3.08} 31%|███ | 3081/10000 [4:52:17<10:34:23, 5.50s/it][2025-06-19 18:22:02,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:22:02,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.92 | bwd_microstep: 3373.77 | bwd_inner_microstep: 3372.79 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.46 [2025-06-19 18:22:02,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.92 | bwd: 3373.79 | bwd_inner: 3372.79 | bwd_allreduce: 0.95 | step: 7.47 31%|███ | 3082/10000 [4:52:23<10:35:52, 5.51s/it] {'loss': 0.0374, 'grad_norm': 1.3188446760177612, 'learning_rate': 3.241618593719315e-05, 'epoch': 3.08} 31%|███ | 3082/10000 [4:52:23<10:35:52, 5.51s/it][2025-06-19 18:22:08,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:22:08,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.46 | bwd_microstep: 3323.07 | bwd_inner_microstep: 3322.23 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.86 [2025-06-19 18:22:08,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.46 | bwd: 3323.08 | bwd_inner: 3322.23 | bwd_allreduce: 0.80 | step: 6.86 31%|███ | 3083/10000 [4:52:28<10:34:32, 5.50s/it] {'loss': 0.1184, 'grad_norm': 3.0299315452575684, 'learning_rate': 3.2411107166377306e-05, 'epoch': 3.08} 31%|███ | 3083/10000 [4:52:28<10:34:32, 5.50s/it][2025-06-19 18:22:13,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:22:13,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.61 | bwd_microstep: 3370.95 | bwd_inner_microstep: 3370.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.88 [2025-06-19 18:22:13,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.61 | bwd: 3370.96 | bwd_inner: 3370.16 | bwd_allreduce: 0.76 | step: 6.88 31%|███ | 3084/10000 [4:52:34<10:35:34, 5.51s/it] {'loss': 0.0175, 'grad_norm': 0.6226794719696045, 'learning_rate': 3.2406027093693934e-05, 'epoch': 3.08} 31%|███ | 3084/10000 [4:52:34<10:35:34, 5.51s/it][2025-06-19 18:22:19,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.78 [2025-06-19 18:22:19,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.51 | bwd_microstep: 3374.05 | bwd_inner_microstep: 3373.08 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.20 [2025-06-19 18:22:19,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.51 | bwd: 3374.06 | bwd_inner: 3373.08 | bwd_allreduce: 0.94 | step: 7.20 31%|███ | 3085/10000 [4:52:40<10:36:22, 5.52s/it] {'loss': 0.1389, 'grad_norm': 3.2228593826293945, 'learning_rate': 3.240094571967589e-05, 'epoch': 3.08} 31%|███ | 3085/10000 [4:52:40<10:36:22, 5.52s/it][2025-06-19 18:22:24,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:22:24,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.09 | bwd_microstep: 3375.95 | bwd_inner_microstep: 3375.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 18:22:24,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.09 | bwd: 3375.97 | bwd_inner: 3375.16 | bwd_allreduce: 0.76 | step: 6.73 31%|███ | 3086/10000 [4:52:45<10:37:17, 5.53s/it] {'loss': 0.0647, 'grad_norm': 1.5226327180862427, 'learning_rate': 3.23958630448562e-05, 'epoch': 3.09} 31%|███ | 3086/10000 [4:52:45<10:37:17, 5.53s/it][2025-06-19 18:22:30,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:22:30,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.56 | bwd_microstep: 3319.89 | bwd_inner_microstep: 3319.06 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.86 [2025-06-19 18:22:30,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.56 | bwd: 3319.91 | bwd_inner: 3319.06 | bwd_allreduce: 0.79 | step: 6.86 31%|███ | 3087/10000 [4:52:51<10:34:49, 5.51s/it] {'loss': 0.0187, 'grad_norm': 0.5869436264038086, 'learning_rate': 3.239077906976801e-05, 'epoch': 3.09} 31%|███ | 3087/10000 [4:52:51<10:34:49, 5.51s/it][2025-06-19 18:22:35,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:22:35,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.47 | bwd_microstep: 3375.24 | bwd_inner_microstep: 3374.37 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.34 [2025-06-19 18:22:35,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.47 | bwd: 3375.26 | bwd_inner: 3374.37 | bwd_allreduce: 0.84 | step: 7.35 31%|███ | 3088/10000 [4:52:56<10:36:08, 5.52s/it] {'loss': 0.008, 'grad_norm': 0.372374564409256, 'learning_rate': 3.23856937949446e-05, 'epoch': 3.09} 31%|███ | 3088/10000 [4:52:56<10:36:08, 5.52s/it][2025-06-19 18:22:41,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:22:41,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.24 | bwd_microstep: 3323.58 | bwd_inner_microstep: 3322.75 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.00 [2025-06-19 18:22:41,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.24 | bwd: 3323.59 | bwd_inner: 3322.76 | bwd_allreduce: 0.79 | step: 7.00 31%|███ | 3089/10000 [4:53:02<10:34:03, 5.50s/it] {'loss': 0.0585, 'grad_norm': 1.5515575408935547, 'learning_rate': 3.238060722091939e-05, 'epoch': 3.09} 31%|███ | 3089/10000 [4:53:02<10:34:03, 5.50s/it][2025-06-19 18:22:46,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:22:46,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.29 | bwd_microstep: 3322.01 | bwd_inner_microstep: 3321.07 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.25 [2025-06-19 18:22:46,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.29 | bwd: 3322.03 | bwd_inner: 3321.07 | bwd_allreduce: 0.91 | step: 7.25 31%|███ | 3090/10000 [4:53:07<10:33:02, 5.50s/it] {'loss': 0.0147, 'grad_norm': 0.7290933132171631, 'learning_rate': 3.2375519348225945e-05, 'epoch': 3.09} 31%|███ | 3090/10000 [4:53:07<10:33:02, 5.50s/it][2025-06-19 18:22:52,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:22:52,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.12 | bwd_microstep: 3380.39 | bwd_inner_microstep: 3379.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-19 18:22:52,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.13 | bwd: 3380.40 | bwd_inner: 3379.59 | bwd_allreduce: 0.76 | step: 7.05 31%|███ | 3091/10000 [4:53:13<10:34:32, 5.51s/it] {'loss': 0.1829, 'grad_norm': 3.90712308883667, 'learning_rate': 3.237043017739796e-05, 'epoch': 3.09} 31%|███ | 3091/10000 [4:53:13<10:34:32, 5.51s/it][2025-06-19 18:22:57,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:22:57,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.28 | bwd_microstep: 3379.88 | bwd_inner_microstep: 3379.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.87 [2025-06-19 18:22:57,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.28 | bwd: 3379.89 | bwd_inner: 3379.09 | bwd_allreduce: 0.76 | step: 6.87 31%|███ | 3092/10000 [4:53:18<10:35:59, 5.52s/it] {'loss': 0.0114, 'grad_norm': 0.48158028721809387, 'learning_rate': 3.236533970896926e-05, 'epoch': 3.09} 31%|███ | 3092/10000 [4:53:18<10:35:59, 5.52s/it][2025-06-19 18:23:03,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:23:03,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.06 | bwd_microstep: 3320.63 | bwd_inner_microstep: 3319.84 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 18:23:03,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.06 | bwd: 3320.64 | bwd_inner: 3319.84 | bwd_allreduce: 0.76 | step: 6.54 31%|███ | 3093/10000 [4:53:24<10:33:35, 5.50s/it] {'loss': 0.0147, 'grad_norm': 0.695866584777832, 'learning_rate': 3.236024794347381e-05, 'epoch': 3.09} 31%|███ | 3093/10000 [4:53:24<10:33:35, 5.50s/it][2025-06-19 18:23:08,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:23:08,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.77 | bwd_microstep: 3324.17 | bwd_inner_microstep: 3323.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 18:23:08,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.77 | bwd: 3324.19 | bwd_inner: 3323.39 | bwd_allreduce: 0.75 | step: 6.55 31%|███ | 3094/10000 [4:53:29<10:32:18, 5.49s/it] {'loss': 0.0285, 'grad_norm': 1.2251781225204468, 'learning_rate': 3.2355154881445726e-05, 'epoch': 3.09} 31%|███ | 3094/10000 [4:53:29<10:32:18, 5.49s/it][2025-06-19 18:23:14,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:23:14,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.95 | bwd_microstep: 3325.33 | bwd_inner_microstep: 3324.37 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.39 [2025-06-19 18:23:14,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.95 | bwd: 3325.34 | bwd_inner: 3324.37 | bwd_allreduce: 0.93 | step: 7.39 31%|███ | 3095/10000 [4:53:35<10:31:22, 5.49s/it] {'loss': 0.0299, 'grad_norm': 1.8640538454055786, 'learning_rate': 3.235006052341923e-05, 'epoch': 3.1} 31%|███ | 3095/10000 [4:53:35<10:31:22, 5.49s/it][2025-06-19 18:23:19,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:23:19,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.95 | bwd_microstep: 3326.58 | bwd_inner_microstep: 3325.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 18:23:19,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.95 | bwd: 3326.59 | bwd_inner: 3325.79 | bwd_allreduce: 0.76 | step: 6.67 31%|███ | 3096/10000 [4:53:40<10:30:49, 5.48s/it] {'loss': 0.0475, 'grad_norm': 2.282165765762329, 'learning_rate': 3.2344964869928706e-05, 'epoch': 3.1} 31%|███ | 3096/10000 [4:53:40<10:30:49, 5.48s/it][2025-06-19 18:23:25,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.91 [2025-06-19 18:23:25,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.31 | bwd_microstep: 3327.39 | bwd_inner_microstep: 3326.51 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.95 [2025-06-19 18:23:25,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.31 | bwd: 3327.40 | bwd_inner: 3326.51 | bwd_allreduce: 0.85 | step: 6.95 31%|███ | 3097/10000 [4:53:46<10:30:28, 5.48s/it] {'loss': 0.0271, 'grad_norm': 1.3158217668533325, 'learning_rate': 3.233986792150866e-05, 'epoch': 3.1} 31%|███ | 3097/10000 [4:53:46<10:30:28, 5.48s/it][2025-06-19 18:23:30,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:23:30,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.31 | bwd_microstep: 3319.72 | bwd_inner_microstep: 3318.78 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.05 [2025-06-19 18:23:30,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.31 | bwd: 3319.73 | bwd_inner: 3318.78 | bwd_allreduce: 0.90 | step: 7.05 31%|███ | 3098/10000 [4:53:51<10:29:54, 5.48s/it] {'loss': 0.0693, 'grad_norm': 1.9230278730392456, 'learning_rate': 3.2334769678693744e-05, 'epoch': 3.1} 31%|███ | 3098/10000 [4:53:51<10:29:54, 5.48s/it][2025-06-19 18:23:36,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:23:36,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.46 | bwd_microstep: 3366.48 | bwd_inner_microstep: 3365.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 18:23:36,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.46 | bwd: 3366.49 | bwd_inner: 3365.68 | bwd_allreduce: 0.77 | step: 6.67 31%|███ | 3099/10000 [4:53:57<10:31:48, 5.49s/it] {'loss': 0.0209, 'grad_norm': 1.223156452178955, 'learning_rate': 3.232967014201873e-05, 'epoch': 3.1} 31%|███ | 3099/10000 [4:53:57<10:31:48, 5.49s/it][2025-06-19 18:23:41,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:23:41,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.81 | bwd_microstep: 3411.27 | bwd_inner_microstep: 3410.32 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.14 [2025-06-19 18:23:41,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.81 | bwd: 3411.28 | bwd_inner: 3410.32 | bwd_allreduce: 0.91 | step: 7.14 31%|███ | 3100/10000 [4:54:02<10:35:21, 5.52s/it] {'loss': 0.0472, 'grad_norm': 1.3142427206039429, 'learning_rate': 3.232456931201855e-05, 'epoch': 3.1} 31%|███ | 3100/10000 [4:54:02<10:35:21, 5.52s/it][2025-06-19 18:23:47,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:23:47,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.57 | bwd_microstep: 3329.39 | bwd_inner_microstep: 3328.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-19 18:23:47,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.57 | bwd: 3329.40 | bwd_inner: 3328.59 | bwd_allreduce: 0.77 | step: 7.02 31%|███ | 3101/10000 [4:54:08<10:33:46, 5.51s/it] {'loss': 0.0362, 'grad_norm': 1.7653170824050903, 'learning_rate': 3.231946718922824e-05, 'epoch': 3.1} 31%|███ | 3101/10000 [4:54:08<10:33:46, 5.51s/it][2025-06-19 18:23:52,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:23:52,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.01 | bwd_microstep: 3314.39 | bwd_inner_microstep: 3313.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 18:23:52,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.01 | bwd: 3314.41 | bwd_inner: 3313.61 | bwd_allreduce: 0.75 | step: 6.68 31%|███ | 3102/10000 [4:54:13<10:31:57, 5.50s/it] {'loss': 0.1252, 'grad_norm': 2.4124062061309814, 'learning_rate': 3.231436377418301e-05, 'epoch': 3.1} 31%|███ | 3102/10000 [4:54:13<10:31:57, 5.50s/it][2025-06-19 18:23:58,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:23:58,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.33 | bwd_microstep: 3313.21 | bwd_inner_microstep: 3312.40 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.01 [2025-06-19 18:23:58,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.33 | bwd: 3313.23 | bwd_inner: 3312.40 | bwd_allreduce: 0.78 | step: 7.02 31%|███ | 3103/10000 [4:54:19<10:30:56, 5.49s/it] {'loss': 0.02, 'grad_norm': 1.3046114444732666, 'learning_rate': 3.2309259067418165e-05, 'epoch': 3.1} 31%|███ | 3103/10000 [4:54:19<10:30:56, 5.49s/it][2025-06-19 18:24:03,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:24:03,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.94 | bwd_microstep: 3320.89 | bwd_inner_microstep: 3320.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 18:24:03,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.94 | bwd: 3320.90 | bwd_inner: 3320.08 | bwd_allreduce: 0.77 | step: 6.95 31%|███ | 3104/10000 [4:54:24<10:30:06, 5.48s/it] {'loss': 0.0619, 'grad_norm': 2.283109188079834, 'learning_rate': 3.230415306946917e-05, 'epoch': 3.1} 31%|███ | 3104/10000 [4:54:24<10:30:06, 5.48s/it][2025-06-19 18:24:09,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:24:09,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.95 | bwd_microstep: 3326.18 | bwd_inner_microstep: 3325.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 18:24:09,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.95 | bwd: 3326.19 | bwd_inner: 3325.37 | bwd_allreduce: 0.77 | step: 7.18 31%|███ | 3105/10000 [4:54:29<10:29:33, 5.48s/it] {'loss': 0.0467, 'grad_norm': 1.5364607572555542, 'learning_rate': 3.229904578087163e-05, 'epoch': 3.1} 31%|███ | 3105/10000 [4:54:29<10:29:33, 5.48s/it][2025-06-19 18:24:14,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:24:14,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.51 | bwd_microstep: 3311.36 | bwd_inner_microstep: 3310.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.59 [2025-06-19 18:24:14,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.51 | bwd: 3311.38 | bwd_inner: 3310.57 | bwd_allreduce: 0.76 | step: 6.59 31%|███ | 3106/10000 [4:54:35<10:28:34, 5.47s/it] {'loss': 0.0491, 'grad_norm': 1.6496503353118896, 'learning_rate': 3.2293937202161266e-05, 'epoch': 3.11} 31%|███ | 3106/10000 [4:54:35<10:28:34, 5.47s/it][2025-06-19 18:24:20,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:24:20,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.19 | bwd_microstep: 3321.44 | bwd_inner_microstep: 3320.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 18:24:20,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.19 | bwd: 3321.46 | bwd_inner: 3320.65 | bwd_allreduce: 0.76 | step: 6.67 31%|███ | 3107/10000 [4:54:40<10:28:07, 5.47s/it] {'loss': 0.0763, 'grad_norm': 2.4146881103515625, 'learning_rate': 3.228882733387394e-05, 'epoch': 3.11} 31%|███ | 3107/10000 [4:54:40<10:28:07, 5.47s/it][2025-06-19 18:24:25,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:24:25,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.58 | bwd_microstep: 3368.54 | bwd_inner_microstep: 3367.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 18:24:25,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.58 | bwd: 3368.55 | bwd_inner: 3367.73 | bwd_allreduce: 0.78 | step: 7.17 31%|███ | 3108/10000 [4:54:46<10:30:24, 5.49s/it] {'loss': 0.1938, 'grad_norm': 3.410863161087036, 'learning_rate': 3.2283716176545676e-05, 'epoch': 3.11} 31%|███ | 3108/10000 [4:54:46<10:30:24, 5.49s/it][2025-06-19 18:24:31,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:24:31,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.20 | bwd_microstep: 3316.38 | bwd_inner_microstep: 3315.12 | bwd_allreduce_microstep: 1.20 | step_microstep: 7.26 [2025-06-19 18:24:31,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.20 | bwd: 3316.39 | bwd_inner: 3315.12 | bwd_allreduce: 1.22 | step: 7.26 31%|███ | 3109/10000 [4:54:51<10:29:27, 5.48s/it] {'loss': 0.0473, 'grad_norm': 1.4216861724853516, 'learning_rate': 3.2278603730712584e-05, 'epoch': 3.11} 31%|███ | 3109/10000 [4:54:51<10:29:27, 5.48s/it][2025-06-19 18:24:36,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:24:36,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.49 | bwd_microstep: 3320.45 | bwd_inner_microstep: 3319.67 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.50 [2025-06-19 18:24:36,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.49 | bwd: 3320.46 | bwd_inner: 3319.67 | bwd_allreduce: 0.75 | step: 6.51 31%|███ | 3110/10000 [4:54:57<10:29:00, 5.48s/it] {'loss': 0.0234, 'grad_norm': 1.1815526485443115, 'learning_rate': 3.2273489996910953e-05, 'epoch': 3.11} 31%|███ | 3110/10000 [4:54:57<10:29:00, 5.48s/it][2025-06-19 18:24:42,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:24:42,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.20 | bwd_microstep: 3375.05 | bwd_inner_microstep: 3374.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 18:24:42,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.20 | bwd: 3375.07 | bwd_inner: 3374.26 | bwd_allreduce: 0.76 | step: 6.65 31%|███ | 3111/10000 [4:55:02<10:31:03, 5.50s/it] {'loss': 0.0683, 'grad_norm': 1.8236217498779297, 'learning_rate': 3.226837497567719e-05, 'epoch': 3.11} 31%|███ | 3111/10000 [4:55:02<10:31:03, 5.50s/it][2025-06-19 18:24:47,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:24:47,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.61 | bwd_microstep: 3320.42 | bwd_inner_microstep: 3319.46 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.58 [2025-06-19 18:24:47,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.61 | bwd: 3320.44 | bwd_inner: 3319.46 | bwd_allreduce: 0.93 | step: 7.58 31%|███ | 3112/10000 [4:55:08<10:29:50, 5.49s/it] {'loss': 0.0459, 'grad_norm': 1.6027214527130127, 'learning_rate': 3.2263258667547816e-05, 'epoch': 3.11} 31%|███ | 3112/10000 [4:55:08<10:29:50, 5.49s/it][2025-06-19 18:24:53,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:24:53,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.67 | bwd_microstep: 3360.94 | bwd_inner_microstep: 3360.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 18:24:53,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.67 | bwd: 3360.96 | bwd_inner: 3360.16 | bwd_allreduce: 0.75 | step: 6.65 31%|███ | 3113/10000 [4:55:13<10:31:07, 5.50s/it] {'loss': 0.0626, 'grad_norm': 1.0819038152694702, 'learning_rate': 3.2258141073059533e-05, 'epoch': 3.11} 31%|███ | 3113/10000 [4:55:13<10:31:07, 5.50s/it][2025-06-19 18:24:58,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:24:58,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.84 | bwd_microstep: 3334.00 | bwd_inner_microstep: 3333.03 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.26 [2025-06-19 18:24:58,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.84 | bwd: 3334.01 | bwd_inner: 3333.03 | bwd_allreduce: 0.94 | step: 7.26 31%|███ | 3114/10000 [4:55:19<10:30:18, 5.49s/it] {'loss': 0.0459, 'grad_norm': 1.958612322807312, 'learning_rate': 3.225302219274914e-05, 'epoch': 3.11} 31%|███ | 3114/10000 [4:55:19<10:30:18, 5.49s/it][2025-06-19 18:25:04,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:25:04,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.46 | bwd_microstep: 3325.23 | bwd_inner_microstep: 3324.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 18:25:04,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.46 | bwd: 3325.24 | bwd_inner: 3324.45 | bwd_allreduce: 0.75 | step: 6.56 31%|███ | 3115/10000 [4:55:24<10:29:39, 5.49s/it] {'loss': 0.0589, 'grad_norm': 1.213114619255066, 'learning_rate': 3.224790202715359e-05, 'epoch': 3.12} 31%|███ | 3115/10000 [4:55:24<10:29:39, 5.49s/it][2025-06-19 18:25:09,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:25:09,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.55 | bwd_microstep: 3317.33 | bwd_inner_microstep: 3316.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 18:25:09,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.55 | bwd: 3317.35 | bwd_inner: 3316.54 | bwd_allreduce: 0.77 | step: 6.70 31%|███ | 3116/10000 [4:55:30<10:28:37, 5.48s/it] {'loss': 0.0414, 'grad_norm': 1.4284560680389404, 'learning_rate': 3.224278057680996e-05, 'epoch': 3.12} 31%|███ | 3116/10000 [4:55:30<10:28:37, 5.48s/it][2025-06-19 18:25:15,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:25:15,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.30 | bwd_microstep: 3373.44 | bwd_inner_microstep: 3372.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 18:25:15,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.30 | bwd: 3373.46 | bwd_inner: 3372.66 | bwd_allreduce: 0.75 | step: 6.59 31%|███ | 3117/10000 [4:55:35<10:30:37, 5.50s/it] {'loss': 0.0087, 'grad_norm': 0.18560580909252167, 'learning_rate': 3.223765784225547e-05, 'epoch': 3.12} 31%|███ | 3117/10000 [4:55:35<10:30:37, 5.50s/it][2025-06-19 18:25:20,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:25:20,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.50 | bwd_microstep: 3360.93 | bwd_inner_microstep: 3359.96 | bwd_allreduce_microstep: 0.92 | step_microstep: 6.94 [2025-06-19 18:25:20,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.50 | bwd: 3360.94 | bwd_inner: 3359.96 | bwd_allreduce: 0.94 | step: 6.94 31%|███ | 3118/10000 [4:55:41<10:31:27, 5.51s/it] {'loss': 0.1595, 'grad_norm': 2.2025227546691895, 'learning_rate': 3.223253382402747e-05, 'epoch': 3.12} 31%|███ | 3118/10000 [4:55:41<10:31:27, 5.51s/it][2025-06-19 18:25:26,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:25:26,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.26 | bwd_microstep: 3406.58 | bwd_inner_microstep: 3405.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 18:25:26,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.26 | bwd: 3406.60 | bwd_inner: 3405.80 | bwd_allreduce: 0.76 | step: 6.78 31%|███ | 3119/10000 [4:55:46<10:34:07, 5.53s/it] {'loss': 0.0424, 'grad_norm': 2.035496234893799, 'learning_rate': 3.222740852266344e-05, 'epoch': 3.12} 31%|███ | 3119/10000 [4:55:46<10:34:07, 5.53s/it][2025-06-19 18:25:31,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:25:31,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.74 | bwd_microstep: 3320.93 | bwd_inner_microstep: 3319.90 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.26 [2025-06-19 18:25:31,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.74 | bwd: 3320.94 | bwd_inner: 3319.90 | bwd_allreduce: 1.00 | step: 7.28 31%|███ | 3120/10000 [4:55:52<10:31:49, 5.51s/it] {'loss': 0.0617, 'grad_norm': 1.8654974699020386, 'learning_rate': 3.222228193870101e-05, 'epoch': 3.12} 31%|███ | 3120/10000 [4:55:52<10:31:49, 5.51s/it][2025-06-19 18:25:37,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:25:37,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.45 | bwd_microstep: 3369.04 | bwd_inner_microstep: 3368.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 18:25:37,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.45 | bwd: 3369.05 | bwd_inner: 3368.24 | bwd_allreduce: 0.77 | step: 6.63 31%|███ | 3121/10000 [4:55:57<10:32:42, 5.52s/it] {'loss': 0.0071, 'grad_norm': 0.24353036284446716, 'learning_rate': 3.221715407267792e-05, 'epoch': 3.12} 31%|███ | 3121/10000 [4:55:57<10:32:42, 5.52s/it][2025-06-19 18:25:42,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:25:42,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.06 | bwd_microstep: 3318.29 | bwd_inner_microstep: 3317.45 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.90 [2025-06-19 18:25:42,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.06 | bwd: 3318.30 | bwd_inner: 3317.45 | bwd_allreduce: 0.81 | step: 6.91 31%|███ | 3122/10000 [4:56:03<10:30:32, 5.50s/it] {'loss': 0.0847, 'grad_norm': 0.9775769114494324, 'learning_rate': 3.2212024925132084e-05, 'epoch': 3.12} 31%|███ | 3122/10000 [4:56:03<10:30:32, 5.50s/it][2025-06-19 18:25:48,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:25:48,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.66 | bwd_microstep: 3383.87 | bwd_inner_microstep: 3383.07 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 18:25:48,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.66 | bwd: 3383.88 | bwd_inner: 3383.07 | bwd_allreduce: 0.77 | step: 6.85 31%|███ | 3123/10000 [4:56:08<10:32:12, 5.52s/it] {'loss': 0.013, 'grad_norm': 0.9364885687828064, 'learning_rate': 3.220689449660151e-05, 'epoch': 3.12} 31%|███ | 3123/10000 [4:56:08<10:32:12, 5.52s/it][2025-06-19 18:25:53,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:25:53,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.47 | bwd_microstep: 3363.59 | bwd_inner_microstep: 3362.77 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.14 [2025-06-19 18:25:53,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.47 | bwd: 3363.60 | bwd_inner: 3362.77 | bwd_allreduce: 0.79 | step: 7.15 31%|███ | 3124/10000 [4:56:14<10:32:27, 5.52s/it] {'loss': 0.1376, 'grad_norm': 2.637720823287964, 'learning_rate': 3.220176278762434e-05, 'epoch': 3.12} 31%|███ | 3124/10000 [4:56:14<10:32:27, 5.52s/it][2025-06-19 18:25:59,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:25:59,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.44 | bwd_microstep: 3316.92 | bwd_inner_microstep: 3316.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 18:25:59,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.44 | bwd: 3316.94 | bwd_inner: 3316.12 | bwd_allreduce: 0.77 | step: 6.82 31%|███▏ | 3125/10000 [4:56:19<10:30:11, 5.50s/it] {'loss': 0.0313, 'grad_norm': 1.0477015972137451, 'learning_rate': 3.219662979873889e-05, 'epoch': 3.12} 31%|███▏ | 3125/10000 [4:56:19<10:30:11, 5.50s/it][2025-06-19 18:26:04,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:26:04,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.52 | bwd_microstep: 3363.71 | bwd_inner_microstep: 3362.89 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.34 [2025-06-19 18:26:04,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.52 | bwd: 3363.73 | bwd_inner: 3362.89 | bwd_allreduce: 0.79 | step: 7.34 31%|███▏ | 3126/10000 [4:56:25<10:31:18, 5.51s/it] {'loss': 0.1584, 'grad_norm': 3.4709768295288086, 'learning_rate': 3.2191495530483594e-05, 'epoch': 3.13} 31%|███▏ | 3126/10000 [4:56:25<10:31:18, 5.51s/it][2025-06-19 18:26:10,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:26:10,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.38 | bwd_microstep: 3357.52 | bwd_inner_microstep: 3356.72 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 18:26:10,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.38 | bwd: 3357.54 | bwd_inner: 3356.72 | bwd_allreduce: 0.77 | step: 6.83 31%|███▏ | 3127/10000 [4:56:30<10:31:51, 5.52s/it] {'loss': 0.0226, 'grad_norm': 0.7364174723625183, 'learning_rate': 3.218635998339699e-05, 'epoch': 3.13} 31%|███▏ | 3127/10000 [4:56:30<10:31:51, 5.52s/it][2025-06-19 18:26:15,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 18:26:15,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.54 | bwd_microstep: 3365.09 | bwd_inner_microstep: 3364.01 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.50 [2025-06-19 18:26:15,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.54 | bwd: 3365.11 | bwd_inner: 3364.01 | bwd_allreduce: 1.04 | step: 7.51 31%|███▏ | 3128/10000 [4:56:36<10:32:24, 5.52s/it] {'loss': 0.048, 'grad_norm': 1.0106103420257568, 'learning_rate': 3.218122315801778e-05, 'epoch': 3.13} 31%|███▏ | 3128/10000 [4:56:36<10:32:24, 5.52s/it][2025-06-19 18:26:21,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:26:21,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.41 | bwd_microstep: 3363.30 | bwd_inner_microstep: 3362.48 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.33 [2025-06-19 18:26:21,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.41 | bwd: 3363.32 | bwd_inner: 3362.48 | bwd_allreduce: 0.79 | step: 7.33 31%|███▏ | 3129/10000 [4:56:42<10:33:13, 5.53s/it] {'loss': 0.0439, 'grad_norm': 1.366094708442688, 'learning_rate': 3.21760850548848e-05, 'epoch': 3.13} 31%|███▏ | 3129/10000 [4:56:42<10:33:13, 5.53s/it][2025-06-19 18:26:26,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:26:26,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.72 | bwd_microstep: 3377.70 | bwd_inner_microstep: 3376.87 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.92 [2025-06-19 18:26:26,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.72 | bwd: 3377.72 | bwd_inner: 3376.87 | bwd_allreduce: 0.80 | step: 6.92 31%|███▏ | 3130/10000 [4:56:47<10:33:42, 5.53s/it] {'loss': 0.0742, 'grad_norm': 2.248351573944092, 'learning_rate': 3.217094567453701e-05, 'epoch': 3.13} 31%|███▏ | 3130/10000 [4:56:47<10:33:42, 5.53s/it][2025-06-19 18:26:32,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 18:26:32,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.89 | bwd_microstep: 3311.13 | bwd_inner_microstep: 3310.03 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.67 [2025-06-19 18:26:32,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.89 | bwd: 3311.15 | bwd_inner: 3310.03 | bwd_allreduce: 1.07 | step: 7.67 31%|███▏ | 3131/10000 [4:56:53<10:30:49, 5.51s/it] {'loss': 0.0599, 'grad_norm': 1.1802703142166138, 'learning_rate': 3.21658050175135e-05, 'epoch': 3.13} 31%|███▏ | 3131/10000 [4:56:53<10:30:49, 5.51s/it][2025-06-19 18:26:37,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:26:37,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.27 | bwd_microstep: 3362.05 | bwd_inner_microstep: 3361.01 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.20 [2025-06-19 18:26:37,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.27 | bwd: 3362.07 | bwd_inner: 3361.01 | bwd_allreduce: 1.00 | step: 7.20 31%|███▏ | 3132/10000 [4:56:58<10:31:29, 5.52s/it] {'loss': 0.1852, 'grad_norm': 2.1023712158203125, 'learning_rate': 3.216066308435351e-05, 'epoch': 3.13} 31%|███▏ | 3132/10000 [4:56:58<10:31:29, 5.52s/it][2025-06-19 18:26:43,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:26:43,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.81 | bwd_microstep: 3315.07 | bwd_inner_microstep: 3314.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 18:26:43,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.81 | bwd: 3315.09 | bwd_inner: 3314.27 | bwd_allreduce: 0.77 | step: 6.68 31%|███▏ | 3133/10000 [4:57:04<10:29:48, 5.50s/it] {'loss': 0.017, 'grad_norm': 0.4050639569759369, 'learning_rate': 3.21555198755964e-05, 'epoch': 3.13} 31%|███▏ | 3133/10000 [4:57:04<10:29:48, 5.50s/it][2025-06-19 18:26:48,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:26:48,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2162.58 | bwd_microstep: 3309.20 | bwd_inner_microstep: 3308.38 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.21 [2025-06-19 18:26:48,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2162.58 | bwd: 3309.21 | bwd_inner: 3308.38 | bwd_allreduce: 0.79 | step: 7.22 31%|███▏ | 3134/10000 [4:57:09<10:29:55, 5.50s/it] {'loss': 0.1778, 'grad_norm': 3.910189390182495, 'learning_rate': 3.2150375391781674e-05, 'epoch': 3.13} 31%|███▏ | 3134/10000 [4:57:09<10:29:55, 5.50s/it][2025-06-19 18:26:54,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:26:54,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.16 | bwd_microstep: 3311.56 | bwd_inner_microstep: 3310.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.99 [2025-06-19 18:26:54,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.16 | bwd: 3311.57 | bwd_inner: 3310.78 | bwd_allreduce: 0.76 | step: 6.99 31%|███▏ | 3135/10000 [4:57:15<10:28:05, 5.49s/it] {'loss': 0.0697, 'grad_norm': 1.3453394174575806, 'learning_rate': 3.214522963344896e-05, 'epoch': 3.13} 31%|███▏ | 3135/10000 [4:57:15<10:28:05, 5.49s/it][2025-06-19 18:26:59,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:26:59,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.82 | bwd_microstep: 3322.84 | bwd_inner_microstep: 3322.04 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 18:26:59,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.82 | bwd: 3322.85 | bwd_inner: 3322.04 | bwd_allreduce: 0.77 | step: 7.18 31%|███▏ | 3136/10000 [4:57:20<10:27:24, 5.48s/it] {'loss': 0.0234, 'grad_norm': 0.6197291016578674, 'learning_rate': 3.214008260113803e-05, 'epoch': 3.14} 31%|███▏ | 3136/10000 [4:57:20<10:27:24, 5.48s/it][2025-06-19 18:27:05,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:27:05,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.84 | bwd_microstep: 3309.26 | bwd_inner_microstep: 3308.19 | bwd_allreduce_microstep: 1.02 | step_microstep: 6.80 [2025-06-19 18:27:05,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.84 | bwd: 3309.27 | bwd_inner: 3308.19 | bwd_allreduce: 1.04 | step: 6.80 31%|███▏ | 3137/10000 [4:57:25<10:26:03, 5.47s/it] {'loss': 0.107, 'grad_norm': 2.5381152629852295, 'learning_rate': 3.2134934295388764e-05, 'epoch': 3.14} 31%|███▏ | 3137/10000 [4:57:25<10:26:03, 5.47s/it][2025-06-19 18:27:10,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:27:10,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.78 | bwd_microstep: 3319.13 | bwd_inner_microstep: 3318.31 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.30 [2025-06-19 18:27:10,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.78 | bwd: 3319.14 | bwd_inner: 3318.31 | bwd_allreduce: 0.78 | step: 7.31 31%|███▏ | 3138/10000 [4:57:31<10:25:21, 5.47s/it] {'loss': 0.0419, 'grad_norm': 1.4857933521270752, 'learning_rate': 3.2129784716741216e-05, 'epoch': 3.14} 31%|███▏ | 3138/10000 [4:57:31<10:25:21, 5.47s/it][2025-06-19 18:27:16,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:27:16,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.49 | bwd_microstep: 3316.94 | bwd_inner_microstep: 3315.96 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.05 [2025-06-19 18:27:16,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.49 | bwd: 3316.95 | bwd_inner: 3315.96 | bwd_allreduce: 0.95 | step: 7.06 31%|███▏ | 3139/10000 [4:57:36<10:24:59, 5.47s/it] {'loss': 0.0848, 'grad_norm': 1.4157516956329346, 'learning_rate': 3.2124633865735554e-05, 'epoch': 3.14} 31%|███▏ | 3139/10000 [4:57:36<10:24:59, 5.47s/it][2025-06-19 18:27:21,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:27:21,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.44 | bwd_microstep: 3312.00 | bwd_inner_microstep: 3311.01 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.88 [2025-06-19 18:27:21,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.44 | bwd: 3312.02 | bwd_inner: 3311.01 | bwd_allreduce: 0.96 | step: 7.88 31%|███▏ | 3140/10000 [4:57:42<10:24:36, 5.46s/it] {'loss': 0.0834, 'grad_norm': 1.8804112672805786, 'learning_rate': 3.211948174291207e-05, 'epoch': 3.14} 31%|███▏ | 3140/10000 [4:57:42<10:24:36, 5.46s/it][2025-06-19 18:27:26,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:27:26,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.00 | bwd_microstep: 3306.56 | bwd_inner_microstep: 3305.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.80 [2025-06-19 18:27:26,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.00 | bwd: 3306.58 | bwd_inner: 3305.78 | bwd_allreduce: 0.75 | step: 6.81 31%|███▏ | 3141/10000 [4:57:47<10:24:18, 5.46s/it] {'loss': 0.033, 'grad_norm': 0.77082759141922, 'learning_rate': 3.2114328348811205e-05, 'epoch': 3.14} 31%|███▏ | 3141/10000 [4:57:47<10:24:18, 5.46s/it][2025-06-19 18:27:32,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:27:32,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.37 | bwd_microstep: 3319.55 | bwd_inner_microstep: 3318.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.06 [2025-06-19 18:27:32,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.38 | bwd: 3319.56 | bwd_inner: 3318.75 | bwd_allreduce: 0.77 | step: 7.06 31%|███▏ | 3142/10000 [4:57:53<10:24:17, 5.46s/it] {'loss': 0.0376, 'grad_norm': 1.1111516952514648, 'learning_rate': 3.210917368397351e-05, 'epoch': 3.14} 31%|███▏ | 3142/10000 [4:57:53<10:24:17, 5.46s/it][2025-06-19 18:27:37,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:27:37,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.96 | bwd_microstep: 3316.77 | bwd_inner_microstep: 3315.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 18:27:37,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.96 | bwd: 3316.79 | bwd_inner: 3315.99 | bwd_allreduce: 0.75 | step: 6.65 31%|███▏ | 3143/10000 [4:57:58<10:24:15, 5.46s/it] {'loss': 0.0586, 'grad_norm': 1.0243494510650635, 'learning_rate': 3.21040177489397e-05, 'epoch': 3.14} 31%|███▏ | 3143/10000 [4:57:58<10:24:15, 5.46s/it][2025-06-19 18:27:43,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:27:43,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.70 | bwd_microstep: 3361.85 | bwd_inner_microstep: 3361.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 18:27:43,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.70 | bwd: 3361.86 | bwd_inner: 3361.07 | bwd_allreduce: 0.75 | step: 6.55 31%|███▏ | 3144/10000 [4:58:04<10:26:14, 5.48s/it] {'loss': 0.1007, 'grad_norm': 2.0279767513275146, 'learning_rate': 3.20988605442506e-05, 'epoch': 3.14} 31%|███▏ | 3144/10000 [4:58:04<10:26:14, 5.48s/it][2025-06-19 18:27:48,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:27:48,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.97 | bwd_microstep: 3363.08 | bwd_inner_microstep: 3362.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 18:27:48,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.97 | bwd: 3363.09 | bwd_inner: 3362.28 | bwd_allreduce: 0.77 | step: 6.65 31%|███▏ | 3145/10000 [4:58:09<10:27:36, 5.49s/it] {'loss': 0.0941, 'grad_norm': 1.9182559251785278, 'learning_rate': 3.209370207044719e-05, 'epoch': 3.15} 31%|███▏ | 3145/10000 [4:58:09<10:27:36, 5.49s/it][2025-06-19 18:27:54,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:27:54,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.80 | bwd_microstep: 3365.87 | bwd_inner_microstep: 3365.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 18:27:54,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.80 | bwd: 3365.88 | bwd_inner: 3365.09 | bwd_allreduce: 0.75 | step: 6.55 31%|███▏ | 3146/10000 [4:58:15<10:28:54, 5.51s/it] {'loss': 0.0723, 'grad_norm': 1.654557466506958, 'learning_rate': 3.2088542328070556e-05, 'epoch': 3.15} 31%|███▏ | 3146/10000 [4:58:15<10:28:54, 5.51s/it][2025-06-19 18:27:59,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:27:59,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.82 | bwd_microstep: 3315.36 | bwd_inner_microstep: 3314.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 18:27:59,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.82 | bwd: 3315.38 | bwd_inner: 3314.56 | bwd_allreduce: 0.78 | step: 6.93 31%|███▏ | 3147/10000 [4:58:20<10:27:08, 5.49s/it] {'loss': 0.0481, 'grad_norm': 1.800477147102356, 'learning_rate': 3.208338131766194e-05, 'epoch': 3.15} 31%|███▏ | 3147/10000 [4:58:20<10:27:08, 5.49s/it][2025-06-19 18:28:05,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:28:05,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.72 | bwd_microstep: 3314.74 | bwd_inner_microstep: 3313.91 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.87 [2025-06-19 18:28:05,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.72 | bwd: 3314.76 | bwd_inner: 3313.91 | bwd_allreduce: 0.80 | step: 6.87 31%|███▏ | 3148/10000 [4:58:26<10:25:48, 5.48s/it] {'loss': 0.0152, 'grad_norm': 0.524182915687561, 'learning_rate': 3.20782190397627e-05, 'epoch': 3.15} 31%|███▏ | 3148/10000 [4:58:26<10:25:48, 5.48s/it][2025-06-19 18:28:10,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.77 [2025-06-19 18:28:10,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.85 | bwd_microstep: 3312.23 | bwd_inner_microstep: 3311.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 18:28:10,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.85 | bwd: 3312.24 | bwd_inner: 3311.44 | bwd_allreduce: 0.76 | step: 6.77 31%|███▏ | 3149/10000 [4:58:31<10:24:53, 5.47s/it] {'loss': 0.0964, 'grad_norm': 1.4704285860061646, 'learning_rate': 3.2073055494914344e-05, 'epoch': 3.15} 31%|███▏ | 3149/10000 [4:58:31<10:24:53, 5.47s/it][2025-06-19 18:28:16,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:28:16,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.93 | bwd_microstep: 3307.98 | bwd_inner_microstep: 3307.20 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.58 [2025-06-19 18:28:16,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.93 | bwd: 3308.00 | bwd_inner: 3307.20 | bwd_allreduce: 0.75 | step: 6.58 32%|███▏ | 3150/10000 [4:58:37<10:24:00, 5.47s/it] {'loss': 0.1453, 'grad_norm': 2.053530216217041, 'learning_rate': 3.20678906836585e-05, 'epoch': 3.15} 32%|███▏ | 3150/10000 [4:58:37<10:24:00, 5.47s/it][2025-06-19 18:28:21,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:28:21,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.06 | bwd_microstep: 3370.56 | bwd_inner_microstep: 3369.71 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.70 [2025-06-19 18:28:21,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.06 | bwd: 3370.57 | bwd_inner: 3369.71 | bwd_allreduce: 0.81 | step: 6.71 32%|███▏ | 3151/10000 [4:58:42<10:26:25, 5.49s/it] {'loss': 0.0746, 'grad_norm': 3.7594618797302246, 'learning_rate': 3.206272460653693e-05, 'epoch': 3.15} 32%|███▏ | 3151/10000 [4:58:42<10:26:25, 5.49s/it][2025-06-19 18:28:27,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:28:27,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.16 | bwd_microstep: 3309.48 | bwd_inner_microstep: 3308.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 18:28:27,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.16 | bwd: 3309.49 | bwd_inner: 3308.68 | bwd_allreduce: 0.77 | step: 6.95 32%|███▏ | 3152/10000 [4:58:48<10:25:13, 5.48s/it] {'loss': 0.1466, 'grad_norm': 2.1356444358825684, 'learning_rate': 3.205755726409152e-05, 'epoch': 3.15} 32%|███▏ | 3152/10000 [4:58:48<10:25:13, 5.48s/it][2025-06-19 18:28:32,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 18:28:32,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.56 | bwd_microstep: 3310.39 | bwd_inner_microstep: 3309.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 18:28:32,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.56 | bwd: 3310.41 | bwd_inner: 3309.60 | bwd_allreduce: 0.76 | step: 6.62 32%|███▏ | 3153/10000 [4:58:53<10:24:08, 5.47s/it] {'loss': 0.0549, 'grad_norm': 1.456600308418274, 'learning_rate': 3.205238865686433e-05, 'epoch': 3.15} 32%|███▏ | 3153/10000 [4:58:53<10:24:08, 5.47s/it][2025-06-19 18:28:38,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:28:38,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.45 | bwd_microstep: 3313.34 | bwd_inner_microstep: 3312.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-19 18:28:38,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.45 | bwd: 3313.35 | bwd_inner: 3312.55 | bwd_allreduce: 0.76 | step: 6.73 32%|███▏ | 3154/10000 [4:58:59<10:23:43, 5.47s/it] {'loss': 0.0232, 'grad_norm': 0.5577664971351624, 'learning_rate': 3.204721878539751e-05, 'epoch': 3.15} 32%|███▏ | 3154/10000 [4:58:59<10:23:43, 5.47s/it][2025-06-19 18:28:43,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 18:28:43,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.53 | bwd_microstep: 3364.99 | bwd_inner_microstep: 3364.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-19 18:28:43,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.53 | bwd: 3365.00 | bwd_inner: 3364.20 | bwd_allreduce: 0.76 | step: 6.52 32%|███▏ | 3155/10000 [4:59:04<10:25:49, 5.49s/it] {'loss': 0.0765, 'grad_norm': 1.199361801147461, 'learning_rate': 3.2042047650233344e-05, 'epoch': 3.15} 32%|███▏ | 3155/10000 [4:59:04<10:25:49, 5.49s/it][2025-06-19 18:28:49,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:28:49,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.60 | bwd_microstep: 3373.31 | bwd_inner_microstep: 3372.39 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.90 [2025-06-19 18:28:49,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.60 | bwd: 3373.32 | bwd_inner: 3372.39 | bwd_allreduce: 0.88 | step: 6.90 32%|███▏ | 3156/10000 [4:59:10<10:27:19, 5.50s/it] {'loss': 0.0661, 'grad_norm': 1.0704492330551147, 'learning_rate': 3.203687525191427e-05, 'epoch': 3.16} 32%|███▏ | 3156/10000 [4:59:10<10:27:19, 5.50s/it][2025-06-19 18:28:54,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:28:54,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.34 | bwd_microstep: 3359.73 | bwd_inner_microstep: 3358.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 18:28:54,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.34 | bwd: 3359.74 | bwd_inner: 3358.94 | bwd_allreduce: 0.76 | step: 6.66 32%|███▏ | 3157/10000 [4:59:15<10:28:21, 5.51s/it] {'loss': 0.0418, 'grad_norm': 0.7110612988471985, 'learning_rate': 3.203170159098284e-05, 'epoch': 3.16} 32%|███▏ | 3157/10000 [4:59:15<10:28:21, 5.51s/it][2025-06-19 18:29:00,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:29:00,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.92 | bwd_microstep: 3369.45 | bwd_inner_microstep: 3368.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 18:29:00,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.92 | bwd: 3369.46 | bwd_inner: 3368.66 | bwd_allreduce: 0.75 | step: 6.61 32%|███▏ | 3158/10000 [4:59:21<10:29:02, 5.52s/it] {'loss': 0.0491, 'grad_norm': 1.0689897537231445, 'learning_rate': 3.202652666798176e-05, 'epoch': 3.16} 32%|███▏ | 3158/10000 [4:59:21<10:29:02, 5.52s/it][2025-06-19 18:29:05,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 18:29:05,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.51 | bwd_microstep: 3328.63 | bwd_inner_microstep: 3327.67 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.43 [2025-06-19 18:29:05,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.51 | bwd: 3328.64 | bwd_inner: 3327.67 | bwd_allreduce: 0.93 | step: 7.44 32%|███▏ | 3159/10000 [4:59:26<10:27:30, 5.50s/it] {'loss': 0.0421, 'grad_norm': 0.7226155400276184, 'learning_rate': 3.202135048345386e-05, 'epoch': 3.16} 32%|███▏ | 3159/10000 [4:59:26<10:27:30, 5.50s/it][2025-06-19 18:29:11,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:29:11,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.05 | bwd_microstep: 3323.65 | bwd_inner_microstep: 3322.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 18:29:11,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.05 | bwd: 3323.66 | bwd_inner: 3322.87 | bwd_allreduce: 0.75 | step: 6.70 32%|███▏ | 3160/10000 [4:59:32<10:26:42, 5.50s/it] {'loss': 0.0386, 'grad_norm': 0.8805099725723267, 'learning_rate': 3.2016173037942075e-05, 'epoch': 3.16} 32%|███▏ | 3160/10000 [4:59:32<10:26:42, 5.50s/it][2025-06-19 18:29:16,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:29:16,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.88 | bwd_microstep: 3370.31 | bwd_inner_microstep: 3369.53 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.52 [2025-06-19 18:29:16,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.88 | bwd: 3370.32 | bwd_inner: 3369.53 | bwd_allreduce: 0.75 | step: 6.53 32%|███▏ | 3161/10000 [4:59:37<10:27:55, 5.51s/it] {'loss': 0.0511, 'grad_norm': 1.5652904510498047, 'learning_rate': 3.201099433198951e-05, 'epoch': 3.16} 32%|███▏ | 3161/10000 [4:59:37<10:27:55, 5.51s/it][2025-06-19 18:29:22,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:29:22,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.98 | bwd_microstep: 3324.38 | bwd_inner_microstep: 3323.54 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.91 [2025-06-19 18:29:22,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.98 | bwd: 3324.40 | bwd_inner: 3323.54 | bwd_allreduce: 0.81 | step: 6.92 32%|███▏ | 3162/10000 [4:59:43<10:26:50, 5.50s/it] {'loss': 0.0517, 'grad_norm': 1.0061869621276855, 'learning_rate': 3.2005814366139386e-05, 'epoch': 3.16} 32%|███▏ | 3162/10000 [4:59:43<10:26:50, 5.50s/it][2025-06-19 18:29:27,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:29:27,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.85 | bwd_microstep: 3323.52 | bwd_inner_microstep: 3322.69 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.90 [2025-06-19 18:29:27,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.85 | bwd: 3323.54 | bwd_inner: 3322.69 | bwd_allreduce: 0.80 | step: 6.91 32%|███▏ | 3163/10000 [4:59:48<10:25:52, 5.49s/it] {'loss': 0.0589, 'grad_norm': 1.0316789150238037, 'learning_rate': 3.200063314093505e-05, 'epoch': 3.16} 32%|███▏ | 3163/10000 [4:59:48<10:25:52, 5.49s/it][2025-06-19 18:29:33,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:29:33,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.79 | bwd_microstep: 3321.55 | bwd_inner_microstep: 3320.71 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.53 [2025-06-19 18:29:33,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.79 | bwd: 3321.57 | bwd_inner: 3320.71 | bwd_allreduce: 0.81 | step: 7.54 32%|███▏ | 3164/10000 [4:59:54<10:24:56, 5.49s/it] {'loss': 0.0515, 'grad_norm': 1.3169221878051758, 'learning_rate': 3.199545065692e-05, 'epoch': 3.16} 32%|███▏ | 3164/10000 [4:59:54<10:24:56, 5.49s/it][2025-06-19 18:29:38,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:29:38,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.16 | bwd_microstep: 3323.98 | bwd_inner_microstep: 3322.90 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.42 [2025-06-19 18:29:38,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.16 | bwd: 3324.00 | bwd_inner: 3322.90 | bwd_allreduce: 1.05 | step: 7.42 32%|███▏ | 3165/10000 [4:59:59<10:24:28, 5.48s/it] {'loss': 0.0814, 'grad_norm': 1.431631088256836, 'learning_rate': 3.199026691463784e-05, 'epoch': 3.17} 32%|███▏ | 3165/10000 [4:59:59<10:24:28, 5.48s/it][2025-06-19 18:29:44,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:29:44,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.93 | bwd_microstep: 3369.18 | bwd_inner_microstep: 3368.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.01 [2025-06-19 18:29:44,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.93 | bwd: 3369.20 | bwd_inner: 3368.39 | bwd_allreduce: 0.76 | step: 7.02 32%|███▏ | 3166/10000 [5:00:05<10:26:30, 5.50s/it] {'loss': 0.1162, 'grad_norm': 3.307774066925049, 'learning_rate': 3.198508191463234e-05, 'epoch': 3.17} 32%|███▏ | 3166/10000 [5:00:05<10:26:30, 5.50s/it][2025-06-19 18:29:49,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:29:49,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.72 | bwd_microstep: 3323.49 | bwd_inner_microstep: 3322.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.88 [2025-06-19 18:29:49,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.72 | bwd: 3323.50 | bwd_inner: 3322.70 | bwd_allreduce: 0.76 | step: 6.89 32%|███▏ | 3167/10000 [5:00:10<10:25:18, 5.49s/it] {'loss': 0.0222, 'grad_norm': 0.4805852472782135, 'learning_rate': 3.1979895657447365e-05, 'epoch': 3.17} 32%|███▏ | 3167/10000 [5:00:10<10:25:18, 5.49s/it][2025-06-19 18:29:55,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:29:55,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.81 | bwd_microstep: 3376.31 | bwd_inner_microstep: 3375.37 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.98 [2025-06-19 18:29:55,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.81 | bwd: 3376.33 | bwd_inner: 3375.37 | bwd_allreduce: 0.91 | step: 6.99 32%|███▏ | 3168/10000 [5:00:16<10:27:13, 5.51s/it] {'loss': 0.0194, 'grad_norm': 0.5492081046104431, 'learning_rate': 3.197470814362694e-05, 'epoch': 3.17} 32%|███▏ | 3168/10000 [5:00:16<10:27:13, 5.51s/it][2025-06-19 18:30:00,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:30:00,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.09 | bwd_microstep: 3378.78 | bwd_inner_microstep: 3377.95 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.04 [2025-06-19 18:30:00,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.09 | bwd: 3378.79 | bwd_inner: 3377.95 | bwd_allreduce: 0.79 | step: 7.04 32%|███▏ | 3169/10000 [5:00:21<10:28:33, 5.52s/it] {'loss': 0.1089, 'grad_norm': 1.4624778032302856, 'learning_rate': 3.1969519373715203e-05, 'epoch': 3.17} 32%|███▏ | 3169/10000 [5:00:21<10:28:33, 5.52s/it][2025-06-19 18:30:06,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:30:06,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.53 | bwd_microstep: 3320.47 | bwd_inner_microstep: 3319.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 18:30:06,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.53 | bwd: 3320.48 | bwd_inner: 3319.67 | bwd_allreduce: 0.77 | step: 7.02 32%|███▏ | 3170/10000 [5:00:27<10:26:38, 5.50s/it] {'loss': 0.0925, 'grad_norm': 1.4167181253433228, 'learning_rate': 3.196432934825644e-05, 'epoch': 3.17} 32%|███▏ | 3170/10000 [5:00:27<10:26:38, 5.50s/it][2025-06-19 18:30:11,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:30:11,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.08 | bwd_microstep: 3365.04 | bwd_inner_microstep: 3364.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 18:30:11,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.08 | bwd: 3365.06 | bwd_inner: 3364.26 | bwd_allreduce: 0.76 | step: 6.65 32%|███▏ | 3171/10000 [5:00:32<10:27:17, 5.51s/it] {'loss': 0.1709, 'grad_norm': 2.696077346801758, 'learning_rate': 3.1959138067795054e-05, 'epoch': 3.17} 32%|███▏ | 3171/10000 [5:00:32<10:27:17, 5.51s/it][2025-06-19 18:30:17,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:30:17,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.87 | bwd_microstep: 3327.78 | bwd_inner_microstep: 3326.99 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 18:30:17,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.87 | bwd: 3327.80 | bwd_inner: 3326.99 | bwd_allreduce: 0.77 | step: 6.71 32%|███▏ | 3172/10000 [5:00:38<10:25:46, 5.50s/it] {'loss': 0.0752, 'grad_norm': 1.9102656841278076, 'learning_rate': 3.195394553287559e-05, 'epoch': 3.17} 32%|███▏ | 3172/10000 [5:00:38<10:25:46, 5.50s/it][2025-06-19 18:30:22,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:30:22,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.04 | bwd_microstep: 3337.80 | bwd_inner_microstep: 3336.69 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.38 [2025-06-19 18:30:22,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.04 | bwd: 3337.82 | bwd_inner: 3336.69 | bwd_allreduce: 1.06 | step: 7.36 32%|███▏ | 3173/10000 [5:00:43<10:25:17, 5.50s/it] {'loss': 0.0549, 'grad_norm': 1.3165092468261719, 'learning_rate': 3.194875174404272e-05, 'epoch': 3.17} 32%|███▏ | 3173/10000 [5:00:43<10:25:17, 5.50s/it][2025-06-19 18:30:28,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:30:28,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.82 | bwd_microstep: 3370.64 | bwd_inner_microstep: 3369.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 18:30:28,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.81 | bwd: 3370.65 | bwd_inner: 3369.85 | bwd_allreduce: 0.76 | step: 6.76 32%|███▏ | 3174/10000 [5:00:49<10:26:58, 5.51s/it] {'loss': 0.0239, 'grad_norm': 0.5881420373916626, 'learning_rate': 3.1943556701841244e-05, 'epoch': 3.17} 32%|███▏ | 3174/10000 [5:00:49<10:26:58, 5.51s/it][2025-06-19 18:30:33,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:30:33,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.88 | bwd_microstep: 3321.61 | bwd_inner_microstep: 3320.83 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 18:30:33,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.88 | bwd: 3321.63 | bwd_inner: 3320.83 | bwd_allreduce: 0.76 | step: 6.58 32%|███▏ | 3175/10000 [5:00:54<10:25:50, 5.50s/it] {'loss': 0.0217, 'grad_norm': 0.5931423902511597, 'learning_rate': 3.1938360406816104e-05, 'epoch': 3.17} 32%|███▏ | 3175/10000 [5:00:54<10:25:50, 5.50s/it][2025-06-19 18:30:39,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:30:39,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.66 | bwd_microstep: 3372.17 | bwd_inner_microstep: 3371.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 18:30:39,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.66 | bwd: 3372.18 | bwd_inner: 3371.39 | bwd_allreduce: 0.76 | step: 6.56 32%|███▏ | 3176/10000 [5:01:00<10:27:07, 5.51s/it] {'loss': 0.0313, 'grad_norm': 0.996884286403656, 'learning_rate': 3.193316285951236e-05, 'epoch': 3.18} 32%|███▏ | 3176/10000 [5:01:00<10:27:07, 5.51s/it][2025-06-19 18:30:44,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:30:44,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.61 | bwd_microstep: 3373.92 | bwd_inner_microstep: 3373.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 18:30:44,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.61 | bwd: 3373.93 | bwd_inner: 3373.13 | bwd_allreduce: 0.76 | step: 6.69 32%|███▏ | 3177/10000 [5:01:05<10:27:45, 5.52s/it] {'loss': 0.0561, 'grad_norm': 1.6738982200622559, 'learning_rate': 3.192796406047524e-05, 'epoch': 3.18} 32%|███▏ | 3177/10000 [5:01:05<10:27:45, 5.52s/it][2025-06-19 18:30:50,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:30:50,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.97 | bwd_microstep: 3330.68 | bwd_inner_microstep: 3329.64 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.57 [2025-06-19 18:30:50,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.97 | bwd: 3330.70 | bwd_inner: 3329.64 | bwd_allreduce: 1.01 | step: 7.57 32%|███▏ | 3178/10000 [5:01:11<10:26:31, 5.51s/it] {'loss': 0.1246, 'grad_norm': 1.6909925937652588, 'learning_rate': 3.192276401025002e-05, 'epoch': 3.18} 32%|███▏ | 3178/10000 [5:01:11<10:26:31, 5.51s/it][2025-06-19 18:30:55,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:30:55,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.58 | bwd_microstep: 3407.13 | bwd_inner_microstep: 3406.01 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.01 [2025-06-19 18:30:55,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.59 | bwd: 3407.15 | bwd_inner: 3406.01 | bwd_allreduce: 1.08 | step: 7.01 32%|███▏ | 3179/10000 [5:01:16<10:29:10, 5.53s/it] {'loss': 0.0181, 'grad_norm': 0.2860783040523529, 'learning_rate': 3.19175627093822e-05, 'epoch': 3.18} 32%|███▏ | 3179/10000 [5:01:16<10:29:10, 5.53s/it][2025-06-19 18:31:01,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:31:01,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.55 | bwd_microstep: 3329.54 | bwd_inner_microstep: 3328.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 18:31:01,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.55 | bwd: 3329.56 | bwd_inner: 3328.76 | bwd_allreduce: 0.75 | step: 6.62 32%|███▏ | 3180/10000 [5:01:22<10:27:14, 5.52s/it] {'loss': 0.0828, 'grad_norm': 1.7270081043243408, 'learning_rate': 3.191236015841737e-05, 'epoch': 3.18} 32%|███▏ | 3180/10000 [5:01:22<10:27:14, 5.52s/it][2025-06-19 18:31:06,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:31:06,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.21 | bwd_microstep: 3328.37 | bwd_inner_microstep: 3327.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 18:31:06,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.21 | bwd: 3328.38 | bwd_inner: 3327.58 | bwd_allreduce: 0.76 | step: 6.59 32%|███▏ | 3181/10000 [5:01:27<10:25:49, 5.51s/it] {'loss': 0.0164, 'grad_norm': 0.5107736587524414, 'learning_rate': 3.1907156357901234e-05, 'epoch': 3.18} 32%|███▏ | 3181/10000 [5:01:27<10:25:49, 5.51s/it][2025-06-19 18:31:12,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:31:12,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.46 | bwd_microstep: 3372.45 | bwd_inner_microstep: 3371.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 18:31:12,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.46 | bwd: 3372.46 | bwd_inner: 3371.65 | bwd_allreduce: 0.76 | step: 6.64 32%|███▏ | 3182/10000 [5:01:33<10:26:54, 5.52s/it] {'loss': 0.0498, 'grad_norm': 1.2480859756469727, 'learning_rate': 3.1901951308379664e-05, 'epoch': 3.18} 32%|███▏ | 3182/10000 [5:01:33<10:26:54, 5.52s/it][2025-06-19 18:31:17,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:31:17,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.36 | bwd_microstep: 3328.11 | bwd_inner_microstep: 3327.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.25 [2025-06-19 18:31:17,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.36 | bwd: 3328.13 | bwd_inner: 3327.31 | bwd_allreduce: 0.78 | step: 7.25 32%|███▏ | 3183/10000 [5:01:38<10:25:28, 5.51s/it] {'loss': 0.0112, 'grad_norm': 0.42907220125198364, 'learning_rate': 3.189674501039865e-05, 'epoch': 3.18} 32%|███▏ | 3183/10000 [5:01:38<10:25:28, 5.51s/it][2025-06-19 18:31:23,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:31:23,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.95 | bwd_microstep: 3329.88 | bwd_inner_microstep: 3329.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 18:31:23,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.95 | bwd: 3329.90 | bwd_inner: 3329.08 | bwd_allreduce: 0.77 | step: 6.85 32%|███▏ | 3184/10000 [5:01:44<10:25:04, 5.50s/it] {'loss': 0.1728, 'grad_norm': 3.067744493484497, 'learning_rate': 3.189153746450429e-05, 'epoch': 3.18} 32%|███▏ | 3184/10000 [5:01:44<10:25:04, 5.50s/it][2025-06-19 18:31:29,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:31:29,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.09 | bwd_microstep: 3399.78 | bwd_inner_microstep: 3398.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.84 [2025-06-19 18:31:29,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.09 | bwd: 3399.79 | bwd_inner: 3398.99 | bwd_allreduce: 0.75 | step: 6.85 32%|███▏ | 3185/10000 [5:01:49<10:27:43, 5.53s/it] {'loss': 0.0329, 'grad_norm': 1.3944422006607056, 'learning_rate': 3.1886328671242836e-05, 'epoch': 3.19} 32%|███▏ | 3185/10000 [5:01:49<10:27:43, 5.53s/it][2025-06-19 18:31:34,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:31:34,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.40 | bwd_microstep: 3330.59 | bwd_inner_microstep: 3329.63 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.85 [2025-06-19 18:31:34,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.40 | bwd: 3330.60 | bwd_inner: 3329.63 | bwd_allreduce: 0.93 | step: 6.86 32%|███▏ | 3186/10000 [5:01:55<10:25:52, 5.51s/it] {'loss': 0.0129, 'grad_norm': 0.2454889863729477, 'learning_rate': 3.1881118631160676e-05, 'epoch': 3.19} 32%|███▏ | 3186/10000 [5:01:55<10:25:52, 5.51s/it][2025-06-19 18:31:40,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:31:40,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.64 | bwd_microstep: 3402.22 | bwd_inner_microstep: 3401.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 18:31:40,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.64 | bwd: 3402.23 | bwd_inner: 3401.43 | bwd_allreduce: 0.76 | step: 6.71 32%|███▏ | 3187/10000 [5:02:00<10:28:08, 5.53s/it] {'loss': 0.0535, 'grad_norm': 1.2022919654846191, 'learning_rate': 3.1875907344804314e-05, 'epoch': 3.19} 32%|███▏ | 3187/10000 [5:02:00<10:28:08, 5.53s/it][2025-06-19 18:31:45,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 18:31:45,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.83 | bwd_microstep: 3326.94 | bwd_inner_microstep: 3325.71 | bwd_allreduce_microstep: 1.15 | step_microstep: 8.86 [2025-06-19 18:31:45,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.83 | bwd: 3326.97 | bwd_inner: 3325.71 | bwd_allreduce: 1.19 | step: 8.87 32%|███▏ | 3188/10000 [5:02:06<10:26:21, 5.52s/it] {'loss': 0.0379, 'grad_norm': 0.6043281555175781, 'learning_rate': 3.1870694812720384e-05, 'epoch': 3.19} 32%|███▏ | 3188/10000 [5:02:06<10:26:21, 5.52s/it][2025-06-19 18:31:51,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:31:51,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.08 | bwd_microstep: 3379.37 | bwd_inner_microstep: 3378.55 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.47 [2025-06-19 18:31:51,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.08 | bwd: 3379.39 | bwd_inner: 3378.55 | bwd_allreduce: 0.79 | step: 7.47 32%|███▏ | 3189/10000 [5:02:11<10:28:04, 5.53s/it] {'loss': 0.0145, 'grad_norm': 0.9587416648864746, 'learning_rate': 3.186548103545567e-05, 'epoch': 3.19} 32%|███▏ | 3189/10000 [5:02:11<10:28:04, 5.53s/it][2025-06-19 18:31:56,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:31:56,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.23 | bwd_microstep: 3371.00 | bwd_inner_microstep: 3370.14 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.99 [2025-06-19 18:31:56,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.23 | bwd: 3371.02 | bwd_inner: 3370.14 | bwd_allreduce: 0.82 | step: 6.99 32%|███▏ | 3190/10000 [5:02:17<10:28:40, 5.54s/it] {'loss': 0.0188, 'grad_norm': 0.9541571736335754, 'learning_rate': 3.186026601355706e-05, 'epoch': 3.19} 32%|███▏ | 3190/10000 [5:02:17<10:28:40, 5.54s/it][2025-06-19 18:32:02,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:32:02,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.72 | bwd_microstep: 3327.25 | bwd_inner_microstep: 3326.38 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.42 [2025-06-19 18:32:02,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.72 | bwd: 3327.27 | bwd_inner: 3326.38 | bwd_allreduce: 0.83 | step: 7.42 32%|███▏ | 3191/10000 [5:02:22<10:26:51, 5.52s/it] {'loss': 0.0553, 'grad_norm': 2.3367245197296143, 'learning_rate': 3.1855049747571594e-05, 'epoch': 3.19} 32%|███▏ | 3191/10000 [5:02:22<10:26:51, 5.52s/it][2025-06-19 18:32:07,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:32:07,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.60 | bwd_microstep: 3329.09 | bwd_inner_microstep: 3328.14 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.37 [2025-06-19 18:32:07,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.60 | bwd: 3329.11 | bwd_inner: 3328.14 | bwd_allreduce: 0.92 | step: 7.38 32%|███▏ | 3192/10000 [5:02:28<10:25:29, 5.51s/it] {'loss': 0.0625, 'grad_norm': 7.045386791229248, 'learning_rate': 3.184983223804643e-05, 'epoch': 3.19} 32%|███▏ | 3192/10000 [5:02:28<10:25:29, 5.51s/it][2025-06-19 18:32:13,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:32:13,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.31 | bwd_microstep: 3389.66 | bwd_inner_microstep: 3388.61 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.20 [2025-06-19 18:32:13,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.31 | bwd: 3389.68 | bwd_inner: 3388.61 | bwd_allreduce: 1.02 | step: 7.20 32%|███▏ | 3193/10000 [5:02:34<10:27:21, 5.53s/it] {'loss': 0.0399, 'grad_norm': 1.39585280418396, 'learning_rate': 3.184461348552885e-05, 'epoch': 3.19} 32%|███▏ | 3193/10000 [5:02:34<10:27:21, 5.53s/it][2025-06-19 18:32:18,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:32:18,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.22 | bwd_microstep: 3340.38 | bwd_inner_microstep: 3339.41 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.64 [2025-06-19 18:32:18,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.22 | bwd: 3340.40 | bwd_inner: 3339.41 | bwd_allreduce: 0.94 | step: 7.64 32%|███▏ | 3194/10000 [5:02:39<10:26:06, 5.52s/it] {'loss': 0.027, 'grad_norm': 1.0651094913482666, 'learning_rate': 3.183939349056631e-05, 'epoch': 3.19} 32%|███▏ | 3194/10000 [5:02:39<10:26:06, 5.52s/it][2025-06-19 18:32:24,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-19 18:32:24,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.92 | bwd_microstep: 3380.14 | bwd_inner_microstep: 3379.25 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.86 [2025-06-19 18:32:24,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.92 | bwd: 3380.16 | bwd_inner: 3379.25 | bwd_allreduce: 0.85 | step: 7.86 32%|███▏ | 3195/10000 [5:02:45<10:27:17, 5.53s/it] {'loss': 0.0788, 'grad_norm': 0.9114205837249756, 'learning_rate': 3.183417225370632e-05, 'epoch': 3.19} 32%|███▏ | 3195/10000 [5:02:45<10:27:17, 5.53s/it][2025-06-19 18:32:29,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:32:29,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.24 | bwd_microstep: 3325.70 | bwd_inner_microstep: 3324.89 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.26 [2025-06-19 18:32:29,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.24 | bwd: 3325.72 | bwd_inner: 3324.89 | bwd_allreduce: 0.78 | step: 7.26 32%|███▏ | 3196/10000 [5:02:50<10:25:19, 5.51s/it] {'loss': 0.0189, 'grad_norm': 1.1371736526489258, 'learning_rate': 3.1828949775496594e-05, 'epoch': 3.2} 32%|███▏ | 3196/10000 [5:02:50<10:25:19, 5.51s/it][2025-06-19 18:32:35,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:32:35,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.16 | bwd_microstep: 3327.39 | bwd_inner_microstep: 3326.44 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.56 [2025-06-19 18:32:35,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.16 | bwd: 3327.41 | bwd_inner: 3326.44 | bwd_allreduce: 0.92 | step: 7.56 32%|███▏ | 3197/10000 [5:02:56<10:23:50, 5.50s/it] {'loss': 0.0518, 'grad_norm': 3.3797354698181152, 'learning_rate': 3.182372605648494e-05, 'epoch': 3.2} 32%|███▏ | 3197/10000 [5:02:56<10:23:50, 5.50s/it][2025-06-19 18:32:40,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:32:40,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.48 | bwd_microstep: 3327.07 | bwd_inner_microstep: 3326.26 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 18:32:40,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.48 | bwd: 3327.09 | bwd_inner: 3326.26 | bwd_allreduce: 0.78 | step: 7.24 32%|███▏ | 3198/10000 [5:03:01<10:23:18, 5.50s/it] {'loss': 0.1676, 'grad_norm': 3.0548131465911865, 'learning_rate': 3.181850109721929e-05, 'epoch': 3.2} 32%|███▏ | 3198/10000 [5:03:01<10:23:18, 5.50s/it][2025-06-19 18:32:46,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:32:46,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.73 | bwd_microstep: 3339.79 | bwd_inner_microstep: 3338.75 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.38 [2025-06-19 18:32:46,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.73 | bwd: 3339.80 | bwd_inner: 3338.75 | bwd_allreduce: 1.00 | step: 7.39 32%|███▏ | 3199/10000 [5:03:07<10:23:00, 5.50s/it] {'loss': 0.0685, 'grad_norm': 1.6093120574951172, 'learning_rate': 3.181327489824773e-05, 'epoch': 3.2} 32%|███▏ | 3199/10000 [5:03:07<10:23:00, 5.50s/it][2025-06-19 18:32:51,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:32:51,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.41 | bwd_microstep: 3378.65 | bwd_inner_microstep: 3377.84 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-19 18:32:51,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.41 | bwd: 3378.67 | bwd_inner: 3377.84 | bwd_allreduce: 0.78 | step: 7.31 32%|███▏ | 3200/10000 [5:03:12<10:24:47, 5.51s/it] {'loss': 0.0493, 'grad_norm': 1.363076090812683, 'learning_rate': 3.1808047460118454e-05, 'epoch': 3.2} 32%|███▏ | 3200/10000 [5:03:12<10:24:47, 5.51s/it][2025-06-19 18:32:57,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:32:57,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.99 | bwd_microstep: 3378.96 | bwd_inner_microstep: 3378.01 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.03 [2025-06-19 18:32:57,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.99 | bwd: 3378.98 | bwd_inner: 3378.01 | bwd_allreduce: 0.92 | step: 7.04 32%|███▏ | 3201/10000 [5:03:18<10:26:16, 5.53s/it] {'loss': 0.0129, 'grad_norm': 0.3961624801158905, 'learning_rate': 3.180281878337981e-05, 'epoch': 3.2} 32%|███▏ | 3201/10000 [5:03:18<10:26:16, 5.53s/it][2025-06-19 18:33:02,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:33:02,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.64 | bwd_microstep: 3333.52 | bwd_inner_microstep: 3332.53 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.68 [2025-06-19 18:33:02,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.64 | bwd: 3333.54 | bwd_inner: 3332.53 | bwd_allreduce: 0.96 | step: 7.68 32%|███▏ | 3202/10000 [5:03:23<10:24:44, 5.51s/it] {'loss': 0.0381, 'grad_norm': 1.1240400075912476, 'learning_rate': 3.179758886858025e-05, 'epoch': 3.2} 32%|███▏ | 3202/10000 [5:03:23<10:24:44, 5.51s/it][2025-06-19 18:33:08,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:33:08,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.62 | bwd_microstep: 3369.34 | bwd_inner_microstep: 3368.34 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.13 [2025-06-19 18:33:08,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.62 | bwd: 3369.37 | bwd_inner: 3368.34 | bwd_allreduce: 0.96 | step: 7.13 32%|███▏ | 3203/10000 [5:03:29<10:26:03, 5.53s/it] {'loss': 0.0253, 'grad_norm': 0.9631059169769287, 'learning_rate': 3.179235771626837e-05, 'epoch': 3.2} 32%|███▏ | 3203/10000 [5:03:29<10:26:03, 5.53s/it][2025-06-19 18:33:13,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:33:13,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.76 | bwd_microstep: 3313.28 | bwd_inner_microstep: 3312.44 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.08 [2025-06-19 18:33:13,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.76 | bwd: 3313.30 | bwd_inner: 3312.44 | bwd_allreduce: 0.81 | step: 7.08 32%|███▏ | 3204/10000 [5:03:34<10:23:58, 5.51s/it] {'loss': 0.0361, 'grad_norm': 1.222818374633789, 'learning_rate': 3.17871253269929e-05, 'epoch': 3.2} 32%|███▏ | 3204/10000 [5:03:34<10:23:58, 5.51s/it][2025-06-19 18:33:19,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:33:19,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.99 | bwd_microstep: 3323.42 | bwd_inner_microstep: 3322.58 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.89 [2025-06-19 18:33:19,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.99 | bwd: 3323.44 | bwd_inner: 3322.58 | bwd_allreduce: 0.80 | step: 6.90 32%|███▏ | 3205/10000 [5:03:40<10:22:40, 5.50s/it] {'loss': 0.0298, 'grad_norm': 0.9164281487464905, 'learning_rate': 3.1781891701302694e-05, 'epoch': 3.21} 32%|███▏ | 3205/10000 [5:03:40<10:22:40, 5.50s/it][2025-06-19 18:33:24,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:33:24,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.66 | bwd_microstep: 3368.57 | bwd_inner_microstep: 3367.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 18:33:24,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.67 | bwd: 3368.58 | bwd_inner: 3367.78 | bwd_allreduce: 0.76 | step: 6.62 32%|███▏ | 3206/10000 [5:03:45<10:24:02, 5.51s/it] {'loss': 0.0504, 'grad_norm': 1.160908818244934, 'learning_rate': 3.1776656839746726e-05, 'epoch': 3.21} 32%|███▏ | 3206/10000 [5:03:45<10:24:02, 5.51s/it][2025-06-19 18:33:30,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:33:30,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.32 | bwd_microstep: 3317.03 | bwd_inner_microstep: 3316.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-19 18:33:30,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.32 | bwd: 3317.04 | bwd_inner: 3316.22 | bwd_allreduce: 0.78 | step: 7.03 32%|███▏ | 3207/10000 [5:03:51<10:22:20, 5.50s/it] {'loss': 0.0094, 'grad_norm': 0.3719661235809326, 'learning_rate': 3.177142074287411e-05, 'epoch': 3.21} 32%|███▏ | 3207/10000 [5:03:51<10:22:20, 5.50s/it][2025-06-19 18:33:35,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:33:35,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.81 | bwd_microstep: 3327.90 | bwd_inner_microstep: 3327.07 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.89 [2025-06-19 18:33:35,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.81 | bwd: 3327.92 | bwd_inner: 3327.07 | bwd_allreduce: 0.79 | step: 6.91 32%|███▏ | 3208/10000 [5:03:56<10:21:36, 5.49s/it] {'loss': 0.1018, 'grad_norm': 3.582984209060669, 'learning_rate': 3.176618341123409e-05, 'epoch': 3.21} 32%|███▏ | 3208/10000 [5:03:56<10:21:36, 5.49s/it][2025-06-19 18:33:41,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:33:41,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.26 | bwd_microstep: 3375.37 | bwd_inner_microstep: 3374.49 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.85 [2025-06-19 18:33:41,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.26 | bwd: 3375.38 | bwd_inner: 3374.49 | bwd_allreduce: 0.85 | step: 6.85 32%|███▏ | 3209/10000 [5:04:02<10:23:15, 5.51s/it] {'loss': 0.1096, 'grad_norm': 2.8472228050231934, 'learning_rate': 3.1760944845376046e-05, 'epoch': 3.21} 32%|███▏ | 3209/10000 [5:04:02<10:23:15, 5.51s/it][2025-06-19 18:33:46,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:33:46,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.57 | bwd_microstep: 3365.10 | bwd_inner_microstep: 3364.28 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.81 [2025-06-19 18:33:46,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.57 | bwd: 3365.12 | bwd_inner: 3364.28 | bwd_allreduce: 0.79 | step: 6.81 32%|███▏ | 3210/10000 [5:04:07<10:24:16, 5.52s/it] {'loss': 0.0808, 'grad_norm': 1.9721418619155884, 'learning_rate': 3.1755705045849465e-05, 'epoch': 3.21} 32%|███▏ | 3210/10000 [5:04:07<10:24:16, 5.52s/it][2025-06-19 18:33:52,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.74 [2025-06-19 18:33:52,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.99 | bwd_microstep: 3323.30 | bwd_inner_microstep: 3322.48 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.92 [2025-06-19 18:33:52,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.99 | bwd: 3323.31 | bwd_inner: 3322.48 | bwd_allreduce: 0.78 | step: 6.92 32%|███▏ | 3211/10000 [5:04:13<10:22:27, 5.50s/it] {'loss': 0.0174, 'grad_norm': 0.7584818005561829, 'learning_rate': 3.1750464013203985e-05, 'epoch': 3.21} 32%|███▏ | 3211/10000 [5:04:13<10:22:27, 5.50s/it][2025-06-19 18:33:57,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:33:57,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.34 | bwd_microstep: 3375.18 | bwd_inner_microstep: 3374.20 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.39 [2025-06-19 18:33:57,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.34 | bwd: 3375.19 | bwd_inner: 3374.20 | bwd_allreduce: 0.95 | step: 7.40 32%|███▏ | 3212/10000 [5:04:18<10:24:10, 5.52s/it] {'loss': 0.1831, 'grad_norm': 3.8884387016296387, 'learning_rate': 3.174522174798937e-05, 'epoch': 3.21} 32%|███▏ | 3212/10000 [5:04:18<10:24:10, 5.52s/it][2025-06-19 18:34:03,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:34:03,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.38 | bwd_microstep: 3330.23 | bwd_inner_microstep: 3329.31 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.20 [2025-06-19 18:34:03,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.37 | bwd: 3330.25 | bwd_inner: 3329.31 | bwd_allreduce: 0.90 | step: 7.20 32%|███▏ | 3213/10000 [5:04:24<10:22:49, 5.51s/it] {'loss': 0.0625, 'grad_norm': 1.8150845766067505, 'learning_rate': 3.17399782507555e-05, 'epoch': 3.21} 32%|███▏ | 3213/10000 [5:04:24<10:22:49, 5.51s/it][2025-06-19 18:34:08,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:34:08,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.58 | bwd_microstep: 3313.51 | bwd_inner_microstep: 3312.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 18:34:08,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.59 | bwd: 3313.52 | bwd_inner: 3312.73 | bwd_allreduce: 0.75 | step: 6.56 32%|███▏ | 3214/10000 [5:04:29<10:21:12, 5.49s/it] {'loss': 0.0525, 'grad_norm': 2.88140869140625, 'learning_rate': 3.1734733522052396e-05, 'epoch': 3.21} 32%|███▏ | 3214/10000 [5:04:29<10:21:12, 5.49s/it][2025-06-19 18:34:14,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 18:34:14,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.08 | bwd_microstep: 3320.89 | bwd_inner_microstep: 3319.90 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.27 [2025-06-19 18:34:14,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.08 | bwd: 3320.91 | bwd_inner: 3319.90 | bwd_allreduce: 0.95 | step: 7.27 32%|███▏ | 3215/10000 [5:04:35<10:20:34, 5.49s/it] {'loss': 0.1273, 'grad_norm': 2.7613799571990967, 'learning_rate': 3.172948756243022e-05, 'epoch': 3.21} 32%|███▏ | 3215/10000 [5:04:35<10:20:34, 5.49s/it][2025-06-19 18:34:19,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:34:19,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.86 | bwd_microstep: 3326.24 | bwd_inner_microstep: 3325.44 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.87 [2025-06-19 18:34:19,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.86 | bwd: 3326.26 | bwd_inner: 3325.44 | bwd_allreduce: 0.77 | step: 6.87 32%|███▏ | 3216/10000 [5:04:40<10:20:07, 5.48s/it] {'loss': 0.0545, 'grad_norm': 3.1918222904205322, 'learning_rate': 3.172424037243923e-05, 'epoch': 3.22} 32%|███▏ | 3216/10000 [5:04:40<10:20:07, 5.48s/it][2025-06-19 18:34:25,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:34:25,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.72 | bwd_microstep: 3326.99 | bwd_inner_microstep: 3325.82 | bwd_allreduce_microstep: 1.10 | step_microstep: 7.25 [2025-06-19 18:34:25,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.72 | bwd: 3327.01 | bwd_inner: 3325.82 | bwd_allreduce: 1.13 | step: 7.25 32%|███▏ | 3217/10000 [5:04:46<10:19:25, 5.48s/it] {'loss': 0.0675, 'grad_norm': 13.72026252746582, 'learning_rate': 3.1718991952629835e-05, 'epoch': 3.22} 32%|███▏ | 3217/10000 [5:04:46<10:19:25, 5.48s/it][2025-06-19 18:34:30,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 18:34:30,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.33 | bwd_microstep: 3373.04 | bwd_inner_microstep: 3372.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 18:34:30,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.33 | bwd: 3373.05 | bwd_inner: 3372.26 | bwd_allreduce: 0.75 | step: 6.66 32%|███▏ | 3218/10000 [5:04:51<10:21:22, 5.50s/it] {'loss': 0.0049, 'grad_norm': 0.19581764936447144, 'learning_rate': 3.171374230355258e-05, 'epoch': 3.22} 32%|███▏ | 3218/10000 [5:04:51<10:21:22, 5.50s/it][2025-06-19 18:34:36,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:34:36,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.69 | bwd_microstep: 3313.24 | bwd_inner_microstep: 3312.24 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.30 [2025-06-19 18:34:36,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.69 | bwd: 3313.26 | bwd_inner: 3312.24 | bwd_allreduce: 0.96 | step: 7.30 32%|███▏ | 3219/10000 [5:04:57<10:19:56, 5.49s/it] {'loss': 0.0436, 'grad_norm': 2.902647018432617, 'learning_rate': 3.170849142575812e-05, 'epoch': 3.22} 32%|███▏ | 3219/10000 [5:04:57<10:19:56, 5.49s/it][2025-06-19 18:34:41,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:34:41,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.10 | bwd_microstep: 3378.05 | bwd_inner_microstep: 3377.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 18:34:41,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.10 | bwd: 3378.06 | bwd_inner: 3377.26 | bwd_allreduce: 0.76 | step: 6.64 32%|███▏ | 3220/10000 [5:05:02<10:22:01, 5.50s/it] {'loss': 0.1535, 'grad_norm': 5.813228130340576, 'learning_rate': 3.170323931979725e-05, 'epoch': 3.22} 32%|███▏ | 3220/10000 [5:05:02<10:22:01, 5.50s/it][2025-06-19 18:34:47,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:34:47,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.27 | bwd_microstep: 3407.69 | bwd_inner_microstep: 3406.74 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.98 [2025-06-19 18:34:47,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.27 | bwd: 3407.70 | bwd_inner: 3406.74 | bwd_allreduce: 0.92 | step: 6.99 32%|███▏ | 3221/10000 [5:05:08<10:24:38, 5.53s/it] {'loss': 0.1088, 'grad_norm': 6.814005374908447, 'learning_rate': 3.169798598622089e-05, 'epoch': 3.22} 32%|███▏ | 3221/10000 [5:05:08<10:24:38, 5.53s/it][2025-06-19 18:34:52,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:34:52,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.23 | bwd_microstep: 3332.45 | bwd_inner_microstep: 3331.47 | bwd_allreduce_microstep: 0.94 | step_microstep: 6.70 [2025-06-19 18:34:52,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.23 | bwd: 3332.46 | bwd_inner: 3331.47 | bwd_allreduce: 0.95 | step: 6.70 32%|███▏ | 3222/10000 [5:05:13<10:22:59, 5.51s/it] {'loss': 0.0422, 'grad_norm': 1.9776759147644043, 'learning_rate': 3.16927314255801e-05, 'epoch': 3.22} 32%|███▏ | 3222/10000 [5:05:13<10:22:59, 5.51s/it][2025-06-19 18:34:58,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:34:58,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.09 | bwd_microstep: 3319.12 | bwd_inner_microstep: 3318.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-19 18:34:58,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.09 | bwd: 3319.14 | bwd_inner: 3318.31 | bwd_allreduce: 0.78 | step: 6.98 32%|███▏ | 3223/10000 [5:05:19<10:21:04, 5.50s/it] {'loss': 0.017, 'grad_norm': 0.7795557975769043, 'learning_rate': 3.1687475638426035e-05, 'epoch': 3.22} 32%|███▏ | 3223/10000 [5:05:19<10:21:04, 5.50s/it][2025-06-19 18:35:03,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:35:03,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.14 | bwd_microstep: 3324.94 | bwd_inner_microstep: 3323.70 | bwd_allreduce_microstep: 1.18 | step_microstep: 7.54 [2025-06-19 18:35:03,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.14 | bwd: 3324.96 | bwd_inner: 3323.70 | bwd_allreduce: 1.21 | step: 7.56 32%|███▏ | 3224/10000 [5:05:24<10:20:15, 5.49s/it] {'loss': 0.131, 'grad_norm': 3.149022340774536, 'learning_rate': 3.1682218625310034e-05, 'epoch': 3.22} 32%|███▏ | 3224/10000 [5:05:24<10:20:15, 5.49s/it][2025-06-19 18:35:09,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:35:09,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.81 | bwd_microstep: 3365.68 | bwd_inner_microstep: 3364.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 18:35:09,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.81 | bwd: 3365.69 | bwd_inner: 3364.71 | bwd_allreduce: 0.76 | step: 6.66 32%|███▏ | 3225/10000 [5:05:30<10:21:45, 5.51s/it] {'loss': 0.069, 'grad_norm': 1.8016082048416138, 'learning_rate': 3.1676960386783507e-05, 'epoch': 3.23} 32%|███▏ | 3225/10000 [5:05:30<10:21:45, 5.51s/it][2025-06-19 18:35:14,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:35:14,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.17 | bwd_microstep: 3324.13 | bwd_inner_microstep: 3323.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 18:35:14,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.17 | bwd: 3324.14 | bwd_inner: 3323.34 | bwd_allreduce: 0.76 | step: 6.62 32%|███▏ | 3226/10000 [5:05:35<10:20:37, 5.50s/it] {'loss': 0.0281, 'grad_norm': 2.653866767883301, 'learning_rate': 3.167170092339804e-05, 'epoch': 3.23} 32%|███▏ | 3226/10000 [5:05:35<10:20:37, 5.50s/it][2025-06-19 18:35:20,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:35:20,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.01 | bwd_microstep: 3318.23 | bwd_inner_microstep: 3317.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 18:35:20,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.01 | bwd: 3318.24 | bwd_inner: 3317.45 | bwd_allreduce: 0.75 | step: 6.55 32%|███▏ | 3227/10000 [5:05:41<10:19:28, 5.49s/it] {'loss': 0.0284, 'grad_norm': 2.3814761638641357, 'learning_rate': 3.166644023570531e-05, 'epoch': 3.23} 32%|███▏ | 3227/10000 [5:05:41<10:19:28, 5.49s/it][2025-06-19 18:35:25,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:35:25,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.08 | bwd_microstep: 3324.98 | bwd_inner_microstep: 3324.18 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-19 18:35:25,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.08 | bwd: 3324.99 | bwd_inner: 3324.18 | bwd_allreduce: 0.77 | step: 6.79 32%|███▏ | 3228/10000 [5:05:46<10:18:50, 5.48s/it] {'loss': 0.0111, 'grad_norm': 0.47365081310272217, 'learning_rate': 3.166117832425715e-05, 'epoch': 3.23} 32%|███▏ | 3228/10000 [5:05:46<10:18:50, 5.48s/it][2025-06-19 18:35:31,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:35:31,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.61 | bwd_microstep: 3373.29 | bwd_inner_microstep: 3372.52 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.69 [2025-06-19 18:35:31,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.61 | bwd: 3373.31 | bwd_inner: 3372.52 | bwd_allreduce: 0.75 | step: 6.69 32%|███▏ | 3229/10000 [5:05:52<10:21:05, 5.50s/it] {'loss': 0.0606, 'grad_norm': 1.759642481803894, 'learning_rate': 3.16559151896055e-05, 'epoch': 3.23} 32%|███▏ | 3229/10000 [5:05:52<10:21:05, 5.50s/it][2025-06-19 18:35:36,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:35:36,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.44 | bwd_microstep: 3372.11 | bwd_inner_microstep: 3371.32 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 18:35:36,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.44 | bwd: 3372.13 | bwd_inner: 3371.32 | bwd_allreduce: 0.77 | step: 6.62 32%|███▏ | 3230/10000 [5:05:57<10:22:06, 5.51s/it] {'loss': 0.0233, 'grad_norm': 0.8446597456932068, 'learning_rate': 3.165065083230244e-05, 'epoch': 3.23} 32%|███▏ | 3230/10000 [5:05:57<10:22:06, 5.51s/it][2025-06-19 18:35:42,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:35:42,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.49 | bwd_microstep: 3326.16 | bwd_inner_microstep: 3325.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-19 18:35:42,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.50 | bwd: 3326.17 | bwd_inner: 3325.36 | bwd_allreduce: 0.77 | step: 6.72 32%|███▏ | 3231/10000 [5:06:03<10:20:57, 5.50s/it] {'loss': 0.0068, 'grad_norm': 0.2438255250453949, 'learning_rate': 3.164538525290019e-05, 'epoch': 3.23} 32%|███▏ | 3231/10000 [5:06:03<10:20:57, 5.50s/it][2025-06-19 18:35:47,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:35:47,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.39 | bwd_microstep: 3368.02 | bwd_inner_microstep: 3367.21 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 18:35:47,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.39 | bwd: 3368.03 | bwd_inner: 3367.21 | bwd_allreduce: 0.78 | step: 7.08 32%|███▏ | 3232/10000 [5:06:08<10:21:49, 5.51s/it] {'loss': 0.09, 'grad_norm': 0.8781144618988037, 'learning_rate': 3.164011845195106e-05, 'epoch': 3.23} 32%|███▏ | 3232/10000 [5:06:08<10:21:49, 5.51s/it][2025-06-19 18:35:53,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:35:53,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.42 | bwd_microstep: 3376.37 | bwd_inner_microstep: 3375.54 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.79 [2025-06-19 18:35:53,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.42 | bwd: 3376.39 | bwd_inner: 3375.54 | bwd_allreduce: 0.80 | step: 6.79 32%|███▏ | 3233/10000 [5:06:14<10:22:54, 5.52s/it] {'loss': 0.0347, 'grad_norm': 1.6773316860198975, 'learning_rate': 3.1634850430007545e-05, 'epoch': 3.23} 32%|███▏ | 3233/10000 [5:06:14<10:22:54, 5.52s/it][2025-06-19 18:35:58,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:35:58,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.84 | bwd_microstep: 3312.74 | bwd_inner_microstep: 3311.80 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.90 [2025-06-19 18:35:58,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.84 | bwd: 3312.76 | bwd_inner: 3311.80 | bwd_allreduce: 0.91 | step: 6.90 32%|███▏ | 3234/10000 [5:06:19<10:20:27, 5.50s/it] {'loss': 0.0433, 'grad_norm': 1.0467487573623657, 'learning_rate': 3.162958118762222e-05, 'epoch': 3.23} 32%|███▏ | 3234/10000 [5:06:19<10:20:27, 5.50s/it][2025-06-19 18:36:04,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:36:04,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.23 | bwd_microstep: 3331.76 | bwd_inner_microstep: 3330.79 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.07 [2025-06-19 18:36:04,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.23 | bwd: 3331.78 | bwd_inner: 3330.79 | bwd_allreduce: 0.93 | step: 7.07 32%|███▏ | 3235/10000 [5:06:25<10:19:39, 5.50s/it] {'loss': 0.0245, 'grad_norm': 1.2176592350006104, 'learning_rate': 3.162431072534779e-05, 'epoch': 3.23} 32%|███▏ | 3235/10000 [5:06:25<10:19:39, 5.50s/it][2025-06-19 18:36:09,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:36:09,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2150.23 | bwd_microstep: 3375.10 | bwd_inner_microstep: 3374.29 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 18:36:09,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2150.23 | bwd: 3375.12 | bwd_inner: 3374.29 | bwd_allreduce: 0.78 | step: 7.20 32%|███▏ | 3236/10000 [5:06:30<10:21:56, 5.52s/it] {'loss': 0.0056, 'grad_norm': 0.24863922595977783, 'learning_rate': 3.161903904373712e-05, 'epoch': 3.24} 32%|███▏ | 3236/10000 [5:06:30<10:21:56, 5.52s/it][2025-06-19 18:36:15,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:36:15,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.07 | bwd_microstep: 3316.85 | bwd_inner_microstep: 3316.00 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.96 [2025-06-19 18:36:15,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.07 | bwd: 3316.88 | bwd_inner: 3316.00 | bwd_allreduce: 0.81 | step: 6.96 32%|███▏ | 3237/10000 [5:06:36<10:20:22, 5.50s/it] {'loss': 0.2585, 'grad_norm': 2.5540759563446045, 'learning_rate': 3.1613766143343175e-05, 'epoch': 3.24} 32%|███▏ | 3237/10000 [5:06:36<10:20:22, 5.50s/it][2025-06-19 18:36:20,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:36:20,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.89 | bwd_microstep: 3315.52 | bwd_inner_microstep: 3314.71 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 18:36:20,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.89 | bwd: 3315.53 | bwd_inner: 3314.71 | bwd_allreduce: 0.78 | step: 7.13 32%|███▏ | 3238/10000 [5:06:41<10:18:42, 5.49s/it] {'loss': 0.0573, 'grad_norm': 1.2651472091674805, 'learning_rate': 3.160849202471907e-05, 'epoch': 3.24} 32%|███▏ | 3238/10000 [5:06:41<10:18:42, 5.49s/it][2025-06-19 18:36:26,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:36:26,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.38 | bwd_microstep: 3402.76 | bwd_inner_microstep: 3401.76 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.23 [2025-06-19 18:36:26,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.38 | bwd: 3402.77 | bwd_inner: 3401.76 | bwd_allreduce: 0.96 | step: 7.23 32%|███▏ | 3239/10000 [5:06:47<10:21:31, 5.52s/it] {'loss': 0.1199, 'grad_norm': 2.423081159591675, 'learning_rate': 3.1603216688418025e-05, 'epoch': 3.24} 32%|███▏ | 3239/10000 [5:06:47<10:21:31, 5.52s/it][2025-06-19 18:36:31,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:36:31,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.20 | bwd_microstep: 3318.72 | bwd_inner_microstep: 3317.85 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.14 [2025-06-19 18:36:31,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.20 | bwd: 3318.74 | bwd_inner: 3317.85 | bwd_allreduce: 0.84 | step: 7.14 32%|███▏ | 3240/10000 [5:06:52<10:19:54, 5.50s/it] {'loss': 0.2577, 'grad_norm': 3.7413549423217773, 'learning_rate': 3.15979401349934e-05, 'epoch': 3.24} 32%|███▏ | 3240/10000 [5:06:52<10:19:54, 5.50s/it][2025-06-19 18:36:37,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:36:37,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.06 | bwd_microstep: 3364.81 | bwd_inner_microstep: 3364.01 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-19 18:36:37,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.06 | bwd: 3364.82 | bwd_inner: 3364.01 | bwd_allreduce: 0.77 | step: 7.02 32%|███▏ | 3241/10000 [5:06:58<10:21:12, 5.51s/it] {'loss': 0.1012, 'grad_norm': 2.3957595825195312, 'learning_rate': 3.159266236499868e-05, 'epoch': 3.24} 32%|███▏ | 3241/10000 [5:06:58<10:21:12, 5.51s/it][2025-06-19 18:36:42,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:36:42,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.14 | bwd_microstep: 3361.25 | bwd_inner_microstep: 3360.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 18:36:42,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.14 | bwd: 3361.26 | bwd_inner: 3360.46 | bwd_allreduce: 0.76 | step: 6.59 32%|███▏ | 3242/10000 [5:07:03<10:21:18, 5.52s/it] {'loss': 0.0485, 'grad_norm': 1.3920397758483887, 'learning_rate': 3.1587383378987486e-05, 'epoch': 3.24} 32%|███▏ | 3242/10000 [5:07:03<10:21:18, 5.52s/it][2025-06-19 18:36:48,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:36:48,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.62 | bwd_microstep: 3362.44 | bwd_inner_microstep: 3361.65 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 18:36:48,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.62 | bwd: 3362.45 | bwd_inner: 3361.65 | bwd_allreduce: 0.76 | step: 6.64 32%|███▏ | 3243/10000 [5:07:09<10:21:32, 5.52s/it] {'loss': 0.0517, 'grad_norm': 1.3092482089996338, 'learning_rate': 3.1582103177513554e-05, 'epoch': 3.24} 32%|███▏ | 3243/10000 [5:07:09<10:21:32, 5.52s/it][2025-06-19 18:36:54,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:36:54,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.69 | bwd_microstep: 3376.17 | bwd_inner_microstep: 3375.18 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.13 [2025-06-19 18:36:54,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.69 | bwd: 3376.19 | bwd_inner: 3375.19 | bwd_allreduce: 0.95 | step: 7.14 32%|███▏ | 3244/10000 [5:07:14<10:22:28, 5.53s/it] {'loss': 0.0358, 'grad_norm': 1.032659888267517, 'learning_rate': 3.157682176113076e-05, 'epoch': 3.24} 32%|███▏ | 3244/10000 [5:07:14<10:22:28, 5.53s/it][2025-06-19 18:36:59,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:36:59,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.23 | bwd_microstep: 3323.16 | bwd_inner_microstep: 3322.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 18:36:59,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.23 | bwd: 3323.18 | bwd_inner: 3322.35 | bwd_allreduce: 0.78 | step: 7.20 32%|███▏ | 3245/10000 [5:07:20<10:20:27, 5.51s/it] {'loss': 0.1143, 'grad_norm': 2.7821922302246094, 'learning_rate': 3.1571539130393086e-05, 'epoch': 3.25} 32%|███▏ | 3245/10000 [5:07:20<10:20:27, 5.51s/it][2025-06-19 18:37:04,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:37:04,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.40 | bwd_microstep: 3312.98 | bwd_inner_microstep: 3312.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 18:37:04,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.40 | bwd: 3312.99 | bwd_inner: 3312.19 | bwd_allreduce: 0.76 | step: 6.64 32%|███▏ | 3246/10000 [5:07:25<10:18:32, 5.49s/it] {'loss': 0.072, 'grad_norm': 1.5093152523040771, 'learning_rate': 3.156625528585466e-05, 'epoch': 3.25} 32%|███▏ | 3246/10000 [5:07:25<10:18:32, 5.49s/it][2025-06-19 18:37:10,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:37:10,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.51 | bwd_microstep: 3352.16 | bwd_inner_microstep: 3351.34 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 18:37:10,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.51 | bwd: 3352.17 | bwd_inner: 3351.34 | bwd_allreduce: 0.78 | step: 7.12 32%|███▏ | 3247/10000 [5:07:31<10:18:59, 5.50s/it] {'loss': 0.1245, 'grad_norm': 2.00549054145813, 'learning_rate': 3.1560970228069736e-05, 'epoch': 3.25} 32%|███▏ | 3247/10000 [5:07:31<10:18:59, 5.50s/it][2025-06-19 18:37:15,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:37:15,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.54 | bwd_microstep: 3358.40 | bwd_inner_microstep: 3357.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 18:37:15,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.55 | bwd: 3358.41 | bwd_inner: 3357.60 | bwd_allreduce: 0.76 | step: 6.76 32%|███▏ | 3248/10000 [5:07:36<10:19:39, 5.51s/it] {'loss': 0.0275, 'grad_norm': 1.3409919738769531, 'learning_rate': 3.1555683957592695e-05, 'epoch': 3.25} 32%|███▏ | 3248/10000 [5:07:36<10:19:39, 5.51s/it][2025-06-19 18:37:21,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:37:21,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.05 | bwd_microstep: 3366.78 | bwd_inner_microstep: 3365.96 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.11 [2025-06-19 18:37:21,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.05 | bwd: 3366.80 | bwd_inner: 3365.96 | bwd_allreduce: 0.79 | step: 7.11 32%|███▏ | 3249/10000 [5:07:42<10:20:23, 5.51s/it] {'loss': 0.0419, 'grad_norm': 0.7002978324890137, 'learning_rate': 3.155039647497803e-05, 'epoch': 3.25} 32%|███▏ | 3249/10000 [5:07:42<10:20:23, 5.51s/it][2025-06-19 18:37:27,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:37:27,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.00 | bwd_microstep: 3358.05 | bwd_inner_microstep: 3357.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 18:37:27,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.00 | bwd: 3358.06 | bwd_inner: 3357.26 | bwd_allreduce: 0.76 | step: 6.72 32%|███▎ | 3250/10000 [5:07:47<10:20:35, 5.52s/it] {'loss': 0.105, 'grad_norm': 2.595438003540039, 'learning_rate': 3.154510778078039e-05, 'epoch': 3.25} 32%|███▎ | 3250/10000 [5:07:47<10:20:35, 5.52s/it][2025-06-19 18:37:32,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:37:32,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.30 | bwd_microstep: 3385.67 | bwd_inner_microstep: 3384.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 18:37:32,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.30 | bwd: 3385.68 | bwd_inner: 3384.88 | bwd_allreduce: 0.76 | step: 6.67 33%|███▎ | 3251/10000 [5:07:53<10:21:50, 5.53s/it] {'loss': 0.0553, 'grad_norm': 0.9413166046142578, 'learning_rate': 3.1539817875554524e-05, 'epoch': 3.25} 33%|███▎ | 3251/10000 [5:07:53<10:21:50, 5.53s/it][2025-06-19 18:37:38,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:37:38,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.27 | bwd_microstep: 3356.90 | bwd_inner_microstep: 3356.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-19 18:37:38,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.27 | bwd: 3356.92 | bwd_inner: 3356.10 | bwd_allreduce: 0.77 | step: 7.07 33%|███▎ | 3252/10000 [5:07:58<10:21:28, 5.53s/it] {'loss': 0.0128, 'grad_norm': 0.44593873620033264, 'learning_rate': 3.1534526759855326e-05, 'epoch': 3.25} 33%|███▎ | 3252/10000 [5:07:58<10:21:28, 5.53s/it][2025-06-19 18:37:43,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:37:43,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.18 | bwd_microstep: 3398.40 | bwd_inner_microstep: 3397.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-19 18:37:43,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.18 | bwd: 3398.41 | bwd_inner: 3397.60 | bwd_allreduce: 0.77 | step: 6.73 33%|███▎ | 3253/10000 [5:08:04<10:22:55, 5.54s/it] {'loss': 0.0636, 'grad_norm': 1.17919921875, 'learning_rate': 3.15292344342378e-05, 'epoch': 3.25} 33%|███▎ | 3253/10000 [5:08:04<10:22:55, 5.54s/it][2025-06-19 18:37:49,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:37:49,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.60 | bwd_microstep: 3317.13 | bwd_inner_microstep: 3316.19 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.05 [2025-06-19 18:37:49,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.60 | bwd: 3317.15 | bwd_inner: 3316.19 | bwd_allreduce: 0.91 | step: 7.05 33%|███▎ | 3254/10000 [5:08:09<10:19:57, 5.51s/it] {'loss': 0.025, 'grad_norm': 1.015483021736145, 'learning_rate': 3.152394089925708e-05, 'epoch': 3.25} 33%|███▎ | 3254/10000 [5:08:09<10:19:57, 5.51s/it][2025-06-19 18:37:54,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:37:54,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.61 | bwd_microstep: 3321.77 | bwd_inner_microstep: 3320.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 18:37:54,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.61 | bwd: 3321.79 | bwd_inner: 3320.98 | bwd_allreduce: 0.77 | step: 6.85 33%|███▎ | 3255/10000 [5:08:15<10:18:33, 5.50s/it] {'loss': 0.0568, 'grad_norm': 1.5855695009231567, 'learning_rate': 3.1518646155468465e-05, 'epoch': 3.25} 33%|███▎ | 3255/10000 [5:08:15<10:18:33, 5.50s/it][2025-06-19 18:38:00,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:38:00,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.84 | bwd_microstep: 3330.23 | bwd_inner_microstep: 3329.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 18:38:00,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.84 | bwd: 3330.24 | bwd_inner: 3329.44 | bwd_allreduce: 0.76 | step: 6.82 33%|███▎ | 3256/10000 [5:08:20<10:17:34, 5.49s/it] {'loss': 0.0237, 'grad_norm': 0.6996119022369385, 'learning_rate': 3.1513350203427314e-05, 'epoch': 3.26} 33%|███▎ | 3256/10000 [5:08:20<10:17:34, 5.49s/it][2025-06-19 18:38:05,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.89 [2025-06-19 18:38:05,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.66 | bwd_microstep: 3316.19 | bwd_inner_microstep: 3315.12 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.45 [2025-06-19 18:38:05,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.66 | bwd: 3316.21 | bwd_inner: 3315.12 | bwd_allreduce: 1.03 | step: 7.45 33%|███▎ | 3257/10000 [5:08:26<10:16:34, 5.49s/it] {'loss': 0.0341, 'grad_norm': 1.5693557262420654, 'learning_rate': 3.150805304368916e-05, 'epoch': 3.26} 33%|███▎ | 3257/10000 [5:08:26<10:16:34, 5.49s/it][2025-06-19 18:38:11,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:38:11,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.66 | bwd_microstep: 3313.72 | bwd_inner_microstep: 3312.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.70 [2025-06-19 18:38:11,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.66 | bwd: 3313.74 | bwd_inner: 3312.90 | bwd_allreduce: 0.78 | step: 6.71 33%|███▎ | 3258/10000 [5:08:31<10:15:30, 5.48s/it] {'loss': 0.0348, 'grad_norm': 1.6166480779647827, 'learning_rate': 3.150275467680966e-05, 'epoch': 3.26} 33%|███▎ | 3258/10000 [5:08:31<10:15:30, 5.48s/it][2025-06-19 18:38:16,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:38:16,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.13 | bwd_microstep: 3369.04 | bwd_inner_microstep: 3368.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 18:38:16,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.13 | bwd: 3369.05 | bwd_inner: 3368.26 | bwd_allreduce: 0.75 | step: 6.55 33%|███▎ | 3259/10000 [5:08:37<10:17:19, 5.49s/it] {'loss': 0.0226, 'grad_norm': 1.2307543754577637, 'learning_rate': 3.149745510334458e-05, 'epoch': 3.26} 33%|███▎ | 3259/10000 [5:08:37<10:17:19, 5.49s/it][2025-06-19 18:38:22,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:38:22,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.61 | bwd_microstep: 3377.40 | bwd_inner_microstep: 3376.49 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.14 [2025-06-19 18:38:22,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.61 | bwd: 3377.42 | bwd_inner: 3376.49 | bwd_allreduce: 0.88 | step: 7.15 33%|███▎ | 3260/10000 [5:08:42<10:18:55, 5.51s/it] {'loss': 0.0607, 'grad_norm': 1.4439572095870972, 'learning_rate': 3.149215432384981e-05, 'epoch': 3.26} 33%|███▎ | 3260/10000 [5:08:42<10:18:55, 5.51s/it][2025-06-19 18:38:27,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.79 [2025-06-19 18:38:27,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.33 | bwd_microstep: 3317.98 | bwd_inner_microstep: 3317.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 18:38:27,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.33 | bwd: 3317.99 | bwd_inner: 3317.20 | bwd_allreduce: 0.75 | step: 6.60 33%|███▎ | 3261/10000 [5:08:48<10:17:27, 5.50s/it] {'loss': 0.0559, 'grad_norm': 1.4157016277313232, 'learning_rate': 3.14868523388814e-05, 'epoch': 3.26} 33%|███▎ | 3261/10000 [5:08:48<10:17:27, 5.50s/it][2025-06-19 18:38:33,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:38:33,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.13 | bwd_microstep: 3369.19 | bwd_inner_microstep: 3368.13 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.88 [2025-06-19 18:38:33,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.13 | bwd: 3369.22 | bwd_inner: 3368.13 | bwd_allreduce: 1.02 | step: 7.89 33%|███▎ | 3262/10000 [5:08:53<10:18:41, 5.51s/it] {'loss': 0.0433, 'grad_norm': 1.1333987712860107, 'learning_rate': 3.1481549148995495e-05, 'epoch': 3.26} 33%|███▎ | 3262/10000 [5:08:53<10:18:41, 5.51s/it][2025-06-19 18:38:38,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 18:38:38,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.19 | bwd_microstep: 3365.58 | bwd_inner_microstep: 3364.81 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.61 [2025-06-19 18:38:38,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.19 | bwd: 3365.60 | bwd_inner: 3364.81 | bwd_allreduce: 0.75 | step: 6.62 33%|███▎ | 3263/10000 [5:08:59<10:19:15, 5.52s/it] {'loss': 0.1303, 'grad_norm': 2.526442050933838, 'learning_rate': 3.1476244754748364e-05, 'epoch': 3.26} 33%|███▎ | 3263/10000 [5:08:59<10:19:15, 5.52s/it][2025-06-19 18:38:44,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:38:44,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.10 | bwd_microstep: 3372.70 | bwd_inner_microstep: 3371.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 18:38:44,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.10 | bwd: 3372.71 | bwd_inner: 3371.92 | bwd_allreduce: 0.76 | step: 6.57 33%|███▎ | 3264/10000 [5:09:04<10:20:06, 5.52s/it] {'loss': 0.0108, 'grad_norm': 0.4396548867225647, 'learning_rate': 3.147093915669642e-05, 'epoch': 3.26} 33%|███▎ | 3264/10000 [5:09:04<10:20:06, 5.52s/it][2025-06-19 18:38:49,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:38:49,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.42 | bwd_microstep: 3329.95 | bwd_inner_microstep: 3329.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.68 [2025-06-19 18:38:49,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.42 | bwd: 3329.96 | bwd_inner: 3329.14 | bwd_allreduce: 0.78 | step: 6.68 33%|███▎ | 3265/10000 [5:09:10<10:18:21, 5.51s/it] {'loss': 0.0493, 'grad_norm': 2.0600790977478027, 'learning_rate': 3.146563235539619e-05, 'epoch': 3.27} 33%|███▎ | 3265/10000 [5:09:10<10:18:21, 5.51s/it][2025-06-19 18:38:55,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:38:55,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.12 | bwd_microstep: 3371.75 | bwd_inner_microstep: 3370.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 18:38:55,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.12 | bwd: 3371.58 | bwd_inner: 3370.78 | bwd_allreduce: 0.76 | step: 6.55 33%|███▎ | 3266/10000 [5:09:15<10:19:20, 5.52s/it] {'loss': 0.0586, 'grad_norm': 1.1839048862457275, 'learning_rate': 3.146032435140436e-05, 'epoch': 3.27} 33%|███▎ | 3266/10000 [5:09:15<10:19:20, 5.52s/it][2025-06-19 18:39:00,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:39:00,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.57 | bwd_microstep: 3321.54 | bwd_inner_microstep: 3320.65 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.04 [2025-06-19 18:39:00,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.57 | bwd: 3321.56 | bwd_inner: 3320.65 | bwd_allreduce: 0.85 | step: 7.05 33%|███▎ | 3267/10000 [5:09:21<10:17:26, 5.50s/it] {'loss': 0.0314, 'grad_norm': 0.9657690525054932, 'learning_rate': 3.1455015145277674e-05, 'epoch': 3.27} 33%|███▎ | 3267/10000 [5:09:21<10:17:26, 5.50s/it][2025-06-19 18:39:06,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:39:06,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.81 | bwd_microstep: 3321.83 | bwd_inner_microstep: 3320.91 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.24 [2025-06-19 18:39:06,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.81 | bwd: 3321.85 | bwd_inner: 3320.91 | bwd_allreduce: 0.89 | step: 7.24 33%|███▎ | 3268/10000 [5:09:26<10:16:25, 5.49s/it] {'loss': 0.0506, 'grad_norm': 1.6293103694915771, 'learning_rate': 3.144970473757308e-05, 'epoch': 3.27} 33%|███▎ | 3268/10000 [5:09:26<10:16:25, 5.49s/it][2025-06-19 18:39:11,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:39:11,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.67 | bwd_microstep: 3324.81 | bwd_inner_microstep: 3324.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 18:39:11,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.67 | bwd: 3324.82 | bwd_inner: 3324.02 | bwd_allreduce: 0.75 | step: 6.67 33%|███▎ | 3269/10000 [5:09:32<10:15:14, 5.48s/it] {'loss': 0.0278, 'grad_norm': 0.9168758392333984, 'learning_rate': 3.144439312884758e-05, 'epoch': 3.27} 33%|███▎ | 3269/10000 [5:09:32<10:15:14, 5.48s/it][2025-06-19 18:39:17,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 18:39:17,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.10 | bwd_microstep: 3372.77 | bwd_inner_microstep: 3371.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.80 [2025-06-19 18:39:17,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.10 | bwd: 3372.78 | bwd_inner: 3371.99 | bwd_allreduce: 0.75 | step: 6.80 33%|███▎ | 3270/10000 [5:09:37<10:17:06, 5.50s/it] {'loss': 0.04, 'grad_norm': 1.186710238456726, 'learning_rate': 3.143908031965837e-05, 'epoch': 3.27} 33%|███▎ | 3270/10000 [5:09:37<10:17:06, 5.50s/it][2025-06-19 18:39:22,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:39:22,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.11 | bwd_microstep: 3377.03 | bwd_inner_microstep: 3376.18 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.99 [2025-06-19 18:39:22,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.11 | bwd: 3377.05 | bwd_inner: 3376.18 | bwd_allreduce: 0.82 | step: 6.99 33%|███▎ | 3271/10000 [5:09:43<10:18:37, 5.52s/it] {'loss': 0.1409, 'grad_norm': 1.7572276592254639, 'learning_rate': 3.143376631056273e-05, 'epoch': 3.27} 33%|███▎ | 3271/10000 [5:09:43<10:18:37, 5.52s/it][2025-06-19 18:39:28,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:39:28,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.73 | bwd_microstep: 3325.20 | bwd_inner_microstep: 3324.26 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.87 [2025-06-19 18:39:28,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.73 | bwd: 3325.21 | bwd_inner: 3324.26 | bwd_allreduce: 0.91 | step: 6.88 33%|███▎ | 3272/10000 [5:09:48<10:17:15, 5.50s/it] {'loss': 0.0468, 'grad_norm': 0.7999803423881531, 'learning_rate': 3.142845110211805e-05, 'epoch': 3.27} 33%|███▎ | 3272/10000 [5:09:48<10:17:15, 5.50s/it][2025-06-19 18:39:33,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:39:33,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.66 | bwd_microstep: 3328.03 | bwd_inner_microstep: 3327.17 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.06 [2025-06-19 18:39:33,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.66 | bwd: 3328.06 | bwd_inner: 3327.17 | bwd_allreduce: 0.82 | step: 7.07 33%|███▎ | 3273/10000 [5:09:54<10:16:24, 5.50s/it] {'loss': 0.081, 'grad_norm': 1.9397459030151367, 'learning_rate': 3.142313469488191e-05, 'epoch': 3.27} 33%|███▎ | 3273/10000 [5:09:54<10:16:24, 5.50s/it][2025-06-19 18:39:39,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 18:39:39,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.24 | bwd_microstep: 3329.80 | bwd_inner_microstep: 3328.87 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.37 [2025-06-19 18:39:39,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.24 | bwd: 3329.81 | bwd_inner: 3328.87 | bwd_allreduce: 0.90 | step: 7.38 33%|███▎ | 3274/10000 [5:09:59<10:15:48, 5.49s/it] {'loss': 0.0973, 'grad_norm': 2.2026970386505127, 'learning_rate': 3.1417817089411947e-05, 'epoch': 3.27} 33%|███▎ | 3274/10000 [5:09:59<10:15:48, 5.49s/it][2025-06-19 18:39:44,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:39:44,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.66 | bwd_microstep: 3325.81 | bwd_inner_microstep: 3325.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.59 [2025-06-19 18:39:44,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.66 | bwd: 3325.82 | bwd_inner: 3325.02 | bwd_allreduce: 0.76 | step: 6.59 33%|███▎ | 3275/10000 [5:10:05<10:15:10, 5.49s/it] {'loss': 0.0883, 'grad_norm': 1.8611304759979248, 'learning_rate': 3.1412498286265964e-05, 'epoch': 3.27} 33%|███▎ | 3275/10000 [5:10:05<10:15:10, 5.49s/it][2025-06-19 18:39:50,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:39:50,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.42 | bwd_microstep: 3326.55 | bwd_inner_microstep: 3325.77 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 18:39:50,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.42 | bwd: 3326.56 | bwd_inner: 3325.77 | bwd_allreduce: 0.75 | step: 6.65 33%|███▎ | 3276/10000 [5:10:10<10:14:38, 5.48s/it] {'loss': 0.0147, 'grad_norm': 0.6763681173324585, 'learning_rate': 3.140717828600188e-05, 'epoch': 3.28} 33%|███▎ | 3276/10000 [5:10:10<10:14:38, 5.48s/it][2025-06-19 18:39:55,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:39:55,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.68 | bwd_microstep: 3320.51 | bwd_inner_microstep: 3319.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.58 [2025-06-19 18:39:55,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.68 | bwd: 3320.53 | bwd_inner: 3319.72 | bwd_allreduce: 0.76 | step: 6.58 33%|███▎ | 3277/10000 [5:10:16<10:14:04, 5.48s/it] {'loss': 0.0089, 'grad_norm': 0.40452587604522705, 'learning_rate': 3.1401857089177733e-05, 'epoch': 3.28} 33%|███▎ | 3277/10000 [5:10:16<10:14:04, 5.48s/it][2025-06-19 18:40:01,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:40:01,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.80 | bwd_microstep: 3374.87 | bwd_inner_microstep: 3374.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 18:40:01,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.80 | bwd: 3374.88 | bwd_inner: 3374.08 | bwd_allreduce: 0.75 | step: 6.55 33%|███▎ | 3278/10000 [5:10:21<10:15:59, 5.50s/it] {'loss': 0.0215, 'grad_norm': 1.2261583805084229, 'learning_rate': 3.1396534696351705e-05, 'epoch': 3.28} 33%|███▎ | 3278/10000 [5:10:21<10:15:59, 5.50s/it][2025-06-19 18:40:06,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:40:06,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.82 | bwd_microstep: 3323.69 | bwd_inner_microstep: 3322.85 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.79 [2025-06-19 18:40:06,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.82 | bwd: 3323.70 | bwd_inner: 3322.85 | bwd_allreduce: 0.82 | step: 6.80 33%|███▎ | 3279/10000 [5:10:27<10:14:52, 5.49s/it] {'loss': 0.0111, 'grad_norm': 0.2998252511024475, 'learning_rate': 3.139121110808207e-05, 'epoch': 3.28} 33%|███▎ | 3279/10000 [5:10:27<10:14:52, 5.49s/it][2025-06-19 18:40:12,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:40:12,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.55 | bwd_microstep: 3381.95 | bwd_inner_microstep: 3381.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 18:40:12,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.55 | bwd: 3381.96 | bwd_inner: 3381.17 | bwd_allreduce: 0.76 | step: 6.60 33%|███▎ | 3280/10000 [5:10:32<10:16:52, 5.51s/it] {'loss': 0.021, 'grad_norm': 0.7831490635871887, 'learning_rate': 3.1385886324927255e-05, 'epoch': 3.28} 33%|███▎ | 3280/10000 [5:10:32<10:16:52, 5.51s/it][2025-06-19 18:40:17,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:40:17,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.56 | bwd_microstep: 3335.41 | bwd_inner_microstep: 3334.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-19 18:40:17,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.56 | bwd: 3335.42 | bwd_inner: 3334.61 | bwd_allreduce: 0.77 | step: 6.93 33%|███▎ | 3281/10000 [5:10:38<10:16:31, 5.51s/it] {'loss': 0.0141, 'grad_norm': 0.9027436971664429, 'learning_rate': 3.138056034744581e-05, 'epoch': 3.28} 33%|███▎ | 3281/10000 [5:10:38<10:16:31, 5.51s/it][2025-06-19 18:40:23,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:40:23,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.32 | bwd_microstep: 3371.09 | bwd_inner_microstep: 3370.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 18:40:23,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.32 | bwd: 3371.11 | bwd_inner: 3370.31 | bwd_allreduce: 0.76 | step: 6.71 33%|███▎ | 3282/10000 [5:10:43<10:17:25, 5.51s/it] {'loss': 0.0898, 'grad_norm': 2.0863234996795654, 'learning_rate': 3.137523317619641e-05, 'epoch': 3.28} 33%|███▎ | 3282/10000 [5:10:43<10:17:25, 5.51s/it][2025-06-19 18:40:28,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:40:28,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.41 | bwd_microstep: 3328.78 | bwd_inner_microstep: 3327.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 18:40:28,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.41 | bwd: 3328.79 | bwd_inner: 3327.99 | bwd_allreduce: 0.76 | step: 6.61 33%|███▎ | 3283/10000 [5:10:49<10:16:16, 5.50s/it] {'loss': 0.1008, 'grad_norm': 2.127136468887329, 'learning_rate': 3.136990481173785e-05, 'epoch': 3.28} 33%|███▎ | 3283/10000 [5:10:49<10:16:16, 5.50s/it][2025-06-19 18:40:34,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:40:34,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.19 | bwd_microstep: 3323.65 | bwd_inner_microstep: 3322.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 18:40:34,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.19 | bwd: 3323.67 | bwd_inner: 3322.87 | bwd_allreduce: 0.75 | step: 6.58 33%|███▎ | 3284/10000 [5:10:54<10:15:26, 5.50s/it] {'loss': 0.006, 'grad_norm': 0.20929160714149475, 'learning_rate': 3.136457525462903e-05, 'epoch': 3.28} 33%|███▎ | 3284/10000 [5:10:54<10:15:26, 5.50s/it][2025-06-19 18:40:39,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:40:39,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.61 | bwd_microstep: 3376.07 | bwd_inner_microstep: 3375.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 18:40:39,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.61 | bwd: 3376.09 | bwd_inner: 3375.29 | bwd_allreduce: 0.75 | step: 6.57 33%|███▎ | 3285/10000 [5:11:00<10:17:07, 5.51s/it] {'loss': 0.0855, 'grad_norm': 1.382115364074707, 'learning_rate': 3.1359244505429015e-05, 'epoch': 3.29} 33%|███▎ | 3285/10000 [5:11:00<10:17:07, 5.51s/it][2025-06-19 18:40:45,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:40:45,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.18 | bwd_microstep: 3382.54 | bwd_inner_microstep: 3381.72 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.35 [2025-06-19 18:40:45,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.18 | bwd: 3382.56 | bwd_inner: 3381.72 | bwd_allreduce: 0.79 | step: 7.35 33%|███▎ | 3286/10000 [5:11:06<10:18:07, 5.52s/it] {'loss': 0.0225, 'grad_norm': 0.8675946593284607, 'learning_rate': 3.1353912564696975e-05, 'epoch': 3.29} 33%|███▎ | 3286/10000 [5:11:06<10:18:07, 5.52s/it][2025-06-19 18:40:50,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:40:50,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.34 | bwd_microstep: 3377.61 | bwd_inner_microstep: 3376.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 18:40:50,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.34 | bwd: 3377.63 | bwd_inner: 3376.83 | bwd_allreduce: 0.76 | step: 6.80 33%|███▎ | 3287/10000 [5:11:11<10:18:51, 5.53s/it] {'loss': 0.0335, 'grad_norm': 1.2063300609588623, 'learning_rate': 3.13485794329922e-05, 'epoch': 3.29} 33%|███▎ | 3287/10000 [5:11:11<10:18:51, 5.53s/it][2025-06-19 18:40:56,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:40:56,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.17 | bwd_microstep: 3371.23 | bwd_inner_microstep: 3370.40 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.75 [2025-06-19 18:40:56,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.17 | bwd: 3371.24 | bwd_inner: 3370.40 | bwd_allreduce: 0.80 | step: 6.75 33%|███▎ | 3288/10000 [5:11:17<10:18:59, 5.53s/it] {'loss': 0.0253, 'grad_norm': 1.5510934591293335, 'learning_rate': 3.134324511087411e-05, 'epoch': 3.29} 33%|███▎ | 3288/10000 [5:11:17<10:18:59, 5.53s/it][2025-06-19 18:41:01,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:41:01,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.80 | bwd_microstep: 3378.05 | bwd_inner_microstep: 3377.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 18:41:01,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.80 | bwd: 3378.06 | bwd_inner: 3377.26 | bwd_allreduce: 0.76 | step: 6.66 33%|███▎ | 3289/10000 [5:11:22<10:19:27, 5.54s/it] {'loss': 0.0469, 'grad_norm': 1.2048205137252808, 'learning_rate': 3.133790959890226e-05, 'epoch': 3.29} 33%|███▎ | 3289/10000 [5:11:22<10:19:27, 5.54s/it][2025-06-19 18:41:07,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:41:07,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.13 | bwd_microstep: 3325.19 | bwd_inner_microstep: 3324.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 18:41:07,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.13 | bwd: 3325.20 | bwd_inner: 3324.40 | bwd_allreduce: 0.76 | step: 6.54 33%|███▎ | 3290/10000 [5:11:28<10:17:03, 5.52s/it] {'loss': 0.0386, 'grad_norm': 1.8140016794204712, 'learning_rate': 3.1332572897636304e-05, 'epoch': 3.29} 33%|███▎ | 3290/10000 [5:11:28<10:17:03, 5.52s/it][2025-06-19 18:41:12,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 18:41:12,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.84 | bwd_microstep: 3325.14 | bwd_inner_microstep: 3324.15 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.10 [2025-06-19 18:41:12,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.84 | bwd: 3325.16 | bwd_inner: 3324.15 | bwd_allreduce: 0.95 | step: 7.10 33%|███▎ | 3291/10000 [5:11:33<10:15:45, 5.51s/it] {'loss': 0.0331, 'grad_norm': 1.6496241092681885, 'learning_rate': 3.132723500763605e-05, 'epoch': 3.29} 33%|███▎ | 3291/10000 [5:11:33<10:15:45, 5.51s/it][2025-06-19 18:41:18,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:41:18,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.11 | bwd_microstep: 3326.94 | bwd_inner_microstep: 3326.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 18:41:18,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.11 | bwd: 3326.96 | bwd_inner: 3326.15 | bwd_allreduce: 0.76 | step: 6.70 33%|███▎ | 3292/10000 [5:11:39<10:14:39, 5.50s/it] {'loss': 0.0448, 'grad_norm': 1.0862895250320435, 'learning_rate': 3.132189592946142e-05, 'epoch': 3.29} 33%|███▎ | 3292/10000 [5:11:39<10:14:39, 5.50s/it][2025-06-19 18:41:23,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:41:23,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.78 | bwd_microstep: 3325.87 | bwd_inner_microstep: 3325.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-19 18:41:23,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.78 | bwd: 3325.88 | bwd_inner: 3325.09 | bwd_allreduce: 0.75 | step: 6.74 33%|███▎ | 3293/10000 [5:11:44<10:13:42, 5.49s/it] {'loss': 0.0051, 'grad_norm': 0.22653527557849884, 'learning_rate': 3.1316555663672453e-05, 'epoch': 3.29} 33%|███▎ | 3293/10000 [5:11:44<10:13:42, 5.49s/it][2025-06-19 18:41:29,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:41:29,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.14 | bwd_microstep: 3336.29 | bwd_inner_microstep: 3335.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 18:41:29,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.14 | bwd: 3336.31 | bwd_inner: 3335.50 | bwd_allreduce: 0.77 | step: 6.77 33%|███▎ | 3294/10000 [5:11:50<10:13:16, 5.49s/it] {'loss': 0.015, 'grad_norm': 0.7067720890045166, 'learning_rate': 3.131121421082932e-05, 'epoch': 3.29} 33%|███▎ | 3294/10000 [5:11:50<10:13:16, 5.49s/it][2025-06-19 18:41:34,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 18:41:34,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.27 | bwd_microstep: 3386.77 | bwd_inner_microstep: 3385.83 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.04 [2025-06-19 18:41:34,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.27 | bwd: 3386.79 | bwd_inner: 3385.83 | bwd_allreduce: 0.91 | step: 7.04 33%|███▎ | 3295/10000 [5:11:55<10:15:50, 5.51s/it] {'loss': 0.0056, 'grad_norm': 0.14301015436649323, 'learning_rate': 3.13058715714923e-05, 'epoch': 3.29} 33%|███▎ | 3295/10000 [5:11:55<10:15:50, 5.51s/it][2025-06-19 18:41:40,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:41:40,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.30 | bwd_microstep: 3324.44 | bwd_inner_microstep: 3323.49 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.13 [2025-06-19 18:41:40,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.30 | bwd: 3324.45 | bwd_inner: 3323.49 | bwd_allreduce: 0.92 | step: 7.13 33%|███▎ | 3296/10000 [5:12:01<10:14:38, 5.50s/it] {'loss': 0.0254, 'grad_norm': 1.1978834867477417, 'learning_rate': 3.130052774622184e-05, 'epoch': 3.3} 33%|███▎ | 3296/10000 [5:12:01<10:14:38, 5.50s/it][2025-06-19 18:41:45,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:41:45,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.28 | bwd_microstep: 3412.71 | bwd_inner_microstep: 3411.66 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.22 [2025-06-19 18:41:45,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.28 | bwd: 3412.73 | bwd_inner: 3411.66 | bwd_allreduce: 1.01 | step: 7.22 33%|███▎ | 3297/10000 [5:12:06<10:17:57, 5.53s/it] {'loss': 0.0218, 'grad_norm': 0.717685341835022, 'learning_rate': 3.129518273557846e-05, 'epoch': 3.3} 33%|███▎ | 3297/10000 [5:12:06<10:17:57, 5.53s/it][2025-06-19 18:41:51,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:41:51,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.69 | bwd_microstep: 3323.31 | bwd_inner_microstep: 3322.48 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.81 [2025-06-19 18:41:51,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.69 | bwd: 3323.33 | bwd_inner: 3322.48 | bwd_allreduce: 0.81 | step: 6.81 33%|███▎ | 3298/10000 [5:12:12<10:15:56, 5.51s/it] {'loss': 0.0043, 'grad_norm': 0.15014182031154633, 'learning_rate': 3.1289836540122836e-05, 'epoch': 3.3} 33%|███▎ | 3298/10000 [5:12:12<10:15:56, 5.51s/it][2025-06-19 18:41:56,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:41:56,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.78 | bwd_microstep: 3327.46 | bwd_inner_microstep: 3326.60 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.98 [2025-06-19 18:41:56,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.78 | bwd: 3327.47 | bwd_inner: 3326.60 | bwd_allreduce: 0.84 | step: 6.98 33%|███▎ | 3299/10000 [5:12:17<10:15:00, 5.51s/it] {'loss': 0.0401, 'grad_norm': 1.4415019750595093, 'learning_rate': 3.128448916041575e-05, 'epoch': 3.3} 33%|███▎ | 3299/10000 [5:12:17<10:15:00, 5.51s/it][2025-06-19 18:42:02,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 18:42:02,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.17 | bwd_microstep: 3376.52 | bwd_inner_microstep: 3375.38 | bwd_allreduce_microstep: 1.08 | step_microstep: 8.00 [2025-06-19 18:42:02,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.17 | bwd: 3376.54 | bwd_inner: 3375.38 | bwd_allreduce: 1.11 | step: 8.01 33%|███▎ | 3300/10000 [5:12:23<10:16:28, 5.52s/it] {'loss': 0.1646, 'grad_norm': 2.576951503753662, 'learning_rate': 3.1279140597018135e-05, 'epoch': 3.3} 33%|███▎ | 3300/10000 [5:12:23<10:16:28, 5.52s/it][2025-06-19 18:42:07,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:42:07,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.94 | bwd_microstep: 3375.52 | bwd_inner_microstep: 3374.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 18:42:07,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.94 | bwd: 3375.53 | bwd_inner: 3374.72 | bwd_allreduce: 0.77 | step: 6.89 33%|███▎ | 3301/10000 [5:12:28<10:17:44, 5.53s/it] {'loss': 0.0425, 'grad_norm': 2.6137378215789795, 'learning_rate': 3.1273790850491015e-05, 'epoch': 3.3} 33%|███▎ | 3301/10000 [5:12:28<10:17:44, 5.53s/it][2025-06-19 18:42:13,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.87 [2025-06-19 18:42:13,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.08 | bwd_microstep: 3321.26 | bwd_inner_microstep: 3320.45 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.29 [2025-06-19 18:42:13,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.08 | bwd: 3321.27 | bwd_inner: 3320.45 | bwd_allreduce: 0.78 | step: 7.30 33%|███▎ | 3302/10000 [5:12:34<10:15:20, 5.51s/it] {'loss': 0.0486, 'grad_norm': 1.6719104051589966, 'learning_rate': 3.1268439921395556e-05, 'epoch': 3.3} 33%|███▎ | 3302/10000 [5:12:34<10:15:20, 5.51s/it][2025-06-19 18:42:18,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.75 [2025-06-19 18:42:18,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.09 | bwd_microstep: 3329.14 | bwd_inner_microstep: 3328.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 18:42:18,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.09 | bwd: 3329.15 | bwd_inner: 3328.35 | bwd_allreduce: 0.76 | step: 6.77 33%|███▎ | 3303/10000 [5:12:39<10:13:59, 5.50s/it] {'loss': 0.0589, 'grad_norm': 2.1375908851623535, 'learning_rate': 3.126308781029304e-05, 'epoch': 3.3} 33%|███▎ | 3303/10000 [5:12:39<10:13:59, 5.50s/it][2025-06-19 18:42:24,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:42:24,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.76 | bwd_microstep: 3399.71 | bwd_inner_microstep: 3398.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 18:42:24,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.76 | bwd: 3399.72 | bwd_inner: 3398.92 | bwd_allreduce: 0.76 | step: 6.61 33%|███▎ | 3304/10000 [5:12:45<10:16:20, 5.52s/it] {'loss': 0.0499, 'grad_norm': 1.4053192138671875, 'learning_rate': 3.12577345177449e-05, 'epoch': 3.3} 33%|███▎ | 3304/10000 [5:12:45<10:16:20, 5.52s/it][2025-06-19 18:42:30,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:42:30,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.58 | bwd_microstep: 3402.26 | bwd_inner_microstep: 3401.43 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.24 [2025-06-19 18:42:30,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.58 | bwd: 3402.28 | bwd_inner: 3401.43 | bwd_allreduce: 0.80 | step: 7.24 33%|███▎ | 3305/10000 [5:12:50<10:18:07, 5.54s/it] {'loss': 0.064, 'grad_norm': 2.7569754123687744, 'learning_rate': 3.1252380044312655e-05, 'epoch': 3.31} 33%|███▎ | 3305/10000 [5:12:50<10:18:07, 5.54s/it][2025-06-19 18:42:35,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:42:35,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.61 | bwd_microstep: 3321.76 | bwd_inner_microstep: 3320.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 18:42:35,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.61 | bwd: 3321.77 | bwd_inner: 3320.98 | bwd_allreduce: 0.75 | step: 6.57 33%|███▎ | 3306/10000 [5:12:56<10:15:50, 5.52s/it] {'loss': 0.1423, 'grad_norm': 1.8962876796722412, 'learning_rate': 3.124702439055797e-05, 'epoch': 3.31} 33%|███▎ | 3306/10000 [5:12:56<10:15:50, 5.52s/it][2025-06-19 18:42:40,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:42:40,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.52 | bwd_microstep: 3324.29 | bwd_inner_microstep: 3323.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 18:42:40,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.52 | bwd: 3324.30 | bwd_inner: 3323.51 | bwd_allreduce: 0.75 | step: 6.58 33%|███▎ | 3307/10000 [5:13:01<10:13:53, 5.50s/it] {'loss': 0.0167, 'grad_norm': 0.5122536420822144, 'learning_rate': 3.124166755704261e-05, 'epoch': 3.31} 33%|███▎ | 3307/10000 [5:13:01<10:13:53, 5.50s/it][2025-06-19 18:42:46,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:42:46,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.36 | bwd_microstep: 3323.99 | bwd_inner_microstep: 3323.18 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 18:42:46,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.36 | bwd: 3324.00 | bwd_inner: 3323.18 | bwd_allreduce: 0.78 | step: 7.20 33%|███▎ | 3308/10000 [5:13:07<10:12:43, 5.49s/it] {'loss': 0.0484, 'grad_norm': 1.030550479888916, 'learning_rate': 3.1236309544328514e-05, 'epoch': 3.31} 33%|███▎ | 3308/10000 [5:13:07<10:12:43, 5.49s/it][2025-06-19 18:42:51,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:42:51,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.81 | bwd_microstep: 3368.50 | bwd_inner_microstep: 3367.64 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.92 [2025-06-19 18:42:51,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.81 | bwd: 3368.52 | bwd_inner: 3367.64 | bwd_allreduce: 0.84 | step: 6.92 33%|███▎ | 3309/10000 [5:13:12<10:14:07, 5.51s/it] {'loss': 0.0508, 'grad_norm': 2.03356671333313, 'learning_rate': 3.123095035297769e-05, 'epoch': 3.31} 33%|███▎ | 3309/10000 [5:13:12<10:14:07, 5.51s/it][2025-06-19 18:42:57,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:42:57,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.68 | bwd_microstep: 3328.90 | bwd_inner_microstep: 3328.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 18:42:57,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.68 | bwd: 3328.92 | bwd_inner: 3328.11 | bwd_allreduce: 0.76 | step: 6.77 33%|███▎ | 3310/10000 [5:13:18<10:13:15, 5.50s/it] {'loss': 0.0506, 'grad_norm': 1.1595230102539062, 'learning_rate': 3.1225589983552295e-05, 'epoch': 3.31} 33%|███▎ | 3310/10000 [5:13:18<10:13:15, 5.50s/it][2025-06-19 18:43:02,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:43:02,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.56 | bwd_microstep: 3328.16 | bwd_inner_microstep: 3327.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 18:43:02,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.56 | bwd: 3328.17 | bwd_inner: 3327.35 | bwd_allreduce: 0.78 | step: 7.20 33%|███▎ | 3311/10000 [5:13:23<10:12:18, 5.49s/it] {'loss': 0.121, 'grad_norm': 1.7813760042190552, 'learning_rate': 3.122022843661462e-05, 'epoch': 3.31} 33%|███▎ | 3311/10000 [5:13:23<10:12:18, 5.49s/it][2025-06-19 18:43:08,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:43:08,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.00 | bwd_microstep: 3368.58 | bwd_inner_microstep: 3367.74 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.82 [2025-06-19 18:43:08,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.00 | bwd: 3368.59 | bwd_inner: 3367.74 | bwd_allreduce: 0.81 | step: 6.84 33%|███▎ | 3312/10000 [5:13:29<10:13:58, 5.51s/it] {'loss': 0.0112, 'grad_norm': 0.32815417647361755, 'learning_rate': 3.1214865712727047e-05, 'epoch': 3.31} 33%|███▎ | 3312/10000 [5:13:29<10:13:58, 5.51s/it][2025-06-19 18:43:13,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:43:13,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.70 | bwd_microstep: 3329.31 | bwd_inner_microstep: 3328.49 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 18:43:13,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.70 | bwd: 3329.32 | bwd_inner: 3328.49 | bwd_allreduce: 0.78 | step: 7.11 33%|███▎ | 3313/10000 [5:13:34<10:12:58, 5.50s/it] {'loss': 0.0125, 'grad_norm': 0.2661892771720886, 'learning_rate': 3.120950181245211e-05, 'epoch': 3.31} 33%|███▎ | 3313/10000 [5:13:34<10:12:58, 5.50s/it][2025-06-19 18:43:19,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:43:19,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.35 | bwd_microstep: 3325.49 | bwd_inner_microstep: 3324.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 18:43:19,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.35 | bwd: 3325.50 | bwd_inner: 3324.70 | bwd_allreduce: 0.76 | step: 6.76 33%|███▎ | 3314/10000 [5:13:40<10:11:53, 5.49s/it] {'loss': 0.0376, 'grad_norm': 1.6228915452957153, 'learning_rate': 3.120413673635247e-05, 'epoch': 3.31} 33%|███▎ | 3314/10000 [5:13:40<10:11:53, 5.49s/it][2025-06-19 18:43:25,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:43:25,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.06 | bwd_microstep: 3405.86 | bwd_inner_microstep: 3405.02 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.83 [2025-06-19 18:43:25,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.06 | bwd: 3405.87 | bwd_inner: 3405.02 | bwd_allreduce: 0.82 | step: 6.84 33%|███▎ | 3315/10000 [5:13:45<10:14:54, 5.52s/it] {'loss': 0.0255, 'grad_norm': 1.2484804391860962, 'learning_rate': 3.1198770484990874e-05, 'epoch': 3.31} 33%|███▎ | 3315/10000 [5:13:45<10:14:54, 5.52s/it][2025-06-19 18:43:30,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:43:30,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.73 | bwd_microstep: 3316.10 | bwd_inner_microstep: 3315.29 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.15 [2025-06-19 18:43:30,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.73 | bwd: 3316.12 | bwd_inner: 3315.29 | bwd_allreduce: 0.79 | step: 7.16 33%|███▎ | 3316/10000 [5:13:51<10:12:46, 5.50s/it] {'loss': 0.0152, 'grad_norm': 0.2820320427417755, 'learning_rate': 3.119340305893024e-05, 'epoch': 3.32} 33%|███▎ | 3316/10000 [5:13:51<10:12:46, 5.50s/it][2025-06-19 18:43:35,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:43:35,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.71 | bwd_microstep: 3315.59 | bwd_inner_microstep: 3314.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-19 18:43:35,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.71 | bwd: 3315.60 | bwd_inner: 3314.80 | bwd_allreduce: 0.76 | step: 6.79 33%|███▎ | 3317/10000 [5:13:56<10:11:12, 5.49s/it] {'loss': 0.0169, 'grad_norm': 1.0371426343917847, 'learning_rate': 3.118803445873356e-05, 'epoch': 3.32} 33%|███▎ | 3317/10000 [5:13:56<10:11:12, 5.49s/it][2025-06-19 18:43:41,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:43:41,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.38 | bwd_microstep: 3322.23 | bwd_inner_microstep: 3321.42 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 18:43:41,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.38 | bwd: 3322.25 | bwd_inner: 3321.42 | bwd_allreduce: 0.78 | step: 7.21 33%|███▎ | 3318/10000 [5:14:02<10:10:27, 5.48s/it] {'loss': 0.0053, 'grad_norm': 0.1666955202817917, 'learning_rate': 3.1182664684964005e-05, 'epoch': 3.32} 33%|███▎ | 3318/10000 [5:14:02<10:10:27, 5.48s/it][2025-06-19 18:43:46,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:43:46,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.32 | bwd_microstep: 3323.58 | bwd_inner_microstep: 3322.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 18:43:46,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.32 | bwd: 3323.59 | bwd_inner: 3322.79 | bwd_allreduce: 0.76 | step: 6.72 33%|███▎ | 3319/10000 [5:14:07<10:09:44, 5.48s/it] {'loss': 0.0811, 'grad_norm': 1.0991029739379883, 'learning_rate': 3.117729373818482e-05, 'epoch': 3.32} 33%|███▎ | 3319/10000 [5:14:07<10:09:44, 5.48s/it][2025-06-19 18:43:52,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.83 [2025-06-19 18:43:52,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.91 | bwd_microstep: 3317.31 | bwd_inner_microstep: 3316.52 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.75 [2025-06-19 18:43:52,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.91 | bwd: 3317.32 | bwd_inner: 3316.52 | bwd_allreduce: 0.76 | step: 6.76 33%|███▎ | 3320/10000 [5:14:13<10:09:03, 5.47s/it] {'loss': 0.0492, 'grad_norm': 1.6882425546646118, 'learning_rate': 3.1171921618959395e-05, 'epoch': 3.32} 33%|███▎ | 3320/10000 [5:14:13<10:09:03, 5.47s/it][2025-06-19 18:43:57,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:43:57,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.49 | bwd_microstep: 3323.73 | bwd_inner_microstep: 3322.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 18:43:57,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.49 | bwd: 3323.74 | bwd_inner: 3322.94 | bwd_allreduce: 0.76 | step: 6.64 33%|███▎ | 3321/10000 [5:14:18<10:08:35, 5.47s/it] {'loss': 0.0358, 'grad_norm': 2.316314220428467, 'learning_rate': 3.116654832785124e-05, 'epoch': 3.32} 33%|███▎ | 3321/10000 [5:14:18<10:08:35, 5.47s/it][2025-06-19 18:44:03,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:44:03,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.30 | bwd_microstep: 3317.42 | bwd_inner_microstep: 3316.61 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 18:44:03,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.30 | bwd: 3317.43 | bwd_inner: 3316.61 | bwd_allreduce: 0.78 | step: 7.08 33%|███▎ | 3322/10000 [5:14:24<10:08:18, 5.47s/it] {'loss': 0.0827, 'grad_norm': 2.6147375106811523, 'learning_rate': 3.1161173865424e-05, 'epoch': 3.32} 33%|███▎ | 3322/10000 [5:14:24<10:08:18, 5.47s/it][2025-06-19 18:44:08,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:44:08,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.29 | bwd_microstep: 3315.34 | bwd_inner_microstep: 3314.49 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.96 [2025-06-19 18:44:08,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.29 | bwd: 3315.36 | bwd_inner: 3314.49 | bwd_allreduce: 0.81 | step: 6.96 33%|███▎ | 3323/10000 [5:14:29<10:07:57, 5.46s/it] {'loss': 0.0424, 'grad_norm': 2.247715950012207, 'learning_rate': 3.1155798232241417e-05, 'epoch': 3.32} 33%|███▎ | 3323/10000 [5:14:29<10:07:57, 5.46s/it][2025-06-19 18:44:14,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 18:44:14,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.21 | bwd_microstep: 3333.04 | bwd_inner_microstep: 3331.98 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.14 [2025-06-19 18:44:14,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.22 | bwd: 3333.06 | bwd_inner: 3331.98 | bwd_allreduce: 1.03 | step: 7.15 33%|███▎ | 3324/10000 [5:14:35<10:09:10, 5.47s/it] {'loss': 0.0106, 'grad_norm': 0.4511306881904602, 'learning_rate': 3.1150421428867374e-05, 'epoch': 3.32} 33%|███▎ | 3324/10000 [5:14:35<10:09:10, 5.47s/it][2025-06-19 18:44:19,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:44:19,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.30 | bwd_microstep: 3319.71 | bwd_inner_microstep: 3318.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 18:44:19,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.30 | bwd: 3319.72 | bwd_inner: 3318.92 | bwd_allreduce: 0.76 | step: 6.59 33%|███▎ | 3325/10000 [5:14:40<10:08:56, 5.47s/it] {'loss': 0.0218, 'grad_norm': 1.146333932876587, 'learning_rate': 3.114504345586587e-05, 'epoch': 3.33} 33%|███▎ | 3325/10000 [5:14:40<10:08:56, 5.47s/it][2025-06-19 18:44:25,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:44:25,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.83 | bwd_microstep: 3328.48 | bwd_inner_microstep: 3327.48 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.29 [2025-06-19 18:44:25,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.83 | bwd: 3328.49 | bwd_inner: 3327.48 | bwd_allreduce: 0.97 | step: 7.31 33%|███▎ | 3326/10000 [5:14:45<10:08:47, 5.47s/it] {'loss': 0.0103, 'grad_norm': 0.3948013484477997, 'learning_rate': 3.113966431380104e-05, 'epoch': 3.33} 33%|███▎ | 3326/10000 [5:14:45<10:08:47, 5.47s/it][2025-06-19 18:44:30,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:44:30,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.11 | bwd_microstep: 3372.88 | bwd_inner_microstep: 3372.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-19 18:44:30,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.11 | bwd: 3372.90 | bwd_inner: 3372.08 | bwd_allreduce: 0.78 | step: 6.91 33%|███▎ | 3327/10000 [5:14:51<10:11:05, 5.49s/it] {'loss': 0.0159, 'grad_norm': 0.40042099356651306, 'learning_rate': 3.113428400323712e-05, 'epoch': 3.33} 33%|███▎ | 3327/10000 [5:14:51<10:11:05, 5.49s/it][2025-06-19 18:44:36,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:44:36,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.41 | bwd_microstep: 3370.47 | bwd_inner_microstep: 3369.62 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.92 [2025-06-19 18:44:36,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.42 | bwd: 3370.49 | bwd_inner: 3369.62 | bwd_allreduce: 0.82 | step: 6.93 33%|███▎ | 3328/10000 [5:14:57<10:12:29, 5.51s/it] {'loss': 0.0746, 'grad_norm': 3.0815186500549316, 'learning_rate': 3.1128902524738495e-05, 'epoch': 3.33} 33%|███▎ | 3328/10000 [5:14:57<10:12:29, 5.51s/it][2025-06-19 18:44:41,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:44:41,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.41 | bwd_microstep: 3320.27 | bwd_inner_microstep: 3319.43 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.95 [2025-06-19 18:44:41,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.41 | bwd: 3320.29 | bwd_inner: 3319.43 | bwd_allreduce: 0.81 | step: 6.96 33%|███▎ | 3329/10000 [5:15:02<10:11:14, 5.50s/it] {'loss': 0.1609, 'grad_norm': 4.195224285125732, 'learning_rate': 3.1123519878869636e-05, 'epoch': 3.33} 33%|███▎ | 3329/10000 [5:15:02<10:11:14, 5.50s/it][2025-06-19 18:44:47,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:44:47,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.09 | bwd_microstep: 3315.47 | bwd_inner_microstep: 3314.68 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.54 [2025-06-19 18:44:47,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.09 | bwd: 3315.48 | bwd_inner: 3314.68 | bwd_allreduce: 0.75 | step: 6.54 33%|███▎ | 3330/10000 [5:15:07<10:09:44, 5.48s/it] {'loss': 0.046, 'grad_norm': 2.5181384086608887, 'learning_rate': 3.1118136066195165e-05, 'epoch': 3.33} 33%|███▎ | 3330/10000 [5:15:07<10:09:44, 5.48s/it][2025-06-19 18:44:52,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:44:52,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.90 | bwd_microstep: 3314.92 | bwd_inner_microstep: 3314.03 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.01 [2025-06-19 18:44:52,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.90 | bwd: 3314.93 | bwd_inner: 3314.03 | bwd_allreduce: 0.86 | step: 7.01 33%|███▎ | 3331/10000 [5:15:13<10:08:35, 5.48s/it] {'loss': 0.0704, 'grad_norm': 2.011169195175171, 'learning_rate': 3.1112751087279824e-05, 'epoch': 3.33} 33%|███▎ | 3331/10000 [5:15:13<10:08:35, 5.48s/it][2025-06-19 18:44:58,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:44:58,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.01 | bwd_microstep: 3321.71 | bwd_inner_microstep: 3320.80 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.96 [2025-06-19 18:44:58,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.01 | bwd: 3321.72 | bwd_inner: 3320.80 | bwd_allreduce: 0.88 | step: 6.96 33%|███▎ | 3332/10000 [5:15:18<10:08:09, 5.47s/it] {'loss': 0.1586, 'grad_norm': 4.049959659576416, 'learning_rate': 3.1107364942688474e-05, 'epoch': 3.33} 33%|███▎ | 3332/10000 [5:15:18<10:08:09, 5.47s/it][2025-06-19 18:45:03,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:45:03,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.58 | bwd_microstep: 3320.14 | bwd_inner_microstep: 3319.01 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.77 [2025-06-19 18:45:03,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.58 | bwd: 3320.16 | bwd_inner: 3319.01 | bwd_allreduce: 1.10 | step: 7.78 33%|███▎ | 3333/10000 [5:15:24<10:08:07, 5.47s/it] {'loss': 0.0753, 'grad_norm': 2.3983609676361084, 'learning_rate': 3.110197763298609e-05, 'epoch': 3.33} 33%|███▎ | 3333/10000 [5:15:24<10:08:07, 5.47s/it][2025-06-19 18:45:09,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:45:09,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.38 | bwd_microstep: 3318.57 | bwd_inner_microstep: 3317.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 18:45:09,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.38 | bwd: 3318.58 | bwd_inner: 3317.79 | bwd_allreduce: 0.76 | step: 6.64 33%|███▎ | 3334/10000 [5:15:29<10:08:01, 5.47s/it] {'loss': 0.0567, 'grad_norm': 2.157637119293213, 'learning_rate': 3.109658915873778e-05, 'epoch': 3.33} 33%|███▎ | 3334/10000 [5:15:29<10:08:01, 5.47s/it][2025-06-19 18:45:14,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:45:14,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.36 | bwd_microstep: 3361.55 | bwd_inner_microstep: 3360.77 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 18:45:14,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.36 | bwd: 3361.57 | bwd_inner: 3360.77 | bwd_allreduce: 0.76 | step: 6.60 33%|███▎ | 3335/10000 [5:15:35<10:09:51, 5.49s/it] {'loss': 0.0189, 'grad_norm': 0.8385804891586304, 'learning_rate': 3.109119952050876e-05, 'epoch': 3.33} 33%|███▎ | 3335/10000 [5:15:35<10:09:51, 5.49s/it][2025-06-19 18:45:20,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.79 [2025-06-19 18:45:20,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.04 | bwd_microstep: 3370.63 | bwd_inner_microstep: 3369.57 | bwd_allreduce_microstep: 1.01 | step_microstep: 6.98 [2025-06-19 18:45:20,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.04 | bwd: 3370.64 | bwd_inner: 3369.57 | bwd_allreduce: 1.03 | step: 6.98 33%|███▎ | 3336/10000 [5:15:40<10:11:26, 5.51s/it] {'loss': 0.0896, 'grad_norm': 2.4046638011932373, 'learning_rate': 3.1085808718864396e-05, 'epoch': 3.34} 33%|███▎ | 3336/10000 [5:15:40<10:11:26, 5.51s/it][2025-06-19 18:45:25,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:45:25,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.04 | bwd_microstep: 3322.45 | bwd_inner_microstep: 3321.64 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.18 [2025-06-19 18:45:25,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.04 | bwd: 3322.47 | bwd_inner: 3321.64 | bwd_allreduce: 0.79 | step: 7.18 33%|███▎ | 3337/10000 [5:15:46<10:10:05, 5.49s/it] {'loss': 0.0315, 'grad_norm': 1.6846396923065186, 'learning_rate': 3.1080416754370143e-05, 'epoch': 3.34} 33%|███▎ | 3337/10000 [5:15:46<10:10:05, 5.49s/it][2025-06-19 18:45:31,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:45:31,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.67 | bwd_microstep: 3360.05 | bwd_inner_microstep: 3359.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 18:45:31,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.67 | bwd: 3360.06 | bwd_inner: 3359.27 | bwd_allreduce: 0.75 | step: 6.66 33%|███▎ | 3338/10000 [5:15:51<10:11:21, 5.51s/it] {'loss': 0.0429, 'grad_norm': 2.316042184829712, 'learning_rate': 3.10750236275916e-05, 'epoch': 3.34} 33%|███▎ | 3338/10000 [5:15:51<10:11:21, 5.51s/it][2025-06-19 18:45:36,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:45:36,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.88 | bwd_microstep: 3370.51 | bwd_inner_microstep: 3369.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 18:45:36,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.88 | bwd: 3370.52 | bwd_inner: 3369.72 | bwd_allreduce: 0.76 | step: 6.65 33%|███▎ | 3339/10000 [5:15:57<10:12:11, 5.51s/it] {'loss': 0.0572, 'grad_norm': 1.7479852437973022, 'learning_rate': 3.106962933909448e-05, 'epoch': 3.34} 33%|███▎ | 3339/10000 [5:15:57<10:12:11, 5.51s/it][2025-06-19 18:45:42,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:45:42,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.35 | bwd_microstep: 3375.72 | bwd_inner_microstep: 3374.81 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.90 [2025-06-19 18:45:42,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.35 | bwd: 3375.74 | bwd_inner: 3374.81 | bwd_allreduce: 0.88 | step: 6.90 33%|███▎ | 3340/10000 [5:16:02<10:13:07, 5.52s/it] {'loss': 0.0444, 'grad_norm': 0.7757589221000671, 'learning_rate': 3.1064233889444615e-05, 'epoch': 3.34} 33%|███▎ | 3340/10000 [5:16:02<10:13:07, 5.52s/it][2025-06-19 18:45:47,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:45:47,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.01 | bwd_microstep: 3333.55 | bwd_inner_microstep: 3332.72 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.27 [2025-06-19 18:45:47,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.01 | bwd: 3333.57 | bwd_inner: 3332.72 | bwd_allreduce: 0.80 | step: 7.27 33%|███▎ | 3341/10000 [5:16:08<10:11:39, 5.51s/it] {'loss': 0.0281, 'grad_norm': 1.4933053255081177, 'learning_rate': 3.1058837279207975e-05, 'epoch': 3.34} 33%|███▎ | 3341/10000 [5:16:08<10:11:39, 5.51s/it][2025-06-19 18:45:53,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:45:53,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.07 | bwd_microstep: 3323.02 | bwd_inner_microstep: 3322.20 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.92 [2025-06-19 18:45:53,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.07 | bwd: 3323.04 | bwd_inner: 3322.20 | bwd_allreduce: 0.79 | step: 6.92 33%|███▎ | 3342/10000 [5:16:13<10:10:14, 5.50s/it] {'loss': 0.2347, 'grad_norm': 4.741446018218994, 'learning_rate': 3.105343950895062e-05, 'epoch': 3.34} 33%|███▎ | 3342/10000 [5:16:13<10:10:14, 5.50s/it][2025-06-19 18:45:58,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:45:58,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.21 | bwd_microstep: 3368.95 | bwd_inner_microstep: 3368.05 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.81 [2025-06-19 18:45:58,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.21 | bwd: 3368.96 | bwd_inner: 3368.05 | bwd_allreduce: 0.86 | step: 6.81 33%|███▎ | 3343/10000 [5:16:19<10:11:19, 5.51s/it] {'loss': 0.0191, 'grad_norm': 0.5223246216773987, 'learning_rate': 3.1048040579238766e-05, 'epoch': 3.34} 33%|███▎ | 3343/10000 [5:16:19<10:11:19, 5.51s/it][2025-06-19 18:46:04,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:46:04,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.17 | bwd_microstep: 3365.76 | bwd_inner_microstep: 3364.93 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.81 [2025-06-19 18:46:04,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.17 | bwd: 3365.78 | bwd_inner: 3364.93 | bwd_allreduce: 0.79 | step: 6.82 33%|███▎ | 3344/10000 [5:16:25<10:11:54, 5.52s/it] {'loss': 0.0305, 'grad_norm': 1.6049754619598389, 'learning_rate': 3.104264049063873e-05, 'epoch': 3.34} 33%|███▎ | 3344/10000 [5:16:25<10:11:54, 5.52s/it][2025-06-19 18:46:09,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:46:09,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.31 | bwd_microstep: 3327.12 | bwd_inner_microstep: 3326.27 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.82 [2025-06-19 18:46:09,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.31 | bwd: 3327.13 | bwd_inner: 3326.27 | bwd_allreduce: 0.82 | step: 6.82 33%|███▎ | 3345/10000 [5:16:30<10:10:37, 5.51s/it] {'loss': 0.0689, 'grad_norm': 2.9094810485839844, 'learning_rate': 3.103723924371695e-05, 'epoch': 3.34} 33%|███▎ | 3345/10000 [5:16:30<10:10:37, 5.51s/it][2025-06-19 18:46:15,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:46:15,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.79 | bwd_microstep: 3321.35 | bwd_inner_microstep: 3320.36 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.57 [2025-06-19 18:46:15,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.79 | bwd: 3321.36 | bwd_inner: 3320.36 | bwd_allreduce: 0.95 | step: 7.57 33%|███▎ | 3346/10000 [5:16:35<10:09:27, 5.50s/it] {'loss': 0.0213, 'grad_norm': 0.7493094801902771, 'learning_rate': 3.103183683904e-05, 'epoch': 3.35} 33%|███▎ | 3346/10000 [5:16:35<10:09:27, 5.50s/it][2025-06-19 18:46:20,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:46:20,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.57 | bwd_microstep: 3319.99 | bwd_inner_microstep: 3319.00 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.53 [2025-06-19 18:46:20,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.57 | bwd: 3320.01 | bwd_inner: 3319.00 | bwd_allreduce: 0.95 | step: 7.53 33%|███▎ | 3347/10000 [5:16:41<10:08:32, 5.49s/it] {'loss': 0.1588, 'grad_norm': 3.754507064819336, 'learning_rate': 3.102643327717457e-05, 'epoch': 3.35} 33%|███▎ | 3347/10000 [5:16:41<10:08:32, 5.49s/it][2025-06-19 18:46:26,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:46:26,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.28 | bwd_microstep: 3310.65 | bwd_inner_microstep: 3309.88 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.52 [2025-06-19 18:46:26,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.28 | bwd: 3310.67 | bwd_inner: 3309.88 | bwd_allreduce: 0.75 | step: 6.52 33%|███▎ | 3348/10000 [5:16:46<10:07:12, 5.48s/it] {'loss': 0.0969, 'grad_norm': 3.82338547706604, 'learning_rate': 3.102102855868747e-05, 'epoch': 3.35} 33%|███▎ | 3348/10000 [5:16:46<10:07:12, 5.48s/it][2025-06-19 18:46:31,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:46:31,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.79 | bwd_microstep: 3324.15 | bwd_inner_microstep: 3323.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 18:46:31,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.79 | bwd: 3324.16 | bwd_inner: 3323.36 | bwd_allreduce: 0.76 | step: 6.80 33%|███▎ | 3349/10000 [5:16:52<10:06:45, 5.47s/it] {'loss': 0.0096, 'grad_norm': 0.21372760832309723, 'learning_rate': 3.101562268414561e-05, 'epoch': 3.35} 33%|███▎ | 3349/10000 [5:16:52<10:06:45, 5.47s/it][2025-06-19 18:46:37,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:46:37,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.80 | bwd_microstep: 3316.87 | bwd_inner_microstep: 3316.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 18:46:37,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.80 | bwd: 3316.89 | bwd_inner: 3316.08 | bwd_allreduce: 0.77 | step: 6.93 34%|███▎ | 3350/10000 [5:16:57<10:06:03, 5.47s/it] {'loss': 0.0614, 'grad_norm': 1.2140556573867798, 'learning_rate': 3.1010215654116075e-05, 'epoch': 3.35} 34%|███▎ | 3350/10000 [5:16:57<10:06:03, 5.47s/it][2025-06-19 18:46:42,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:46:42,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.20 | bwd_microstep: 3368.50 | bwd_inner_microstep: 3367.44 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.30 [2025-06-19 18:46:42,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.19 | bwd: 3368.52 | bwd_inner: 3367.44 | bwd_allreduce: 1.03 | step: 7.31 34%|███▎ | 3351/10000 [5:17:03<10:08:05, 5.49s/it] {'loss': 0.0182, 'grad_norm': 1.0947840213775635, 'learning_rate': 3.1004807469166004e-05, 'epoch': 3.35} 34%|███▎ | 3351/10000 [5:17:03<10:08:05, 5.49s/it][2025-06-19 18:46:48,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:46:48,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.49 | bwd_microstep: 3365.02 | bwd_inner_microstep: 3364.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 18:46:48,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.49 | bwd: 3365.03 | bwd_inner: 3364.24 | bwd_allreduce: 0.75 | step: 6.55 34%|███▎ | 3352/10000 [5:17:08<10:09:27, 5.50s/it] {'loss': 0.1264, 'grad_norm': 3.210862159729004, 'learning_rate': 3.099939812986271e-05, 'epoch': 3.35} 34%|███▎ | 3352/10000 [5:17:08<10:09:27, 5.50s/it][2025-06-19 18:46:53,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:46:53,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.00 | bwd_microstep: 3365.41 | bwd_inner_microstep: 3364.60 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.91 [2025-06-19 18:46:53,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.00 | bwd: 3365.43 | bwd_inner: 3364.60 | bwd_allreduce: 0.79 | step: 6.91 34%|███▎ | 3353/10000 [5:17:14<10:10:28, 5.51s/it] {'loss': 0.0996, 'grad_norm': 1.8049708604812622, 'learning_rate': 3.09939876367736e-05, 'epoch': 3.35} 34%|███▎ | 3353/10000 [5:17:14<10:10:28, 5.51s/it][2025-06-19 18:46:59,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:46:59,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.98 | bwd_microstep: 3367.44 | bwd_inner_microstep: 3366.64 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 18:46:59,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.98 | bwd: 3367.45 | bwd_inner: 3366.64 | bwd_allreduce: 0.77 | step: 7.05 34%|███▎ | 3354/10000 [5:17:19<10:11:09, 5.52s/it] {'loss': 0.0151, 'grad_norm': 0.8569462299346924, 'learning_rate': 3.0988575990466215e-05, 'epoch': 3.35} 34%|███▎ | 3354/10000 [5:17:19<10:11:09, 5.52s/it][2025-06-19 18:47:04,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:47:04,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.50 | bwd_microstep: 3359.18 | bwd_inner_microstep: 3358.21 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.74 [2025-06-19 18:47:04,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.50 | bwd: 3359.20 | bwd_inner: 3358.21 | bwd_allreduce: 0.92 | step: 7.74 34%|███▎ | 3355/10000 [5:17:25<10:11:30, 5.52s/it] {'loss': 0.0082, 'grad_norm': 0.3485162854194641, 'learning_rate': 3.0983163191508204e-05, 'epoch': 3.35} 34%|███▎ | 3355/10000 [5:17:25<10:11:30, 5.52s/it][2025-06-19 18:47:10,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.86 [2025-06-19 18:47:10,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.40 | bwd_microstep: 3310.09 | bwd_inner_microstep: 3309.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 18:47:10,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.40 | bwd: 3310.10 | bwd_inner: 3309.28 | bwd_allreduce: 0.78 | step: 7.20 34%|███▎ | 3356/10000 [5:17:30<10:09:06, 5.50s/it] {'loss': 0.1066, 'grad_norm': 1.1408718824386597, 'learning_rate': 3.097774924046735e-05, 'epoch': 3.36} 34%|███▎ | 3356/10000 [5:17:30<10:09:06, 5.50s/it][2025-06-19 18:47:15,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:47:15,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.34 | bwd_microstep: 3305.36 | bwd_inner_microstep: 3304.58 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.59 [2025-06-19 18:47:15,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.34 | bwd: 3305.37 | bwd_inner: 3304.58 | bwd_allreduce: 0.75 | step: 6.59 34%|███▎ | 3357/10000 [5:17:36<10:07:04, 5.48s/it] {'loss': 0.0049, 'grad_norm': 0.24265791475772858, 'learning_rate': 3.097233413791155e-05, 'epoch': 3.36} 34%|███▎ | 3357/10000 [5:17:36<10:07:04, 5.48s/it][2025-06-19 18:47:21,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:47:21,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.16 | bwd_microstep: 3368.04 | bwd_inner_microstep: 3367.10 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.47 [2025-06-19 18:47:21,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.16 | bwd: 3368.05 | bwd_inner: 3367.10 | bwd_allreduce: 0.90 | step: 7.47 34%|███▎ | 3358/10000 [5:17:41<10:08:29, 5.50s/it] {'loss': 0.3441, 'grad_norm': 165.00628662109375, 'learning_rate': 3.0966917884408815e-05, 'epoch': 3.36} 34%|███▎ | 3358/10000 [5:17:41<10:08:29, 5.50s/it][2025-06-19 18:47:26,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:47:26,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.75 | bwd_microstep: 3314.84 | bwd_inner_microstep: 3313.71 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.20 [2025-06-19 18:47:26,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.75 | bwd: 3314.86 | bwd_inner: 3313.71 | bwd_allreduce: 1.09 | step: 7.20 34%|███▎ | 3359/10000 [5:17:47<10:07:14, 5.49s/it] {'loss': 0.1228, 'grad_norm': 1.9464153051376343, 'learning_rate': 3.0961500480527305e-05, 'epoch': 3.36} 34%|███▎ | 3359/10000 [5:17:47<10:07:14, 5.49s/it][2025-06-19 18:47:32,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:47:32,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.80 | bwd_microstep: 3374.25 | bwd_inner_microstep: 3373.44 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-19 18:47:32,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.80 | bwd: 3374.27 | bwd_inner: 3373.44 | bwd_allreduce: 0.78 | step: 6.94 34%|███▎ | 3360/10000 [5:17:52<10:09:45, 5.51s/it] {'loss': 0.2125, 'grad_norm': 2.6004843711853027, 'learning_rate': 3.095608192683526e-05, 'epoch': 3.36} 34%|███▎ | 3360/10000 [5:17:52<10:09:45, 5.51s/it][2025-06-19 18:47:37,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:47:37,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.56 | bwd_microstep: 3366.24 | bwd_inner_microstep: 3365.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 18:47:37,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.56 | bwd: 3366.25 | bwd_inner: 3365.43 | bwd_allreduce: 0.78 | step: 7.17 34%|███▎ | 3361/10000 [5:17:58<10:10:19, 5.52s/it] {'loss': 0.0055, 'grad_norm': 0.16468703746795654, 'learning_rate': 3.0950662223901075e-05, 'epoch': 3.36} 34%|███▎ | 3361/10000 [5:17:58<10:10:19, 5.52s/it][2025-06-19 18:47:43,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:47:43,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.10 | bwd_microstep: 3316.88 | bwd_inner_microstep: 3316.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 18:47:43,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.10 | bwd: 3316.89 | bwd_inner: 3316.08 | bwd_allreduce: 0.77 | step: 6.75 34%|███▎ | 3362/10000 [5:18:03<10:08:07, 5.50s/it] {'loss': 0.0411, 'grad_norm': 3.1857876777648926, 'learning_rate': 3.094524137229325e-05, 'epoch': 3.36} 34%|███▎ | 3362/10000 [5:18:03<10:08:07, 5.50s/it][2025-06-19 18:47:48,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:47:48,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.81 | bwd_microstep: 3321.48 | bwd_inner_microstep: 3320.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 18:47:48,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.81 | bwd: 3321.49 | bwd_inner: 3320.67 | bwd_allreduce: 0.77 | step: 7.05 34%|███▎ | 3363/10000 [5:18:09<10:06:51, 5.49s/it] {'loss': 0.1136, 'grad_norm': 2.496755599975586, 'learning_rate': 3.0939819372580395e-05, 'epoch': 3.36} 34%|███▎ | 3363/10000 [5:18:09<10:06:51, 5.49s/it][2025-06-19 18:47:54,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:47:54,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.39 | bwd_microstep: 3362.95 | bwd_inner_microstep: 3362.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-19 18:47:54,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.39 | bwd: 3362.96 | bwd_inner: 3362.15 | bwd_allreduce: 0.77 | step: 7.01 34%|███▎ | 3364/10000 [5:18:14<10:08:37, 5.50s/it] {'loss': 0.0508, 'grad_norm': 1.8323733806610107, 'learning_rate': 3.0934396225331264e-05, 'epoch': 3.36} 34%|███▎ | 3364/10000 [5:18:14<10:08:37, 5.50s/it][2025-06-19 18:47:59,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:47:59,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.48 | bwd_microstep: 3311.16 | bwd_inner_microstep: 3310.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 18:47:59,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.48 | bwd: 3311.17 | bwd_inner: 3310.36 | bwd_allreduce: 0.77 | step: 6.71 34%|███▎ | 3365/10000 [5:18:20<10:06:48, 5.49s/it] {'loss': 0.0106, 'grad_norm': 0.46784570813179016, 'learning_rate': 3.092897193111472e-05, 'epoch': 3.37} 34%|███▎ | 3365/10000 [5:18:20<10:06:48, 5.49s/it][2025-06-19 18:48:05,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:48:05,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.79 | bwd_microstep: 3318.15 | bwd_inner_microstep: 3317.37 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 18:48:05,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.79 | bwd: 3318.17 | bwd_inner: 3317.37 | bwd_allreduce: 0.76 | step: 6.68 34%|███▎ | 3366/10000 [5:18:25<10:05:41, 5.48s/it] {'loss': 0.0415, 'grad_norm': 1.3697954416275024, 'learning_rate': 3.092354649049973e-05, 'epoch': 3.37} 34%|███▎ | 3366/10000 [5:18:25<10:05:41, 5.48s/it][2025-06-19 18:48:10,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:48:10,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.24 | bwd_microstep: 3309.58 | bwd_inner_microstep: 3308.61 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.11 [2025-06-19 18:48:10,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.24 | bwd: 3309.59 | bwd_inner: 3308.62 | bwd_allreduce: 0.93 | step: 7.11 34%|███▎ | 3367/10000 [5:18:31<10:04:28, 5.47s/it] {'loss': 0.0707, 'grad_norm': 2.159486770629883, 'learning_rate': 3.091811990405543e-05, 'epoch': 3.37} 34%|███▎ | 3367/10000 [5:18:31<10:04:28, 5.47s/it][2025-06-19 18:48:15,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.74 [2025-06-19 18:48:15,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.51 | bwd_microstep: 3367.98 | bwd_inner_microstep: 3366.97 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.12 [2025-06-19 18:48:15,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.51 | bwd: 3368.00 | bwd_inner: 3366.97 | bwd_allreduce: 0.98 | step: 7.13 34%|███▎ | 3368/10000 [5:18:36<10:06:58, 5.49s/it] {'loss': 0.0333, 'grad_norm': 0.7480292916297913, 'learning_rate': 3.0912692172351016e-05, 'epoch': 3.37} 34%|███▎ | 3368/10000 [5:18:36<10:06:58, 5.49s/it][2025-06-19 18:48:21,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:48:21,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.33 | bwd_microstep: 3363.83 | bwd_inner_microstep: 3362.79 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.46 [2025-06-19 18:48:21,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.33 | bwd: 3363.85 | bwd_inner: 3362.79 | bwd_allreduce: 1.00 | step: 7.46 34%|███▎ | 3369/10000 [5:18:42<10:08:15, 5.50s/it] {'loss': 0.0332, 'grad_norm': 0.9724850654602051, 'learning_rate': 3.090726329595584e-05, 'epoch': 3.37} 34%|███▎ | 3369/10000 [5:18:42<10:08:15, 5.50s/it][2025-06-19 18:48:26,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:48:26,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.28 | bwd_microstep: 3311.70 | bwd_inner_microstep: 3310.91 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 18:48:26,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.28 | bwd: 3311.71 | bwd_inner: 3310.91 | bwd_allreduce: 0.76 | step: 6.65 34%|███▎ | 3370/10000 [5:18:47<10:06:26, 5.49s/it] {'loss': 0.1197, 'grad_norm': 2.121659994125366, 'learning_rate': 3.0901833275439366e-05, 'epoch': 3.37} 34%|███▎ | 3370/10000 [5:18:47<10:06:26, 5.49s/it][2025-06-19 18:48:32,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:48:32,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.86 | bwd_microstep: 3316.58 | bwd_inner_microstep: 3315.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 18:48:32,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.86 | bwd: 3316.60 | bwd_inner: 3315.80 | bwd_allreduce: 0.76 | step: 6.64 34%|███▎ | 3371/10000 [5:18:53<10:05:30, 5.48s/it] {'loss': 0.0131, 'grad_norm': 0.39349740743637085, 'learning_rate': 3.089640211137118e-05, 'epoch': 3.37} 34%|███▎ | 3371/10000 [5:18:53<10:05:30, 5.48s/it][2025-06-19 18:48:37,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 18:48:37,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.81 | bwd_microstep: 3314.36 | bwd_inner_microstep: 3313.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 18:48:37,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.81 | bwd: 3314.38 | bwd_inner: 3313.58 | bwd_allreduce: 0.76 | step: 6.58 34%|███▎ | 3372/10000 [5:18:58<10:04:29, 5.47s/it] {'loss': 0.0637, 'grad_norm': 1.9644607305526733, 'learning_rate': 3.089096980432099e-05, 'epoch': 3.37} 34%|███▎ | 3372/10000 [5:18:58<10:04:29, 5.47s/it][2025-06-19 18:48:43,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:48:43,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.19 | bwd_microstep: 3326.04 | bwd_inner_microstep: 3325.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 18:48:43,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.19 | bwd: 3326.06 | bwd_inner: 3325.24 | bwd_allreduce: 0.78 | step: 7.06 34%|███▎ | 3373/10000 [5:19:04<10:04:38, 5.47s/it] {'loss': 0.0278, 'grad_norm': 1.0393849611282349, 'learning_rate': 3.0885536354858605e-05, 'epoch': 3.37} 34%|███▎ | 3373/10000 [5:19:04<10:04:38, 5.47s/it][2025-06-19 18:48:48,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:48:48,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.60 | bwd_microstep: 3365.47 | bwd_inner_microstep: 3364.64 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.94 [2025-06-19 18:48:48,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.60 | bwd: 3365.48 | bwd_inner: 3364.64 | bwd_allreduce: 0.80 | step: 6.94 34%|███▎ | 3374/10000 [5:19:09<10:06:46, 5.49s/it] {'loss': 0.0317, 'grad_norm': 0.8666461706161499, 'learning_rate': 3.088010176355398e-05, 'epoch': 3.37} 34%|███▎ | 3374/10000 [5:19:09<10:06:46, 5.49s/it][2025-06-19 18:48:54,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:48:54,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.29 | bwd_microstep: 3318.03 | bwd_inner_microstep: 3317.10 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.98 [2025-06-19 18:48:54,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.29 | bwd: 3318.04 | bwd_inner: 3317.10 | bwd_allreduce: 0.90 | step: 6.98 34%|███▍ | 3375/10000 [5:19:15<10:05:32, 5.48s/it] {'loss': 0.088, 'grad_norm': 2.119910717010498, 'learning_rate': 3.087466603097717e-05, 'epoch': 3.38} 34%|███▍ | 3375/10000 [5:19:15<10:05:32, 5.48s/it][2025-06-19 18:48:59,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:48:59,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.16 | bwd_microstep: 3323.22 | bwd_inner_microstep: 3322.40 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.83 [2025-06-19 18:48:59,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.16 | bwd: 3323.23 | bwd_inner: 3322.40 | bwd_allreduce: 0.79 | step: 6.83 34%|███▍ | 3376/10000 [5:19:20<10:05:11, 5.48s/it] {'loss': 0.0777, 'grad_norm': 1.5896655321121216, 'learning_rate': 3.086922915769838e-05, 'epoch': 3.38} 34%|███▍ | 3376/10000 [5:19:20<10:05:11, 5.48s/it][2025-06-19 18:49:05,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:49:05,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.12 | bwd_microstep: 3373.13 | bwd_inner_microstep: 3372.31 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 18:49:05,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.12 | bwd: 3373.14 | bwd_inner: 3372.31 | bwd_allreduce: 0.79 | step: 7.29 34%|███▍ | 3377/10000 [5:19:26<10:07:06, 5.50s/it] {'loss': 0.0373, 'grad_norm': 1.8027348518371582, 'learning_rate': 3.0863791144287885e-05, 'epoch': 3.38} 34%|███▍ | 3377/10000 [5:19:26<10:07:06, 5.50s/it][2025-06-19 18:49:10,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:49:10,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.81 | bwd_microstep: 3322.33 | bwd_inner_microstep: 3321.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 18:49:10,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.81 | bwd: 3322.35 | bwd_inner: 3321.55 | bwd_allreduce: 0.75 | step: 6.63 34%|███▍ | 3378/10000 [5:19:31<10:06:00, 5.49s/it] {'loss': 0.1398, 'grad_norm': 2.8156750202178955, 'learning_rate': 3.085835199131612e-05, 'epoch': 3.38} 34%|███▍ | 3378/10000 [5:19:31<10:06:00, 5.49s/it][2025-06-19 18:49:16,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:49:16,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.20 | bwd_microstep: 3380.47 | bwd_inner_microstep: 3379.27 | bwd_allreduce_microstep: 1.14 | step_microstep: 7.75 [2025-06-19 18:49:16,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.20 | bwd: 3380.49 | bwd_inner: 3379.27 | bwd_allreduce: 1.17 | step: 7.76 34%|███▍ | 3379/10000 [5:19:37<10:08:16, 5.51s/it] {'loss': 0.0358, 'grad_norm': 0.9931831955909729, 'learning_rate': 3.085291169935363e-05, 'epoch': 3.38} 34%|███▍ | 3379/10000 [5:19:37<10:08:16, 5.51s/it][2025-06-19 18:49:21,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:49:21,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.69 | bwd_microstep: 3318.22 | bwd_inner_microstep: 3317.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.58 [2025-06-19 18:49:21,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.69 | bwd: 3318.24 | bwd_inner: 3317.43 | bwd_allreduce: 0.76 | step: 6.59 34%|███▍ | 3380/10000 [5:19:42<10:06:37, 5.50s/it] {'loss': 0.0664, 'grad_norm': 2.7153496742248535, 'learning_rate': 3.0847470268971074e-05, 'epoch': 3.38} 34%|███▍ | 3380/10000 [5:19:42<10:06:37, 5.50s/it][2025-06-19 18:49:27,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 18:49:27,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.02 | bwd_microstep: 3316.28 | bwd_inner_microstep: 3315.37 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.43 [2025-06-19 18:49:27,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.02 | bwd: 3316.30 | bwd_inner: 3315.37 | bwd_allreduce: 0.88 | step: 7.43 34%|███▍ | 3381/10000 [5:19:48<10:05:13, 5.49s/it] {'loss': 0.0215, 'grad_norm': 0.8904911279678345, 'learning_rate': 3.0842027700739225e-05, 'epoch': 3.38} 34%|███▍ | 3381/10000 [5:19:48<10:05:13, 5.49s/it][2025-06-19 18:49:32,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:49:32,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.10 | bwd_microstep: 3326.21 | bwd_inner_microstep: 3325.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 18:49:32,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.10 | bwd: 3326.22 | bwd_inner: 3325.41 | bwd_allreduce: 0.77 | step: 6.82 34%|███▍ | 3382/10000 [5:19:53<10:04:43, 5.48s/it] {'loss': 0.01, 'grad_norm': 0.24873219430446625, 'learning_rate': 3.0836583995228994e-05, 'epoch': 3.38} 34%|███▍ | 3382/10000 [5:19:53<10:04:43, 5.48s/it][2025-06-19 18:49:38,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:49:38,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.41 | bwd_microstep: 3326.63 | bwd_inner_microstep: 3325.81 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.33 [2025-06-19 18:49:38,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.41 | bwd: 3326.65 | bwd_inner: 3325.81 | bwd_allreduce: 0.79 | step: 7.34 34%|███▍ | 3383/10000 [5:19:59<10:04:18, 5.48s/it] {'loss': 0.0296, 'grad_norm': 0.8165910243988037, 'learning_rate': 3.083113915301139e-05, 'epoch': 3.38} 34%|███▍ | 3383/10000 [5:19:59<10:04:18, 5.48s/it][2025-06-19 18:49:43,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:49:43,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.80 | bwd_microstep: 3323.61 | bwd_inner_microstep: 3322.77 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.82 [2025-06-19 18:49:43,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.80 | bwd: 3323.62 | bwd_inner: 3322.77 | bwd_allreduce: 0.81 | step: 6.83 34%|███▍ | 3384/10000 [5:20:04<10:03:57, 5.48s/it] {'loss': 0.0474, 'grad_norm': 2.1039083003997803, 'learning_rate': 3.082569317465756e-05, 'epoch': 3.38} 34%|███▍ | 3384/10000 [5:20:04<10:03:57, 5.48s/it][2025-06-19 18:49:49,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:49:49,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.81 | bwd_microstep: 3379.79 | bwd_inner_microstep: 3378.97 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 18:49:49,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.81 | bwd: 3379.81 | bwd_inner: 3378.97 | bwd_allreduce: 0.79 | step: 7.29 34%|███▍ | 3385/10000 [5:20:10<10:06:11, 5.50s/it] {'loss': 0.0257, 'grad_norm': 0.8944030404090881, 'learning_rate': 3.0820246060738774e-05, 'epoch': 3.38} 34%|███▍ | 3385/10000 [5:20:10<10:06:11, 5.50s/it][2025-06-19 18:49:54,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:49:54,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.87 | bwd_microstep: 3332.17 | bwd_inner_microstep: 3331.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 18:49:54,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.87 | bwd: 3332.19 | bwd_inner: 3331.38 | bwd_allreduce: 0.76 | step: 6.75 34%|███▍ | 3386/10000 [5:20:15<10:05:22, 5.49s/it] {'loss': 0.0713, 'grad_norm': 2.6391239166259766, 'learning_rate': 3.0814797811826375e-05, 'epoch': 3.39} 34%|███▍ | 3386/10000 [5:20:15<10:05:22, 5.49s/it][2025-06-19 18:50:00,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:50:00,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.90 | bwd_microstep: 3332.75 | bwd_inner_microstep: 3331.93 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.26 [2025-06-19 18:50:00,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.90 | bwd: 3332.77 | bwd_inner: 3331.93 | bwd_allreduce: 0.79 | step: 7.26 34%|███▍ | 3387/10000 [5:20:21<10:04:48, 5.49s/it] {'loss': 0.0604, 'grad_norm': 1.7066857814788818, 'learning_rate': 3.080934842849189e-05, 'epoch': 3.39} 34%|███▍ | 3387/10000 [5:20:21<10:04:48, 5.49s/it][2025-06-19 18:50:05,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:50:05,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.16 | bwd_microstep: 3323.48 | bwd_inner_microstep: 3322.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 18:50:05,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.16 | bwd: 3323.49 | bwd_inner: 3322.68 | bwd_allreduce: 0.77 | step: 6.72 34%|███▍ | 3388/10000 [5:20:26<10:04:23, 5.48s/it] {'loss': 0.0222, 'grad_norm': 0.7779116034507751, 'learning_rate': 3.080389791130692e-05, 'epoch': 3.39} 34%|███▍ | 3388/10000 [5:20:26<10:04:23, 5.48s/it][2025-06-19 18:50:11,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:50:11,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.20 | bwd_microstep: 3376.54 | bwd_inner_microstep: 3375.72 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.29 [2025-06-19 18:50:11,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.20 | bwd: 3376.56 | bwd_inner: 3375.72 | bwd_allreduce: 0.80 | step: 7.29 34%|███▍ | 3389/10000 [5:20:32<10:06:28, 5.50s/it] {'loss': 0.0357, 'grad_norm': 1.074419379234314, 'learning_rate': 3.0798446260843205e-05, 'epoch': 3.39} 34%|███▍ | 3389/10000 [5:20:32<10:06:28, 5.50s/it][2025-06-19 18:50:16,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:50:16,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.70 | bwd_microstep: 3329.69 | bwd_inner_microstep: 3328.86 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.87 [2025-06-19 18:50:16,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.70 | bwd: 3329.71 | bwd_inner: 3328.86 | bwd_allreduce: 0.79 | step: 6.87 34%|███▍ | 3390/10000 [5:20:37<10:05:33, 5.50s/it] {'loss': 0.0267, 'grad_norm': 1.1516259908676147, 'learning_rate': 3.0792993477672585e-05, 'epoch': 3.39} 34%|███▍ | 3390/10000 [5:20:37<10:05:33, 5.50s/it][2025-06-19 18:50:22,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:50:22,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.16 | bwd_microstep: 3387.65 | bwd_inner_microstep: 3386.83 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.09 [2025-06-19 18:50:22,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.16 | bwd: 3387.66 | bwd_inner: 3386.83 | bwd_allreduce: 0.79 | step: 7.09 34%|███▍ | 3391/10000 [5:20:43<10:07:45, 5.52s/it] {'loss': 0.0256, 'grad_norm': 0.7304566502571106, 'learning_rate': 3.078753956236705e-05, 'epoch': 3.39} 34%|███▍ | 3391/10000 [5:20:43<10:07:45, 5.52s/it][2025-06-19 18:50:27,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:50:27,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.68 | bwd_microstep: 3376.28 | bwd_inner_microstep: 3375.39 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.80 [2025-06-19 18:50:27,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.68 | bwd: 3376.30 | bwd_inner: 3375.39 | bwd_allreduce: 0.86 | step: 6.81 34%|███▍ | 3392/10000 [5:20:48<10:08:39, 5.53s/it] {'loss': 0.1583, 'grad_norm': 2.250382900238037, 'learning_rate': 3.0782084515498676e-05, 'epoch': 3.39} 34%|███▍ | 3392/10000 [5:20:48<10:08:39, 5.53s/it][2025-06-19 18:50:33,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 18:50:33,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.06 | bwd_microstep: 3332.28 | bwd_inner_microstep: 3331.25 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.67 [2025-06-19 18:50:33,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.06 | bwd: 3332.29 | bwd_inner: 3331.25 | bwd_allreduce: 0.99 | step: 7.68 34%|███▍ | 3393/10000 [5:20:54<10:07:00, 5.51s/it] {'loss': 0.0311, 'grad_norm': 0.8585457801818848, 'learning_rate': 3.0776628337639675e-05, 'epoch': 3.39} 34%|███▍ | 3393/10000 [5:20:54<10:07:00, 5.51s/it][2025-06-19 18:50:38,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:50:38,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.18 | bwd_microstep: 3328.18 | bwd_inner_microstep: 3327.28 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.00 [2025-06-19 18:50:38,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.18 | bwd: 3328.20 | bwd_inner: 3327.28 | bwd_allreduce: 0.87 | step: 7.00 34%|███▍ | 3394/10000 [5:20:59<10:06:03, 5.50s/it] {'loss': 0.0912, 'grad_norm': 2.8981051445007324, 'learning_rate': 3.0771171029362385e-05, 'epoch': 3.39} 34%|███▍ | 3394/10000 [5:20:59<10:06:03, 5.50s/it][2025-06-19 18:50:44,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:50:44,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.44 | bwd_microstep: 3381.24 | bwd_inner_microstep: 3380.42 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.22 [2025-06-19 18:50:44,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.44 | bwd: 3381.25 | bwd_inner: 3380.42 | bwd_allreduce: 0.79 | step: 7.23 34%|███▍ | 3395/10000 [5:21:05<10:07:37, 5.52s/it] {'loss': 0.1479, 'grad_norm': 2.9165842533111572, 'learning_rate': 3.0765712591239245e-05, 'epoch': 3.4} 34%|███▍ | 3395/10000 [5:21:05<10:07:37, 5.52s/it][2025-06-19 18:50:50,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:50:50,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2152.02 | bwd_microstep: 3409.68 | bwd_inner_microstep: 3408.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-19 18:50:50,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2152.02 | bwd: 3409.70 | bwd_inner: 3408.87 | bwd_allreduce: 0.78 | step: 6.97 34%|███▍ | 3396/10000 [5:21:10<10:10:15, 5.54s/it] {'loss': 0.0335, 'grad_norm': 1.39617919921875, 'learning_rate': 3.076025302384281e-05, 'epoch': 3.4} 34%|███▍ | 3396/10000 [5:21:10<10:10:15, 5.54s/it][2025-06-19 18:50:55,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:50:55,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.72 | bwd_microstep: 3404.14 | bwd_inner_microstep: 3403.34 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 18:50:55,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.72 | bwd: 3404.15 | bwd_inner: 3403.34 | bwd_allreduce: 0.77 | step: 7.19 34%|███▍ | 3397/10000 [5:21:16<10:11:46, 5.56s/it] {'loss': 0.0954, 'grad_norm': 2.351620674133301, 'learning_rate': 3.075479232774578e-05, 'epoch': 3.4} 34%|███▍ | 3397/10000 [5:21:16<10:11:46, 5.56s/it][2025-06-19 18:51:01,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:51:01,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.44 | bwd_microstep: 3329.58 | bwd_inner_microstep: 3328.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 18:51:01,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.44 | bwd: 3329.60 | bwd_inner: 3328.79 | bwd_allreduce: 0.76 | step: 6.67 34%|███▍ | 3398/10000 [5:21:21<10:08:56, 5.53s/it] {'loss': 0.0069, 'grad_norm': 0.193309485912323, 'learning_rate': 3.074933050352095e-05, 'epoch': 3.4} 34%|███▍ | 3398/10000 [5:21:21<10:08:56, 5.53s/it][2025-06-19 18:51:06,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:51:06,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.30 | bwd_microstep: 3321.75 | bwd_inner_microstep: 3320.92 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.39 [2025-06-19 18:51:06,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.30 | bwd: 3321.76 | bwd_inner: 3320.92 | bwd_allreduce: 0.79 | step: 7.39 34%|███▍ | 3399/10000 [5:21:27<10:06:46, 5.52s/it] {'loss': 0.0614, 'grad_norm': 2.372499704360962, 'learning_rate': 3.0743867551741235e-05, 'epoch': 3.4} 34%|███▍ | 3399/10000 [5:21:27<10:06:46, 5.52s/it][2025-06-19 18:51:12,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 18:51:12,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.98 | bwd_microstep: 3335.91 | bwd_inner_microstep: 3334.90 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.37 [2025-06-19 18:51:12,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.98 | bwd: 3335.92 | bwd_inner: 3334.90 | bwd_allreduce: 0.98 | step: 7.37 34%|███▍ | 3400/10000 [5:21:32<10:05:52, 5.51s/it] {'loss': 0.0145, 'grad_norm': 0.47097641229629517, 'learning_rate': 3.073840347297968e-05, 'epoch': 3.4} 34%|███▍ | 3400/10000 [5:21:32<10:05:52, 5.51s/it][2025-06-19 18:51:17,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.92 [2025-06-19 18:51:17,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.87 | bwd_microstep: 3379.87 | bwd_inner_microstep: 3379.04 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.44 [2025-06-19 18:51:17,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.87 | bwd: 3379.89 | bwd_inner: 3379.04 | bwd_allreduce: 0.79 | step: 7.44 34%|███▍ | 3401/10000 [5:21:38<10:07:44, 5.53s/it] {'loss': 0.0503, 'grad_norm': 2.638115882873535, 'learning_rate': 3.073293826780944e-05, 'epoch': 3.4} 34%|███▍ | 3401/10000 [5:21:38<10:07:44, 5.53s/it][2025-06-19 18:51:23,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:51:23,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.36 | bwd_microstep: 3377.99 | bwd_inner_microstep: 3377.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 18:51:23,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.36 | bwd: 3378.00 | bwd_inner: 3377.20 | bwd_allreduce: 0.76 | step: 6.73 34%|███▍ | 3402/10000 [5:21:43<10:08:32, 5.53s/it] {'loss': 0.0103, 'grad_norm': 0.2976968288421631, 'learning_rate': 3.0727471936803783e-05, 'epoch': 3.4} 34%|███▍ | 3402/10000 [5:21:43<10:08:32, 5.53s/it][2025-06-19 18:51:28,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:51:28,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.45 | bwd_microstep: 3329.37 | bwd_inner_microstep: 3328.59 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 18:51:28,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.45 | bwd: 3329.38 | bwd_inner: 3328.59 | bwd_allreduce: 0.75 | step: 6.58 34%|███▍ | 3403/10000 [5:21:49<10:06:29, 5.52s/it] {'loss': 0.0847, 'grad_norm': 2.8220412731170654, 'learning_rate': 3.0722004480536115e-05, 'epoch': 3.4} 34%|███▍ | 3403/10000 [5:21:49<10:06:29, 5.52s/it][2025-06-19 18:51:34,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:51:34,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.65 | bwd_microstep: 3327.57 | bwd_inner_microstep: 3326.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 18:51:34,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.65 | bwd: 3327.59 | bwd_inner: 3326.77 | bwd_allreduce: 0.78 | step: 7.16 34%|███▍ | 3404/10000 [5:21:54<10:05:13, 5.51s/it] {'loss': 0.0558, 'grad_norm': 2.3621177673339844, 'learning_rate': 3.0716535899579936e-05, 'epoch': 3.4} 34%|███▍ | 3404/10000 [5:21:54<10:05:13, 5.51s/it][2025-06-19 18:51:39,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:51:39,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.11 | bwd_microstep: 3324.46 | bwd_inner_microstep: 3323.62 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.31 [2025-06-19 18:51:39,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.11 | bwd: 3324.48 | bwd_inner: 3323.62 | bwd_allreduce: 0.81 | step: 7.31 34%|███▍ | 3405/10000 [5:22:00<10:04:15, 5.50s/it] {'loss': 0.108, 'grad_norm': 2.28344464302063, 'learning_rate': 3.071106619450888e-05, 'epoch': 3.41} 34%|███▍ | 3405/10000 [5:22:00<10:04:15, 5.50s/it][2025-06-19 18:51:45,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:51:45,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.42 | bwd_microstep: 3329.84 | bwd_inner_microstep: 3329.05 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 18:51:45,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.42 | bwd: 3329.85 | bwd_inner: 3329.05 | bwd_allreduce: 0.76 | step: 6.68 34%|███▍ | 3406/10000 [5:22:05<10:03:26, 5.49s/it] {'loss': 0.0184, 'grad_norm': 0.6079795956611633, 'learning_rate': 3.070559536589669e-05, 'epoch': 3.41} 34%|███▍ | 3406/10000 [5:22:05<10:03:26, 5.49s/it][2025-06-19 18:51:50,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:51:50,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.79 | bwd_microstep: 3317.24 | bwd_inner_microstep: 3316.37 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.96 [2025-06-19 18:51:50,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.79 | bwd: 3317.25 | bwd_inner: 3316.37 | bwd_allreduce: 0.83 | step: 6.97 34%|███▍ | 3407/10000 [5:22:11<10:02:24, 5.48s/it] {'loss': 0.0418, 'grad_norm': 2.3897011280059814, 'learning_rate': 3.070012341431723e-05, 'epoch': 3.41} 34%|███▍ | 3407/10000 [5:22:11<10:02:24, 5.48s/it][2025-06-19 18:51:56,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:51:56,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.22 | bwd_microstep: 3329.05 | bwd_inner_microstep: 3328.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 18:51:56,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.22 | bwd: 3329.07 | bwd_inner: 3328.27 | bwd_allreduce: 0.76 | step: 6.60 34%|███▍ | 3408/10000 [5:22:16<10:02:02, 5.48s/it] {'loss': 0.0633, 'grad_norm': 1.4625223875045776, 'learning_rate': 3.069465034034449e-05, 'epoch': 3.41} 34%|███▍ | 3408/10000 [5:22:16<10:02:02, 5.48s/it][2025-06-19 18:52:01,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:52:01,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.96 | bwd_microstep: 3328.65 | bwd_inner_microstep: 3327.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-19 18:52:01,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.96 | bwd: 3328.67 | bwd_inner: 3327.85 | bwd_allreduce: 0.77 | step: 7.04 34%|███▍ | 3409/10000 [5:22:22<10:01:43, 5.48s/it] {'loss': 0.2604, 'grad_norm': 2.9700796604156494, 'learning_rate': 3.068917614455256e-05, 'epoch': 3.41} 34%|███▍ | 3409/10000 [5:22:22<10:01:43, 5.48s/it][2025-06-19 18:52:06,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:52:06,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.16 | bwd_microstep: 3325.64 | bwd_inner_microstep: 3324.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 18:52:06,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.16 | bwd: 3325.65 | bwd_inner: 3324.86 | bwd_allreduce: 0.75 | step: 6.62 34%|███▍ | 3410/10000 [5:22:27<10:01:46, 5.48s/it] {'loss': 0.0236, 'grad_norm': 0.9963544011116028, 'learning_rate': 3.068370082751567e-05, 'epoch': 3.41} 34%|███▍ | 3410/10000 [5:22:27<10:01:46, 5.48s/it][2025-06-19 18:52:12,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:52:12,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.73 | bwd_microstep: 3320.79 | bwd_inner_microstep: 3320.01 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 18:52:12,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.73 | bwd: 3320.80 | bwd_inner: 3320.01 | bwd_allreduce: 0.75 | step: 6.59 34%|███▍ | 3411/10000 [5:22:33<10:01:03, 5.47s/it] {'loss': 0.0056, 'grad_norm': 0.16581495106220245, 'learning_rate': 3.067822438980813e-05, 'epoch': 3.41} 34%|███▍ | 3411/10000 [5:22:33<10:01:03, 5.47s/it][2025-06-19 18:52:17,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:52:17,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.70 | bwd_microstep: 3324.39 | bwd_inner_microstep: 3323.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 18:52:17,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.70 | bwd: 3324.40 | bwd_inner: 3323.60 | bwd_allreduce: 0.76 | step: 6.90 34%|███▍ | 3412/10000 [5:22:38<10:00:47, 5.47s/it] {'loss': 0.0568, 'grad_norm': 2.646207332611084, 'learning_rate': 3.067274683200442e-05, 'epoch': 3.41} 34%|███▍ | 3412/10000 [5:22:38<10:00:47, 5.47s/it][2025-06-19 18:52:23,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 18:52:23,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.27 | bwd_microstep: 3326.39 | bwd_inner_microstep: 3325.50 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.47 [2025-06-19 18:52:23,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.27 | bwd: 3326.41 | bwd_inner: 3325.50 | bwd_allreduce: 0.85 | step: 7.47 34%|███▍ | 3413/10000 [5:22:44<10:00:38, 5.47s/it] {'loss': 0.0947, 'grad_norm': 2.0922343730926514, 'learning_rate': 3.0667268154679115e-05, 'epoch': 3.41} 34%|███▍ | 3413/10000 [5:22:44<10:00:38, 5.47s/it][2025-06-19 18:52:28,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:52:28,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.30 | bwd_microstep: 3376.81 | bwd_inner_microstep: 3375.84 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.29 [2025-06-19 18:52:28,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.30 | bwd: 3376.83 | bwd_inner: 3375.84 | bwd_allreduce: 0.94 | step: 7.29 34%|███▍ | 3414/10000 [5:22:49<10:03:38, 5.50s/it] {'loss': 0.0983, 'grad_norm': 1.8783650398254395, 'learning_rate': 3.066178835840687e-05, 'epoch': 3.41} 34%|███▍ | 3414/10000 [5:22:49<10:03:38, 5.50s/it][2025-06-19 18:52:34,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:52:34,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.88 | bwd_microstep: 3325.96 | bwd_inner_microstep: 3325.17 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 18:52:34,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.88 | bwd: 3325.98 | bwd_inner: 3325.17 | bwd_allreduce: 0.76 | step: 6.75 34%|███▍ | 3415/10000 [5:22:55<10:02:39, 5.49s/it] {'loss': 0.2117, 'grad_norm': 3.461667060852051, 'learning_rate': 3.065630744376252e-05, 'epoch': 3.42} 34%|███▍ | 3415/10000 [5:22:55<10:02:39, 5.49s/it][2025-06-19 18:52:39,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:52:39,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.32 | bwd_microstep: 3367.53 | bwd_inner_microstep: 3366.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-19 18:52:39,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.32 | bwd: 3367.54 | bwd_inner: 3366.73 | bwd_allreduce: 0.77 | step: 6.83 34%|███▍ | 3416/10000 [5:23:00<10:04:02, 5.50s/it] {'loss': 0.0801, 'grad_norm': 1.537629246711731, 'learning_rate': 3.065082541132098e-05, 'epoch': 3.42} 34%|███▍ | 3416/10000 [5:23:00<10:04:02, 5.50s/it][2025-06-19 18:52:45,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:52:45,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.73 | bwd_microstep: 3317.80 | bwd_inner_microstep: 3317.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 18:52:45,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.73 | bwd: 3317.81 | bwd_inner: 3317.00 | bwd_allreduce: 0.77 | step: 7.18 34%|███▍ | 3417/10000 [5:23:06<10:02:42, 5.49s/it] {'loss': 0.0274, 'grad_norm': 1.1611100435256958, 'learning_rate': 3.064534226165727e-05, 'epoch': 3.42} 34%|███▍ | 3417/10000 [5:23:06<10:02:42, 5.49s/it][2025-06-19 18:52:50,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:52:50,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.68 | bwd_microstep: 3319.13 | bwd_inner_microstep: 3318.20 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.51 [2025-06-19 18:52:50,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.68 | bwd: 3319.14 | bwd_inner: 3318.20 | bwd_allreduce: 0.89 | step: 7.51 34%|███▍ | 3418/10000 [5:23:11<10:01:56, 5.49s/it] {'loss': 0.047, 'grad_norm': 1.277902603149414, 'learning_rate': 3.063985799534658e-05, 'epoch': 3.42} 34%|███▍ | 3418/10000 [5:23:11<10:01:56, 5.49s/it][2025-06-19 18:52:56,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:52:56,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.68 | bwd_microstep: 3371.59 | bwd_inner_microstep: 3370.80 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 18:52:56,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.68 | bwd: 3371.60 | bwd_inner: 3370.80 | bwd_allreduce: 0.76 | step: 6.74 34%|███▍ | 3419/10000 [5:23:17<10:03:42, 5.50s/it] {'loss': 0.0664, 'grad_norm': 1.982891321182251, 'learning_rate': 3.0634372612964164e-05, 'epoch': 3.42} 34%|███▍ | 3419/10000 [5:23:17<10:03:42, 5.50s/it][2025-06-19 18:53:01,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:53:01,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.06 | bwd_microstep: 3313.69 | bwd_inner_microstep: 3312.85 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.86 [2025-06-19 18:53:01,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.06 | bwd: 3313.71 | bwd_inner: 3312.85 | bwd_allreduce: 0.81 | step: 6.86 34%|███▍ | 3420/10000 [5:23:22<10:01:56, 5.49s/it] {'loss': 0.1178, 'grad_norm': 2.202023506164551, 'learning_rate': 3.062888611508541e-05, 'epoch': 3.42} 34%|███▍ | 3420/10000 [5:23:22<10:01:56, 5.49s/it][2025-06-19 18:53:07,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:53:07,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.78 | bwd_microstep: 3403.50 | bwd_inner_microstep: 3402.51 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.77 [2025-06-19 18:53:07,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.78 | bwd: 3403.52 | bwd_inner: 3402.51 | bwd_allreduce: 0.96 | step: 7.77 34%|███▍ | 3421/10000 [5:23:28<10:04:49, 5.52s/it] {'loss': 0.0716, 'grad_norm': 2.374516487121582, 'learning_rate': 3.062339850228583e-05, 'epoch': 3.42} 34%|███▍ | 3421/10000 [5:23:28<10:04:49, 5.52s/it][2025-06-19 18:53:12,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 18:53:12,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.79 | bwd_microstep: 3327.09 | bwd_inner_microstep: 3326.16 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.09 [2025-06-19 18:53:12,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.79 | bwd: 3327.11 | bwd_inner: 3326.16 | bwd_allreduce: 0.91 | step: 7.10 34%|███▍ | 3422/10000 [5:23:33<10:03:22, 5.50s/it] {'loss': 0.0598, 'grad_norm': 1.733668327331543, 'learning_rate': 3.061790977514106e-05, 'epoch': 3.42} 34%|███▍ | 3422/10000 [5:23:33<10:03:22, 5.50s/it][2025-06-19 18:53:18,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:53:18,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.17 | bwd_microstep: 3374.29 | bwd_inner_microstep: 3373.47 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-19 18:53:18,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.17 | bwd: 3374.31 | bwd_inner: 3373.47 | bwd_allreduce: 0.79 | step: 7.31 34%|███▍ | 3423/10000 [5:23:39<10:04:45, 5.52s/it] {'loss': 0.0542, 'grad_norm': 1.0712800025939941, 'learning_rate': 3.061241993422684e-05, 'epoch': 3.42} 34%|███▍ | 3423/10000 [5:23:39<10:04:45, 5.52s/it][2025-06-19 18:53:23,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:53:23,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.61 | bwd_microstep: 3321.21 | bwd_inner_microstep: 3320.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 18:53:23,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.61 | bwd: 3321.22 | bwd_inner: 3320.42 | bwd_allreduce: 0.76 | step: 6.77 34%|███▍ | 3424/10000 [5:23:44<10:03:02, 5.50s/it] {'loss': 0.027, 'grad_norm': 1.2529568672180176, 'learning_rate': 3.060692898011901e-05, 'epoch': 3.42} 34%|███▍ | 3424/10000 [5:23:44<10:03:02, 5.50s/it][2025-06-19 18:53:29,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:53:29,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.10 | bwd_microstep: 3368.28 | bwd_inner_microstep: 3367.47 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.33 [2025-06-19 18:53:29,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.10 | bwd: 3368.30 | bwd_inner: 3367.46 | bwd_allreduce: 0.79 | step: 7.33 34%|███▍ | 3425/10000 [5:23:50<10:03:55, 5.51s/it] {'loss': 0.0123, 'grad_norm': 0.4784323573112488, 'learning_rate': 3.060143691339356e-05, 'epoch': 3.42} 34%|███▍ | 3425/10000 [5:23:50<10:03:55, 5.51s/it][2025-06-19 18:53:34,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:53:34,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.60 | bwd_microstep: 3317.75 | bwd_inner_microstep: 3316.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 18:53:34,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.60 | bwd: 3317.77 | bwd_inner: 3316.94 | bwd_allreduce: 0.78 | step: 6.77 34%|███▍ | 3426/10000 [5:23:55<10:02:03, 5.49s/it] {'loss': 0.0641, 'grad_norm': 1.676806926727295, 'learning_rate': 3.059594373462659e-05, 'epoch': 3.43} 34%|███▍ | 3426/10000 [5:23:55<10:02:03, 5.49s/it][2025-06-19 18:53:40,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:53:40,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.25 | bwd_microstep: 3366.26 | bwd_inner_microstep: 3365.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 18:53:40,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.25 | bwd: 3366.27 | bwd_inner: 3365.46 | bwd_allreduce: 0.76 | step: 6.72 34%|███▍ | 3427/10000 [5:24:01<10:03:11, 5.51s/it] {'loss': 0.0626, 'grad_norm': 2.180283546447754, 'learning_rate': 3.059044944439429e-05, 'epoch': 3.43} 34%|███▍ | 3427/10000 [5:24:01<10:03:11, 5.51s/it][2025-06-19 18:53:45,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:53:45,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.87 | bwd_microstep: 3319.19 | bwd_inner_microstep: 3318.38 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-19 18:53:45,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.88 | bwd: 3319.21 | bwd_inner: 3318.38 | bwd_allreduce: 0.79 | step: 7.12 34%|███▍ | 3428/10000 [5:24:06<10:01:32, 5.49s/it] {'loss': 0.0355, 'grad_norm': 1.3245129585266113, 'learning_rate': 3.0584954043272996e-05, 'epoch': 3.43} 34%|███▍ | 3428/10000 [5:24:06<10:01:32, 5.49s/it][2025-06-19 18:53:51,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:53:51,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.97 | bwd_microstep: 3360.64 | bwd_inner_microstep: 3359.84 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 18:53:51,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.97 | bwd: 3360.66 | bwd_inner: 3359.84 | bwd_allreduce: 0.77 | step: 6.78 34%|███▍ | 3429/10000 [5:24:12<10:02:33, 5.50s/it] {'loss': 0.1322, 'grad_norm': 1.9753398895263672, 'learning_rate': 3.057945753183915e-05, 'epoch': 3.43} 34%|███▍ | 3429/10000 [5:24:12<10:02:33, 5.50s/it][2025-06-19 18:53:56,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:53:56,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.96 | bwd_microstep: 3323.97 | bwd_inner_microstep: 3323.07 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.89 [2025-06-19 18:53:56,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.96 | bwd: 3323.98 | bwd_inner: 3323.07 | bwd_allreduce: 0.87 | step: 6.90 34%|███▍ | 3430/10000 [5:24:17<10:01:15, 5.49s/it] {'loss': 0.0679, 'grad_norm': 1.3071006536483765, 'learning_rate': 3.057395991066931e-05, 'epoch': 3.43} 34%|███▍ | 3430/10000 [5:24:17<10:01:15, 5.49s/it][2025-06-19 18:54:02,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:54:02,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.63 | bwd_microstep: 3365.34 | bwd_inner_microstep: 3364.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 18:54:02,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.63 | bwd: 3365.35 | bwd_inner: 3364.55 | bwd_allreduce: 0.76 | step: 6.64 34%|███▍ | 3431/10000 [5:24:23<10:02:33, 5.50s/it] {'loss': 0.0611, 'grad_norm': 1.310109257698059, 'learning_rate': 3.056846118034015e-05, 'epoch': 3.43} 34%|███▍ | 3431/10000 [5:24:23<10:02:33, 5.50s/it][2025-06-19 18:54:07,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:54:07,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.27 | bwd_microstep: 3313.74 | bwd_inner_microstep: 3312.73 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.92 [2025-06-19 18:54:07,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.27 | bwd: 3313.75 | bwd_inner: 3312.73 | bwd_allreduce: 0.98 | step: 7.92 34%|███▍ | 3432/10000 [5:24:28<10:00:50, 5.49s/it] {'loss': 0.0651, 'grad_norm': 1.4524022340774536, 'learning_rate': 3.056296134142846e-05, 'epoch': 3.43} 34%|███▍ | 3432/10000 [5:24:28<10:00:50, 5.49s/it][2025-06-19 18:54:13,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.80 [2025-06-19 18:54:13,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.71 | bwd_microstep: 3311.97 | bwd_inner_microstep: 3311.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 18:54:13,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.71 | bwd: 3311.99 | bwd_inner: 3311.18 | bwd_allreduce: 0.76 | step: 6.74 34%|███▍ | 3433/10000 [5:24:34<9:59:37, 5.48s/it] {'loss': 0.0225, 'grad_norm': 0.8642359972000122, 'learning_rate': 3.0557460394511156e-05, 'epoch': 3.43} 34%|███▍ | 3433/10000 [5:24:34<9:59:37, 5.48s/it][2025-06-19 18:54:18,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:54:18,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.02 | bwd_microstep: 3364.01 | bwd_inner_microstep: 3363.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 18:54:18,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.02 | bwd: 3364.02 | bwd_inner: 3363.22 | bwd_allreduce: 0.76 | step: 6.73 34%|███▍ | 3434/10000 [5:24:39<10:00:59, 5.49s/it] {'loss': 0.0166, 'grad_norm': 0.44007769227027893, 'learning_rate': 3.0551958340165254e-05, 'epoch': 3.43} 34%|███▍ | 3434/10000 [5:24:39<10:00:59, 5.49s/it][2025-06-19 18:54:24,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.78 [2025-06-19 18:54:24,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.42 | bwd_microstep: 3325.66 | bwd_inner_microstep: 3324.67 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.55 [2025-06-19 18:54:24,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.42 | bwd: 3325.68 | bwd_inner: 3324.67 | bwd_allreduce: 0.96 | step: 7.56 34%|███▍ | 3435/10000 [5:24:45<10:00:08, 5.48s/it] {'loss': 0.0522, 'grad_norm': 1.2098431587219238, 'learning_rate': 3.054645517896789e-05, 'epoch': 3.44} 34%|███▍ | 3435/10000 [5:24:45<10:00:08, 5.48s/it][2025-06-19 18:54:29,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:54:29,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.97 | bwd_microstep: 3316.55 | bwd_inner_microstep: 3315.75 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.26 [2025-06-19 18:54:29,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.97 | bwd: 3316.57 | bwd_inner: 3315.75 | bwd_allreduce: 0.77 | step: 7.26 34%|███▍ | 3436/10000 [5:24:50<9:59:17, 5.48s/it] {'loss': 0.059, 'grad_norm': 1.2625620365142822, 'learning_rate': 3.054095091149633e-05, 'epoch': 3.44} 34%|███▍ | 3436/10000 [5:24:50<9:59:17, 5.48s/it][2025-06-19 18:54:35,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:54:35,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.38 | bwd_microstep: 3363.95 | bwd_inner_microstep: 3363.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 18:54:35,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.38 | bwd: 3363.97 | bwd_inner: 3363.15 | bwd_allreduce: 0.77 | step: 6.73 34%|███▍ | 3437/10000 [5:24:56<10:00:52, 5.49s/it] {'loss': 0.0351, 'grad_norm': 1.2695515155792236, 'learning_rate': 3.053544553832794e-05, 'epoch': 3.44} 34%|███▍ | 3437/10000 [5:24:56<10:00:52, 5.49s/it][2025-06-19 18:54:40,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:54:40,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.95 | bwd_microstep: 3366.86 | bwd_inner_microstep: 3366.03 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.31 [2025-06-19 18:54:40,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.95 | bwd: 3366.88 | bwd_inner: 3366.03 | bwd_allreduce: 0.80 | step: 7.32 34%|███▍ | 3438/10000 [5:25:01<10:02:06, 5.51s/it] {'loss': 0.0145, 'grad_norm': 0.567937433719635, 'learning_rate': 3.052993906004021e-05, 'epoch': 3.44} 34%|███▍ | 3438/10000 [5:25:01<10:02:06, 5.51s/it][2025-06-19 18:54:46,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:54:46,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.42 | bwd_microstep: 3319.29 | bwd_inner_microstep: 3318.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 18:54:46,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.42 | bwd: 3319.31 | bwd_inner: 3318.50 | bwd_allreduce: 0.76 | step: 6.67 34%|███▍ | 3439/10000 [5:25:07<10:00:39, 5.49s/it] {'loss': 0.0428, 'grad_norm': 0.9450468420982361, 'learning_rate': 3.052443147721074e-05, 'epoch': 3.44} 34%|███▍ | 3439/10000 [5:25:07<10:00:39, 5.49s/it][2025-06-19 18:54:51,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:54:51,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.72 | bwd_microstep: 3366.36 | bwd_inner_microstep: 3365.40 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.74 [2025-06-19 18:54:51,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.72 | bwd: 3366.38 | bwd_inner: 3365.40 | bwd_allreduce: 0.93 | step: 7.75 34%|███▍ | 3440/10000 [5:25:12<10:01:56, 5.51s/it] {'loss': 0.0557, 'grad_norm': 0.8934339880943298, 'learning_rate': 3.0518922790417255e-05, 'epoch': 3.44} 34%|███▍ | 3440/10000 [5:25:12<10:01:56, 5.51s/it][2025-06-19 18:54:57,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:54:57,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.83 | bwd_microstep: 3371.69 | bwd_inner_microstep: 3370.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 18:54:57,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.83 | bwd: 3371.71 | bwd_inner: 3370.90 | bwd_allreduce: 0.76 | step: 6.83 34%|███▍ | 3441/10000 [5:25:18<10:03:17, 5.52s/it] {'loss': 0.0314, 'grad_norm': 0.7026512622833252, 'learning_rate': 3.0513413000237597e-05, 'epoch': 3.44} 34%|███▍ | 3441/10000 [5:25:18<10:03:17, 5.52s/it][2025-06-19 18:55:02,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:55:02,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.23 | bwd_microstep: 3362.83 | bwd_inner_microstep: 3362.01 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.29 [2025-06-19 18:55:02,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.23 | bwd: 3362.84 | bwd_inner: 3362.01 | bwd_allreduce: 0.78 | step: 7.30 34%|███▍ | 3442/10000 [5:25:23<10:03:45, 5.52s/it] {'loss': 0.0438, 'grad_norm': 1.6129276752471924, 'learning_rate': 3.0507902107249704e-05, 'epoch': 3.44} 34%|███▍ | 3442/10000 [5:25:23<10:03:45, 5.52s/it][2025-06-19 18:55:08,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:55:08,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.03 | bwd_microstep: 3311.12 | bwd_inner_microstep: 3310.31 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 18:55:08,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.03 | bwd: 3311.13 | bwd_inner: 3310.31 | bwd_allreduce: 0.77 | step: 6.90 34%|███▍ | 3443/10000 [5:25:29<10:01:22, 5.50s/it] {'loss': 0.043, 'grad_norm': 0.9098308682441711, 'learning_rate': 3.0502390112031644e-05, 'epoch': 3.44} 34%|███▍ | 3443/10000 [5:25:29<10:01:22, 5.50s/it][2025-06-19 18:55:13,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:55:13,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.70 | bwd_microstep: 3324.78 | bwd_inner_microstep: 3323.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 18:55:13,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.70 | bwd: 3324.80 | bwd_inner: 3323.97 | bwd_allreduce: 0.78 | step: 7.01 34%|███▍ | 3444/10000 [5:25:34<10:00:10, 5.49s/it] {'loss': 0.0302, 'grad_norm': 0.7655925750732422, 'learning_rate': 3.049687701516161e-05, 'epoch': 3.44} 34%|███▍ | 3444/10000 [5:25:34<10:00:10, 5.49s/it][2025-06-19 18:55:19,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:55:19,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.69 | bwd_microstep: 3366.25 | bwd_inner_microstep: 3365.33 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.20 [2025-06-19 18:55:19,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.69 | bwd: 3366.27 | bwd_inner: 3365.33 | bwd_allreduce: 0.90 | step: 7.20 34%|███▍ | 3445/10000 [5:25:40<10:01:26, 5.51s/it] {'loss': 0.0837, 'grad_norm': 2.4739959239959717, 'learning_rate': 3.0491362817217893e-05, 'epoch': 3.44} 34%|███▍ | 3445/10000 [5:25:40<10:01:26, 5.51s/it][2025-06-19 18:55:24,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:55:24,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.96 | bwd_microstep: 3327.18 | bwd_inner_microstep: 3326.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 18:55:24,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.96 | bwd: 3327.20 | bwd_inner: 3326.39 | bwd_allreduce: 0.76 | step: 6.71 34%|███▍ | 3446/10000 [5:25:45<10:00:14, 5.49s/it] {'loss': 0.0156, 'grad_norm': 0.5064905285835266, 'learning_rate': 3.048584751877891e-05, 'epoch': 3.45} 34%|███▍ | 3446/10000 [5:25:45<10:00:14, 5.49s/it][2025-06-19 18:55:30,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:55:30,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.14 | bwd_microstep: 3318.19 | bwd_inner_microstep: 3317.18 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.60 [2025-06-19 18:55:30,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.14 | bwd: 3318.21 | bwd_inner: 3317.18 | bwd_allreduce: 0.97 | step: 7.60 34%|███▍ | 3447/10000 [5:25:51<9:58:58, 5.48s/it] {'loss': 0.0609, 'grad_norm': 2.2034249305725098, 'learning_rate': 3.048033112042318e-05, 'epoch': 3.45} 34%|███▍ | 3447/10000 [5:25:51<9:58:58, 5.48s/it][2025-06-19 18:55:35,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:55:35,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.27 | bwd_microstep: 3328.83 | bwd_inner_microstep: 3328.00 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.97 [2025-06-19 18:55:35,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.27 | bwd: 3328.84 | bwd_inner: 3328.00 | bwd_allreduce: 0.80 | step: 6.97 34%|███▍ | 3448/10000 [5:25:56<9:58:34, 5.48s/it] {'loss': 0.0221, 'grad_norm': 0.9247041940689087, 'learning_rate': 3.0474813622729364e-05, 'epoch': 3.45} 34%|███▍ | 3448/10000 [5:25:56<9:58:34, 5.48s/it][2025-06-19 18:55:41,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:55:41,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.59 | bwd_microstep: 3376.63 | bwd_inner_microstep: 3375.78 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.85 [2025-06-19 18:55:41,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.59 | bwd: 3376.65 | bwd_inner: 3375.78 | bwd_allreduce: 0.82 | step: 6.85 34%|███▍ | 3449/10000 [5:26:02<10:00:49, 5.50s/it] {'loss': 0.0154, 'grad_norm': 0.29597097635269165, 'learning_rate': 3.0469295026276212e-05, 'epoch': 3.45} 34%|███▍ | 3449/10000 [5:26:02<10:00:49, 5.50s/it][2025-06-19 18:55:46,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:55:46,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.75 | bwd_microstep: 3373.77 | bwd_inner_microstep: 3372.90 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.94 [2025-06-19 18:55:46,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.75 | bwd: 3373.78 | bwd_inner: 3372.90 | bwd_allreduce: 0.84 | step: 6.95 34%|███▍ | 3450/10000 [5:26:07<10:02:19, 5.52s/it] {'loss': 0.1037, 'grad_norm': 2.2934210300445557, 'learning_rate': 3.04637753316426e-05, 'epoch': 3.45} 34%|███▍ | 3450/10000 [5:26:07<10:02:19, 5.52s/it][2025-06-19 18:55:52,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.89 [2025-06-19 18:55:52,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.52 | bwd_microstep: 3314.30 | bwd_inner_microstep: 3313.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 18:55:52,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.52 | bwd: 3314.31 | bwd_inner: 3313.51 | bwd_allreduce: 0.76 | step: 6.84 35%|███▍ | 3451/10000 [5:26:13<10:00:12, 5.50s/it] {'loss': 0.0189, 'grad_norm': 0.859178364276886, 'learning_rate': 3.0458254539407525e-05, 'epoch': 3.45} 35%|███▍ | 3451/10000 [5:26:13<10:00:12, 5.50s/it][2025-06-19 18:55:57,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:55:57,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.49 | bwd_microstep: 3357.49 | bwd_inner_microstep: 3356.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 18:55:57,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.49 | bwd: 3357.50 | bwd_inner: 3356.69 | bwd_allreduce: 0.76 | step: 6.67 35%|███▍ | 3452/10000 [5:26:18<10:00:41, 5.50s/it] {'loss': 0.0288, 'grad_norm': 1.5189627408981323, 'learning_rate': 3.0452732650150084e-05, 'epoch': 3.45} 35%|███▍ | 3452/10000 [5:26:18<10:00:41, 5.50s/it][2025-06-19 18:56:03,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:56:03,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.41 | bwd_microstep: 3377.08 | bwd_inner_microstep: 3376.12 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.30 [2025-06-19 18:56:03,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.41 | bwd: 3377.09 | bwd_inner: 3376.12 | bwd_allreduce: 0.93 | step: 7.30 35%|███▍ | 3453/10000 [5:26:24<10:01:55, 5.52s/it] {'loss': 0.0188, 'grad_norm': 0.8892201781272888, 'learning_rate': 3.04472096644495e-05, 'epoch': 3.45} 35%|███▍ | 3453/10000 [5:26:24<10:01:55, 5.52s/it][2025-06-19 18:56:08,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:56:08,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.39 | bwd_microstep: 3311.44 | bwd_inner_microstep: 3310.61 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.83 [2025-06-19 18:56:08,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.39 | bwd: 3311.46 | bwd_inner: 3310.61 | bwd_allreduce: 0.79 | step: 6.83 35%|███▍ | 3454/10000 [5:26:29<9:59:51, 5.50s/it] {'loss': 0.0344, 'grad_norm': 1.6771349906921387, 'learning_rate': 3.0441685582885115e-05, 'epoch': 3.45} 35%|███▍ | 3454/10000 [5:26:29<9:59:51, 5.50s/it][2025-06-19 18:56:14,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 18:56:14,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.96 | bwd_microstep: 3372.13 | bwd_inner_microstep: 3371.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 18:56:14,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.96 | bwd: 3372.14 | bwd_inner: 3371.35 | bwd_allreduce: 0.75 | step: 6.66 35%|███▍ | 3455/10000 [5:26:35<10:01:00, 5.51s/it] {'loss': 0.0787, 'grad_norm': 2.5040950775146484, 'learning_rate': 3.0436160406036362e-05, 'epoch': 3.46} 35%|███▍ | 3455/10000 [5:26:35<10:01:00, 5.51s/it][2025-06-19 18:56:19,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:56:19,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.21 | bwd_microstep: 3312.21 | bwd_inner_microstep: 3311.41 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.79 [2025-06-19 18:56:19,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.21 | bwd: 3312.22 | bwd_inner: 3311.41 | bwd_allreduce: 0.77 | step: 6.79 35%|███▍ | 3456/10000 [5:26:40<9:59:10, 5.49s/it] {'loss': 0.042, 'grad_norm': 1.6549522876739502, 'learning_rate': 3.043063413448283e-05, 'epoch': 3.46} 35%|███▍ | 3456/10000 [5:26:40<9:59:10, 5.49s/it][2025-06-19 18:56:25,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:56:25,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.28 | bwd_microstep: 3371.06 | bwd_inner_microstep: 3370.26 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-19 18:56:25,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.29 | bwd: 3371.08 | bwd_inner: 3370.26 | bwd_allreduce: 0.77 | step: 6.79 35%|███▍ | 3457/10000 [5:26:46<10:00:13, 5.50s/it] {'loss': 0.1983, 'grad_norm': 3.1894569396972656, 'learning_rate': 3.0425106768804177e-05, 'epoch': 3.46} 35%|███▍ | 3457/10000 [5:26:46<10:00:13, 5.50s/it][2025-06-19 18:56:30,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:56:30,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.37 | bwd_microstep: 3311.14 | bwd_inner_microstep: 3310.33 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 18:56:30,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.37 | bwd: 3311.15 | bwd_inner: 3310.33 | bwd_allreduce: 0.78 | step: 7.12 35%|███▍ | 3458/10000 [5:26:51<9:58:18, 5.49s/it] {'loss': 0.0476, 'grad_norm': 1.8471767902374268, 'learning_rate': 3.0419578309580214e-05, 'epoch': 3.46} 35%|███▍ | 3458/10000 [5:26:51<9:58:18, 5.49s/it][2025-06-19 18:56:36,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:56:36,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.58 | bwd_microstep: 3318.09 | bwd_inner_microstep: 3317.14 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.06 [2025-06-19 18:56:36,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.58 | bwd: 3318.10 | bwd_inner: 3317.14 | bwd_allreduce: 0.92 | step: 7.06 35%|███▍ | 3459/10000 [5:26:57<9:57:36, 5.48s/it] {'loss': 0.0115, 'grad_norm': 0.8283393979072571, 'learning_rate': 3.0414048757390846e-05, 'epoch': 3.46} 35%|███▍ | 3459/10000 [5:26:57<9:57:36, 5.48s/it][2025-06-19 18:56:41,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:56:41,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.43 | bwd_microstep: 3364.78 | bwd_inner_microstep: 3363.89 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.01 [2025-06-19 18:56:41,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.43 | bwd: 3364.79 | bwd_inner: 3363.89 | bwd_allreduce: 0.86 | step: 7.01 35%|███▍ | 3460/10000 [5:27:02<9:59:21, 5.50s/it] {'loss': 0.0228, 'grad_norm': 0.8112232089042664, 'learning_rate': 3.0408518112816092e-05, 'epoch': 3.46} 35%|███▍ | 3460/10000 [5:27:02<9:59:21, 5.50s/it][2025-06-19 18:56:47,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:56:47,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.74 | bwd_microstep: 3312.26 | bwd_inner_microstep: 3311.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.09 [2025-06-19 18:56:47,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.74 | bwd: 3312.28 | bwd_inner: 3311.46 | bwd_allreduce: 0.77 | step: 7.09 35%|███▍ | 3461/10000 [5:27:08<9:57:47, 5.49s/it] {'loss': 0.1577, 'grad_norm': 2.586785078048706, 'learning_rate': 3.04029863764361e-05, 'epoch': 3.46} 35%|███▍ | 3461/10000 [5:27:08<9:57:47, 5.49s/it][2025-06-19 18:56:52,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:56:52,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.62 | bwd_microstep: 3318.56 | bwd_inner_microstep: 3317.77 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 18:56:52,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.62 | bwd: 3318.57 | bwd_inner: 3317.77 | bwd_allreduce: 0.77 | step: 6.68 35%|███▍ | 3462/10000 [5:27:13<9:57:03, 5.48s/it] {'loss': 0.0157, 'grad_norm': 1.107101559638977, 'learning_rate': 3.039745354883112e-05, 'epoch': 3.46} 35%|███▍ | 3462/10000 [5:27:13<9:57:03, 5.48s/it][2025-06-19 18:56:58,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:56:58,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.63 | bwd_microstep: 3316.55 | bwd_inner_microstep: 3315.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 18:56:58,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.63 | bwd: 3316.56 | bwd_inner: 3315.75 | bwd_allreduce: 0.77 | step: 6.75 35%|███▍ | 3463/10000 [5:27:19<9:56:12, 5.47s/it] {'loss': 0.0109, 'grad_norm': 0.6134714484214783, 'learning_rate': 3.0391919630581516e-05, 'epoch': 3.46} 35%|███▍ | 3463/10000 [5:27:19<9:56:12, 5.47s/it][2025-06-19 18:57:03,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:57:03,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.87 | bwd_microstep: 3311.89 | bwd_inner_microstep: 3311.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-19 18:57:03,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.87 | bwd: 3311.90 | bwd_inner: 3311.08 | bwd_allreduce: 0.78 | step: 7.03 35%|███▍ | 3464/10000 [5:27:24<9:55:40, 5.47s/it] {'loss': 0.0078, 'grad_norm': 0.20506452023983002, 'learning_rate': 3.0386384622267768e-05, 'epoch': 3.46} 35%|███▍ | 3464/10000 [5:27:24<9:55:40, 5.47s/it][2025-06-19 18:57:09,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:57:09,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.09 | bwd_microstep: 3316.61 | bwd_inner_microstep: 3315.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 18:57:09,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.09 | bwd: 3316.62 | bwd_inner: 3315.82 | bwd_allreduce: 0.76 | step: 6.69 35%|███▍ | 3465/10000 [5:27:29<9:55:28, 5.47s/it] {'loss': 0.0612, 'grad_norm': 2.552304267883301, 'learning_rate': 3.0380848524470482e-05, 'epoch': 3.46} 35%|███▍ | 3465/10000 [5:27:29<9:55:28, 5.47s/it][2025-06-19 18:57:14,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:57:14,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.59 | bwd_microstep: 3357.85 | bwd_inner_microstep: 3356.90 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.19 [2025-06-19 18:57:14,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.59 | bwd: 3357.87 | bwd_inner: 3356.90 | bwd_allreduce: 0.93 | step: 7.20 35%|███▍ | 3466/10000 [5:27:35<9:57:01, 5.48s/it] {'loss': 0.1827, 'grad_norm': 3.851355791091919, 'learning_rate': 3.0375311337770373e-05, 'epoch': 3.47} 35%|███▍ | 3466/10000 [5:27:35<9:57:01, 5.48s/it][2025-06-19 18:57:20,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:57:20,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.15 | bwd_microstep: 3312.17 | bwd_inner_microstep: 3311.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 18:57:20,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.15 | bwd: 3312.18 | bwd_inner: 3311.38 | bwd_allreduce: 0.76 | step: 6.58 35%|███▍ | 3467/10000 [5:27:40<9:56:29, 5.48s/it] {'loss': 0.0261, 'grad_norm': 0.6338393688201904, 'learning_rate': 3.0369773062748246e-05, 'epoch': 3.47} 35%|███▍ | 3467/10000 [5:27:40<9:56:29, 5.48s/it][2025-06-19 18:57:25,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 18:57:25,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.46 | bwd_microstep: 3362.91 | bwd_inner_microstep: 3361.92 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.66 [2025-06-19 18:57:25,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.46 | bwd: 3362.93 | bwd_inner: 3361.92 | bwd_allreduce: 0.96 | step: 7.67 35%|███▍ | 3468/10000 [5:27:46<9:58:33, 5.50s/it] {'loss': 0.0474, 'grad_norm': 1.0393528938293457, 'learning_rate': 3.0364233699985056e-05, 'epoch': 3.47} 35%|███▍ | 3468/10000 [5:27:46<9:58:33, 5.50s/it][2025-06-19 18:57:31,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 18:57:31,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.58 | bwd_microstep: 3303.69 | bwd_inner_microstep: 3302.64 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.36 [2025-06-19 18:57:31,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.58 | bwd: 3303.71 | bwd_inner: 3302.64 | bwd_allreduce: 1.02 | step: 7.37 35%|███▍ | 3469/10000 [5:27:51<9:56:49, 5.48s/it] {'loss': 0.0179, 'grad_norm': 1.583141565322876, 'learning_rate': 3.0358693250061858e-05, 'epoch': 3.47} 35%|███▍ | 3469/10000 [5:27:51<9:56:49, 5.48s/it][2025-06-19 18:57:36,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:57:36,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.26 | bwd_microstep: 3313.33 | bwd_inner_microstep: 3312.43 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.86 [2025-06-19 18:57:36,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.26 | bwd: 3313.34 | bwd_inner: 3312.43 | bwd_allreduce: 0.87 | step: 6.86 35%|███▍ | 3470/10000 [5:27:57<9:55:51, 5.47s/it] {'loss': 0.0185, 'grad_norm': 0.4847051203250885, 'learning_rate': 3.0353151713559808e-05, 'epoch': 3.47} 35%|███▍ | 3470/10000 [5:27:57<9:55:51, 5.47s/it][2025-06-19 18:57:42,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:57:42,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.77 | bwd_microstep: 3366.44 | bwd_inner_microstep: 3365.65 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 18:57:42,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.77 | bwd: 3366.45 | bwd_inner: 3365.65 | bwd_allreduce: 0.75 | step: 6.58 35%|███▍ | 3471/10000 [5:28:02<9:57:32, 5.49s/it] {'loss': 0.0664, 'grad_norm': 3.126659870147705, 'learning_rate': 3.0347609091060194e-05, 'epoch': 3.47} 35%|███▍ | 3471/10000 [5:28:02<9:57:32, 5.49s/it][2025-06-19 18:57:47,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:57:47,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.27 | bwd_microstep: 3312.93 | bwd_inner_microstep: 3312.08 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.48 [2025-06-19 18:57:47,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.27 | bwd: 3312.95 | bwd_inner: 3312.08 | bwd_allreduce: 0.81 | step: 7.48 35%|███▍ | 3472/10000 [5:28:08<9:56:17, 5.48s/it] {'loss': 0.0255, 'grad_norm': 1.3943889141082764, 'learning_rate': 3.0342065383144413e-05, 'epoch': 3.47} 35%|███▍ | 3472/10000 [5:28:08<9:56:17, 5.48s/it][2025-06-19 18:57:53,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 18:57:53,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.49 | bwd_microstep: 3312.82 | bwd_inner_microstep: 3312.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 18:57:53,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.49 | bwd: 3312.83 | bwd_inner: 3312.02 | bwd_allreduce: 0.76 | step: 6.67 35%|███▍ | 3473/10000 [5:28:13<9:55:26, 5.47s/it] {'loss': 0.0467, 'grad_norm': 2.222874879837036, 'learning_rate': 3.0336520590393977e-05, 'epoch': 3.47} 35%|███▍ | 3473/10000 [5:28:13<9:55:26, 5.47s/it][2025-06-19 18:57:58,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 18:57:58,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.80 | bwd_microstep: 3310.16 | bwd_inner_microstep: 3309.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 18:57:58,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.80 | bwd: 3310.17 | bwd_inner: 3309.35 | bwd_allreduce: 0.78 | step: 7.19 35%|███▍ | 3474/10000 [5:28:19<9:54:40, 5.47s/it] {'loss': 0.0117, 'grad_norm': 0.672968327999115, 'learning_rate': 3.0330974713390495e-05, 'epoch': 3.47} 35%|███▍ | 3474/10000 [5:28:19<9:54:40, 5.47s/it][2025-06-19 18:58:03,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:58:03,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.12 | bwd_microstep: 3320.64 | bwd_inner_microstep: 3319.81 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.91 [2025-06-19 18:58:03,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.12 | bwd: 3320.66 | bwd_inner: 3319.81 | bwd_allreduce: 0.79 | step: 6.91 35%|███▍ | 3475/10000 [5:28:24<9:54:19, 5.47s/it] {'loss': 0.0787, 'grad_norm': 1.6815568208694458, 'learning_rate': 3.032542775271572e-05, 'epoch': 3.48} 35%|███▍ | 3475/10000 [5:28:24<9:54:19, 5.47s/it][2025-06-19 18:58:09,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:58:09,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.35 | bwd_microstep: 3322.02 | bwd_inner_microstep: 3321.13 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.52 [2025-06-19 18:58:09,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.35 | bwd: 3322.05 | bwd_inner: 3321.13 | bwd_allreduce: 0.86 | step: 7.52 35%|███▍ | 3476/10000 [5:28:30<9:54:22, 5.47s/it] {'loss': 0.0262, 'grad_norm': 0.9475711584091187, 'learning_rate': 3.0319879708951486e-05, 'epoch': 3.48} 35%|███▍ | 3476/10000 [5:28:30<9:54:22, 5.47s/it][2025-06-19 18:58:14,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:58:14,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.81 | bwd_microstep: 3309.60 | bwd_inner_microstep: 3308.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 18:58:14,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.81 | bwd: 3309.62 | bwd_inner: 3308.81 | bwd_allreduce: 0.77 | step: 6.80 35%|███▍ | 3477/10000 [5:28:35<9:54:07, 5.46s/it] {'loss': 0.0501, 'grad_norm': 2.2182886600494385, 'learning_rate': 3.031433058267978e-05, 'epoch': 3.48} 35%|███▍ | 3477/10000 [5:28:35<9:54:07, 5.46s/it][2025-06-19 18:58:20,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:58:20,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.45 | bwd_microstep: 3321.76 | bwd_inner_microstep: 3320.97 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 18:58:20,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.45 | bwd: 3321.77 | bwd_inner: 3320.97 | bwd_allreduce: 0.76 | step: 6.71 35%|███▍ | 3478/10000 [5:28:41<9:54:08, 5.47s/it] {'loss': 0.016, 'grad_norm': 0.9689628481864929, 'learning_rate': 3.0308780374482657e-05, 'epoch': 3.48} 35%|███▍ | 3478/10000 [5:28:41<9:54:08, 5.47s/it][2025-06-19 18:58:25,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 18:58:25,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.11 | bwd_microstep: 3320.68 | bwd_inner_microstep: 3319.73 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.52 [2025-06-19 18:58:25,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.11 | bwd: 3320.69 | bwd_inner: 3319.73 | bwd_allreduce: 0.91 | step: 7.53 35%|███▍ | 3479/10000 [5:28:46<9:54:13, 5.47s/it] {'loss': 0.0807, 'grad_norm': 2.830620527267456, 'learning_rate': 3.030322908494232e-05, 'epoch': 3.48} 35%|███▍ | 3479/10000 [5:28:46<9:54:13, 5.47s/it][2025-06-19 18:58:31,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:58:31,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.73 | bwd_microstep: 3310.13 | bwd_inner_microstep: 3309.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 18:58:31,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.73 | bwd: 3310.14 | bwd_inner: 3309.34 | bwd_allreduce: 0.76 | step: 6.69 35%|███▍ | 3480/10000 [5:28:52<9:53:38, 5.46s/it] {'loss': 0.0132, 'grad_norm': 0.5609567761421204, 'learning_rate': 3.0297676714641075e-05, 'epoch': 3.48} 35%|███▍ | 3480/10000 [5:28:52<9:53:38, 5.46s/it][2025-06-19 18:58:36,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:58:36,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.22 | bwd_microstep: 3328.80 | bwd_inner_microstep: 3328.01 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 18:58:36,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.22 | bwd: 3328.81 | bwd_inner: 3328.01 | bwd_allreduce: 0.76 | step: 6.67 35%|███▍ | 3481/10000 [5:28:57<9:53:45, 5.46s/it] {'loss': 0.1125, 'grad_norm': 2.388497829437256, 'learning_rate': 3.029212326416133e-05, 'epoch': 3.48} 35%|███▍ | 3481/10000 [5:28:57<9:53:45, 5.46s/it][2025-06-19 18:58:42,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:58:42,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.15 | bwd_microstep: 3365.03 | bwd_inner_microstep: 3364.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.16 [2025-06-19 18:58:42,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.15 | bwd: 3365.04 | bwd_inner: 3364.23 | bwd_allreduce: 0.77 | step: 7.16 35%|███▍ | 3482/10000 [5:29:03<9:56:11, 5.49s/it] {'loss': 0.0444, 'grad_norm': 1.787619709968567, 'learning_rate': 3.0286568734085618e-05, 'epoch': 3.48} 35%|███▍ | 3482/10000 [5:29:03<9:56:11, 5.49s/it][2025-06-19 18:58:47,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:58:47,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.74 | bwd_microstep: 3370.43 | bwd_inner_microstep: 3369.44 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.40 [2025-06-19 18:58:47,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.74 | bwd: 3370.44 | bwd_inner: 3369.44 | bwd_allreduce: 0.96 | step: 7.41 35%|███▍ | 3483/10000 [5:29:08<9:57:50, 5.50s/it] {'loss': 0.0315, 'grad_norm': 0.9949180483818054, 'learning_rate': 3.0281013124996594e-05, 'epoch': 3.48} 35%|███▍ | 3483/10000 [5:29:08<9:57:50, 5.50s/it][2025-06-19 18:58:53,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 18:58:53,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.73 | bwd_microstep: 3315.38 | bwd_inner_microstep: 3314.51 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.56 [2025-06-19 18:58:53,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.73 | bwd: 3315.40 | bwd_inner: 3314.51 | bwd_allreduce: 0.84 | step: 6.57 35%|███▍ | 3484/10000 [5:29:14<9:56:34, 5.49s/it] {'loss': 0.1434, 'grad_norm': 2.060917854309082, 'learning_rate': 3.0275456437477008e-05, 'epoch': 3.48} 35%|███▍ | 3484/10000 [5:29:14<9:56:34, 5.49s/it][2025-06-19 18:58:58,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:58:58,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.98 | bwd_microstep: 3366.15 | bwd_inner_microstep: 3365.34 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.99 [2025-06-19 18:58:58,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.98 | bwd: 3366.17 | bwd_inner: 3365.34 | bwd_allreduce: 0.78 | step: 6.99 35%|███▍ | 3485/10000 [5:29:19<9:57:33, 5.50s/it] {'loss': 0.0213, 'grad_norm': 1.0769877433776855, 'learning_rate': 3.026989867210973e-05, 'epoch': 3.48} 35%|███▍ | 3485/10000 [5:29:19<9:57:33, 5.50s/it][2025-06-19 18:59:04,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:59:04,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.72 | bwd_microstep: 3323.47 | bwd_inner_microstep: 3322.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-19 18:59:04,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.72 | bwd: 3323.49 | bwd_inner: 3322.67 | bwd_allreduce: 0.78 | step: 6.89 35%|███▍ | 3486/10000 [5:29:25<9:56:36, 5.50s/it] {'loss': 0.0117, 'grad_norm': 0.6567574739456177, 'learning_rate': 3.026433982947775e-05, 'epoch': 3.49} 35%|███▍ | 3486/10000 [5:29:25<9:56:36, 5.50s/it][2025-06-19 18:59:09,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 18:59:09,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.19 | bwd_microstep: 3314.77 | bwd_inner_microstep: 3313.87 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.97 [2025-06-19 18:59:09,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.19 | bwd: 3314.79 | bwd_inner: 3313.87 | bwd_allreduce: 0.87 | step: 6.98 35%|███▍ | 3487/10000 [5:29:30<9:55:17, 5.48s/it] {'loss': 0.0051, 'grad_norm': 0.22981835901737213, 'learning_rate': 3.0258779910164154e-05, 'epoch': 3.49} 35%|███▍ | 3487/10000 [5:29:30<9:55:17, 5.48s/it][2025-06-19 18:59:15,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 18:59:15,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.66 | bwd_microstep: 3400.34 | bwd_inner_microstep: 3399.46 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.84 [2025-06-19 18:59:15,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.66 | bwd: 3400.36 | bwd_inner: 3399.46 | bwd_allreduce: 0.85 | step: 6.84 35%|███▍ | 3488/10000 [5:29:36<9:58:25, 5.51s/it] {'loss': 0.0464, 'grad_norm': 2.0264365673065186, 'learning_rate': 3.025321891475216e-05, 'epoch': 3.49} 35%|███▍ | 3488/10000 [5:29:36<9:58:25, 5.51s/it][2025-06-19 18:59:20,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:59:20,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.19 | bwd_microstep: 3319.29 | bwd_inner_microstep: 3318.48 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-19 18:59:20,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.19 | bwd: 3319.30 | bwd_inner: 3318.48 | bwd_allreduce: 0.78 | step: 7.03 35%|███▍ | 3489/10000 [5:29:41<9:56:38, 5.50s/it] {'loss': 0.0058, 'grad_norm': 0.2518353760242462, 'learning_rate': 3.024765684382509e-05, 'epoch': 3.49} 35%|███▍ | 3489/10000 [5:29:41<9:56:38, 5.50s/it][2025-06-19 18:59:26,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 18:59:26,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.49 | bwd_microstep: 3371.66 | bwd_inner_microstep: 3370.80 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.99 [2025-06-19 18:59:26,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.49 | bwd: 3371.67 | bwd_inner: 3370.80 | bwd_allreduce: 0.82 | step: 7.00 35%|███▍ | 3490/10000 [5:29:47<9:57:59, 5.51s/it] {'loss': 0.0058, 'grad_norm': 0.36692196130752563, 'learning_rate': 3.024209369796638e-05, 'epoch': 3.49} 35%|███▍ | 3490/10000 [5:29:47<9:57:59, 5.51s/it][2025-06-19 18:59:31,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 18:59:31,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.69 | bwd_microstep: 3319.57 | bwd_inner_microstep: 3318.61 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.36 [2025-06-19 18:59:31,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.69 | bwd: 3319.59 | bwd_inner: 3318.61 | bwd_allreduce: 0.93 | step: 7.36 35%|███▍ | 3491/10000 [5:29:52<9:56:35, 5.50s/it] {'loss': 0.0169, 'grad_norm': 0.8342733383178711, 'learning_rate': 3.023652947775957e-05, 'epoch': 3.49} 35%|███▍ | 3491/10000 [5:29:52<9:56:35, 5.50s/it][2025-06-19 18:59:37,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 18:59:37,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.62 | bwd_microstep: 3376.94 | bwd_inner_microstep: 3375.90 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.22 [2025-06-19 18:59:37,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.62 | bwd: 3376.95 | bwd_inner: 3375.90 | bwd_allreduce: 1.00 | step: 7.22 35%|███▍ | 3492/10000 [5:29:58<9:58:24, 5.52s/it] {'loss': 0.0566, 'grad_norm': 1.904350996017456, 'learning_rate': 3.023096418378833e-05, 'epoch': 3.49} 35%|███▍ | 3492/10000 [5:29:58<9:58:24, 5.52s/it][2025-06-19 18:59:42,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 18:59:42,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.09 | bwd_microstep: 3318.44 | bwd_inner_microstep: 3317.66 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.54 [2025-06-19 18:59:42,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.09 | bwd: 3318.46 | bwd_inner: 3317.67 | bwd_allreduce: 0.75 | step: 6.55 35%|███▍ | 3493/10000 [5:30:03<9:56:43, 5.50s/it] {'loss': 0.0441, 'grad_norm': 1.9839634895324707, 'learning_rate': 3.0225397816636427e-05, 'epoch': 3.49} 35%|███▍ | 3493/10000 [5:30:03<9:56:43, 5.50s/it][2025-06-19 18:59:48,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 18:59:48,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.26 | bwd_microstep: 3318.13 | bwd_inner_microstep: 3317.00 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.75 [2025-06-19 18:59:48,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.26 | bwd: 3318.15 | bwd_inner: 3317.00 | bwd_allreduce: 1.09 | step: 7.77 35%|███▍ | 3494/10000 [5:30:09<9:55:22, 5.49s/it] {'loss': 0.0965, 'grad_norm': 3.4522502422332764, 'learning_rate': 3.0219830376887754e-05, 'epoch': 3.49} 35%|███▍ | 3494/10000 [5:30:09<9:55:22, 5.49s/it][2025-06-19 18:59:53,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:59:53,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.33 | bwd_microstep: 3325.08 | bwd_inner_microstep: 3324.11 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.07 [2025-06-19 18:59:53,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.33 | bwd: 3325.09 | bwd_inner: 3324.11 | bwd_allreduce: 0.92 | step: 7.07 35%|███▍ | 3495/10000 [5:30:14<9:55:06, 5.49s/it] {'loss': 0.0571, 'grad_norm': 1.9242349863052368, 'learning_rate': 3.0214261865126303e-05, 'epoch': 3.5} 35%|███▍ | 3495/10000 [5:30:14<9:55:06, 5.49s/it][2025-06-19 18:59:59,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 18:59:59,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.19 | bwd_microstep: 3323.85 | bwd_inner_microstep: 3322.86 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.19 [2025-06-19 18:59:59,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.19 | bwd: 3323.87 | bwd_inner: 3322.86 | bwd_allreduce: 0.97 | step: 7.20 35%|███▍ | 3496/10000 [5:30:20<9:54:51, 5.49s/it] {'loss': 0.0276, 'grad_norm': 1.8631874322891235, 'learning_rate': 3.0208692281936193e-05, 'epoch': 3.5} 35%|███▍ | 3496/10000 [5:30:20<9:54:51, 5.49s/it][2025-06-19 19:00:04,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:00:04,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.09 | bwd_microstep: 3373.13 | bwd_inner_microstep: 3372.32 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 19:00:04,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.09 | bwd: 3373.14 | bwd_inner: 3372.32 | bwd_allreduce: 0.78 | step: 6.77 35%|███▍ | 3497/10000 [5:30:25<9:56:53, 5.51s/it] {'loss': 0.0104, 'grad_norm': 0.7713314890861511, 'learning_rate': 3.0203121627901644e-05, 'epoch': 3.5} 35%|███▍ | 3497/10000 [5:30:25<9:56:53, 5.51s/it][2025-06-19 19:00:10,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:00:10,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.25 | bwd_microstep: 3313.29 | bwd_inner_microstep: 3312.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 19:00:10,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.25 | bwd: 3313.30 | bwd_inner: 3312.50 | bwd_allreduce: 0.76 | step: 6.53 35%|███▍ | 3498/10000 [5:30:31<9:55:17, 5.49s/it] {'loss': 0.0333, 'grad_norm': 2.0082170963287354, 'learning_rate': 3.0197549903606983e-05, 'epoch': 3.5} 35%|███▍ | 3498/10000 [5:30:31<9:55:17, 5.49s/it][2025-06-19 19:00:15,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:00:15,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.54 | bwd_microstep: 3321.33 | bwd_inner_microstep: 3320.48 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.06 [2025-06-19 19:00:15,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.54 | bwd: 3321.35 | bwd_inner: 3320.48 | bwd_allreduce: 0.82 | step: 7.06 35%|███▍ | 3499/10000 [5:30:36<9:57:47, 5.52s/it] {'loss': 0.081, 'grad_norm': 3.0428578853607178, 'learning_rate': 3.019197710963667e-05, 'epoch': 3.5} 35%|███▍ | 3499/10000 [5:30:36<9:57:47, 5.52s/it][2025-06-19 19:00:21,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:00:21,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.72 | bwd_microstep: 3320.99 | bwd_inner_microstep: 3320.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 19:00:21,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.72 | bwd: 3321.00 | bwd_inner: 3320.20 | bwd_allreduce: 0.76 | step: 6.76 35%|███▌ | 3500/10000 [5:30:42<9:56:05, 5.50s/it] {'loss': 0.0255, 'grad_norm': 0.9840183258056641, 'learning_rate': 3.0186403246575263e-05, 'epoch': 3.5} 35%|███▌ | 3500/10000 [5:30:42<9:56:05, 5.50s/it][2025-06-19 19:00:26,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:00:26,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.55 | bwd_microstep: 3406.50 | bwd_inner_microstep: 3405.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-19 19:00:26,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.55 | bwd: 3406.51 | bwd_inner: 3405.71 | bwd_allreduce: 0.76 | step: 6.60 35%|███▌ | 3501/10000 [5:30:47<9:58:37, 5.53s/it] {'loss': 0.0147, 'grad_norm': 0.45031410455703735, 'learning_rate': 3.018082831500743e-05, 'epoch': 3.5} 35%|███▌ | 3501/10000 [5:30:47<9:58:37, 5.53s/it][2025-06-19 19:00:32,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:00:32,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.53 | bwd_microstep: 3324.54 | bwd_inner_microstep: 3323.62 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.98 [2025-06-19 19:00:32,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.53 | bwd: 3324.55 | bwd_inner: 3323.62 | bwd_allreduce: 0.88 | step: 6.98 35%|███▌ | 3502/10000 [5:30:53<9:56:41, 5.51s/it] {'loss': 0.069, 'grad_norm': 2.9841699600219727, 'learning_rate': 3.0175252315517958e-05, 'epoch': 3.5} 35%|███▌ | 3502/10000 [5:30:53<9:56:41, 5.51s/it][2025-06-19 19:00:37,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:00:37,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.99 | bwd_microstep: 3327.68 | bwd_inner_microstep: 3326.71 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.28 [2025-06-19 19:00:37,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.99 | bwd: 3327.70 | bwd_inner: 3326.71 | bwd_allreduce: 0.95 | step: 7.29 35%|███▌ | 3503/10000 [5:30:58<9:55:30, 5.50s/it] {'loss': 0.0485, 'grad_norm': 2.5758767127990723, 'learning_rate': 3.0169675248691738e-05, 'epoch': 3.5} 35%|███▌ | 3503/10000 [5:30:58<9:55:30, 5.50s/it][2025-06-19 19:00:43,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:00:43,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.55 | bwd_microstep: 3329.34 | bwd_inner_microstep: 3328.52 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.63 [2025-06-19 19:00:43,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.55 | bwd: 3329.35 | bwd_inner: 3328.52 | bwd_allreduce: 0.79 | step: 6.64 35%|███▌ | 3504/10000 [5:31:04<9:54:59, 5.50s/it] {'loss': 0.118, 'grad_norm': 2.6218655109405518, 'learning_rate': 3.0164097115113783e-05, 'epoch': 3.5} 35%|███▌ | 3504/10000 [5:31:04<9:54:59, 5.50s/it][2025-06-19 19:00:48,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:00:48,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.44 | bwd_microstep: 3332.85 | bwd_inner_microstep: 3331.92 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.23 [2025-06-19 19:00:48,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.44 | bwd: 3332.87 | bwd_inner: 3331.92 | bwd_allreduce: 0.90 | step: 7.24 35%|███▌ | 3505/10000 [5:31:09<9:54:25, 5.49s/it] {'loss': 0.0018, 'grad_norm': 0.05050263553857803, 'learning_rate': 3.015851791536922e-05, 'epoch': 3.5} 35%|███▌ | 3505/10000 [5:31:09<9:54:25, 5.49s/it][2025-06-19 19:00:54,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:00:54,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.51 | bwd_microstep: 3318.67 | bwd_inner_microstep: 3317.72 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.85 [2025-06-19 19:00:54,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.51 | bwd: 3318.69 | bwd_inner: 3317.72 | bwd_allreduce: 0.92 | step: 6.86 35%|███▌ | 3506/10000 [5:31:15<9:53:44, 5.49s/it] {'loss': 0.1083, 'grad_norm': 1.7886781692504883, 'learning_rate': 3.0152937650043274e-05, 'epoch': 3.51} 35%|███▌ | 3506/10000 [5:31:15<9:53:44, 5.49s/it][2025-06-19 19:00:59,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:00:59,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.06 | bwd_microstep: 3332.80 | bwd_inner_microstep: 3331.97 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.39 [2025-06-19 19:00:59,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.06 | bwd: 3332.82 | bwd_inner: 3331.97 | bwd_allreduce: 0.80 | step: 7.39 35%|███▌ | 3507/10000 [5:31:20<9:53:36, 5.49s/it] {'loss': 0.0758, 'grad_norm': 2.9567530155181885, 'learning_rate': 3.0147356319721292e-05, 'epoch': 3.51} 35%|███▌ | 3507/10000 [5:31:20<9:53:36, 5.49s/it][2025-06-19 19:01:05,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:01:05,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.31 | bwd_microstep: 3385.58 | bwd_inner_microstep: 3384.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-19 19:01:05,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.31 | bwd: 3385.59 | bwd_inner: 3384.77 | bwd_allreduce: 0.77 | step: 6.78 35%|███▌ | 3508/10000 [5:31:26<9:55:35, 5.50s/it] {'loss': 0.0585, 'grad_norm': 2.546558380126953, 'learning_rate': 3.014177392498872e-05, 'epoch': 3.51} 35%|███▌ | 3508/10000 [5:31:26<9:55:35, 5.50s/it][2025-06-19 19:01:10,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:01:10,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.64 | bwd_microstep: 3325.29 | bwd_inner_microstep: 3324.47 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.14 [2025-06-19 19:01:10,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.64 | bwd: 3325.30 | bwd_inner: 3324.47 | bwd_allreduce: 0.79 | step: 7.15 35%|███▌ | 3509/10000 [5:31:31<9:54:31, 5.50s/it] {'loss': 0.1418, 'grad_norm': 2.415848731994629, 'learning_rate': 3.0136190466431138e-05, 'epoch': 3.51} 35%|███▌ | 3509/10000 [5:31:31<9:54:31, 5.50s/it][2025-06-19 19:01:16,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-19 19:01:16,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.00 | bwd_microstep: 3377.31 | bwd_inner_microstep: 3376.13 | bwd_allreduce_microstep: 1.11 | step_microstep: 7.97 [2025-06-19 19:01:16,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.00 | bwd: 3377.33 | bwd_inner: 3376.13 | bwd_allreduce: 1.14 | step: 7.98 35%|███▌ | 3510/10000 [5:31:37<9:56:17, 5.51s/it] {'loss': 0.152, 'grad_norm': 1.9802181720733643, 'learning_rate': 3.0130605944634215e-05, 'epoch': 3.51} 35%|███▌ | 3510/10000 [5:31:37<9:56:17, 5.51s/it][2025-06-19 19:01:21,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:01:21,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.47 | bwd_microstep: 3377.44 | bwd_inner_microstep: 3376.61 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.01 [2025-06-19 19:01:21,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.47 | bwd: 3377.46 | bwd_inner: 3376.61 | bwd_allreduce: 0.80 | step: 7.01 35%|███▌ | 3511/10000 [5:31:42<9:57:44, 5.53s/it] {'loss': 0.0386, 'grad_norm': 1.5985934734344482, 'learning_rate': 3.0125020360183747e-05, 'epoch': 3.51} 35%|███▌ | 3511/10000 [5:31:42<9:57:44, 5.53s/it][2025-06-19 19:01:27,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 19:01:27,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.29 | bwd_microstep: 3324.49 | bwd_inner_microstep: 3323.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 19:01:27,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.29 | bwd: 3324.51 | bwd_inner: 3323.71 | bwd_allreduce: 0.76 | step: 6.61 35%|███▌ | 3512/10000 [5:31:48<9:55:45, 5.51s/it] {'loss': 0.0216, 'grad_norm': 2.0990476608276367, 'learning_rate': 3.011943371366564e-05, 'epoch': 3.51} 35%|███▌ | 3512/10000 [5:31:48<9:55:45, 5.51s/it][2025-06-19 19:01:32,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:01:32,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.59 | bwd_microstep: 3382.90 | bwd_inner_microstep: 3382.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 19:01:32,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.59 | bwd: 3382.91 | bwd_inner: 3382.10 | bwd_allreduce: 0.76 | step: 6.66 35%|███▌ | 3513/10000 [5:31:53<9:57:07, 5.52s/it] {'loss': 0.0435, 'grad_norm': 2.1497981548309326, 'learning_rate': 3.01138460056659e-05, 'epoch': 3.51} 35%|███▌ | 3513/10000 [5:31:53<9:57:07, 5.52s/it][2025-06-19 19:01:38,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:01:38,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.38 | bwd_microstep: 3326.20 | bwd_inner_microstep: 3325.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 19:01:38,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.38 | bwd: 3326.22 | bwd_inner: 3325.42 | bwd_allreduce: 0.75 | step: 6.59 35%|███▌ | 3514/10000 [5:31:59<9:55:44, 5.51s/it] {'loss': 0.0206, 'grad_norm': 0.7176262140274048, 'learning_rate': 3.010825723677065e-05, 'epoch': 3.51} 35%|███▌ | 3514/10000 [5:31:59<9:55:44, 5.51s/it][2025-06-19 19:01:43,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:01:43,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.96 | bwd_microstep: 3341.19 | bwd_inner_microstep: 3340.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.41 [2025-06-19 19:01:43,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.96 | bwd: 3341.21 | bwd_inner: 3340.38 | bwd_allreduce: 0.78 | step: 7.41 35%|███▌ | 3515/10000 [5:32:04<9:55:04, 5.51s/it] {'loss': 0.0824, 'grad_norm': 2.5580809116363525, 'learning_rate': 3.0102667407566137e-05, 'epoch': 3.52} 35%|███▌ | 3515/10000 [5:32:04<9:55:04, 5.51s/it][2025-06-19 19:01:49,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:01:49,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.16 | bwd_microstep: 3325.37 | bwd_inner_microstep: 3324.56 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.79 [2025-06-19 19:01:49,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.16 | bwd: 3325.39 | bwd_inner: 3324.56 | bwd_allreduce: 0.78 | step: 6.80 35%|███▌ | 3516/10000 [5:32:10<9:53:59, 5.50s/it] {'loss': 0.0506, 'grad_norm': 2.8899154663085938, 'learning_rate': 3.0097076518638695e-05, 'epoch': 3.52} 35%|███▌ | 3516/10000 [5:32:10<9:53:59, 5.50s/it][2025-06-19 19:01:54,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:01:54,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.17 | bwd_microstep: 3383.01 | bwd_inner_microstep: 3382.06 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.29 [2025-06-19 19:01:54,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.17 | bwd: 3383.02 | bwd_inner: 3382.06 | bwd_allreduce: 0.92 | step: 7.30 35%|███▌ | 3517/10000 [5:32:15<9:55:40, 5.51s/it] {'loss': 0.0286, 'grad_norm': 1.7794651985168457, 'learning_rate': 3.009148457057479e-05, 'epoch': 3.52} 35%|███▌ | 3517/10000 [5:32:15<9:55:40, 5.51s/it][2025-06-19 19:02:00,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:02:00,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.26 | bwd_microstep: 3331.10 | bwd_inner_microstep: 3330.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 19:02:00,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.26 | bwd: 3331.11 | bwd_inner: 3330.32 | bwd_allreduce: 0.76 | step: 6.58 35%|███▌ | 3518/10000 [5:32:21<9:54:44, 5.51s/it] {'loss': 0.107, 'grad_norm': 3.6284871101379395, 'learning_rate': 3.008589156396099e-05, 'epoch': 3.52} 35%|███▌ | 3518/10000 [5:32:21<9:54:44, 5.51s/it][2025-06-19 19:02:05,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:02:05,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.99 | bwd_microstep: 3323.32 | bwd_inner_microstep: 3322.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 19:02:05,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.99 | bwd: 3323.33 | bwd_inner: 3322.53 | bwd_allreduce: 0.75 | step: 6.63 35%|███▌ | 3519/10000 [5:32:26<9:53:32, 5.49s/it] {'loss': 0.2191, 'grad_norm': 5.826259136199951, 'learning_rate': 3.0080297499383967e-05, 'epoch': 3.52} 35%|███▌ | 3519/10000 [5:32:26<9:53:32, 5.49s/it][2025-06-19 19:02:11,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:02:11,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.10 | bwd_microstep: 3376.43 | bwd_inner_microstep: 3375.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 19:02:11,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.10 | bwd: 3376.44 | bwd_inner: 3375.63 | bwd_allreduce: 0.77 | step: 6.87 35%|███▌ | 3520/10000 [5:32:32<9:55:01, 5.51s/it] {'loss': 0.0747, 'grad_norm': 1.305672287940979, 'learning_rate': 3.007470237743053e-05, 'epoch': 3.52} 35%|███▌ | 3520/10000 [5:32:32<9:55:01, 5.51s/it][2025-06-19 19:02:17,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:02:17,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.66 | bwd_microstep: 3371.23 | bwd_inner_microstep: 3370.44 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 19:02:17,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.66 | bwd: 3371.24 | bwd_inner: 3370.44 | bwd_allreduce: 0.76 | step: 6.58 35%|███▌ | 3521/10000 [5:32:37<9:56:07, 5.52s/it] {'loss': 0.0776, 'grad_norm': 1.9470436573028564, 'learning_rate': 3.006910619868757e-05, 'epoch': 3.52} 35%|███▌ | 3521/10000 [5:32:37<9:56:07, 5.52s/it][2025-06-19 19:02:22,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:02:22,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.95 | bwd_microstep: 3329.51 | bwd_inner_microstep: 3328.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-19 19:02:22,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.95 | bwd: 3329.52 | bwd_inner: 3328.69 | bwd_allreduce: 0.78 | step: 6.81 35%|███▌ | 3522/10000 [5:32:43<9:54:31, 5.51s/it] {'loss': 0.0302, 'grad_norm': 1.3222616910934448, 'learning_rate': 3.0063508963742107e-05, 'epoch': 3.52} 35%|███▌ | 3522/10000 [5:32:43<9:54:31, 5.51s/it][2025-06-19 19:02:27,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:02:27,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.53 | bwd_microstep: 3329.04 | bwd_inner_microstep: 3328.26 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.55 [2025-06-19 19:02:27,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.53 | bwd: 3329.05 | bwd_inner: 3328.26 | bwd_allreduce: 0.75 | step: 6.56 35%|███▌ | 3523/10000 [5:32:48<9:53:33, 5.50s/it] {'loss': 0.0786, 'grad_norm': 2.5762269496917725, 'learning_rate': 3.005791067318125e-05, 'epoch': 3.52} 35%|███▌ | 3523/10000 [5:32:48<9:53:33, 5.50s/it][2025-06-19 19:02:33,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 19:02:33,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.69 | bwd_microstep: 3370.05 | bwd_inner_microstep: 3369.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.57 [2025-06-19 19:02:33,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.69 | bwd: 3370.07 | bwd_inner: 3369.26 | bwd_allreduce: 0.76 | step: 6.57 35%|███▌ | 3524/10000 [5:32:54<9:54:37, 5.51s/it] {'loss': 0.0625, 'grad_norm': 3.2596099376678467, 'learning_rate': 3.0052311327592246e-05, 'epoch': 3.52} 35%|███▌ | 3524/10000 [5:32:54<9:54:37, 5.51s/it][2025-06-19 19:02:38,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:02:38,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.07 | bwd_microstep: 3323.94 | bwd_inner_microstep: 3323.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 19:02:38,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.07 | bwd: 3323.96 | bwd_inner: 3323.14 | bwd_allreduce: 0.77 | step: 6.99 35%|███▌ | 3525/10000 [5:32:59<9:53:40, 5.50s/it] {'loss': 0.011, 'grad_norm': 0.7467104196548462, 'learning_rate': 3.0046710927562442e-05, 'epoch': 3.52} 35%|███▌ | 3525/10000 [5:32:59<9:53:40, 5.50s/it][2025-06-19 19:02:44,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:02:44,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.69 | bwd_microstep: 3382.01 | bwd_inner_microstep: 3381.12 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.21 [2025-06-19 19:02:44,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.68 | bwd: 3382.02 | bwd_inner: 3381.12 | bwd_allreduce: 0.86 | step: 7.22 35%|███▌ | 3526/10000 [5:33:05<9:55:29, 5.52s/it] {'loss': 0.1341, 'grad_norm': 2.9012935161590576, 'learning_rate': 3.004110947367929e-05, 'epoch': 3.53} 35%|███▌ | 3526/10000 [5:33:05<9:55:29, 5.52s/it][2025-06-19 19:02:50,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 19:02:50,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.14 | bwd_microstep: 3374.48 | bwd_inner_microstep: 3373.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 19:02:50,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.14 | bwd: 3374.49 | bwd_inner: 3373.70 | bwd_allreduce: 0.75 | step: 6.54 35%|███▌ | 3527/10000 [5:33:10<9:56:28, 5.53s/it] {'loss': 0.047, 'grad_norm': 2.1019766330718994, 'learning_rate': 3.0035506966530347e-05, 'epoch': 3.53} 35%|███▌ | 3527/10000 [5:33:10<9:56:28, 5.53s/it][2025-06-19 19:02:55,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.78 [2025-06-19 19:02:55,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.86 | bwd_microstep: 3319.80 | bwd_inner_microstep: 3319.01 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 19:02:55,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.86 | bwd: 3319.81 | bwd_inner: 3319.01 | bwd_allreduce: 0.76 | step: 6.61 35%|███▌ | 3528/10000 [5:33:16<9:54:15, 5.51s/it] {'loss': 0.025, 'grad_norm': 1.416350245475769, 'learning_rate': 3.002990340670331e-05, 'epoch': 3.53} 35%|███▌ | 3528/10000 [5:33:16<9:54:15, 5.51s/it][2025-06-19 19:03:01,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:03:01,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.75 | bwd_microstep: 3329.06 | bwd_inner_microstep: 3328.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-19 19:03:01,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.75 | bwd: 3329.08 | bwd_inner: 3328.24 | bwd_allreduce: 0.78 | step: 6.84 35%|███▌ | 3529/10000 [5:33:21<9:53:15, 5.50s/it] {'loss': 0.0327, 'grad_norm': 0.9663100838661194, 'learning_rate': 3.0024298794785954e-05, 'epoch': 3.53} 35%|███▌ | 3529/10000 [5:33:21<9:53:15, 5.50s/it][2025-06-19 19:03:06,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:03:06,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.92 | bwd_microstep: 3373.35 | bwd_inner_microstep: 3372.57 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 19:03:06,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.92 | bwd: 3373.36 | bwd_inner: 3372.57 | bwd_allreduce: 0.76 | step: 6.54 35%|███▌ | 3530/10000 [5:33:27<9:54:22, 5.51s/it] {'loss': 0.0803, 'grad_norm': 2.194676637649536, 'learning_rate': 3.0018693131366176e-05, 'epoch': 3.53} 35%|███▌ | 3530/10000 [5:33:27<9:54:22, 5.51s/it][2025-06-19 19:03:12,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:03:12,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.41 | bwd_microstep: 3378.27 | bwd_inner_microstep: 3377.32 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.25 [2025-06-19 19:03:12,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.41 | bwd: 3378.28 | bwd_inner: 3377.32 | bwd_allreduce: 0.91 | step: 7.26 35%|███▌ | 3531/10000 [5:33:32<9:55:34, 5.52s/it] {'loss': 0.0177, 'grad_norm': 0.6806623339653015, 'learning_rate': 3.0013086417031987e-05, 'epoch': 3.53} 35%|███▌ | 3531/10000 [5:33:32<9:55:34, 5.52s/it][2025-06-19 19:03:17,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 19:03:17,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.90 | bwd_microstep: 3326.44 | bwd_inner_microstep: 3325.02 | bwd_allreduce_microstep: 1.35 | step_microstep: 7.45 [2025-06-19 19:03:17,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.90 | bwd: 3326.46 | bwd_inner: 3325.02 | bwd_allreduce: 1.38 | step: 7.45 35%|███▌ | 3532/10000 [5:33:38<9:53:59, 5.51s/it] {'loss': 0.018, 'grad_norm': 0.7354788184165955, 'learning_rate': 3.00074786523715e-05, 'epoch': 3.53} 35%|███▌ | 3532/10000 [5:33:38<9:53:59, 5.51s/it][2025-06-19 19:03:23,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:03:23,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.00 | bwd_microstep: 3381.14 | bwd_inner_microstep: 3380.26 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.41 [2025-06-19 19:03:23,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.00 | bwd: 3381.16 | bwd_inner: 3380.26 | bwd_allreduce: 0.84 | step: 7.41 35%|███▌ | 3533/10000 [5:33:43<9:55:24, 5.52s/it] {'loss': 0.0405, 'grad_norm': 1.7938055992126465, 'learning_rate': 3.000186983797296e-05, 'epoch': 3.53} 35%|███▌ | 3533/10000 [5:33:43<9:55:24, 5.52s/it][2025-06-19 19:03:28,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:03:28,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.62 | bwd_microstep: 3379.37 | bwd_inner_microstep: 3378.45 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.02 [2025-06-19 19:03:28,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.62 | bwd: 3379.38 | bwd_inner: 3378.45 | bwd_allreduce: 0.89 | step: 7.02 35%|███▌ | 3534/10000 [5:33:49<9:56:34, 5.54s/it] {'loss': 0.02, 'grad_norm': 0.8372328877449036, 'learning_rate': 2.9996259974424687e-05, 'epoch': 3.53} 35%|███▌ | 3534/10000 [5:33:49<9:56:34, 5.54s/it][2025-06-19 19:03:34,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:03:34,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.62 | bwd_microstep: 3320.46 | bwd_inner_microstep: 3319.68 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 19:03:34,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.62 | bwd: 3320.48 | bwd_inner: 3319.68 | bwd_allreduce: 0.76 | step: 6.58 35%|███▌ | 3535/10000 [5:33:54<9:54:12, 5.51s/it] {'loss': 0.0829, 'grad_norm': 3.361494779586792, 'learning_rate': 2.9990649062315138e-05, 'epoch': 3.54} 35%|███▌ | 3535/10000 [5:33:54<9:54:12, 5.51s/it][2025-06-19 19:03:39,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:03:39,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.03 | bwd_microstep: 3367.24 | bwd_inner_microstep: 3366.33 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.94 [2025-06-19 19:03:39,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.03 | bwd: 3367.26 | bwd_inner: 3366.33 | bwd_allreduce: 0.88 | step: 6.94 35%|███▌ | 3536/10000 [5:34:00<9:54:59, 5.52s/it] {'loss': 0.1647, 'grad_norm': 2.1593520641326904, 'learning_rate': 2.998503710223287e-05, 'epoch': 3.54} 35%|███▌ | 3536/10000 [5:34:00<9:54:59, 5.52s/it][2025-06-19 19:03:45,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:03:45,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.23 | bwd_microstep: 3406.46 | bwd_inner_microstep: 3405.47 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.88 [2025-06-19 19:03:45,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.23 | bwd: 3406.47 | bwd_inner: 3405.47 | bwd_allreduce: 0.95 | step: 7.88 35%|███▌ | 3537/10000 [5:34:06<9:57:15, 5.54s/it] {'loss': 0.1376, 'grad_norm': 2.305051565170288, 'learning_rate': 2.9979424094766553e-05, 'epoch': 3.54} 35%|███▌ | 3537/10000 [5:34:06<9:57:15, 5.54s/it][2025-06-19 19:03:50,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:03:50,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.19 | bwd_microstep: 3321.67 | bwd_inner_microstep: 3320.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 19:03:50,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.19 | bwd: 3321.68 | bwd_inner: 3320.88 | bwd_allreduce: 0.75 | step: 6.55 35%|███▌ | 3538/10000 [5:34:11<9:54:40, 5.52s/it] {'loss': 0.1302, 'grad_norm': 2.5172173976898193, 'learning_rate': 2.9973810040504965e-05, 'epoch': 3.54} 35%|███▌ | 3538/10000 [5:34:11<9:54:40, 5.52s/it][2025-06-19 19:03:56,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:03:56,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.19 | bwd_microstep: 3316.92 | bwd_inner_microstep: 3316.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 19:03:56,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.19 | bwd: 3316.93 | bwd_inner: 3316.14 | bwd_allreduce: 0.75 | step: 6.59 35%|███▌ | 3539/10000 [5:34:17<9:52:37, 5.50s/it] {'loss': 0.0107, 'grad_norm': 0.42637374997138977, 'learning_rate': 2.996819494003699e-05, 'epoch': 3.54} 35%|███▌ | 3539/10000 [5:34:17<9:52:37, 5.50s/it][2025-06-19 19:04:01,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:04:01,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.04 | bwd_microstep: 3320.23 | bwd_inner_microstep: 3319.32 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.52 [2025-06-19 19:04:01,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.04 | bwd: 3320.24 | bwd_inner: 3319.32 | bwd_allreduce: 0.87 | step: 6.52 35%|███▌ | 3540/10000 [5:34:22<9:51:32, 5.49s/it] {'loss': 0.0048, 'grad_norm': 0.2385699301958084, 'learning_rate': 2.9962578793951635e-05, 'epoch': 3.54} 35%|███▌ | 3540/10000 [5:34:22<9:51:32, 5.49s/it][2025-06-19 19:04:07,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:04:07,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.56 | bwd_microstep: 3408.02 | bwd_inner_microstep: 3406.87 | bwd_allreduce_microstep: 1.07 | step_microstep: 8.02 [2025-06-19 19:04:07,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.56 | bwd: 3408.04 | bwd_inner: 3406.87 | bwd_allreduce: 1.10 | step: 8.02 35%|███▌ | 3541/10000 [5:34:28<9:54:37, 5.52s/it] {'loss': 0.032, 'grad_norm': 1.2426114082336426, 'learning_rate': 2.9956961602838e-05, 'epoch': 3.54} 35%|███▌ | 3541/10000 [5:34:28<9:54:37, 5.52s/it][2025-06-19 19:04:12,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:04:12,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.75 | bwd_microstep: 3326.20 | bwd_inner_microstep: 3325.38 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-19 19:04:12,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.75 | bwd: 3326.22 | bwd_inner: 3325.38 | bwd_allreduce: 0.79 | step: 7.31 35%|███▌ | 3542/10000 [5:34:33<9:52:59, 5.51s/it] {'loss': 0.1212, 'grad_norm': 1.9868656396865845, 'learning_rate': 2.99513433672853e-05, 'epoch': 3.54} 35%|███▌ | 3542/10000 [5:34:33<9:52:59, 5.51s/it][2025-06-19 19:04:18,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:04:18,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.76 | bwd_microstep: 3315.47 | bwd_inner_microstep: 3314.63 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.94 [2025-06-19 19:04:18,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.76 | bwd: 3315.48 | bwd_inner: 3314.63 | bwd_allreduce: 0.81 | step: 6.94 35%|███▌ | 3543/10000 [5:34:39<9:51:38, 5.50s/it] {'loss': 0.0399, 'grad_norm': 2.251190662384033, 'learning_rate': 2.9945724087882873e-05, 'epoch': 3.54} 35%|███▌ | 3543/10000 [5:34:39<9:51:38, 5.50s/it][2025-06-19 19:04:23,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:04:23,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.10 | bwd_microstep: 3367.61 | bwd_inner_microstep: 3366.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 19:04:23,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.10 | bwd: 3367.63 | bwd_inner: 3366.82 | bwd_allreduce: 0.77 | step: 6.75 35%|███▌ | 3544/10000 [5:34:44<9:52:35, 5.51s/it] {'loss': 0.0289, 'grad_norm': 0.9072744250297546, 'learning_rate': 2.9940103765220145e-05, 'epoch': 3.54} 35%|███▌ | 3544/10000 [5:34:44<9:52:35, 5.51s/it][2025-06-19 19:04:29,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:04:29,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.30 | bwd_microstep: 3326.32 | bwd_inner_microstep: 3325.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.26 [2025-06-19 19:04:29,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.30 | bwd: 3326.33 | bwd_inner: 3325.51 | bwd_allreduce: 0.78 | step: 7.27 35%|███▌ | 3545/10000 [5:34:50<9:51:23, 5.50s/it] {'loss': 0.0325, 'grad_norm': 1.5280405282974243, 'learning_rate': 2.9934482399886664e-05, 'epoch': 3.54} 35%|███▌ | 3545/10000 [5:34:50<9:51:23, 5.50s/it][2025-06-19 19:04:34,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:04:34,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.65 | bwd_microstep: 3378.75 | bwd_inner_microstep: 3377.64 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.95 [2025-06-19 19:04:34,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.65 | bwd: 3378.78 | bwd_inner: 3377.64 | bwd_allreduce: 1.07 | step: 7.96 35%|███▌ | 3546/10000 [5:34:55<9:52:57, 5.51s/it] {'loss': 0.0959, 'grad_norm': 1.8842447996139526, 'learning_rate': 2.992885999247209e-05, 'epoch': 3.55} 35%|███▌ | 3546/10000 [5:34:55<9:52:57, 5.51s/it][2025-06-19 19:04:40,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:04:40,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.32 | bwd_microstep: 3319.80 | bwd_inner_microstep: 3319.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.58 [2025-06-19 19:04:40,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.32 | bwd: 3319.81 | bwd_inner: 3319.01 | bwd_allreduce: 0.76 | step: 6.58 35%|███▌ | 3547/10000 [5:35:01<9:51:19, 5.50s/it] {'loss': 0.0075, 'grad_norm': 0.2881394624710083, 'learning_rate': 2.9923236543566182e-05, 'epoch': 3.55} 35%|███▌ | 3547/10000 [5:35:01<9:51:19, 5.50s/it][2025-06-19 19:04:45,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:04:45,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.87 | bwd_microstep: 3320.57 | bwd_inner_microstep: 3319.69 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.32 [2025-06-19 19:04:45,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.87 | bwd: 3320.58 | bwd_inner: 3319.69 | bwd_allreduce: 0.85 | step: 7.32 35%|███▌ | 3548/10000 [5:35:06<9:50:12, 5.49s/it] {'loss': 0.0695, 'grad_norm': 1.5089362859725952, 'learning_rate': 2.9917612053758822e-05, 'epoch': 3.55} 35%|███▌ | 3548/10000 [5:35:06<9:50:12, 5.49s/it][2025-06-19 19:04:51,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:04:51,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.05 | bwd_microstep: 3374.65 | bwd_inner_microstep: 3373.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 19:04:51,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.05 | bwd: 3374.67 | bwd_inner: 3373.87 | bwd_allreduce: 0.76 | step: 6.62 35%|███▌ | 3549/10000 [5:35:12<9:52:03, 5.51s/it] {'loss': 0.1378, 'grad_norm': 1.3533633947372437, 'learning_rate': 2.9911986523639975e-05, 'epoch': 3.55} 35%|███▌ | 3549/10000 [5:35:12<9:52:03, 5.51s/it][2025-06-19 19:04:56,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:04:56,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.55 | bwd_microstep: 3372.69 | bwd_inner_microstep: 3371.83 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.24 [2025-06-19 19:04:56,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.55 | bwd: 3372.71 | bwd_inner: 3371.83 | bwd_allreduce: 0.82 | step: 7.25 36%|███▌ | 3550/10000 [5:35:17<9:53:14, 5.52s/it] {'loss': 0.0311, 'grad_norm': 0.6638807654380798, 'learning_rate': 2.9906359953799756e-05, 'epoch': 3.55} 36%|███▌ | 3550/10000 [5:35:17<9:53:14, 5.52s/it][2025-06-19 19:05:02,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:05:02,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.64 | bwd_microstep: 3370.14 | bwd_inner_microstep: 3369.32 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.07 [2025-06-19 19:05:02,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.64 | bwd: 3370.16 | bwd_inner: 3369.32 | bwd_allreduce: 0.80 | step: 7.07 36%|███▌ | 3551/10000 [5:35:23<9:53:41, 5.52s/it] {'loss': 0.0501, 'grad_norm': 0.9899275302886963, 'learning_rate': 2.990073234482835e-05, 'epoch': 3.55} 36%|███▌ | 3551/10000 [5:35:23<9:53:41, 5.52s/it][2025-06-19 19:05:07,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:05:07,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.94 | bwd_microstep: 3324.10 | bwd_inner_microstep: 3323.29 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.74 [2025-06-19 19:05:07,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.94 | bwd: 3324.12 | bwd_inner: 3323.29 | bwd_allreduce: 0.78 | step: 6.75 36%|███▌ | 3552/10000 [5:35:28<9:51:45, 5.51s/it] {'loss': 0.1107, 'grad_norm': 2.1078081130981445, 'learning_rate': 2.989510369731607e-05, 'epoch': 3.55} 36%|███▌ | 3552/10000 [5:35:28<9:51:45, 5.51s/it][2025-06-19 19:05:13,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:05:13,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.82 | bwd_microstep: 3326.95 | bwd_inner_microstep: 3326.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 19:05:13,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.82 | bwd: 3326.96 | bwd_inner: 3326.13 | bwd_allreduce: 0.78 | step: 7.09 36%|███▌ | 3553/10000 [5:35:34<9:50:29, 5.50s/it] {'loss': 0.0271, 'grad_norm': 2.092620611190796, 'learning_rate': 2.9889474011853336e-05, 'epoch': 3.55} 36%|███▌ | 3553/10000 [5:35:34<9:50:29, 5.50s/it][2025-06-19 19:05:18,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:05:18,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.56 | bwd_microstep: 3369.19 | bwd_inner_microstep: 3368.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-19 19:05:18,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.56 | bwd: 3369.20 | bwd_inner: 3368.39 | bwd_allreduce: 0.77 | step: 6.92 36%|███▌ | 3554/10000 [5:35:39<9:51:38, 5.51s/it] {'loss': 0.0668, 'grad_norm': 2.0036561489105225, 'learning_rate': 2.9883843289030675e-05, 'epoch': 3.55} 36%|███▌ | 3554/10000 [5:35:39<9:51:38, 5.51s/it][2025-06-19 19:05:24,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:05:24,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.32 | bwd_microstep: 3364.04 | bwd_inner_microstep: 3363.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.70 [2025-06-19 19:05:24,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.32 | bwd: 3364.06 | bwd_inner: 3363.24 | bwd_allreduce: 0.77 | step: 6.70 36%|███▌ | 3555/10000 [5:35:45<9:52:08, 5.51s/it] {'loss': 0.0289, 'grad_norm': 0.7464356422424316, 'learning_rate': 2.9878211529438738e-05, 'epoch': 3.56} 36%|███▌ | 3555/10000 [5:35:45<9:52:08, 5.51s/it][2025-06-19 19:05:29,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:05:29,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.56 | bwd_microstep: 3327.86 | bwd_inner_microstep: 3327.04 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.78 [2025-06-19 19:05:29,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.56 | bwd: 3327.87 | bwd_inner: 3327.04 | bwd_allreduce: 0.79 | step: 6.79 36%|███▌ | 3556/10000 [5:35:50<9:50:49, 5.50s/it] {'loss': 0.0302, 'grad_norm': 1.1207153797149658, 'learning_rate': 2.9872578733668245e-05, 'epoch': 3.56} 36%|███▌ | 3556/10000 [5:35:50<9:50:49, 5.50s/it][2025-06-19 19:05:35,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:05:35,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.95 | bwd_microstep: 3313.86 | bwd_inner_microstep: 3313.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 19:05:35,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.95 | bwd: 3313.87 | bwd_inner: 3313.07 | bwd_allreduce: 0.76 | step: 6.68 36%|███▌ | 3557/10000 [5:35:56<9:49:11, 5.49s/it] {'loss': 0.1487, 'grad_norm': 2.3302276134490967, 'learning_rate': 2.9866944902310067e-05, 'epoch': 3.56} 36%|███▌ | 3557/10000 [5:35:56<9:49:11, 5.49s/it][2025-06-19 19:05:40,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:05:40,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.51 | bwd_microstep: 3328.86 | bwd_inner_microstep: 3327.81 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.51 [2025-06-19 19:05:40,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.51 | bwd: 3328.88 | bwd_inner: 3327.81 | bwd_allreduce: 1.03 | step: 7.52 36%|███▌ | 3558/10000 [5:36:01<9:48:47, 5.48s/it] {'loss': 0.107, 'grad_norm': 1.8816391229629517, 'learning_rate': 2.9861310035955165e-05, 'epoch': 3.56} 36%|███▌ | 3558/10000 [5:36:01<9:48:47, 5.48s/it][2025-06-19 19:05:46,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:05:46,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.78 | bwd_microstep: 3378.41 | bwd_inner_microstep: 3377.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 19:05:46,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.78 | bwd: 3378.43 | bwd_inner: 3377.61 | bwd_allreduce: 0.77 | step: 6.85 36%|███▌ | 3559/10000 [5:36:07<9:50:43, 5.50s/it] {'loss': 0.0218, 'grad_norm': 1.0743411779403687, 'learning_rate': 2.9855674135194602e-05, 'epoch': 3.56} 36%|███▌ | 3559/10000 [5:36:07<9:50:43, 5.50s/it][2025-06-19 19:05:51,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:05:51,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.06 | bwd_microstep: 3364.59 | bwd_inner_microstep: 3363.63 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.05 [2025-06-19 19:05:51,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.06 | bwd: 3364.60 | bwd_inner: 3363.63 | bwd_allreduce: 0.93 | step: 7.06 36%|███▌ | 3560/10000 [5:36:12<9:51:17, 5.51s/it] {'loss': 0.0856, 'grad_norm': 2.1636593341827393, 'learning_rate': 2.985003720061957e-05, 'epoch': 3.56} 36%|███▌ | 3560/10000 [5:36:12<9:51:17, 5.51s/it][2025-06-19 19:05:57,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:05:57,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.96 | bwd_microstep: 3319.54 | bwd_inner_microstep: 3318.72 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.87 [2025-06-19 19:05:57,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.96 | bwd: 3319.56 | bwd_inner: 3318.72 | bwd_allreduce: 0.79 | step: 6.88 36%|███▌ | 3561/10000 [5:36:18<9:49:49, 5.50s/it] {'loss': 0.0231, 'grad_norm': 0.813779354095459, 'learning_rate': 2.984439923282135e-05, 'epoch': 3.56} 36%|███▌ | 3561/10000 [5:36:18<9:49:49, 5.50s/it][2025-06-19 19:06:02,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:06:02,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.61 | bwd_microstep: 3317.42 | bwd_inner_microstep: 3316.43 | bwd_allreduce_microstep: 0.94 | step_microstep: 6.90 [2025-06-19 19:06:02,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.61 | bwd: 3317.43 | bwd_inner: 3316.43 | bwd_allreduce: 0.96 | step: 6.90 36%|███▌ | 3562/10000 [5:36:23<9:48:24, 5.48s/it] {'loss': 0.226, 'grad_norm': 3.562411069869995, 'learning_rate': 2.983876023239133e-05, 'epoch': 3.56} 36%|███▌ | 3562/10000 [5:36:23<9:48:24, 5.48s/it][2025-06-19 19:06:08,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:06:08,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.09 | bwd_microstep: 3312.37 | bwd_inner_microstep: 3311.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 19:06:08,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.09 | bwd: 3312.38 | bwd_inner: 3311.57 | bwd_allreduce: 0.77 | step: 6.65 36%|███▌ | 3563/10000 [5:36:29<9:47:39, 5.48s/it] {'loss': 0.0729, 'grad_norm': 1.871239423751831, 'learning_rate': 2.9833120199921038e-05, 'epoch': 3.56} 36%|███▌ | 3563/10000 [5:36:29<9:47:39, 5.48s/it][2025-06-19 19:06:13,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:06:13,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.19 | bwd_microstep: 3366.31 | bwd_inner_microstep: 3365.35 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.82 [2025-06-19 19:06:13,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.19 | bwd: 3366.33 | bwd_inner: 3365.35 | bwd_allreduce: 0.93 | step: 7.82 36%|███▌ | 3564/10000 [5:36:34<9:49:31, 5.50s/it] {'loss': 0.0732, 'grad_norm': 2.345407009124756, 'learning_rate': 2.9827479136002068e-05, 'epoch': 3.56} 36%|███▌ | 3564/10000 [5:36:34<9:49:31, 5.50s/it][2025-06-19 19:06:19,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:06:19,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.14 | bwd_microstep: 3324.22 | bwd_inner_microstep: 3323.31 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.09 [2025-06-19 19:06:19,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.14 | bwd: 3324.24 | bwd_inner: 3323.31 | bwd_allreduce: 0.88 | step: 7.10 36%|███▌ | 3565/10000 [5:36:40<9:48:56, 5.49s/it] {'loss': 0.0324, 'grad_norm': 0.8993860483169556, 'learning_rate': 2.9821837041226148e-05, 'epoch': 3.56} 36%|███▌ | 3565/10000 [5:36:40<9:48:56, 5.49s/it][2025-06-19 19:06:24,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.71 | optimizer_step: 2.73 [2025-06-19 19:06:24,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.51 | bwd_microstep: 3322.59 | bwd_inner_microstep: 3321.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.89 [2025-06-19 19:06:24,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.51 | bwd: 3322.60 | bwd_inner: 3321.81 | bwd_allreduce: 0.75 | step: 6.90 36%|███▌ | 3566/10000 [5:36:45<9:48:12, 5.49s/it] {'loss': 0.0122, 'grad_norm': 0.37883487343788147, 'learning_rate': 2.9816193916185105e-05, 'epoch': 3.57} 36%|███▌ | 3566/10000 [5:36:45<9:48:12, 5.49s/it][2025-06-19 19:06:30,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:06:30,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.77 | bwd_microstep: 3309.31 | bwd_inner_microstep: 3308.52 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 19:06:30,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.77 | bwd: 3309.32 | bwd_inner: 3308.52 | bwd_allreduce: 0.75 | step: 6.55 36%|███▌ | 3567/10000 [5:36:50<9:46:46, 5.47s/it] {'loss': 0.0634, 'grad_norm': 1.6078547239303589, 'learning_rate': 2.9810549761470882e-05, 'epoch': 3.57} 36%|███▌ | 3567/10000 [5:36:50<9:46:46, 5.47s/it][2025-06-19 19:06:35,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:06:35,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.11 | bwd_microstep: 3315.25 | bwd_inner_microstep: 3314.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.80 [2025-06-19 19:06:35,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.11 | bwd: 3315.27 | bwd_inner: 3314.46 | bwd_allreduce: 0.76 | step: 6.81 36%|███▌ | 3568/10000 [5:36:56<9:46:12, 5.47s/it] {'loss': 0.0343, 'grad_norm': 1.1869733333587646, 'learning_rate': 2.9804904577675522e-05, 'epoch': 3.57} 36%|███▌ | 3568/10000 [5:36:56<9:46:12, 5.47s/it][2025-06-19 19:06:41,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:06:41,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.65 | bwd_microstep: 3317.85 | bwd_inner_microstep: 3317.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 19:06:41,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.65 | bwd: 3317.86 | bwd_inner: 3317.06 | bwd_allreduce: 0.76 | step: 6.91 36%|███▌ | 3569/10000 [5:37:01<9:45:44, 5.46s/it] {'loss': 0.1987, 'grad_norm': 2.428105354309082, 'learning_rate': 2.9799258365391175e-05, 'epoch': 3.57} 36%|███▌ | 3569/10000 [5:37:01<9:45:44, 5.46s/it][2025-06-19 19:06:46,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 19:06:46,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.54 | bwd_microstep: 3362.96 | bwd_inner_microstep: 3362.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 19:06:46,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.54 | bwd: 3362.97 | bwd_inner: 3362.18 | bwd_allreduce: 0.76 | step: 6.58 36%|███▌ | 3570/10000 [5:37:07<9:47:42, 5.48s/it] {'loss': 0.0496, 'grad_norm': 1.1517823934555054, 'learning_rate': 2.9793611125210103e-05, 'epoch': 3.57} 36%|███▌ | 3570/10000 [5:37:07<9:47:42, 5.48s/it][2025-06-19 19:06:52,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:06:52,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.82 | bwd_microstep: 3360.21 | bwd_inner_microstep: 3359.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 19:06:52,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.82 | bwd: 3360.23 | bwd_inner: 3359.43 | bwd_allreduce: 0.75 | step: 6.57 36%|███▌ | 3571/10000 [5:37:12<9:48:52, 5.50s/it] {'loss': 0.0282, 'grad_norm': 0.7200421690940857, 'learning_rate': 2.978796285772468e-05, 'epoch': 3.57} 36%|███▌ | 3571/10000 [5:37:12<9:48:52, 5.50s/it][2025-06-19 19:06:57,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 19:06:57,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.29 | bwd_microstep: 3326.68 | bwd_inner_microstep: 3325.70 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.18 [2025-06-19 19:06:57,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.29 | bwd: 3326.70 | bwd_inner: 3325.70 | bwd_allreduce: 0.95 | step: 7.19 36%|███▌ | 3572/10000 [5:37:18<9:48:03, 5.49s/it] {'loss': 0.026, 'grad_norm': 1.4731229543685913, 'learning_rate': 2.9782313563527384e-05, 'epoch': 3.57} 36%|███▌ | 3572/10000 [5:37:18<9:48:03, 5.49s/it][2025-06-19 19:07:03,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:07:03,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.33 | bwd_microstep: 3317.15 | bwd_inner_microstep: 3316.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 19:07:03,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.33 | bwd: 3317.16 | bwd_inner: 3316.35 | bwd_allreduce: 0.76 | step: 6.65 36%|███▌ | 3573/10000 [5:37:23<9:47:15, 5.48s/it] {'loss': 0.076, 'grad_norm': 1.54132878780365, 'learning_rate': 2.9776663243210794e-05, 'epoch': 3.57} 36%|███▌ | 3573/10000 [5:37:23<9:47:15, 5.48s/it][2025-06-19 19:07:08,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:07:08,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.53 | bwd_microstep: 3312.74 | bwd_inner_microstep: 3311.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 19:07:08,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.53 | bwd: 3312.76 | bwd_inner: 3311.94 | bwd_allreduce: 0.77 | step: 7.12 36%|███▌ | 3574/10000 [5:37:29<9:46:21, 5.47s/it] {'loss': 0.0668, 'grad_norm': 1.752328634262085, 'learning_rate': 2.9771011897367602e-05, 'epoch': 3.57} 36%|███▌ | 3574/10000 [5:37:29<9:46:21, 5.47s/it][2025-06-19 19:07:13,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:07:13,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.27 | bwd_microstep: 3306.00 | bwd_inner_microstep: 3305.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 19:07:13,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.27 | bwd: 3306.02 | bwd_inner: 3305.23 | bwd_allreduce: 0.75 | step: 6.55 36%|███▌ | 3575/10000 [5:37:34<9:45:08, 5.46s/it] {'loss': 0.0217, 'grad_norm': 0.49826493859291077, 'learning_rate': 2.9765359526590614e-05, 'epoch': 3.58} 36%|███▌ | 3575/10000 [5:37:34<9:45:08, 5.46s/it][2025-06-19 19:07:19,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:07:19,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.44 | bwd_microstep: 3375.68 | bwd_inner_microstep: 3374.72 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.06 [2025-06-19 19:07:19,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.44 | bwd: 3375.69 | bwd_inner: 3374.72 | bwd_allreduce: 0.93 | step: 7.07 36%|███▌ | 3576/10000 [5:37:40<9:47:33, 5.49s/it] {'loss': 0.0143, 'grad_norm': 0.36649301648139954, 'learning_rate': 2.9759706131472733e-05, 'epoch': 3.58} 36%|███▌ | 3576/10000 [5:37:40<9:47:33, 5.49s/it][2025-06-19 19:07:25,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:07:25,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.69 | bwd_microstep: 3358.28 | bwd_inner_microstep: 3357.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 19:07:25,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.69 | bwd: 3358.30 | bwd_inner: 3357.49 | bwd_allreduce: 0.76 | step: 6.70 36%|███▌ | 3577/10000 [5:37:45<9:48:39, 5.50s/it] {'loss': 0.0373, 'grad_norm': 1.2831926345825195, 'learning_rate': 2.975405171260697e-05, 'epoch': 3.58} 36%|███▌ | 3577/10000 [5:37:45<9:48:39, 5.50s/it][2025-06-19 19:07:30,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:07:30,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.05 | bwd_microstep: 3311.25 | bwd_inner_microstep: 3310.37 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.90 [2025-06-19 19:07:30,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.05 | bwd: 3311.27 | bwd_inner: 3310.37 | bwd_allreduce: 0.86 | step: 6.90 36%|███▌ | 3578/10000 [5:37:51<9:46:54, 5.48s/it] {'loss': 0.0446, 'grad_norm': 1.4725786447525024, 'learning_rate': 2.9748396270586464e-05, 'epoch': 3.58} 36%|███▌ | 3578/10000 [5:37:51<9:46:54, 5.48s/it][2025-06-19 19:07:35,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:07:35,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.17 | bwd_microstep: 3315.60 | bwd_inner_microstep: 3314.78 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.99 [2025-06-19 19:07:35,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.17 | bwd: 3315.61 | bwd_inner: 3314.78 | bwd_allreduce: 0.79 | step: 6.99 36%|███▌ | 3579/10000 [5:37:56<9:45:51, 5.47s/it] {'loss': 0.0377, 'grad_norm': 1.067299485206604, 'learning_rate': 2.9742739806004422e-05, 'epoch': 3.58} 36%|███▌ | 3579/10000 [5:37:56<9:45:51, 5.47s/it][2025-06-19 19:07:41,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:07:41,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.48 | bwd_microstep: 3313.62 | bwd_inner_microstep: 3312.73 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.28 [2025-06-19 19:07:41,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.48 | bwd: 3313.64 | bwd_inner: 3312.73 | bwd_allreduce: 0.87 | step: 7.29 36%|███▌ | 3580/10000 [5:38:02<9:45:20, 5.47s/it] {'loss': 0.0258, 'grad_norm': 0.5459325909614563, 'learning_rate': 2.9737082319454195e-05, 'epoch': 3.58} 36%|███▌ | 3580/10000 [5:38:02<9:45:20, 5.47s/it][2025-06-19 19:07:46,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:07:46,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.41 | bwd_microstep: 3315.18 | bwd_inner_microstep: 3314.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 19:07:46,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.41 | bwd: 3315.19 | bwd_inner: 3314.39 | bwd_allreduce: 0.76 | step: 6.76 36%|███▌ | 3581/10000 [5:38:07<9:45:16, 5.47s/it] {'loss': 0.0124, 'grad_norm': 0.3965589106082916, 'learning_rate': 2.9731423811529227e-05, 'epoch': 3.58} 36%|███▌ | 3581/10000 [5:38:07<9:45:16, 5.47s/it][2025-06-19 19:07:52,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:07:52,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.95 | bwd_microstep: 3318.15 | bwd_inner_microstep: 3317.27 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.20 [2025-06-19 19:07:52,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.95 | bwd: 3318.17 | bwd_inner: 3317.27 | bwd_allreduce: 0.84 | step: 7.20 36%|███▌ | 3582/10000 [5:38:13<9:44:58, 5.47s/it] {'loss': 0.0125, 'grad_norm': 0.38348639011383057, 'learning_rate': 2.972576428282306e-05, 'epoch': 3.58} 36%|███▌ | 3582/10000 [5:38:13<9:44:58, 5.47s/it][2025-06-19 19:07:57,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 19:07:57,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2162.40 | bwd_microstep: 3366.41 | bwd_inner_microstep: 3365.16 | bwd_allreduce_microstep: 1.19 | step_microstep: 7.50 [2025-06-19 19:07:57,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2162.40 | bwd: 3366.42 | bwd_inner: 3365.16 | bwd_allreduce: 1.22 | step: 7.51 36%|███▌ | 3583/10000 [5:38:18<9:48:13, 5.50s/it] {'loss': 0.0297, 'grad_norm': 1.1148278713226318, 'learning_rate': 2.9720103733929365e-05, 'epoch': 3.58} 36%|███▌ | 3583/10000 [5:38:18<9:48:13, 5.50s/it][2025-06-19 19:08:03,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 19:08:03,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.00 | bwd_microstep: 3325.07 | bwd_inner_microstep: 3324.08 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.62 [2025-06-19 19:08:03,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.00 | bwd: 3325.10 | bwd_inner: 3324.08 | bwd_allreduce: 0.95 | step: 7.63 36%|███▌ | 3584/10000 [5:38:24<9:48:35, 5.50s/it] {'loss': 0.026, 'grad_norm': 0.901720404624939, 'learning_rate': 2.97144421654419e-05, 'epoch': 3.58} 36%|███▌ | 3584/10000 [5:38:24<9:48:35, 5.50s/it][2025-06-19 19:08:08,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-19 19:08:08,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.90 | bwd_microstep: 3314.66 | bwd_inner_microstep: 3313.20 | bwd_allreduce_microstep: 1.37 | step_microstep: 9.07 [2025-06-19 19:08:08,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.90 | bwd: 3314.68 | bwd_inner: 3313.20 | bwd_allreduce: 1.41 | step: 9.08 36%|███▌ | 3585/10000 [5:38:29<9:48:46, 5.51s/it] {'loss': 0.0431, 'grad_norm': 0.9126595854759216, 'learning_rate': 2.9708779577954532e-05, 'epoch': 3.58} 36%|███▌ | 3585/10000 [5:38:29<9:48:46, 5.51s/it][2025-06-19 19:08:14,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:08:14,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2158.45 | bwd_microstep: 3317.14 | bwd_inner_microstep: 3316.27 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.36 [2025-06-19 19:08:14,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2158.45 | bwd: 3317.17 | bwd_inner: 3316.28 | bwd_allreduce: 0.83 | step: 7.35 36%|███▌ | 3586/10000 [5:38:35<9:49:22, 5.51s/it] {'loss': 0.082, 'grad_norm': 2.1722216606140137, 'learning_rate': 2.9703115972061253e-05, 'epoch': 3.59} 36%|███▌ | 3586/10000 [5:38:35<9:49:22, 5.51s/it][2025-06-19 19:08:20,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:08:20,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2156.69 | bwd_microstep: 3371.67 | bwd_inner_microstep: 3370.88 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-19 19:08:20,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2156.69 | bwd: 3371.68 | bwd_inner: 3370.88 | bwd_allreduce: 0.76 | step: 6.84 36%|███▌ | 3587/10000 [5:38:40<9:51:06, 5.53s/it] {'loss': 0.0135, 'grad_norm': 0.6719303727149963, 'learning_rate': 2.9697451348356138e-05, 'epoch': 3.59} 36%|███▌ | 3587/10000 [5:38:40<9:51:06, 5.53s/it][2025-06-19 19:08:25,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:08:25,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.61 | bwd_microstep: 3311.53 | bwd_inner_microstep: 3310.54 | bwd_allreduce_microstep: 0.89 | step_microstep: 8.05 [2025-06-19 19:08:25,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.61 | bwd: 3311.57 | bwd_inner: 3310.54 | bwd_allreduce: 0.93 | step: 8.06 36%|███▌ | 3588/10000 [5:38:46<9:49:00, 5.51s/it] {'loss': 0.0365, 'grad_norm': 1.144679069519043, 'learning_rate': 2.9691785707433384e-05, 'epoch': 3.59} 36%|███▌ | 3588/10000 [5:38:46<9:49:00, 5.51s/it][2025-06-19 19:08:30,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-19 19:08:30,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.81 | bwd_microstep: 3315.31 | bwd_inner_microstep: 3314.23 | bwd_allreduce_microstep: 1.00 | step_microstep: 8.37 [2025-06-19 19:08:30,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.81 | bwd: 3315.34 | bwd_inner: 3314.23 | bwd_allreduce: 1.04 | step: 8.38 36%|███▌ | 3589/10000 [5:38:51<9:48:00, 5.50s/it] {'loss': 0.0231, 'grad_norm': 1.0895127058029175, 'learning_rate': 2.968611904988729e-05, 'epoch': 3.59} 36%|███▌ | 3589/10000 [5:38:51<9:48:00, 5.50s/it][2025-06-19 19:08:36,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:08:36,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2175.06 | bwd_microstep: 3368.81 | bwd_inner_microstep: 3367.98 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.72 [2025-06-19 19:08:36,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2175.06 | bwd: 3368.82 | bwd_inner: 3367.98 | bwd_allreduce: 0.80 | step: 6.73 36%|███▌ | 3590/10000 [5:38:57<9:50:37, 5.53s/it] {'loss': 0.0697, 'grad_norm': 2.1095972061157227, 'learning_rate': 2.9680451376312263e-05, 'epoch': 3.59} 36%|███▌ | 3590/10000 [5:38:57<9:50:37, 5.53s/it][2025-06-19 19:08:42,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:08:42,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.38 | bwd_microstep: 3309.77 | bwd_inner_microstep: 3308.79 | bwd_allreduce_microstep: 0.94 | step_microstep: 6.68 [2025-06-19 19:08:42,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.38 | bwd: 3309.79 | bwd_inner: 3308.79 | bwd_allreduce: 0.95 | step: 6.68 36%|███▌ | 3591/10000 [5:39:02<9:48:07, 5.51s/it] {'loss': 0.0509, 'grad_norm': 1.6306958198547363, 'learning_rate': 2.9674782687302818e-05, 'epoch': 3.59} 36%|███▌ | 3591/10000 [5:39:02<9:48:07, 5.51s/it][2025-06-19 19:08:47,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:08:47,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.18 | bwd_microstep: 3313.08 | bwd_inner_microstep: 3312.20 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.92 [2025-06-19 19:08:47,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.18 | bwd: 3313.11 | bwd_inner: 3312.20 | bwd_allreduce: 0.85 | step: 7.92 36%|███▌ | 3592/10000 [5:39:08<9:46:37, 5.49s/it] {'loss': 0.0483, 'grad_norm': 2.000436305999756, 'learning_rate': 2.9669112983453563e-05, 'epoch': 3.59} 36%|███▌ | 3592/10000 [5:39:08<9:46:37, 5.49s/it][2025-06-19 19:08:53,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:08:53,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2155.66 | bwd_microstep: 3375.24 | bwd_inner_microstep: 3374.35 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.21 [2025-06-19 19:08:53,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2155.66 | bwd: 3375.26 | bwd_inner: 3374.35 | bwd_allreduce: 0.84 | step: 7.20 36%|███▌ | 3593/10000 [5:39:13<9:49:16, 5.52s/it] {'loss': 0.0614, 'grad_norm': 1.533678650856018, 'learning_rate': 2.966344226535924e-05, 'epoch': 3.59} 36%|███▌ | 3593/10000 [5:39:13<9:49:16, 5.52s/it][2025-06-19 19:08:58,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:08:58,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.59 | bwd_microstep: 3327.25 | bwd_inner_microstep: 3326.38 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.16 [2025-06-19 19:08:58,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.59 | bwd: 3327.27 | bwd_inner: 3326.38 | bwd_allreduce: 0.83 | step: 7.16 36%|███▌ | 3594/10000 [5:39:19<9:49:00, 5.52s/it] {'loss': 0.0396, 'grad_norm': 1.1682997941970825, 'learning_rate': 2.965777053361467e-05, 'epoch': 3.59} 36%|███▌ | 3594/10000 [5:39:19<9:49:00, 5.52s/it][2025-06-19 19:09:04,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.73 [2025-06-19 19:09:04,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.90 | bwd_microstep: 3333.81 | bwd_inner_microstep: 3332.43 | bwd_allreduce_microstep: 1.26 | step_microstep: 9.12 [2025-06-19 19:09:04,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.90 | bwd: 3333.86 | bwd_inner: 3332.43 | bwd_allreduce: 1.32 | step: 9.13 36%|███▌ | 3595/10000 [5:39:24<9:48:59, 5.52s/it] {'loss': 0.0127, 'grad_norm': 0.45630982518196106, 'learning_rate': 2.9652097788814793e-05, 'epoch': 3.59} 36%|███▌ | 3595/10000 [5:39:24<9:48:59, 5.52s/it][2025-06-19 19:09:09,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:09:09,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2160.14 | bwd_microstep: 3381.12 | bwd_inner_microstep: 3380.32 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-19 19:09:09,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2160.14 | bwd: 3381.14 | bwd_inner: 3380.32 | bwd_allreduce: 0.77 | step: 6.83 36%|███▌ | 3596/10000 [5:39:30<9:51:05, 5.54s/it] {'loss': 0.0289, 'grad_norm': 0.9409111142158508, 'learning_rate': 2.964642403155466e-05, 'epoch': 3.6} 36%|███▌ | 3596/10000 [5:39:30<9:51:05, 5.54s/it][2025-06-19 19:09:15,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:09:15,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.84 | bwd_microstep: 3400.78 | bwd_inner_microstep: 3399.97 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.02 [2025-06-19 19:09:15,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.84 | bwd: 3400.80 | bwd_inner: 3399.97 | bwd_allreduce: 0.78 | step: 7.02 36%|███▌ | 3597/10000 [5:39:36<9:52:20, 5.55s/it] {'loss': 0.052, 'grad_norm': 2.025477886199951, 'learning_rate': 2.9640749262429408e-05, 'epoch': 3.6} 36%|███▌ | 3597/10000 [5:39:36<9:52:20, 5.55s/it][2025-06-19 19:09:20,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.01 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 19:09:20,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.19 | bwd_microstep: 3329.71 | bwd_inner_microstep: 3328.61 | bwd_allreduce_microstep: 1.05 | step_microstep: 8.03 [2025-06-19 19:09:20,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.19 | bwd: 3329.73 | bwd_inner: 3328.61 | bwd_allreduce: 1.07 | step: 8.04 36%|███▌ | 3598/10000 [5:39:41<9:50:12, 5.53s/it] {'loss': 0.0404, 'grad_norm': 1.7121808528900146, 'learning_rate': 2.9635073482034307e-05, 'epoch': 3.6} 36%|███▌ | 3598/10000 [5:39:41<9:50:12, 5.53s/it][2025-06-19 19:09:26,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:09:26,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.35 | bwd_microstep: 3368.61 | bwd_inner_microstep: 3367.52 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.33 [2025-06-19 19:09:26,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.35 | bwd: 3368.63 | bwd_inner: 3367.52 | bwd_allreduce: 1.06 | step: 7.33 36%|███▌ | 3599/10000 [5:39:47<9:53:21, 5.56s/it] {'loss': 0.0763, 'grad_norm': 3.2304835319519043, 'learning_rate': 2.9629396690964718e-05, 'epoch': 3.6} 36%|███▌ | 3599/10000 [5:39:47<9:53:21, 5.56s/it][2025-06-19 19:09:31,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:09:31,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.38 | bwd_microstep: 3327.26 | bwd_inner_microstep: 3326.19 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.23 [2025-06-19 19:09:31,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.38 | bwd: 3327.28 | bwd_inner: 3326.19 | bwd_allreduce: 1.04 | step: 7.24 36%|███▌ | 3600/10000 [5:39:52<9:51:01, 5.54s/it] {'loss': 0.0264, 'grad_norm': 0.7790032625198364, 'learning_rate': 2.9623718889816105e-05, 'epoch': 3.6} 36%|███▌ | 3600/10000 [5:39:52<9:51:01, 5.54s/it][2025-06-19 19:09:37,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:09:37,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.54 | bwd_microstep: 3329.82 | bwd_inner_microstep: 3329.01 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 19:09:37,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.54 | bwd: 3329.84 | bwd_inner: 3329.01 | bwd_allreduce: 0.78 | step: 7.23 36%|███▌ | 3601/10000 [5:39:58<9:49:31, 5.53s/it] {'loss': 0.2253, 'grad_norm': 2.6387617588043213, 'learning_rate': 2.9618040079184048e-05, 'epoch': 3.6} 36%|███▌ | 3601/10000 [5:39:58<9:49:31, 5.53s/it][2025-06-19 19:09:42,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.70 | optimizer_step: 2.73 [2025-06-19 19:09:42,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.99 | bwd_microstep: 3379.97 | bwd_inner_microstep: 3378.74 | bwd_allreduce_microstep: 1.13 | step_microstep: 9.08 [2025-06-19 19:09:42,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.99 | bwd: 3379.99 | bwd_inner: 3378.74 | bwd_allreduce: 1.18 | step: 9.09 36%|███▌ | 3602/10000 [5:40:03<9:50:24, 5.54s/it] {'loss': 0.1482, 'grad_norm': 2.445375442504883, 'learning_rate': 2.9612360259664217e-05, 'epoch': 3.6} 36%|███▌ | 3602/10000 [5:40:03<9:50:24, 5.54s/it][2025-06-19 19:09:48,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 19:09:48,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.56 | bwd_microstep: 3327.29 | bwd_inner_microstep: 3326.25 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.47 [2025-06-19 19:09:48,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.56 | bwd: 3327.31 | bwd_inner: 3326.25 | bwd_allreduce: 0.99 | step: 7.47 36%|███▌ | 3603/10000 [5:40:09<9:50:00, 5.53s/it] {'loss': 0.13, 'grad_norm': 2.758949041366577, 'learning_rate': 2.960667943185241e-05, 'epoch': 3.6} 36%|███▌ | 3603/10000 [5:40:09<9:50:00, 5.53s/it][2025-06-19 19:09:53,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:09:53,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.04 | bwd_microstep: 3340.58 | bwd_inner_microstep: 3339.59 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.42 [2025-06-19 19:09:53,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.04 | bwd: 3340.60 | bwd_inner: 3339.59 | bwd_allreduce: 0.96 | step: 7.43 36%|███▌ | 3604/10000 [5:40:14<9:49:53, 5.53s/it] {'loss': 0.007, 'grad_norm': 0.5050749182701111, 'learning_rate': 2.9600997596344526e-05, 'epoch': 3.6} 36%|███▌ | 3604/10000 [5:40:14<9:49:53, 5.53s/it][2025-06-19 19:09:59,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:09:59,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.27 | bwd_microstep: 3328.46 | bwd_inner_microstep: 3327.58 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.83 [2025-06-19 19:09:59,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.27 | bwd: 3328.49 | bwd_inner: 3327.58 | bwd_allreduce: 0.84 | step: 7.84 36%|███▌ | 3605/10000 [5:40:20<9:49:32, 5.53s/it] {'loss': 0.1325, 'grad_norm': 4.228185176849365, 'learning_rate': 2.9595314753736547e-05, 'epoch': 3.6} 36%|███▌ | 3605/10000 [5:40:20<9:49:32, 5.53s/it][2025-06-19 19:10:05,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:10:05,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.62 | bwd_microstep: 3322.95 | bwd_inner_microstep: 3322.10 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.08 [2025-06-19 19:10:05,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.62 | bwd: 3322.98 | bwd_inner: 3322.10 | bwd_allreduce: 0.81 | step: 7.09 36%|███▌ | 3606/10000 [5:40:25<9:48:49, 5.53s/it] {'loss': 0.0482, 'grad_norm': 2.6356427669525146, 'learning_rate': 2.958963090462458e-05, 'epoch': 3.61} 36%|███▌ | 3606/10000 [5:40:25<9:48:49, 5.53s/it][2025-06-19 19:10:10,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:10:10,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.89 | bwd_microstep: 3330.18 | bwd_inner_microstep: 3329.10 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.12 [2025-06-19 19:10:10,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.89 | bwd: 3330.20 | bwd_inner: 3329.10 | bwd_allreduce: 1.05 | step: 7.12 36%|███▌ | 3607/10000 [5:40:31<9:47:31, 5.51s/it] {'loss': 0.098, 'grad_norm': 2.0133986473083496, 'learning_rate': 2.9583946049604844e-05, 'epoch': 3.61} 36%|███▌ | 3607/10000 [5:40:31<9:47:31, 5.51s/it][2025-06-19 19:10:16,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.72 [2025-06-19 19:10:16,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.77 | bwd_microstep: 3344.26 | bwd_inner_microstep: 3342.56 | bwd_allreduce_microstep: 1.59 | step_microstep: 9.21 [2025-06-19 19:10:16,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.77 | bwd: 3344.29 | bwd_inner: 3342.56 | bwd_allreduce: 1.65 | step: 9.22 36%|███▌ | 3608/10000 [5:40:36<9:47:14, 5.51s/it] {'loss': 0.0619, 'grad_norm': 2.0052454471588135, 'learning_rate': 2.957826018927364e-05, 'epoch': 3.61} 36%|███▌ | 3608/10000 [5:40:36<9:47:14, 5.51s/it][2025-06-19 19:10:21,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.72 [2025-06-19 19:10:21,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.15 | bwd_microstep: 3390.66 | bwd_inner_microstep: 3389.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 19:10:21,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.15 | bwd: 3390.67 | bwd_inner: 3389.87 | bwd_allreduce: 0.77 | step: 6.90 36%|███▌ | 3609/10000 [5:40:42<9:49:35, 5.54s/it] {'loss': 0.0577, 'grad_norm': 2.162135124206543, 'learning_rate': 2.957257332422741e-05, 'epoch': 3.61} 36%|███▌ | 3609/10000 [5:40:42<9:49:35, 5.54s/it][2025-06-19 19:10:27,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:10:27,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.98 | bwd_microstep: 3329.30 | bwd_inner_microstep: 3328.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 19:10:27,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.98 | bwd: 3329.31 | bwd_inner: 3328.51 | bwd_allreduce: 0.76 | step: 6.60 36%|███▌ | 3610/10000 [5:40:47<9:47:33, 5.52s/it] {'loss': 0.0375, 'grad_norm': 1.4700853824615479, 'learning_rate': 2.9566885455062656e-05, 'epoch': 3.61} 36%|███▌ | 3610/10000 [5:40:47<9:47:33, 5.52s/it][2025-06-19 19:10:32,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:10:32,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.92 | bwd_microstep: 3336.88 | bwd_inner_microstep: 3336.06 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.10 [2025-06-19 19:10:32,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.92 | bwd: 3336.90 | bwd_inner: 3336.06 | bwd_allreduce: 0.79 | step: 7.13 36%|███▌ | 3611/10000 [5:40:53<9:46:37, 5.51s/it] {'loss': 0.0756, 'grad_norm': 2.071790933609009, 'learning_rate': 2.9561196582376022e-05, 'epoch': 3.61} 36%|███▌ | 3611/10000 [5:40:53<9:46:37, 5.51s/it][2025-06-19 19:10:38,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:10:38,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.78 | bwd_microstep: 3339.10 | bwd_inner_microstep: 3338.10 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.05 [2025-06-19 19:10:38,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.78 | bwd: 3339.12 | bwd_inner: 3338.10 | bwd_allreduce: 0.96 | step: 7.06 36%|███▌ | 3612/10000 [5:40:58<9:46:41, 5.51s/it] {'loss': 0.0241, 'grad_norm': 1.1773020029067993, 'learning_rate': 2.9555506706764242e-05, 'epoch': 3.61} 36%|███▌ | 3612/10000 [5:40:58<9:46:41, 5.51s/it][2025-06-19 19:10:43,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 19:10:43,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.96 | bwd_microstep: 3342.68 | bwd_inner_microstep: 3341.63 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.86 [2025-06-19 19:10:43,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.96 | bwd: 3342.70 | bwd_inner: 3341.63 | bwd_allreduce: 1.02 | step: 7.86 36%|███▌ | 3613/10000 [5:41:04<9:46:43, 5.51s/it] {'loss': 0.0571, 'grad_norm': 1.6412978172302246, 'learning_rate': 2.9549815828824152e-05, 'epoch': 3.61} 36%|███▌ | 3613/10000 [5:41:04<9:46:43, 5.51s/it][2025-06-19 19:10:49,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:10:49,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.25 | bwd_microstep: 3336.97 | bwd_inner_microstep: 3336.17 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 19:10:49,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.25 | bwd: 3336.98 | bwd_inner: 3336.17 | bwd_allreduce: 0.77 | step: 6.96 36%|███▌ | 3614/10000 [5:41:09<9:46:16, 5.51s/it] {'loss': 0.0997, 'grad_norm': 1.7216359376907349, 'learning_rate': 2.9544123949152707e-05, 'epoch': 3.61} 36%|███▌ | 3614/10000 [5:41:09<9:46:16, 5.51s/it][2025-06-19 19:10:54,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:10:54,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.58 | bwd_microstep: 3321.49 | bwd_inner_microstep: 3320.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 19:10:54,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.58 | bwd: 3321.50 | bwd_inner: 3320.71 | bwd_allreduce: 0.76 | step: 6.72 36%|███▌ | 3615/10000 [5:41:15<9:44:57, 5.50s/it] {'loss': 0.0133, 'grad_norm': 0.7080789804458618, 'learning_rate': 2.9538431068346953e-05, 'epoch': 3.62} 36%|███▌ | 3615/10000 [5:41:15<9:44:57, 5.50s/it][2025-06-19 19:11:00,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:11:00,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.72 | bwd_microstep: 3335.91 | bwd_inner_microstep: 3334.82 | bwd_allreduce_microstep: 1.03 | step_microstep: 8.15 [2025-06-19 19:11:00,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.72 | bwd: 3335.92 | bwd_inner: 3334.82 | bwd_allreduce: 1.05 | step: 8.17 36%|███▌ | 3616/10000 [5:41:20<9:44:40, 5.49s/it] {'loss': 0.0326, 'grad_norm': 1.0824432373046875, 'learning_rate': 2.9532737187004055e-05, 'epoch': 3.62} 36%|███▌ | 3616/10000 [5:41:20<9:44:40, 5.49s/it][2025-06-19 19:11:05,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:11:05,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.96 | bwd_microstep: 3346.67 | bwd_inner_microstep: 3345.70 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.33 [2025-06-19 19:11:05,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.96 | bwd: 3346.70 | bwd_inner: 3345.70 | bwd_allreduce: 0.95 | step: 7.34 36%|███▌ | 3617/10000 [5:41:26<9:45:12, 5.50s/it] {'loss': 0.048, 'grad_norm': 1.3390867710113525, 'learning_rate': 2.9527042305721264e-05, 'epoch': 3.62} 36%|███▌ | 3617/10000 [5:41:26<9:45:12, 5.50s/it][2025-06-19 19:11:11,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:11:11,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.91 | bwd_microstep: 3337.70 | bwd_inner_microstep: 3336.79 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.75 [2025-06-19 19:11:11,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.91 | bwd: 3337.71 | bwd_inner: 3336.79 | bwd_allreduce: 0.88 | step: 6.75 36%|███▌ | 3618/10000 [5:41:31<9:45:45, 5.51s/it] {'loss': 0.0188, 'grad_norm': 0.39467310905456543, 'learning_rate': 2.952134642509595e-05, 'epoch': 3.62} 36%|███▌ | 3618/10000 [5:41:31<9:45:45, 5.51s/it][2025-06-19 19:11:16,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:11:16,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.56 | bwd_microstep: 3328.24 | bwd_inner_microstep: 3327.24 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.24 [2025-06-19 19:11:16,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.56 | bwd: 3328.25 | bwd_inner: 3327.24 | bwd_allreduce: 0.97 | step: 7.24 36%|███▌ | 3619/10000 [5:41:37<9:44:34, 5.50s/it] {'loss': 0.1374, 'grad_norm': 3.342348098754883, 'learning_rate': 2.9515649545725594e-05, 'epoch': 3.62} 36%|███▌ | 3619/10000 [5:41:37<9:44:34, 5.50s/it][2025-06-19 19:11:22,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:11:22,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.51 | bwd_microstep: 3388.63 | bwd_inner_microstep: 3387.82 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.80 [2025-06-19 19:11:22,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.51 | bwd: 3388.65 | bwd_inner: 3387.82 | bwd_allreduce: 0.78 | step: 7.82 36%|███▌ | 3620/10000 [5:41:42<9:47:12, 5.52s/it] {'loss': 0.0881, 'grad_norm': 2.1934754848480225, 'learning_rate': 2.9509951668207754e-05, 'epoch': 3.62} 36%|███▌ | 3620/10000 [5:41:42<9:47:12, 5.52s/it][2025-06-19 19:11:27,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:11:27,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.40 | bwd_microstep: 3340.15 | bwd_inner_microstep: 3339.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-19 19:11:27,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.40 | bwd: 3340.16 | bwd_inner: 3339.35 | bwd_allreduce: 0.77 | step: 6.91 36%|███▌ | 3621/10000 [5:41:48<9:46:07, 5.51s/it] {'loss': 0.0445, 'grad_norm': 0.7530363202095032, 'learning_rate': 2.9504252793140127e-05, 'epoch': 3.62} 36%|███▌ | 3621/10000 [5:41:48<9:46:07, 5.51s/it][2025-06-19 19:11:33,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:11:33,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.46 | bwd_microstep: 3332.52 | bwd_inner_microstep: 3331.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 19:11:33,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.46 | bwd: 3332.54 | bwd_inner: 3331.73 | bwd_allreduce: 0.76 | step: 6.71 36%|███▌ | 3622/10000 [5:41:53<9:45:05, 5.50s/it] {'loss': 0.0583, 'grad_norm': 2.135483741760254, 'learning_rate': 2.9498552921120494e-05, 'epoch': 3.62} 36%|███▌ | 3622/10000 [5:41:53<9:45:05, 5.50s/it][2025-06-19 19:11:38,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:11:38,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.55 | bwd_microstep: 3332.42 | bwd_inner_microstep: 3331.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 19:11:38,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.55 | bwd: 3332.44 | bwd_inner: 3331.64 | bwd_allreduce: 0.76 | step: 6.67 36%|███▌ | 3623/10000 [5:41:59<9:44:09, 5.50s/it] {'loss': 0.0221, 'grad_norm': 0.9902410507202148, 'learning_rate': 2.949285205274674e-05, 'epoch': 3.62} 36%|███▌ | 3623/10000 [5:41:59<9:44:09, 5.50s/it][2025-06-19 19:11:44,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:11:44,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.95 | bwd_microstep: 3317.68 | bwd_inner_microstep: 3316.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 19:11:44,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.95 | bwd: 3317.70 | bwd_inner: 3316.89 | bwd_allreduce: 0.76 | step: 6.64 36%|███▌ | 3624/10000 [5:42:04<9:42:59, 5.49s/it] {'loss': 0.0499, 'grad_norm': 1.4208258390426636, 'learning_rate': 2.948715018861686e-05, 'epoch': 3.62} 36%|███▌ | 3624/10000 [5:42:04<9:42:59, 5.49s/it][2025-06-19 19:11:49,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.68 | optimizer_step: 2.88 [2025-06-19 19:11:49,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.94 | bwd_microstep: 3381.21 | bwd_inner_microstep: 3380.42 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.16 [2025-06-19 19:11:49,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.94 | bwd: 3381.22 | bwd_inner: 3380.42 | bwd_allreduce: 0.76 | step: 7.16 36%|███▋ | 3625/10000 [5:42:10<9:44:49, 5.50s/it] {'loss': 0.0967, 'grad_norm': 3.116269588470459, 'learning_rate': 2.9481447329328965e-05, 'epoch': 3.62} 36%|███▋ | 3625/10000 [5:42:10<9:44:49, 5.50s/it][2025-06-19 19:11:55,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 19:11:55,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.95 | bwd_microstep: 3410.65 | bwd_inner_microstep: 3409.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 19:11:55,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.95 | bwd: 3410.66 | bwd_inner: 3409.86 | bwd_allreduce: 0.76 | step: 6.66 36%|███▋ | 3626/10000 [5:42:16<9:47:25, 5.53s/it] {'loss': 0.015, 'grad_norm': 0.5529835224151611, 'learning_rate': 2.947574347548124e-05, 'epoch': 3.63} 36%|███▋ | 3626/10000 [5:42:16<9:47:25, 5.53s/it][2025-06-19 19:12:00,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:12:00,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.73 | bwd_microstep: 3380.73 | bwd_inner_microstep: 3379.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.57 [2025-06-19 19:12:00,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.73 | bwd: 3380.75 | bwd_inner: 3379.94 | bwd_allreduce: 0.77 | step: 6.58 36%|███▋ | 3627/10000 [5:42:21<9:48:01, 5.54s/it] {'loss': 0.0045, 'grad_norm': 0.1987265944480896, 'learning_rate': 2.9470038627672015e-05, 'epoch': 3.63} 36%|███▋ | 3627/10000 [5:42:21<9:48:01, 5.54s/it][2025-06-19 19:12:06,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:12:06,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.99 | bwd_microstep: 3370.51 | bwd_inner_microstep: 3369.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-19 19:12:06,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.99 | bwd: 3370.52 | bwd_inner: 3369.71 | bwd_allreduce: 0.77 | step: 7.09 36%|███▋ | 3628/10000 [5:42:27<9:47:57, 5.54s/it] {'loss': 0.0132, 'grad_norm': 0.5202216506004333, 'learning_rate': 2.946433278649968e-05, 'epoch': 3.63} 36%|███▋ | 3628/10000 [5:42:27<9:47:57, 5.54s/it][2025-06-19 19:12:11,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.77 [2025-06-19 19:12:11,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.46 | bwd_microstep: 3334.61 | bwd_inner_microstep: 3333.62 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.37 [2025-06-19 19:12:11,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.46 | bwd: 3334.63 | bwd_inner: 3333.62 | bwd_allreduce: 0.95 | step: 7.38 36%|███▋ | 3629/10000 [5:42:32<9:46:13, 5.52s/it] {'loss': 0.0184, 'grad_norm': 0.6740781664848328, 'learning_rate': 2.945862595256277e-05, 'epoch': 3.63} 36%|███▋ | 3629/10000 [5:42:32<9:46:13, 5.52s/it][2025-06-19 19:12:17,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:12:17,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.03 | bwd_microstep: 3321.07 | bwd_inner_microstep: 3319.95 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.51 [2025-06-19 19:12:17,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.03 | bwd: 3321.09 | bwd_inner: 3319.95 | bwd_allreduce: 1.09 | step: 7.52 36%|███▋ | 3630/10000 [5:42:38<9:44:36, 5.51s/it] {'loss': 0.0157, 'grad_norm': 0.6204429864883423, 'learning_rate': 2.945291812645989e-05, 'epoch': 3.63} 36%|███▋ | 3630/10000 [5:42:38<9:44:36, 5.51s/it][2025-06-19 19:12:22,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:12:22,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.98 | bwd_microstep: 3366.93 | bwd_inner_microstep: 3366.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-19 19:12:22,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.98 | bwd: 3366.95 | bwd_inner: 3366.13 | bwd_allreduce: 0.77 | step: 6.61 36%|███▋ | 3631/10000 [5:42:43<9:45:44, 5.52s/it] {'loss': 0.0542, 'grad_norm': 1.8276666402816772, 'learning_rate': 2.944720930878977e-05, 'epoch': 3.63} 36%|███▋ | 3631/10000 [5:42:43<9:45:44, 5.52s/it][2025-06-19 19:12:28,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:12:28,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.35 | bwd_microstep: 3327.89 | bwd_inner_microstep: 3327.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 19:12:28,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.35 | bwd: 3327.90 | bwd_inner: 3327.11 | bwd_allreduce: 0.75 | step: 6.55 36%|███▋ | 3632/10000 [5:42:49<9:44:09, 5.50s/it] {'loss': 0.0334, 'grad_norm': 1.3890929222106934, 'learning_rate': 2.944149950015124e-05, 'epoch': 3.63} 36%|███▋ | 3632/10000 [5:42:49<9:44:09, 5.50s/it][2025-06-19 19:12:33,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:12:33,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.06 | bwd_microstep: 3373.00 | bwd_inner_microstep: 3372.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.61 [2025-06-19 19:12:33,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.06 | bwd: 3373.02 | bwd_inner: 3372.21 | bwd_allreduce: 0.76 | step: 6.62 36%|███▋ | 3633/10000 [5:42:54<9:45:13, 5.51s/it] {'loss': 0.1967, 'grad_norm': 2.1437718868255615, 'learning_rate': 2.9435788701143232e-05, 'epoch': 3.63} 36%|███▋ | 3633/10000 [5:42:54<9:45:13, 5.51s/it][2025-06-19 19:12:39,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:12:39,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.43 | bwd_microstep: 3324.08 | bwd_inner_microstep: 3323.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 19:12:39,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.43 | bwd: 3324.10 | bwd_inner: 3323.30 | bwd_allreduce: 0.76 | step: 6.61 36%|███▋ | 3634/10000 [5:43:00<9:43:33, 5.50s/it] {'loss': 0.168, 'grad_norm': 2.8274035453796387, 'learning_rate': 2.9430076912364787e-05, 'epoch': 3.63} 36%|███▋ | 3634/10000 [5:43:00<9:43:33, 5.50s/it][2025-06-19 19:12:44,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:12:44,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.11 | bwd_microstep: 3375.41 | bwd_inner_microstep: 3374.61 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.61 [2025-06-19 19:12:44,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.11 | bwd: 3375.43 | bwd_inner: 3374.61 | bwd_allreduce: 0.77 | step: 6.62 36%|███▋ | 3635/10000 [5:43:05<9:44:44, 5.51s/it] {'loss': 0.0235, 'grad_norm': 1.2040071487426758, 'learning_rate': 2.9424364134415032e-05, 'epoch': 3.63} 36%|███▋ | 3635/10000 [5:43:05<9:44:44, 5.51s/it][2025-06-19 19:12:50,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:12:50,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.46 | bwd_microstep: 3327.83 | bwd_inner_microstep: 3326.94 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.14 [2025-06-19 19:12:50,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.46 | bwd: 3327.85 | bwd_inner: 3326.94 | bwd_allreduce: 0.85 | step: 7.14 36%|███▋ | 3636/10000 [5:43:11<9:43:18, 5.50s/it] {'loss': 0.01, 'grad_norm': 0.30254635214805603, 'learning_rate': 2.9418650367893226e-05, 'epoch': 3.64} 36%|███▋ | 3636/10000 [5:43:11<9:43:18, 5.50s/it][2025-06-19 19:12:55,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:12:55,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.16 | bwd_microstep: 3324.06 | bwd_inner_microstep: 3323.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 19:12:55,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.17 | bwd: 3324.07 | bwd_inner: 3323.27 | bwd_allreduce: 0.76 | step: 6.77 36%|███▋ | 3637/10000 [5:43:16<9:42:45, 5.50s/it] {'loss': 0.0092, 'grad_norm': 0.7676032781600952, 'learning_rate': 2.941293561339871e-05, 'epoch': 3.64} 36%|███▋ | 3637/10000 [5:43:16<9:42:45, 5.50s/it][2025-06-19 19:13:01,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:13:01,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.28 | bwd_microstep: 3319.74 | bwd_inner_microstep: 3318.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 19:13:01,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.28 | bwd: 3319.76 | bwd_inner: 3318.94 | bwd_allreduce: 0.77 | step: 6.97 36%|███▋ | 3638/10000 [5:43:22<9:41:37, 5.49s/it] {'loss': 0.0825, 'grad_norm': 2.3569934368133545, 'learning_rate': 2.9407219871530926e-05, 'epoch': 3.64} 36%|███▋ | 3638/10000 [5:43:22<9:41:37, 5.49s/it][2025-06-19 19:13:06,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:13:06,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.07 | bwd_microstep: 3379.52 | bwd_inner_microstep: 3378.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 19:13:06,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.07 | bwd: 3379.53 | bwd_inner: 3378.73 | bwd_allreduce: 0.77 | step: 6.69 36%|███▋ | 3639/10000 [5:43:27<9:43:30, 5.50s/it] {'loss': 0.0835, 'grad_norm': 1.8244422674179077, 'learning_rate': 2.940150314288945e-05, 'epoch': 3.64} 36%|███▋ | 3639/10000 [5:43:27<9:43:30, 5.50s/it][2025-06-19 19:13:12,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:13:12,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.86 | bwd_microstep: 3324.79 | bwd_inner_microstep: 3323.82 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.17 [2025-06-19 19:13:12,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.86 | bwd: 3324.80 | bwd_inner: 3323.82 | bwd_allreduce: 0.93 | step: 7.18 36%|███▋ | 3640/10000 [5:43:33<9:42:32, 5.50s/it] {'loss': 0.0803, 'grad_norm': 2.6581716537475586, 'learning_rate': 2.9395785428073917e-05, 'epoch': 3.64} 36%|███▋ | 3640/10000 [5:43:33<9:42:32, 5.50s/it][2025-06-19 19:13:17,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 19:13:17,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.79 | bwd_microstep: 3387.06 | bwd_inner_microstep: 3386.14 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.49 [2025-06-19 19:13:17,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.79 | bwd: 3387.10 | bwd_inner: 3386.14 | bwd_allreduce: 0.88 | step: 8.49 36%|███▋ | 3641/10000 [5:43:38<9:44:34, 5.52s/it] {'loss': 0.0132, 'grad_norm': 0.6158996820449829, 'learning_rate': 2.9390066727684106e-05, 'epoch': 3.64} 36%|███▋ | 3641/10000 [5:43:38<9:44:34, 5.52s/it][2025-06-19 19:13:23,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:13:23,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.71 | bwd_microstep: 3323.21 | bwd_inner_microstep: 3322.42 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 19:13:23,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.71 | bwd: 3323.23 | bwd_inner: 3322.42 | bwd_allreduce: 0.76 | step: 6.80 36%|███▋ | 3642/10000 [5:43:44<9:44:25, 5.52s/it] {'loss': 0.0784, 'grad_norm': 1.3044682741165161, 'learning_rate': 2.938434704231987e-05, 'epoch': 3.64} 36%|███▋ | 3642/10000 [5:43:44<9:44:25, 5.52s/it][2025-06-19 19:13:28,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:13:28,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2160.50 | bwd_microstep: 3349.31 | bwd_inner_microstep: 3348.42 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.24 [2025-06-19 19:13:28,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2160.50 | bwd: 3349.33 | bwd_inner: 3348.42 | bwd_allreduce: 0.84 | step: 7.25 36%|███▋ | 3643/10000 [5:43:49<9:45:28, 5.53s/it] {'loss': 0.0271, 'grad_norm': 0.8354964852333069, 'learning_rate': 2.937862637258119e-05, 'epoch': 3.64} 36%|███▋ | 3643/10000 [5:43:49<9:45:28, 5.53s/it][2025-06-19 19:13:34,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.66 | optimizer_step: 2.73 [2025-06-19 19:13:34,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.43 | bwd_microstep: 3319.58 | bwd_inner_microstep: 3318.24 | bwd_allreduce_microstep: 1.24 | step_microstep: 8.73 [2025-06-19 19:13:34,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.43 | bwd: 3319.61 | bwd_inner: 3318.24 | bwd_allreduce: 1.28 | step: 8.73 36%|███▋ | 3644/10000 [5:43:55<9:44:02, 5.51s/it] {'loss': 0.0986, 'grad_norm': 1.7232528924942017, 'learning_rate': 2.9372904719068135e-05, 'epoch': 3.64} 36%|███▋ | 3644/10000 [5:43:55<9:44:02, 5.51s/it][2025-06-19 19:13:39,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.73 [2025-06-19 19:13:39,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2167.03 | bwd_microstep: 3369.52 | bwd_inner_microstep: 3368.16 | bwd_allreduce_microstep: 1.27 | step_microstep: 9.07 [2025-06-19 19:13:39,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2167.03 | bwd: 3369.55 | bwd_inner: 3368.16 | bwd_allreduce: 1.31 | step: 9.09 36%|███▋ | 3645/10000 [5:44:00<9:46:32, 5.54s/it] {'loss': 0.0268, 'grad_norm': 1.5103774070739746, 'learning_rate': 2.9367182082380866e-05, 'epoch': 3.65} 36%|███▋ | 3645/10000 [5:44:00<9:46:32, 5.54s/it][2025-06-19 19:13:45,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 19:13:45,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.44 | bwd_microstep: 3332.02 | bwd_inner_microstep: 3331.18 | bwd_allreduce_microstep: 0.79 | step_microstep: 8.32 [2025-06-19 19:13:45,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.44 | bwd: 3332.03 | bwd_inner: 3331.18 | bwd_allreduce: 0.81 | step: 8.32 36%|███▋ | 3646/10000 [5:44:06<9:45:02, 5.52s/it] {'loss': 0.0456, 'grad_norm': 2.5068626403808594, 'learning_rate': 2.936145846311967e-05, 'epoch': 3.65} 36%|███▋ | 3646/10000 [5:44:06<9:45:02, 5.52s/it][2025-06-19 19:13:51,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:13:51,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2165.57 | bwd_microstep: 3372.21 | bwd_inner_microstep: 3371.29 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.98 [2025-06-19 19:13:51,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2165.57 | bwd: 3372.24 | bwd_inner: 3371.29 | bwd_allreduce: 0.87 | step: 7.98 36%|███▋ | 3647/10000 [5:44:11<9:47:02, 5.54s/it] {'loss': 0.0022, 'grad_norm': 0.05537741631269455, 'learning_rate': 2.935573386188494e-05, 'epoch': 3.65} 36%|███▋ | 3647/10000 [5:44:11<9:47:02, 5.54s/it][2025-06-19 19:13:56,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 19:13:56,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.43 | bwd_microstep: 3324.84 | bwd_inner_microstep: 3323.67 | bwd_allreduce_microstep: 1.08 | step_microstep: 8.16 [2025-06-19 19:13:56,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.43 | bwd: 3324.87 | bwd_inner: 3323.67 | bwd_allreduce: 1.12 | step: 8.17 36%|███▋ | 3648/10000 [5:44:17<9:45:39, 5.53s/it] {'loss': 0.0702, 'grad_norm': 2.779683828353882, 'learning_rate': 2.935000827927714e-05, 'epoch': 3.65} 36%|███▋ | 3648/10000 [5:44:17<9:45:39, 5.53s/it][2025-06-19 19:14:02,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:14:02,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2179.12 | bwd_microstep: 3368.55 | bwd_inner_microstep: 3367.66 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.37 [2025-06-19 19:14:02,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2179.12 | bwd: 3368.58 | bwd_inner: 3367.66 | bwd_allreduce: 0.85 | step: 7.38 36%|███▋ | 3649/10000 [5:44:22<9:47:41, 5.55s/it] {'loss': 0.026, 'grad_norm': 1.4311538934707642, 'learning_rate': 2.9344281715896873e-05, 'epoch': 3.65} 36%|███▋ | 3649/10000 [5:44:22<9:47:41, 5.55s/it][2025-06-19 19:14:07,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:14:07,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.06 | bwd_microstep: 3339.12 | bwd_inner_microstep: 3338.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 19:14:07,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.06 | bwd: 3339.13 | bwd_inner: 3338.33 | bwd_allreduce: 0.76 | step: 6.71 36%|███▋ | 3650/10000 [5:44:28<9:45:43, 5.53s/it] {'loss': 0.0238, 'grad_norm': 0.748768150806427, 'learning_rate': 2.9338554172344813e-05, 'epoch': 3.65} 36%|███▋ | 3650/10000 [5:44:28<9:45:43, 5.53s/it][2025-06-19 19:14:13,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:14:13,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2163.70 | bwd_microstep: 3371.05 | bwd_inner_microstep: 3370.22 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.20 [2025-06-19 19:14:13,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2163.70 | bwd: 3371.07 | bwd_inner: 3370.22 | bwd_allreduce: 0.79 | step: 7.20 37%|███▋ | 3651/10000 [5:44:34<9:46:50, 5.55s/it] {'loss': 0.0088, 'grad_norm': 0.422610878944397, 'learning_rate': 2.9332825649221765e-05, 'epoch': 3.65} 37%|███▋ | 3651/10000 [5:44:34<9:46:50, 5.55s/it][2025-06-19 19:14:18,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:14:18,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.62 | bwd_microstep: 3394.25 | bwd_inner_microstep: 3393.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 19:14:18,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.62 | bwd: 3394.26 | bwd_inner: 3393.46 | bwd_allreduce: 0.76 | step: 6.79 37%|███▋ | 3652/10000 [5:44:39<9:47:32, 5.55s/it] {'loss': 0.01, 'grad_norm': 0.2462759017944336, 'learning_rate': 2.9327096147128632e-05, 'epoch': 3.65} 37%|███▋ | 3652/10000 [5:44:39<9:47:32, 5.55s/it][2025-06-19 19:14:24,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:14:24,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.44 | bwd_microstep: 3360.20 | bwd_inner_microstep: 3359.38 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.03 [2025-06-19 19:14:24,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.44 | bwd: 3360.21 | bwd_inner: 3359.38 | bwd_allreduce: 0.79 | step: 7.04 37%|███▋ | 3653/10000 [5:44:45<9:46:32, 5.54s/it] {'loss': 0.0106, 'grad_norm': 0.30298903584480286, 'learning_rate': 2.932136566666639e-05, 'epoch': 3.65} 37%|███▋ | 3653/10000 [5:44:45<9:46:32, 5.54s/it][2025-06-19 19:14:29,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:14:29,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.88 | bwd_microstep: 3371.67 | bwd_inner_microstep: 3370.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 19:14:29,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.88 | bwd: 3371.68 | bwd_inner: 3370.89 | bwd_allreduce: 0.76 | step: 6.59 37%|███▋ | 3654/10000 [5:44:50<9:46:12, 5.54s/it] {'loss': 0.0094, 'grad_norm': 0.2820074260234833, 'learning_rate': 2.931563420843615e-05, 'epoch': 3.65} 37%|███▋ | 3654/10000 [5:44:50<9:46:12, 5.54s/it][2025-06-19 19:14:35,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:14:35,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.97 | bwd_microstep: 3394.21 | bwd_inner_microstep: 3393.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 19:14:35,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.97 | bwd: 3394.22 | bwd_inner: 3393.42 | bwd_allreduce: 0.76 | step: 6.57 37%|███▋ | 3655/10000 [5:44:56<9:46:46, 5.55s/it] {'loss': 0.0954, 'grad_norm': 1.4047908782958984, 'learning_rate': 2.9309901773039125e-05, 'epoch': 3.66} 37%|███▋ | 3655/10000 [5:44:56<9:46:46, 5.55s/it][2025-06-19 19:14:40,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:14:40,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.04 | bwd_microstep: 3366.09 | bwd_inner_microstep: 3365.30 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 19:14:40,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.04 | bwd: 3366.11 | bwd_inner: 3365.30 | bwd_allreduce: 0.77 | step: 6.93 37%|███▋ | 3656/10000 [5:45:01<9:45:54, 5.54s/it] {'loss': 0.0705, 'grad_norm': 4.978781700134277, 'learning_rate': 2.930416836107661e-05, 'epoch': 3.66} 37%|███▋ | 3656/10000 [5:45:01<9:45:54, 5.54s/it][2025-06-19 19:14:46,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:14:46,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.73 | bwd_microstep: 3363.01 | bwd_inner_microstep: 3362.10 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.45 [2025-06-19 19:14:46,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.73 | bwd: 3363.03 | bwd_inner: 3362.10 | bwd_allreduce: 0.86 | step: 7.45 37%|███▋ | 3657/10000 [5:45:07<9:45:20, 5.54s/it] {'loss': 0.0377, 'grad_norm': 1.4040697813034058, 'learning_rate': 2.9298433973150015e-05, 'epoch': 3.66} 37%|███▋ | 3657/10000 [5:45:07<9:45:20, 5.54s/it][2025-06-19 19:14:52,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:14:52,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2158.07 | bwd_microstep: 3375.62 | bwd_inner_microstep: 3374.73 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.30 [2025-06-19 19:14:52,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2158.07 | bwd: 3375.65 | bwd_inner: 3374.73 | bwd_allreduce: 0.84 | step: 7.31 37%|███▋ | 3658/10000 [5:45:12<9:46:34, 5.55s/it] {'loss': 0.0486, 'grad_norm': 1.3249890804290771, 'learning_rate': 2.9292698609860852e-05, 'epoch': 3.66} 37%|███▋ | 3658/10000 [5:45:12<9:46:34, 5.55s/it][2025-06-19 19:14:57,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 19:14:57,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2150.64 | bwd_microstep: 3326.19 | bwd_inner_microstep: 3325.31 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.35 [2025-06-19 19:14:57,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2150.64 | bwd: 3326.22 | bwd_inner: 3325.31 | bwd_allreduce: 0.85 | step: 7.35 37%|███▋ | 3659/10000 [5:45:18<9:45:34, 5.54s/it] {'loss': 0.0212, 'grad_norm': 0.6610473990440369, 'learning_rate': 2.9286962271810734e-05, 'epoch': 3.66} 37%|███▋ | 3659/10000 [5:45:18<9:45:34, 5.54s/it][2025-06-19 19:15:03,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:15:03,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2166.01 | bwd_microstep: 3371.45 | bwd_inner_microstep: 3370.60 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.16 [2025-06-19 19:15:03,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2166.01 | bwd: 3371.46 | bwd_inner: 3370.60 | bwd_allreduce: 0.80 | step: 7.17 37%|███▋ | 3660/10000 [5:45:23<9:46:45, 5.55s/it] {'loss': 0.0331, 'grad_norm': 1.5944126844406128, 'learning_rate': 2.9281224959601372e-05, 'epoch': 3.66} 37%|███▋ | 3660/10000 [5:45:23<9:46:45, 5.55s/it][2025-06-19 19:15:08,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.75 [2025-06-19 19:15:08,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.46 | bwd_microstep: 3320.00 | bwd_inner_microstep: 3319.11 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.53 [2025-06-19 19:15:08,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.46 | bwd: 3320.03 | bwd_inner: 3319.11 | bwd_allreduce: 0.84 | step: 7.54 37%|███▋ | 3661/10000 [5:45:29<9:45:22, 5.54s/it] {'loss': 0.0364, 'grad_norm': 0.9174565672874451, 'learning_rate': 2.9275486673834587e-05, 'epoch': 3.66} 37%|███▋ | 3661/10000 [5:45:29<9:45:22, 5.54s/it][2025-06-19 19:15:14,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:15:14,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.66 | bwd_microstep: 3315.70 | bwd_inner_microstep: 3314.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 19:15:14,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.66 | bwd: 3315.71 | bwd_inner: 3314.90 | bwd_allreduce: 0.77 | step: 6.91 37%|███▋ | 3662/10000 [5:45:34<9:43:08, 5.52s/it] {'loss': 0.0805, 'grad_norm': 2.3425300121307373, 'learning_rate': 2.9269747415112308e-05, 'epoch': 3.66} 37%|███▋ | 3662/10000 [5:45:34<9:43:08, 5.52s/it][2025-06-19 19:15:19,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:15:19,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.96 | bwd_microstep: 3361.52 | bwd_inner_microstep: 3360.64 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.26 [2025-06-19 19:15:19,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.96 | bwd: 3361.55 | bwd_inner: 3360.64 | bwd_allreduce: 0.83 | step: 7.26 37%|███▋ | 3663/10000 [5:45:40<9:43:36, 5.53s/it] {'loss': 0.0546, 'grad_norm': 2.2551753520965576, 'learning_rate': 2.9264007184036538e-05, 'epoch': 3.66} 37%|███▋ | 3663/10000 [5:45:40<9:43:36, 5.53s/it][2025-06-19 19:15:25,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:15:25,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.14 | bwd_microstep: 3317.92 | bwd_inner_microstep: 3317.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 19:15:25,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.14 | bwd: 3317.94 | bwd_inner: 3317.13 | bwd_allreduce: 0.76 | step: 6.64 37%|███▋ | 3664/10000 [5:45:45<9:41:20, 5.51s/it] {'loss': 0.0535, 'grad_norm': 2.6926634311676025, 'learning_rate': 2.9258265981209412e-05, 'epoch': 3.66} 37%|███▋ | 3664/10000 [5:45:45<9:41:20, 5.51s/it][2025-06-19 19:15:30,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:15:30,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.77 | bwd_microstep: 3318.98 | bwd_inner_microstep: 3318.08 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.57 [2025-06-19 19:15:30,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.77 | bwd: 3319.01 | bwd_inner: 3318.08 | bwd_allreduce: 0.85 | step: 7.57 37%|███▋ | 3665/10000 [5:45:51<9:40:07, 5.49s/it] {'loss': 0.0107, 'grad_norm': 0.43907973170280457, 'learning_rate': 2.9252523807233157e-05, 'epoch': 3.67} 37%|███▋ | 3665/10000 [5:45:51<9:40:07, 5.49s/it][2025-06-19 19:15:36,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:15:36,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.47 | bwd_microstep: 3369.66 | bwd_inner_microstep: 3368.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 19:15:36,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.47 | bwd: 3369.67 | bwd_inner: 3368.86 | bwd_allreduce: 0.77 | step: 6.72 37%|███▋ | 3666/10000 [5:45:56<9:41:39, 5.51s/it] {'loss': 0.0162, 'grad_norm': 0.709622859954834, 'learning_rate': 2.9246780662710095e-05, 'epoch': 3.67} 37%|███▋ | 3666/10000 [5:45:56<9:41:39, 5.51s/it][2025-06-19 19:15:41,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:15:41,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.93 | bwd_microstep: 3320.14 | bwd_inner_microstep: 3319.26 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.12 [2025-06-19 19:15:41,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.93 | bwd: 3320.17 | bwd_inner: 3319.26 | bwd_allreduce: 0.83 | step: 7.11 37%|███▋ | 3667/10000 [5:46:02<9:40:19, 5.50s/it] {'loss': 0.006, 'grad_norm': 0.27113091945648193, 'learning_rate': 2.924103654824266e-05, 'epoch': 3.67} 37%|███▋ | 3667/10000 [5:46:02<9:40:19, 5.50s/it][2025-06-19 19:15:47,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 19:15:47,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.11 | bwd_microstep: 3319.36 | bwd_inner_microstep: 3318.28 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.68 [2025-06-19 19:15:47,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.11 | bwd: 3319.39 | bwd_inner: 3318.28 | bwd_allreduce: 1.04 | step: 7.69 37%|███▋ | 3668/10000 [5:46:07<9:39:27, 5.49s/it] {'loss': 0.0039, 'grad_norm': 0.11185982823371887, 'learning_rate': 2.9235291464433372e-05, 'epoch': 3.67} 37%|███▋ | 3668/10000 [5:46:07<9:39:27, 5.49s/it][2025-06-19 19:15:52,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:15:52,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.06 | bwd_microstep: 3303.65 | bwd_inner_microstep: 3302.83 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 19:15:52,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.06 | bwd: 3303.66 | bwd_inner: 3302.83 | bwd_allreduce: 0.78 | step: 7.21 37%|███▋ | 3669/10000 [5:46:13<9:37:55, 5.48s/it] {'loss': 0.1858, 'grad_norm': 3.4747211933135986, 'learning_rate': 2.9229545411884873e-05, 'epoch': 3.67} 37%|███▋ | 3669/10000 [5:46:13<9:37:55, 5.48s/it][2025-06-19 19:15:57,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:15:57,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.01 | bwd_microstep: 3316.47 | bwd_inner_microstep: 3315.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 19:15:57,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.01 | bwd: 3316.49 | bwd_inner: 3315.68 | bwd_allreduce: 0.76 | step: 6.82 37%|███▋ | 3670/10000 [5:46:18<9:37:17, 5.47s/it] {'loss': 0.072, 'grad_norm': 2.4881038665771484, 'learning_rate': 2.9223798391199904e-05, 'epoch': 3.67} 37%|███▋ | 3670/10000 [5:46:18<9:37:17, 5.47s/it][2025-06-19 19:16:03,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 19:16:03,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.12 | bwd_microstep: 3329.64 | bwd_inner_microstep: 3328.75 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.53 [2025-06-19 19:16:03,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.12 | bwd: 3329.67 | bwd_inner: 3328.75 | bwd_allreduce: 0.85 | step: 7.53 37%|███▋ | 3671/10000 [5:46:24<9:38:13, 5.48s/it] {'loss': 0.0346, 'grad_norm': 1.233384609222412, 'learning_rate': 2.9218050402981285e-05, 'epoch': 3.67} 37%|███▋ | 3671/10000 [5:46:24<9:38:13, 5.48s/it][2025-06-19 19:16:08,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 19:16:08,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.93 | bwd_microstep: 3333.14 | bwd_inner_microstep: 3332.30 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.26 [2025-06-19 19:16:08,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.93 | bwd: 3333.15 | bwd_inner: 3332.30 | bwd_allreduce: 0.81 | step: 7.27 37%|███▋ | 3672/10000 [5:46:29<9:37:59, 5.48s/it] {'loss': 0.0549, 'grad_norm': 2.4870901107788086, 'learning_rate': 2.9212301447831968e-05, 'epoch': 3.67} 37%|███▋ | 3672/10000 [5:46:29<9:37:59, 5.48s/it][2025-06-19 19:16:14,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:16:14,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.93 | bwd_microstep: 3313.61 | bwd_inner_microstep: 3312.78 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.26 [2025-06-19 19:16:14,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.93 | bwd: 3313.62 | bwd_inner: 3312.78 | bwd_allreduce: 0.80 | step: 7.26 37%|███▋ | 3673/10000 [5:46:35<9:38:26, 5.49s/it] {'loss': 0.0314, 'grad_norm': 1.2603538036346436, 'learning_rate': 2.9206551526354983e-05, 'epoch': 3.67} 37%|███▋ | 3673/10000 [5:46:35<9:38:26, 5.49s/it][2025-06-19 19:16:19,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:16:19,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.38 | bwd_microstep: 3320.01 | bwd_inner_microstep: 3319.12 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.30 [2025-06-19 19:16:19,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.38 | bwd: 3320.04 | bwd_inner: 3319.12 | bwd_allreduce: 0.84 | step: 7.31 37%|███▋ | 3674/10000 [5:46:40<9:37:40, 5.48s/it] {'loss': 0.0493, 'grad_norm': 2.167295217514038, 'learning_rate': 2.9200800639153467e-05, 'epoch': 3.67} 37%|███▋ | 3674/10000 [5:46:40<9:37:40, 5.48s/it][2025-06-19 19:16:25,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:16:25,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.78 | bwd_microstep: 3316.53 | bwd_inner_microstep: 3315.64 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.54 [2025-06-19 19:16:25,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.78 | bwd: 3316.55 | bwd_inner: 3315.64 | bwd_allreduce: 0.84 | step: 7.55 37%|███▋ | 3675/10000 [5:46:46<9:38:07, 5.48s/it] {'loss': 0.2438, 'grad_norm': 4.074530124664307, 'learning_rate': 2.9195048786830672e-05, 'epoch': 3.67} 37%|███▋ | 3675/10000 [5:46:46<9:38:07, 5.48s/it][2025-06-19 19:16:30,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.72 [2025-06-19 19:16:30,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.07 | bwd_microstep: 3322.89 | bwd_inner_microstep: 3321.67 | bwd_allreduce_microstep: 1.13 | step_microstep: 8.40 [2025-06-19 19:16:30,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.07 | bwd: 3322.92 | bwd_inner: 3321.67 | bwd_allreduce: 1.17 | step: 8.40 37%|███▋ | 3676/10000 [5:46:51<9:39:00, 5.49s/it] {'loss': 0.027, 'grad_norm': 1.7095062732696533, 'learning_rate': 2.9189295969989932e-05, 'epoch': 3.68} 37%|███▋ | 3676/10000 [5:46:51<9:39:00, 5.49s/it][2025-06-19 19:16:36,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.73 [2025-06-19 19:16:36,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.06 | bwd_microstep: 3314.76 | bwd_inner_microstep: 3313.53 | bwd_allreduce_microstep: 1.14 | step_microstep: 8.52 [2025-06-19 19:16:36,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.06 | bwd: 3314.79 | bwd_inner: 3313.53 | bwd_allreduce: 1.18 | step: 8.53 37%|███▋ | 3677/10000 [5:46:57<9:39:42, 5.50s/it] {'loss': 0.0907, 'grad_norm': 3.3747546672821045, 'learning_rate': 2.918354218923469e-05, 'epoch': 3.68} 37%|███▋ | 3677/10000 [5:46:57<9:39:42, 5.50s/it][2025-06-19 19:16:42,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:16:42,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2169.22 | bwd_microstep: 3395.58 | bwd_inner_microstep: 3394.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 19:16:42,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2169.22 | bwd: 3395.60 | bwd_inner: 3394.78 | bwd_allreduce: 0.78 | step: 7.10 37%|███▋ | 3678/10000 [5:47:02<9:43:11, 5.53s/it] {'loss': 0.1483, 'grad_norm': 4.616624355316162, 'learning_rate': 2.9177787445168494e-05, 'epoch': 3.68} 37%|███▋ | 3678/10000 [5:47:02<9:43:11, 5.53s/it][2025-06-19 19:16:47,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.66 | optimizer_step: 2.73 [2025-06-19 19:16:47,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.28 | bwd_microstep: 3362.44 | bwd_inner_microstep: 3361.01 | bwd_allreduce_microstep: 1.31 | step_microstep: 8.59 [2025-06-19 19:16:47,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.28 | bwd: 3362.47 | bwd_inner: 3361.01 | bwd_allreduce: 1.37 | step: 8.59 37%|███▋ | 3679/10000 [5:47:08<9:42:57, 5.53s/it] {'loss': 0.0118, 'grad_norm': 0.776456892490387, 'learning_rate': 2.9172031738394998e-05, 'epoch': 3.68} 37%|███▋ | 3679/10000 [5:47:08<9:42:57, 5.53s/it][2025-06-19 19:16:53,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:16:53,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.70 | bwd_microstep: 3321.35 | bwd_inner_microstep: 3320.46 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.61 [2025-06-19 19:16:53,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.70 | bwd: 3321.38 | bwd_inner: 3320.46 | bwd_allreduce: 0.85 | step: 7.62 37%|███▋ | 3680/10000 [5:47:13<9:42:02, 5.53s/it] {'loss': 0.0801, 'grad_norm': 1.9754778146743774, 'learning_rate': 2.9166275069517936e-05, 'epoch': 3.68} 37%|███▋ | 3680/10000 [5:47:13<9:42:02, 5.53s/it][2025-06-19 19:16:58,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-19 19:16:58,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2162.26 | bwd_microstep: 3366.67 | bwd_inner_microstep: 3365.70 | bwd_allreduce_microstep: 0.88 | step_microstep: 8.13 [2025-06-19 19:16:58,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2162.26 | bwd: 3366.70 | bwd_inner: 3365.70 | bwd_allreduce: 0.92 | step: 8.14 37%|███▋ | 3681/10000 [5:47:19<9:43:41, 5.54s/it] {'loss': 0.0205, 'grad_norm': 1.7176361083984375, 'learning_rate': 2.9160517439141168e-05, 'epoch': 3.68} 37%|███▋ | 3681/10000 [5:47:19<9:43:41, 5.54s/it][2025-06-19 19:17:04,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:17:04,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.55 | bwd_microstep: 3320.36 | bwd_inner_microstep: 3319.51 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.06 [2025-06-19 19:17:04,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.55 | bwd: 3320.39 | bwd_inner: 3319.51 | bwd_allreduce: 0.81 | step: 7.06 37%|███▋ | 3682/10000 [5:47:24<9:41:26, 5.52s/it] {'loss': 0.0343, 'grad_norm': 1.3224189281463623, 'learning_rate': 2.9154758847868626e-05, 'epoch': 3.68} 37%|███▋ | 3682/10000 [5:47:24<9:41:26, 5.52s/it][2025-06-19 19:17:09,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:17:09,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.04 | bwd_microstep: 3313.35 | bwd_inner_microstep: 3312.40 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.94 [2025-06-19 19:17:09,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.04 | bwd: 3313.38 | bwd_inner: 3312.40 | bwd_allreduce: 0.89 | step: 7.95 37%|███▋ | 3683/10000 [5:47:30<9:40:16, 5.51s/it] {'loss': 0.0312, 'grad_norm': 1.0864964723587036, 'learning_rate': 2.914899929630438e-05, 'epoch': 3.68} 37%|███▋ | 3683/10000 [5:47:30<9:40:16, 5.51s/it][2025-06-19 19:17:15,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:17:15,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.13 | bwd_microstep: 3323.62 | bwd_inner_microstep: 3322.74 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.63 [2025-06-19 19:17:15,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.13 | bwd: 3323.64 | bwd_inner: 3322.74 | bwd_allreduce: 0.84 | step: 7.63 37%|███▋ | 3684/10000 [5:47:35<9:39:46, 5.51s/it] {'loss': 0.1345, 'grad_norm': 3.4952354431152344, 'learning_rate': 2.9143238785052558e-05, 'epoch': 3.68} 37%|███▋ | 3684/10000 [5:47:35<9:39:46, 5.51s/it][2025-06-19 19:17:20,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:17:20,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2169.32 | bwd_microstep: 3365.32 | bwd_inner_microstep: 3364.44 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.30 [2025-06-19 19:17:20,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2169.32 | bwd: 3365.35 | bwd_inner: 3364.44 | bwd_allreduce: 0.83 | step: 7.31 37%|███▋ | 3685/10000 [5:47:41<9:41:58, 5.53s/it] {'loss': 0.0623, 'grad_norm': 1.610215425491333, 'learning_rate': 2.913747731471743e-05, 'epoch': 3.69} 37%|███▋ | 3685/10000 [5:47:41<9:41:58, 5.53s/it][2025-06-19 19:17:26,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:17:26,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.71 | bwd_microstep: 3310.56 | bwd_inner_microstep: 3309.67 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.48 [2025-06-19 19:17:26,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.71 | bwd: 3310.58 | bwd_inner: 3309.67 | bwd_allreduce: 0.84 | step: 7.49 37%|███▋ | 3686/10000 [5:47:47<9:40:54, 5.52s/it] {'loss': 0.0387, 'grad_norm': 1.767356276512146, 'learning_rate': 2.9131714885903336e-05, 'epoch': 3.69} 37%|███▋ | 3686/10000 [5:47:47<9:40:54, 5.52s/it][2025-06-19 19:17:31,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 19:17:31,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2150.99 | bwd_microstep: 3314.71 | bwd_inner_microstep: 3313.80 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.81 [2025-06-19 19:17:31,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2150.99 | bwd: 3314.74 | bwd_inner: 3313.80 | bwd_allreduce: 0.86 | step: 7.82 37%|███▋ | 3687/10000 [5:47:52<9:40:39, 5.52s/it] {'loss': 0.038, 'grad_norm': 1.9369401931762695, 'learning_rate': 2.9125951499214732e-05, 'epoch': 3.69} 37%|███▋ | 3687/10000 [5:47:52<9:40:39, 5.52s/it][2025-06-19 19:17:37,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:17:37,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.83 | bwd_microstep: 3365.92 | bwd_inner_microstep: 3365.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 19:17:37,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.83 | bwd: 3365.94 | bwd_inner: 3365.13 | bwd_allreduce: 0.77 | step: 6.94 37%|███▋ | 3688/10000 [5:47:58<9:41:05, 5.52s/it] {'loss': 0.059, 'grad_norm': 2.550166606903076, 'learning_rate': 2.912018715525618e-05, 'epoch': 3.69} 37%|███▋ | 3688/10000 [5:47:58<9:41:05, 5.52s/it][2025-06-19 19:17:42,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 19:17:42,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.57 | bwd_microstep: 3362.86 | bwd_inner_microstep: 3362.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 19:17:42,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.57 | bwd: 3362.87 | bwd_inner: 3362.08 | bwd_allreduce: 0.75 | step: 6.66 37%|███▋ | 3689/10000 [5:48:03<9:41:07, 5.52s/it] {'loss': 0.0317, 'grad_norm': 1.0287175178527832, 'learning_rate': 2.9114421854632314e-05, 'epoch': 3.69} 37%|███▋ | 3689/10000 [5:48:03<9:41:07, 5.52s/it][2025-06-19 19:17:48,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:17:48,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.43 | bwd_microstep: 3319.32 | bwd_inner_microstep: 3318.42 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.51 [2025-06-19 19:17:48,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.43 | bwd: 3319.34 | bwd_inner: 3318.42 | bwd_allreduce: 0.85 | step: 7.52 37%|███▋ | 3690/10000 [5:48:09<9:39:40, 5.51s/it] {'loss': 0.0924, 'grad_norm': 2.808806896209717, 'learning_rate': 2.91086555979479e-05, 'epoch': 3.69} 37%|███▋ | 3690/10000 [5:48:09<9:39:40, 5.51s/it][2025-06-19 19:17:53,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 19:17:53,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.47 | bwd_microstep: 3318.66 | bwd_inner_microstep: 3317.84 | bwd_allreduce_microstep: 0.77 | step_microstep: 8.14 [2025-06-19 19:17:53,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.47 | bwd: 3318.67 | bwd_inner: 3317.84 | bwd_allreduce: 0.79 | step: 8.16 37%|███▋ | 3691/10000 [5:48:14<9:39:03, 5.51s/it] {'loss': 0.0225, 'grad_norm': 1.1240957975387573, 'learning_rate': 2.910288838580779e-05, 'epoch': 3.69} 37%|███▋ | 3691/10000 [5:48:14<9:39:03, 5.51s/it][2025-06-19 19:17:59,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:17:59,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.31 | bwd_microstep: 3318.30 | bwd_inner_microstep: 3317.43 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.02 [2025-06-19 19:17:59,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.31 | bwd: 3318.31 | bwd_inner: 3317.43 | bwd_allreduce: 0.85 | step: 7.03 37%|███▋ | 3692/10000 [5:48:20<9:37:41, 5.49s/it] {'loss': 0.1245, 'grad_norm': 3.4064810276031494, 'learning_rate': 2.909712021881693e-05, 'epoch': 3.69} 37%|███▋ | 3692/10000 [5:48:20<9:37:41, 5.49s/it][2025-06-19 19:18:04,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:18:04,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.41 | bwd_microstep: 3365.94 | bwd_inner_microstep: 3365.09 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.97 [2025-06-19 19:18:04,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.41 | bwd: 3365.96 | bwd_inner: 3365.09 | bwd_allreduce: 0.81 | step: 6.97 37%|███▋ | 3693/10000 [5:48:25<9:39:22, 5.51s/it] {'loss': 0.0648, 'grad_norm': 3.064873456954956, 'learning_rate': 2.9091351097580384e-05, 'epoch': 3.69} 37%|███▋ | 3693/10000 [5:48:25<9:39:22, 5.51s/it][2025-06-19 19:18:10,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.87 [2025-06-19 19:18:10,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.24 | bwd_microstep: 3310.29 | bwd_inner_microstep: 3309.28 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.04 [2025-06-19 19:18:10,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.24 | bwd: 3310.31 | bwd_inner: 3309.28 | bwd_allreduce: 0.98 | step: 7.04 37%|███▋ | 3694/10000 [5:48:31<9:38:04, 5.50s/it] {'loss': 0.0676, 'grad_norm': 2.0670297145843506, 'learning_rate': 2.9085581022703305e-05, 'epoch': 3.69} 37%|███▋ | 3694/10000 [5:48:31<9:38:04, 5.50s/it][2025-06-19 19:18:15,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:18:15,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.44 | bwd_microstep: 3325.37 | bwd_inner_microstep: 3324.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 19:18:15,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.44 | bwd: 3325.38 | bwd_inner: 3324.58 | bwd_allreduce: 0.76 | step: 6.66 37%|███▋ | 3695/10000 [5:48:36<9:37:18, 5.49s/it] {'loss': 0.0996, 'grad_norm': 2.0831472873687744, 'learning_rate': 2.9079809994790937e-05, 'epoch': 3.69} 37%|███▋ | 3695/10000 [5:48:36<9:37:18, 5.49s/it][2025-06-19 19:18:21,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-19 19:18:21,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.73 | bwd_microstep: 3321.85 | bwd_inner_microstep: 3320.67 | bwd_allreduce_microstep: 1.10 | step_microstep: 8.28 [2025-06-19 19:18:21,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.73 | bwd: 3321.87 | bwd_inner: 3320.67 | bwd_allreduce: 1.14 | step: 8.29 37%|███▋ | 3696/10000 [5:48:42<9:36:33, 5.49s/it] {'loss': 0.0291, 'grad_norm': 1.0941524505615234, 'learning_rate': 2.9074038014448648e-05, 'epoch': 3.7} 37%|███▋ | 3696/10000 [5:48:42<9:36:33, 5.49s/it][2025-06-19 19:18:26,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:18:26,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.54 | bwd_microstep: 3316.66 | bwd_inner_microstep: 3315.68 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.12 [2025-06-19 19:18:26,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.54 | bwd: 3316.67 | bwd_inner: 3315.68 | bwd_allreduce: 0.94 | step: 7.12 37%|███▋ | 3697/10000 [5:48:47<9:35:53, 5.48s/it] {'loss': 0.0317, 'grad_norm': 1.1043137311935425, 'learning_rate': 2.906826508228188e-05, 'epoch': 3.7} 37%|███▋ | 3697/10000 [5:48:47<9:35:53, 5.48s/it][2025-06-19 19:18:32,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:18:32,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.64 | bwd_microstep: 3316.17 | bwd_inner_microstep: 3315.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 19:18:32,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.64 | bwd: 3316.19 | bwd_inner: 3315.38 | bwd_allreduce: 0.76 | step: 6.85 37%|███▋ | 3698/10000 [5:48:52<9:35:12, 5.48s/it] {'loss': 0.0111, 'grad_norm': 0.7249164581298828, 'learning_rate': 2.9062491198896198e-05, 'epoch': 3.7} 37%|███▋ | 3698/10000 [5:48:52<9:35:12, 5.48s/it][2025-06-19 19:18:38,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 417.39 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:18:38,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.94 | bwd_microstep: 3433.42 | bwd_inner_microstep: 3432.55 | bwd_allreduce_microstep: 0.82 | step_microstep: 425.24 [2025-06-19 19:18:38,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.94 | bwd: 3433.43 | bwd_inner: 3432.55 | bwd_allreduce: 0.84 | step: 425.25 37%|███▋ | 3699/10000 [5:48:58<9:52:04, 5.64s/it] {'loss': 0.0089, 'grad_norm': 0.5545579791069031, 'learning_rate': 2.9056716364897243e-05, 'epoch': 3.7} 37%|███▋ | 3699/10000 [5:48:58<9:52:04, 5.64s/it][2025-06-19 19:18:43,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:18:43,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.54 | bwd_microstep: 3327.12 | bwd_inner_microstep: 3326.23 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.70 [2025-06-19 19:18:43,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.54 | bwd: 3327.14 | bwd_inner: 3326.23 | bwd_allreduce: 0.84 | step: 7.70 37%|███▋ | 3700/10000 [5:49:04<9:48:06, 5.60s/it] {'loss': 0.0354, 'grad_norm': 1.2307636737823486, 'learning_rate': 2.9050940580890783e-05, 'epoch': 3.7} 37%|███▋ | 3700/10000 [5:49:04<9:48:06, 5.60s/it][2025-06-19 19:18:49,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 19:18:49,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2183.06 | bwd_microstep: 3373.48 | bwd_inner_microstep: 3372.42 | bwd_allreduce_microstep: 0.98 | step_microstep: 8.04 [2025-06-19 19:18:49,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2183.06 | bwd: 3373.51 | bwd_inner: 3372.42 | bwd_allreduce: 1.01 | step: 8.05 37%|███▋ | 3701/10000 [5:49:10<9:48:11, 5.60s/it] {'loss': 0.0458, 'grad_norm': 1.583581566810608, 'learning_rate': 2.9045163847482657e-05, 'epoch': 3.7} 37%|███▋ | 3701/10000 [5:49:10<9:48:11, 5.60s/it][2025-06-19 19:18:54,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:18:54,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.62 | bwd_microstep: 3326.80 | bwd_inner_microstep: 3325.59 | bwd_allreduce_microstep: 1.14 | step_microstep: 8.10 [2025-06-19 19:18:54,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.62 | bwd: 3326.83 | bwd_inner: 3325.59 | bwd_allreduce: 1.17 | step: 8.13 37%|███▋ | 3702/10000 [5:49:15<9:45:06, 5.57s/it] {'loss': 0.0896, 'grad_norm': 1.4352108240127563, 'learning_rate': 2.9039386165278825e-05, 'epoch': 3.7} 37%|███▋ | 3702/10000 [5:49:15<9:45:06, 5.57s/it][2025-06-19 19:19:00,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-19 19:19:00,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.62 | bwd_microstep: 3321.67 | bwd_inner_microstep: 3320.27 | bwd_allreduce_microstep: 1.29 | step_microstep: 8.86 [2025-06-19 19:19:00,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.62 | bwd: 3321.71 | bwd_inner: 3320.27 | bwd_allreduce: 1.34 | step: 8.87 37%|███▋ | 3703/10000 [5:49:21<9:43:10, 5.56s/it] {'loss': 0.0186, 'grad_norm': 1.0598020553588867, 'learning_rate': 2.9033607534885333e-05, 'epoch': 3.7} 37%|███▋ | 3703/10000 [5:49:21<9:43:10, 5.56s/it][2025-06-19 19:19:05,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:19:05,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.41 | bwd_microstep: 3367.28 | bwd_inner_microstep: 3366.44 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.90 [2025-06-19 19:19:05,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.41 | bwd: 3367.30 | bwd_inner: 3366.44 | bwd_allreduce: 0.80 | step: 6.90 37%|███▋ | 3704/10000 [5:49:26<9:43:14, 5.56s/it] {'loss': 0.0601, 'grad_norm': 2.1546859741210938, 'learning_rate': 2.9027827956908337e-05, 'epoch': 3.7} 37%|███▋ | 3704/10000 [5:49:26<9:43:14, 5.56s/it][2025-06-19 19:19:11,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:19:11,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.57 | bwd_microstep: 3322.35 | bwd_inner_microstep: 3321.46 | bwd_allreduce_microstep: 0.82 | step_microstep: 8.02 [2025-06-19 19:19:11,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.57 | bwd: 3322.38 | bwd_inner: 3321.46 | bwd_allreduce: 0.85 | step: 8.02 37%|███▋ | 3705/10000 [5:49:32<9:40:59, 5.54s/it] {'loss': 0.1799, 'grad_norm': 1.8225451707839966, 'learning_rate': 2.9022047431954097e-05, 'epoch': 3.71} 37%|███▋ | 3705/10000 [5:49:32<9:40:59, 5.54s/it][2025-06-19 19:19:16,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:19:16,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.96 | bwd_microstep: 3312.81 | bwd_inner_microstep: 3311.94 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.47 [2025-06-19 19:19:16,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.96 | bwd: 3312.84 | bwd_inner: 3311.94 | bwd_allreduce: 0.83 | step: 7.47 37%|███▋ | 3706/10000 [5:49:37<9:39:02, 5.52s/it] {'loss': 0.0629, 'grad_norm': 1.9572794437408447, 'learning_rate': 2.9016265960628945e-05, 'epoch': 3.71} 37%|███▋ | 3706/10000 [5:49:37<9:39:02, 5.52s/it][2025-06-19 19:19:22,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:19:22,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.59 | bwd_microstep: 3330.38 | bwd_inner_microstep: 3329.27 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.29 [2025-06-19 19:19:22,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.59 | bwd: 3330.41 | bwd_inner: 3329.27 | bwd_allreduce: 1.07 | step: 7.29 37%|███▋ | 3707/10000 [5:49:43<9:38:00, 5.51s/it] {'loss': 0.0697, 'grad_norm': 2.0061590671539307, 'learning_rate': 2.9010483543539344e-05, 'epoch': 3.71} 37%|███▋ | 3707/10000 [5:49:43<9:38:00, 5.51s/it][2025-06-19 19:19:27,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:19:27,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.47 | bwd_microstep: 3321.75 | bwd_inner_microstep: 3320.85 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.91 [2025-06-19 19:19:27,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.47 | bwd: 3321.76 | bwd_inner: 3320.85 | bwd_allreduce: 0.87 | step: 6.91 37%|███▋ | 3708/10000 [5:49:48<9:37:02, 5.50s/it] {'loss': 0.1496, 'grad_norm': 2.62282657623291, 'learning_rate': 2.9004700181291838e-05, 'epoch': 3.71} 37%|███▋ | 3708/10000 [5:49:48<9:37:02, 5.50s/it][2025-06-19 19:19:33,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:19:33,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.13 | bwd_microstep: 3319.56 | bwd_inner_microstep: 3318.57 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.07 [2025-06-19 19:19:33,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.13 | bwd: 3319.58 | bwd_inner: 3318.57 | bwd_allreduce: 0.96 | step: 7.07 37%|███▋ | 3709/10000 [5:49:54<9:36:03, 5.49s/it] {'loss': 0.042, 'grad_norm': 1.1575266122817993, 'learning_rate': 2.8998915874493073e-05, 'epoch': 3.71} 37%|███▋ | 3709/10000 [5:49:54<9:36:03, 5.49s/it][2025-06-19 19:19:38,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:19:38,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.12 | bwd_microstep: 3318.79 | bwd_inner_microstep: 3317.94 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.73 [2025-06-19 19:19:38,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.12 | bwd: 3318.81 | bwd_inner: 3317.95 | bwd_allreduce: 0.81 | step: 6.72 37%|███▋ | 3710/10000 [5:49:59<9:35:18, 5.49s/it] {'loss': 0.0252, 'grad_norm': 0.976944088935852, 'learning_rate': 2.8993130623749805e-05, 'epoch': 3.71} 37%|███▋ | 3710/10000 [5:49:59<9:35:18, 5.49s/it][2025-06-19 19:19:44,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:19:44,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.06 | bwd_microstep: 3372.09 | bwd_inner_microstep: 3371.16 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.02 [2025-06-19 19:19:44,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.06 | bwd: 3372.11 | bwd_inner: 3371.16 | bwd_allreduce: 0.91 | step: 7.02 37%|███▋ | 3711/10000 [5:50:05<9:37:12, 5.51s/it] {'loss': 0.0216, 'grad_norm': 0.7706316709518433, 'learning_rate': 2.8987344429668863e-05, 'epoch': 3.71} 37%|███▋ | 3711/10000 [5:50:05<9:37:12, 5.51s/it][2025-06-19 19:19:49,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 19:19:49,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.02 | bwd_microstep: 3315.06 | bwd_inner_microstep: 3313.79 | bwd_allreduce_microstep: 1.19 | step_microstep: 7.65 [2025-06-19 19:19:49,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.02 | bwd: 3315.08 | bwd_inner: 3313.79 | bwd_allreduce: 1.22 | step: 7.66 37%|███▋ | 3712/10000 [5:50:10<9:36:58, 5.51s/it] {'loss': 0.0273, 'grad_norm': 1.3092807531356812, 'learning_rate': 2.8981557292857203e-05, 'epoch': 3.71} 37%|███▋ | 3712/10000 [5:50:10<9:36:58, 5.51s/it][2025-06-19 19:19:55,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 19:19:55,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.88 | bwd_microstep: 3330.00 | bwd_inner_microstep: 3328.91 | bwd_allreduce_microstep: 1.02 | step_microstep: 8.36 [2025-06-19 19:19:55,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.88 | bwd: 3330.02 | bwd_inner: 3328.91 | bwd_allreduce: 1.06 | step: 8.37 37%|███▋ | 3713/10000 [5:50:16<9:36:55, 5.51s/it] {'loss': 0.02, 'grad_norm': 0.5415335297584534, 'learning_rate': 2.8975769213921867e-05, 'epoch': 3.71} 37%|███▋ | 3713/10000 [5:50:16<9:36:55, 5.51s/it][2025-06-19 19:20:00,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:20:00,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.08 | bwd_microstep: 3321.80 | bwd_inner_microstep: 3321.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 19:20:00,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.08 | bwd: 3321.82 | bwd_inner: 3321.01 | bwd_allreduce: 0.77 | step: 6.78 37%|███▋ | 3714/10000 [5:50:21<9:36:07, 5.50s/it] {'loss': 0.0488, 'grad_norm': 0.9957298040390015, 'learning_rate': 2.8969980193470003e-05, 'epoch': 3.71} 37%|███▋ | 3714/10000 [5:50:21<9:36:07, 5.50s/it][2025-06-19 19:20:06,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:20:06,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.69 | bwd_microstep: 3329.95 | bwd_inner_microstep: 3329.00 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.10 [2025-06-19 19:20:06,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.70 | bwd: 3329.97 | bwd_inner: 3329.00 | bwd_allreduce: 0.92 | step: 7.10 37%|███▋ | 3715/10000 [5:50:27<9:36:11, 5.50s/it] {'loss': 0.1154, 'grad_norm': 2.13502836227417, 'learning_rate': 2.8964190232108843e-05, 'epoch': 3.71} 37%|███▋ | 3715/10000 [5:50:27<9:36:11, 5.50s/it][2025-06-19 19:20:11,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:20:11,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.39 | bwd_microstep: 3325.00 | bwd_inner_microstep: 3324.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-19 19:20:11,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.39 | bwd: 3325.02 | bwd_inner: 3324.20 | bwd_allreduce: 0.78 | step: 6.89 37%|███▋ | 3716/10000 [5:50:32<9:35:39, 5.50s/it] {'loss': 0.0057, 'grad_norm': 0.11184249818325043, 'learning_rate': 2.8958399330445738e-05, 'epoch': 3.72} 37%|███▋ | 3716/10000 [5:50:32<9:35:39, 5.50s/it][2025-06-19 19:20:17,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 19:20:17,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.00 | bwd_microstep: 3326.89 | bwd_inner_microstep: 3325.70 | bwd_allreduce_microstep: 1.13 | step_microstep: 8.37 [2025-06-19 19:20:17,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.00 | bwd: 3326.91 | bwd_inner: 3325.70 | bwd_allreduce: 1.15 | step: 8.37 37%|███▋ | 3717/10000 [5:50:38<9:36:05, 5.50s/it] {'loss': 0.0777, 'grad_norm': 2.05562424659729, 'learning_rate': 2.895260748908812e-05, 'epoch': 3.72} 37%|███▋ | 3717/10000 [5:50:38<9:36:05, 5.50s/it][2025-06-19 19:20:22,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:20:22,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.86 | bwd_microstep: 3323.13 | bwd_inner_microstep: 3322.18 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.08 [2025-06-19 19:20:22,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.86 | bwd: 3323.15 | bwd_inner: 3322.18 | bwd_allreduce: 0.92 | step: 7.09 37%|███▋ | 3718/10000 [5:50:43<9:35:33, 5.50s/it] {'loss': 0.009, 'grad_norm': 0.3015792667865753, 'learning_rate': 2.8946814708643526e-05, 'epoch': 3.72} 37%|███▋ | 3718/10000 [5:50:43<9:35:33, 5.50s/it][2025-06-19 19:20:28,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 19:20:28,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.96 | bwd_microstep: 3378.83 | bwd_inner_microstep: 3377.88 | bwd_allreduce_microstep: 0.86 | step_microstep: 8.04 [2025-06-19 19:20:28,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.96 | bwd: 3378.87 | bwd_inner: 3377.88 | bwd_allreduce: 0.90 | step: 8.05 37%|███▋ | 3719/10000 [5:50:49<9:37:18, 5.51s/it] {'loss': 0.0557, 'grad_norm': 1.8602267503738403, 'learning_rate': 2.8941020989719592e-05, 'epoch': 3.72} 37%|███▋ | 3719/10000 [5:50:49<9:37:18, 5.51s/it][2025-06-19 19:20:33,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:20:33,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.22 | bwd_microstep: 3333.21 | bwd_inner_microstep: 3332.26 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.40 [2025-06-19 19:20:33,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.22 | bwd: 3333.23 | bwd_inner: 3332.26 | bwd_allreduce: 0.92 | step: 7.41 37%|███▋ | 3720/10000 [5:50:54<9:36:48, 5.51s/it] {'loss': 0.0122, 'grad_norm': 0.6237514019012451, 'learning_rate': 2.893522633292405e-05, 'epoch': 3.72} 37%|███▋ | 3720/10000 [5:50:54<9:36:48, 5.51s/it][2025-06-19 19:20:39,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 19:20:39,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.27 | bwd_microstep: 3374.52 | bwd_inner_microstep: 3373.50 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.53 [2025-06-19 19:20:39,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.27 | bwd: 3374.54 | bwd_inner: 3373.50 | bwd_allreduce: 0.98 | step: 7.53 37%|███▋ | 3721/10000 [5:51:00<9:38:25, 5.53s/it] {'loss': 0.0475, 'grad_norm': 1.3835161924362183, 'learning_rate': 2.8929430738864736e-05, 'epoch': 3.72} 37%|███▋ | 3721/10000 [5:51:00<9:38:25, 5.53s/it][2025-06-19 19:20:45,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 19:20:45,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.53 | bwd_microstep: 3382.03 | bwd_inner_microstep: 3380.86 | bwd_allreduce_microstep: 1.10 | step_microstep: 8.17 [2025-06-19 19:20:45,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.53 | bwd: 3382.05 | bwd_inner: 3380.86 | bwd_allreduce: 1.13 | step: 8.18 37%|███▋ | 3722/10000 [5:51:05<9:39:56, 5.54s/it] {'loss': 0.0966, 'grad_norm': 2.3180758953094482, 'learning_rate': 2.8923634208149577e-05, 'epoch': 3.72} 37%|███▋ | 3722/10000 [5:51:05<9:39:56, 5.54s/it][2025-06-19 19:20:50,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:20:50,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.90 | bwd_microstep: 3323.19 | bwd_inner_microstep: 3322.37 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.79 [2025-06-19 19:20:50,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.90 | bwd: 3323.21 | bwd_inner: 3322.37 | bwd_allreduce: 0.79 | step: 6.80 37%|███▋ | 3723/10000 [5:51:11<9:38:15, 5.53s/it] {'loss': 0.151, 'grad_norm': 2.2158806324005127, 'learning_rate': 2.8917836741386612e-05, 'epoch': 3.72} 37%|███▋ | 3723/10000 [5:51:11<9:38:15, 5.53s/it][2025-06-19 19:20:56,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:20:56,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.09 | bwd_microstep: 3373.45 | bwd_inner_microstep: 3372.59 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.91 [2025-06-19 19:20:56,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.09 | bwd: 3373.47 | bwd_inner: 3372.59 | bwd_allreduce: 0.83 | step: 6.92 37%|███▋ | 3724/10000 [5:51:16<9:38:38, 5.53s/it] {'loss': 0.0766, 'grad_norm': 2.017364501953125, 'learning_rate': 2.891203833918396e-05, 'epoch': 3.72} 37%|███▋ | 3724/10000 [5:51:16<9:38:38, 5.53s/it][2025-06-19 19:21:01,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 19:21:01,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.82 | bwd_microstep: 3371.17 | bwd_inner_microstep: 3370.04 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.53 [2025-06-19 19:21:01,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.82 | bwd: 3371.19 | bwd_inner: 3370.04 | bwd_allreduce: 1.10 | step: 7.54 37%|███▋ | 3725/10000 [5:51:22<9:39:26, 5.54s/it] {'loss': 0.0084, 'grad_norm': 0.4864974021911621, 'learning_rate': 2.890623900214985e-05, 'epoch': 3.73} 37%|███▋ | 3725/10000 [5:51:22<9:39:26, 5.54s/it][2025-06-19 19:21:07,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:21:07,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.57 | bwd_microstep: 3316.13 | bwd_inner_microstep: 3315.22 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.92 [2025-06-19 19:21:07,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.57 | bwd: 3316.14 | bwd_inner: 3315.22 | bwd_allreduce: 0.88 | step: 6.92 37%|███▋ | 3726/10000 [5:51:27<9:37:13, 5.52s/it] {'loss': 0.0403, 'grad_norm': 1.4996213912963867, 'learning_rate': 2.89004387308926e-05, 'epoch': 3.73} 37%|███▋ | 3726/10000 [5:51:27<9:37:13, 5.52s/it][2025-06-19 19:21:12,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:21:12,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.25 | bwd_microstep: 3326.12 | bwd_inner_microstep: 3325.22 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.57 [2025-06-19 19:21:12,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.25 | bwd: 3326.14 | bwd_inner: 3325.22 | bwd_allreduce: 0.87 | step: 7.57 37%|███▋ | 3727/10000 [5:51:33<9:36:00, 5.51s/it] {'loss': 0.0311, 'grad_norm': 1.5970462560653687, 'learning_rate': 2.8894637526020637e-05, 'epoch': 3.73} 37%|███▋ | 3727/10000 [5:51:33<9:36:00, 5.51s/it][2025-06-19 19:21:18,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:21:18,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.05 | bwd_microstep: 3329.22 | bwd_inner_microstep: 3328.44 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 19:21:18,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.05 | bwd: 3329.24 | bwd_inner: 3328.44 | bwd_allreduce: 0.76 | step: 6.65 37%|███▋ | 3728/10000 [5:51:38<9:35:16, 5.50s/it] {'loss': 0.0236, 'grad_norm': 0.718632161617279, 'learning_rate': 2.8888835388142483e-05, 'epoch': 3.73} 37%|███▋ | 3728/10000 [5:51:38<9:35:16, 5.50s/it][2025-06-19 19:21:23,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 19:21:23,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.39 | bwd_microstep: 3341.82 | bwd_inner_microstep: 3340.74 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.68 [2025-06-19 19:21:23,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.39 | bwd: 3341.84 | bwd_inner: 3340.74 | bwd_allreduce: 1.05 | step: 7.69 37%|███▋ | 3729/10000 [5:51:44<9:35:09, 5.50s/it] {'loss': 0.1854, 'grad_norm': 2.3036088943481445, 'learning_rate': 2.8883032317866747e-05, 'epoch': 3.73} 37%|███▋ | 3729/10000 [5:51:44<9:35:09, 5.50s/it][2025-06-19 19:21:29,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:21:29,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.06 | bwd_microstep: 3333.66 | bwd_inner_microstep: 3332.77 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.25 [2025-06-19 19:21:29,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.07 | bwd: 3333.68 | bwd_inner: 3332.77 | bwd_allreduce: 0.86 | step: 7.26 37%|███▋ | 3730/10000 [5:51:49<9:34:48, 5.50s/it] {'loss': 0.0806, 'grad_norm': 4.782853603363037, 'learning_rate': 2.8877228315802145e-05, 'epoch': 3.73} 37%|███▋ | 3730/10000 [5:51:49<9:34:48, 5.50s/it][2025-06-19 19:21:34,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:21:34,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.35 | bwd_microstep: 3334.86 | bwd_inner_microstep: 3333.88 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.24 [2025-06-19 19:21:34,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.35 | bwd: 3334.88 | bwd_inner: 3333.88 | bwd_allreduce: 0.95 | step: 7.25 37%|███▋ | 3731/10000 [5:51:55<9:34:30, 5.50s/it] {'loss': 0.0302, 'grad_norm': 1.3559273481369019, 'learning_rate': 2.8871423382557493e-05, 'epoch': 3.73} 37%|███▋ | 3731/10000 [5:51:55<9:34:30, 5.50s/it][2025-06-19 19:21:40,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:21:40,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.06 | bwd_microstep: 3326.98 | bwd_inner_microstep: 3326.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 19:21:40,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.06 | bwd: 3326.99 | bwd_inner: 3326.19 | bwd_allreduce: 0.76 | step: 6.69 37%|███▋ | 3732/10000 [5:52:00<9:34:04, 5.50s/it] {'loss': 0.0557, 'grad_norm': 1.8346277475357056, 'learning_rate': 2.8865617518741707e-05, 'epoch': 3.73} 37%|███▋ | 3732/10000 [5:52:00<9:34:04, 5.50s/it][2025-06-19 19:21:45,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:21:45,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.35 | bwd_microstep: 3378.53 | bwd_inner_microstep: 3377.38 | bwd_allreduce_microstep: 1.07 | step_microstep: 8.06 [2025-06-19 19:21:45,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.35 | bwd: 3378.56 | bwd_inner: 3377.38 | bwd_allreduce: 1.10 | step: 8.06 37%|███▋ | 3733/10000 [5:52:06<9:35:58, 5.51s/it] {'loss': 0.0322, 'grad_norm': 1.2868931293487549, 'learning_rate': 2.8859810724963792e-05, 'epoch': 3.73} 37%|███▋ | 3733/10000 [5:52:06<9:35:58, 5.51s/it][2025-06-19 19:21:51,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:21:51,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.35 | bwd_microstep: 3372.15 | bwd_inner_microstep: 3371.05 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.34 [2025-06-19 19:21:51,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.35 | bwd: 3372.17 | bwd_inner: 3371.05 | bwd_allreduce: 1.06 | step: 7.34 37%|███▋ | 3734/10000 [5:52:11<9:37:16, 5.53s/it] {'loss': 0.2961, 'grad_norm': 4.386279106140137, 'learning_rate': 2.8854003001832844e-05, 'epoch': 3.73} 37%|███▋ | 3734/10000 [5:52:11<9:37:16, 5.53s/it][2025-06-19 19:21:56,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:21:56,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.58 | bwd_microstep: 3326.20 | bwd_inner_microstep: 3325.37 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.36 [2025-06-19 19:21:56,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.58 | bwd: 3326.22 | bwd_inner: 3325.37 | bwd_allreduce: 0.80 | step: 7.37 37%|███▋ | 3735/10000 [5:52:17<9:35:44, 5.51s/it] {'loss': 0.0444, 'grad_norm': 1.2301915884017944, 'learning_rate': 2.8848194349958076e-05, 'epoch': 3.73} 37%|███▋ | 3735/10000 [5:52:17<9:35:44, 5.51s/it][2025-06-19 19:22:02,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:22:02,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.80 | bwd_microstep: 3327.75 | bwd_inner_microstep: 3326.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 19:22:02,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.80 | bwd: 3327.76 | bwd_inner: 3326.95 | bwd_allreduce: 0.77 | step: 6.89 37%|███▋ | 3736/10000 [5:52:22<9:34:32, 5.50s/it] {'loss': 0.0146, 'grad_norm': 0.3992200493812561, 'learning_rate': 2.884238476994879e-05, 'epoch': 3.74} 37%|███▋ | 3736/10000 [5:52:22<9:34:32, 5.50s/it][2025-06-19 19:22:07,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:22:07,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.55 | bwd_microstep: 3321.96 | bwd_inner_microstep: 3321.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 19:22:07,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.55 | bwd: 3321.98 | bwd_inner: 3321.16 | bwd_allreduce: 0.78 | step: 7.14 37%|███▋ | 3737/10000 [5:52:28<9:33:29, 5.49s/it] {'loss': 0.0589, 'grad_norm': 1.4585034847259521, 'learning_rate': 2.8836574262414378e-05, 'epoch': 3.74} 37%|███▋ | 3737/10000 [5:52:28<9:33:29, 5.49s/it][2025-06-19 19:22:13,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:22:13,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.27 | bwd_microstep: 3324.27 | bwd_inner_microstep: 3323.49 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 19:22:13,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.27 | bwd: 3324.28 | bwd_inner: 3323.49 | bwd_allreduce: 0.75 | step: 6.67 37%|███▋ | 3738/10000 [5:52:33<9:32:36, 5.49s/it] {'loss': 0.0221, 'grad_norm': 0.6676081418991089, 'learning_rate': 2.8830762827964336e-05, 'epoch': 3.74} 37%|███▋ | 3738/10000 [5:52:33<9:32:36, 5.49s/it][2025-06-19 19:22:18,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:22:18,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.82 | bwd_microstep: 3330.99 | bwd_inner_microstep: 3330.18 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.25 [2025-06-19 19:22:18,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.82 | bwd: 3331.01 | bwd_inner: 3330.18 | bwd_allreduce: 0.78 | step: 7.26 37%|███▋ | 3739/10000 [5:52:39<9:32:24, 5.49s/it] {'loss': 0.1354, 'grad_norm': 2.6652534008026123, 'learning_rate': 2.882495046720826e-05, 'epoch': 3.74} 37%|███▋ | 3739/10000 [5:52:39<9:32:24, 5.49s/it][2025-06-19 19:22:24,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 19:22:24,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.93 | bwd_microstep: 3330.84 | bwd_inner_microstep: 3330.05 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 19:22:24,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.93 | bwd: 3330.85 | bwd_inner: 3330.05 | bwd_allreduce: 0.76 | step: 6.72 37%|███▋ | 3740/10000 [5:52:44<9:32:02, 5.48s/it] {'loss': 0.1684, 'grad_norm': 2.6162126064300537, 'learning_rate': 2.8819137180755836e-05, 'epoch': 3.74} 37%|███▋ | 3740/10000 [5:52:44<9:32:02, 5.48s/it][2025-06-19 19:22:29,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:22:29,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.23 | bwd_microstep: 3315.89 | bwd_inner_microstep: 3315.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 19:22:29,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.23 | bwd: 3315.90 | bwd_inner: 3315.10 | bwd_allreduce: 0.76 | step: 6.62 37%|███▋ | 3741/10000 [5:52:50<9:31:20, 5.48s/it] {'loss': 0.0268, 'grad_norm': 1.1403733491897583, 'learning_rate': 2.8813322969216858e-05, 'epoch': 3.74} 37%|███▋ | 3741/10000 [5:52:50<9:31:20, 5.48s/it][2025-06-19 19:22:34,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:22:34,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.87 | bwd_microstep: 3327.64 | bwd_inner_microstep: 3326.83 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 19:22:34,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.87 | bwd: 3327.65 | bwd_inner: 3326.83 | bwd_allreduce: 0.77 | step: 7.23 37%|███▋ | 3742/10000 [5:52:55<9:31:02, 5.48s/it] {'loss': 0.0335, 'grad_norm': 0.7934399247169495, 'learning_rate': 2.88075078332012e-05, 'epoch': 3.74} 37%|███▋ | 3742/10000 [5:52:55<9:31:02, 5.48s/it][2025-06-19 19:22:40,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:22:40,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.94 | bwd_microstep: 3325.98 | bwd_inner_microstep: 3325.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 19:22:40,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.94 | bwd: 3326.00 | bwd_inner: 3325.19 | bwd_allreduce: 0.76 | step: 6.73 37%|███▋ | 3743/10000 [5:53:01<9:31:08, 5.48s/it] {'loss': 0.0516, 'grad_norm': 1.7145463228225708, 'learning_rate': 2.8801691773318846e-05, 'epoch': 3.74} 37%|███▋ | 3743/10000 [5:53:01<9:31:08, 5.48s/it][2025-06-19 19:22:45,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:22:45,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.03 | bwd_microstep: 3375.27 | bwd_inner_microstep: 3374.46 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 19:22:45,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.03 | bwd: 3375.29 | bwd_inner: 3374.46 | bwd_allreduce: 0.78 | step: 6.96 37%|███▋ | 3744/10000 [5:53:06<9:33:12, 5.50s/it] {'loss': 0.0481, 'grad_norm': 1.6010074615478516, 'learning_rate': 2.879587479017988e-05, 'epoch': 3.74} 37%|███▋ | 3744/10000 [5:53:06<9:33:12, 5.50s/it][2025-06-19 19:22:51,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:22:51,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.64 | bwd_microstep: 3326.02 | bwd_inner_microstep: 3325.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 19:22:51,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.64 | bwd: 3326.03 | bwd_inner: 3325.21 | bwd_allreduce: 0.77 | step: 6.96 37%|███▋ | 3745/10000 [5:53:12<9:32:21, 5.49s/it] {'loss': 0.1508, 'grad_norm': 1.9240779876708984, 'learning_rate': 2.8790056884394463e-05, 'epoch': 3.75} 37%|███▋ | 3745/10000 [5:53:12<9:32:21, 5.49s/it][2025-06-19 19:22:56,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:22:56,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.27 | bwd_microstep: 3323.62 | bwd_inner_microstep: 3322.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-19 19:22:56,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.27 | bwd: 3323.64 | bwd_inner: 3322.81 | bwd_allreduce: 0.78 | step: 6.98 37%|███▋ | 3746/10000 [5:53:17<9:31:49, 5.49s/it] {'loss': 0.0222, 'grad_norm': 1.0396987199783325, 'learning_rate': 2.8784238056572885e-05, 'epoch': 3.75} 37%|███▋ | 3746/10000 [5:53:17<9:31:49, 5.49s/it][2025-06-19 19:23:02,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:23:02,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.23 | bwd_microstep: 3400.00 | bwd_inner_microstep: 3399.21 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.86 [2025-06-19 19:23:02,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.23 | bwd: 3400.01 | bwd_inner: 3399.21 | bwd_allreduce: 0.76 | step: 6.87 37%|███▋ | 3747/10000 [5:53:23<9:34:31, 5.51s/it] {'loss': 0.0471, 'grad_norm': 1.5315595865249634, 'learning_rate': 2.8778418307325496e-05, 'epoch': 3.75} 37%|███▋ | 3747/10000 [5:53:23<9:34:31, 5.51s/it][2025-06-19 19:23:08,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:23:08,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.68 | bwd_microstep: 3365.61 | bwd_inner_microstep: 3364.83 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 19:23:08,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.68 | bwd: 3365.63 | bwd_inner: 3364.83 | bwd_allreduce: 0.75 | step: 6.64 37%|███▋ | 3748/10000 [5:53:28<9:35:12, 5.52s/it] {'loss': 0.0626, 'grad_norm': 1.8914577960968018, 'learning_rate': 2.8772597637262776e-05, 'epoch': 3.75} 37%|███▋ | 3748/10000 [5:53:28<9:35:12, 5.52s/it][2025-06-19 19:23:13,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:23:13,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.03 | bwd_microstep: 3368.78 | bwd_inner_microstep: 3367.97 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.14 [2025-06-19 19:23:13,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.03 | bwd: 3368.80 | bwd_inner: 3367.97 | bwd_allreduce: 0.79 | step: 7.14 37%|███▋ | 3749/10000 [5:53:34<9:35:35, 5.52s/it] {'loss': 0.0175, 'grad_norm': 0.8299980163574219, 'learning_rate': 2.876677604699527e-05, 'epoch': 3.75} 37%|███▋ | 3749/10000 [5:53:34<9:35:35, 5.52s/it][2025-06-19 19:23:19,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:23:19,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.33 | bwd_microstep: 3368.78 | bwd_inner_microstep: 3368.00 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 19:23:19,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.33 | bwd: 3368.80 | bwd_inner: 3368.00 | bwd_allreduce: 0.76 | step: 6.68 38%|███▊ | 3750/10000 [5:53:39<9:35:50, 5.53s/it] {'loss': 0.0126, 'grad_norm': 0.38203102350234985, 'learning_rate': 2.876095353713365e-05, 'epoch': 3.75} 38%|███▊ | 3750/10000 [5:53:39<9:35:50, 5.53s/it][2025-06-19 19:23:24,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:23:24,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.06 | bwd_microstep: 3327.92 | bwd_inner_microstep: 3326.80 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.11 [2025-06-19 19:23:24,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.06 | bwd: 3327.73 | bwd_inner: 3326.80 | bwd_allreduce: 0.89 | step: 7.11 38%|███▊ | 3751/10000 [5:53:45<9:33:53, 5.51s/it] {'loss': 0.0188, 'grad_norm': 0.736087441444397, 'learning_rate': 2.8755130108288663e-05, 'epoch': 3.75} 38%|███▊ | 3751/10000 [5:53:45<9:33:53, 5.51s/it][2025-06-19 19:23:30,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:23:30,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.43 | bwd_microstep: 3317.28 | bwd_inner_microstep: 3316.17 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.81 [2025-06-19 19:23:30,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.43 | bwd: 3317.29 | bwd_inner: 3316.17 | bwd_allreduce: 1.07 | step: 7.82 38%|███▊ | 3752/10000 [5:53:50<9:32:35, 5.50s/it] {'loss': 0.0048, 'grad_norm': 0.2053799033164978, 'learning_rate': 2.874930576107116e-05, 'epoch': 3.75} 38%|███▊ | 3752/10000 [5:53:50<9:32:35, 5.50s/it][2025-06-19 19:23:35,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 19:23:35,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.73 | bwd_microstep: 3314.08 | bwd_inner_microstep: 3312.99 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.07 [2025-06-19 19:23:35,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.73 | bwd: 3314.11 | bwd_inner: 3312.99 | bwd_allreduce: 1.05 | step: 7.06 38%|███▊ | 3753/10000 [5:53:56<9:31:26, 5.49s/it] {'loss': 0.091, 'grad_norm': 3.633775472640991, 'learning_rate': 2.874348049609209e-05, 'epoch': 3.75} 38%|███▊ | 3753/10000 [5:53:56<9:31:26, 5.49s/it][2025-06-19 19:23:40,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.73 [2025-06-19 19:23:40,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.96 | bwd_microstep: 3322.44 | bwd_inner_microstep: 3321.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 19:23:40,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.96 | bwd: 3322.45 | bwd_inner: 3321.64 | bwd_allreduce: 0.77 | step: 6.80 38%|███▊ | 3754/10000 [5:54:01<9:30:42, 5.48s/it] {'loss': 0.0491, 'grad_norm': 2.130373477935791, 'learning_rate': 2.87376543139625e-05, 'epoch': 3.75} 38%|███▊ | 3754/10000 [5:54:01<9:30:42, 5.48s/it][2025-06-19 19:23:46,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:23:46,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.40 | bwd_microstep: 3370.77 | bwd_inner_microstep: 3369.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.75 [2025-06-19 19:23:46,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.40 | bwd: 3370.79 | bwd_inner: 3369.97 | bwd_allreduce: 0.78 | step: 6.76 38%|███▊ | 3755/10000 [5:54:07<9:32:12, 5.50s/it] {'loss': 0.0789, 'grad_norm': 2.811432123184204, 'learning_rate': 2.8731827215293524e-05, 'epoch': 3.75} 38%|███▊ | 3755/10000 [5:54:07<9:32:12, 5.50s/it][2025-06-19 19:23:52,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:23:52,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.97 | bwd_microstep: 3373.57 | bwd_inner_microstep: 3372.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 19:23:52,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.98 | bwd: 3373.59 | bwd_inner: 3372.77 | bwd_allreduce: 0.78 | step: 7.18 38%|███▊ | 3756/10000 [5:54:12<9:33:34, 5.51s/it] {'loss': 0.0516, 'grad_norm': 1.5341126918792725, 'learning_rate': 2.8725999200696396e-05, 'epoch': 3.76} 38%|███▊ | 3756/10000 [5:54:12<9:33:34, 5.51s/it][2025-06-19 19:23:57,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:23:57,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.13 | bwd_microstep: 3372.01 | bwd_inner_microstep: 3371.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 19:23:57,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.13 | bwd: 3372.02 | bwd_inner: 3371.21 | bwd_allreduce: 0.77 | step: 6.73 38%|███▊ | 3757/10000 [5:54:18<9:34:20, 5.52s/it] {'loss': 0.0303, 'grad_norm': 2.312558650970459, 'learning_rate': 2.8720170270782448e-05, 'epoch': 3.76} 38%|███▊ | 3757/10000 [5:54:18<9:34:20, 5.52s/it][2025-06-19 19:24:03,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:24:03,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.04 | bwd_microstep: 3321.42 | bwd_inner_microstep: 3320.42 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.51 [2025-06-19 19:24:03,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.04 | bwd: 3321.44 | bwd_inner: 3320.42 | bwd_allreduce: 0.97 | step: 7.51 38%|███▊ | 3758/10000 [5:54:23<9:32:47, 5.51s/it] {'loss': 0.0513, 'grad_norm': 0.9503486156463623, 'learning_rate': 2.8714340426163113e-05, 'epoch': 3.76} 38%|███▊ | 3758/10000 [5:54:23<9:32:47, 5.51s/it][2025-06-19 19:24:08,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.80 [2025-06-19 19:24:08,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.31 | bwd_microstep: 3311.78 | bwd_inner_microstep: 3310.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.80 [2025-06-19 19:24:08,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.31 | bwd: 3311.80 | bwd_inner: 3310.99 | bwd_allreduce: 0.76 | step: 6.81 38%|███▊ | 3759/10000 [5:54:29<9:31:10, 5.49s/it] {'loss': 0.049, 'grad_norm': 1.8411303758621216, 'learning_rate': 2.8708509667449917e-05, 'epoch': 3.76} 38%|███▊ | 3759/10000 [5:54:29<9:31:10, 5.49s/it][2025-06-19 19:24:14,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:24:14,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.67 | bwd_microstep: 3378.15 | bwd_inner_microstep: 3377.20 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.04 [2025-06-19 19:24:14,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.68 | bwd: 3378.17 | bwd_inner: 3377.20 | bwd_allreduce: 0.92 | step: 7.04 38%|███▊ | 3760/10000 [5:54:34<9:32:58, 5.51s/it] {'loss': 0.092, 'grad_norm': 2.622473955154419, 'learning_rate': 2.8702677995254466e-05, 'epoch': 3.76} 38%|███▊ | 3760/10000 [5:54:34<9:32:58, 5.51s/it][2025-06-19 19:24:19,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:24:19,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.11 | bwd_microstep: 3324.20 | bwd_inner_microstep: 3323.39 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.65 [2025-06-19 19:24:19,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.11 | bwd: 3324.21 | bwd_inner: 3323.39 | bwd_allreduce: 0.78 | step: 6.65 38%|███▊ | 3761/10000 [5:54:40<9:32:03, 5.50s/it] {'loss': 0.0782, 'grad_norm': 2.2297043800354004, 'learning_rate': 2.869684541018849e-05, 'epoch': 3.76} 38%|███▊ | 3761/10000 [5:54:40<9:32:03, 5.50s/it][2025-06-19 19:24:25,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:24:25,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.42 | bwd_microstep: 3324.10 | bwd_inner_microstep: 3323.13 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.36 [2025-06-19 19:24:25,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.42 | bwd: 3324.12 | bwd_inner: 3323.13 | bwd_allreduce: 0.93 | step: 7.37 38%|███▊ | 3762/10000 [5:54:45<9:31:18, 5.50s/it] {'loss': 0.0618, 'grad_norm': 3.305427074432373, 'learning_rate': 2.8691011912863793e-05, 'epoch': 3.76} 38%|███▊ | 3762/10000 [5:54:45<9:31:18, 5.50s/it][2025-06-19 19:24:30,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:24:30,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.90 | bwd_microstep: 3370.45 | bwd_inner_microstep: 3369.59 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.15 [2025-06-19 19:24:30,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.90 | bwd: 3370.47 | bwd_inner: 3369.59 | bwd_allreduce: 0.83 | step: 7.15 38%|███▊ | 3763/10000 [5:54:51<9:33:06, 5.51s/it] {'loss': 0.0426, 'grad_norm': 1.2841626405715942, 'learning_rate': 2.8685177503892282e-05, 'epoch': 3.76} 38%|███▊ | 3763/10000 [5:54:51<9:33:06, 5.51s/it][2025-06-19 19:24:36,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:24:36,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.02 | bwd_microstep: 3319.64 | bwd_inner_microstep: 3318.71 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.17 [2025-06-19 19:24:36,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.02 | bwd: 3319.66 | bwd_inner: 3318.71 | bwd_allreduce: 0.90 | step: 7.18 38%|███▊ | 3764/10000 [5:54:56<9:31:32, 5.50s/it] {'loss': 0.0656, 'grad_norm': 2.106139659881592, 'learning_rate': 2.867934218388596e-05, 'epoch': 3.76} 38%|███▊ | 3764/10000 [5:54:56<9:31:32, 5.50s/it][2025-06-19 19:24:41,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:24:41,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.58 | bwd_microstep: 3363.91 | bwd_inner_microstep: 3363.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 19:24:41,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.58 | bwd: 3363.92 | bwd_inner: 3363.10 | bwd_allreduce: 0.77 | step: 6.90 38%|███▊ | 3765/10000 [5:55:02<9:32:28, 5.51s/it] {'loss': 0.0459, 'grad_norm': 1.444252610206604, 'learning_rate': 2.867350595345692e-05, 'epoch': 3.77} 38%|███▊ | 3765/10000 [5:55:02<9:32:28, 5.51s/it][2025-06-19 19:24:47,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:24:47,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.91 | bwd_microstep: 3317.47 | bwd_inner_microstep: 3316.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 19:24:47,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.91 | bwd: 3317.49 | bwd_inner: 3316.68 | bwd_allreduce: 0.77 | step: 6.99 38%|███▊ | 3766/10000 [5:55:07<9:30:46, 5.49s/it] {'loss': 0.084, 'grad_norm': 2.731637954711914, 'learning_rate': 2.8667668813217365e-05, 'epoch': 3.77} 38%|███▊ | 3766/10000 [5:55:07<9:30:46, 5.49s/it][h264 @ 0x156edb00] Reference 5 >= 5 [h264 @ 0x156edb00] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x14197440] left block unavailable for requested intra mode [h264 @ 0x14197440] error while decoding MB 0 25, bytestream 45493 [h264 @ 0x159ac1c0] Reference 5 >= 5 [h264 @ 0x159ac1c0] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x159ac1c0] left block unavailable for requested intra mode [h264 @ 0x159ac1c0] error while decoding MB 0 25, bytestream 45493 [2025-06-19 19:24:52,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:24:52,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.89 | bwd_microstep: 3315.55 | bwd_inner_microstep: 3314.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 19:24:52,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.90 | bwd: 3315.56 | bwd_inner: 3314.76 | bwd_allreduce: 0.75 | step: 6.56 38%|███▊ | 3767/10000 [5:55:13<9:29:34, 5.48s/it] {'loss': 0.053, 'grad_norm': 1.6551833152770996, 'learning_rate': 2.8661830763779576e-05, 'epoch': 3.77} 38%|███▊ | 3767/10000 [5:55:13<9:29:34, 5.48s/it][2025-06-19 19:24:58,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:24:58,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.03 | bwd_microstep: 3367.16 | bwd_inner_microstep: 3366.17 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.68 [2025-06-19 19:24:58,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.03 | bwd: 3367.18 | bwd_inner: 3366.17 | bwd_allreduce: 0.96 | step: 7.68 38%|███▊ | 3768/10000 [5:55:18<9:30:56, 5.50s/it] {'loss': 0.0707, 'grad_norm': 2.026111364364624, 'learning_rate': 2.8655991805755944e-05, 'epoch': 3.77} 38%|███▊ | 3768/10000 [5:55:18<9:30:56, 5.50s/it][2025-06-19 19:25:03,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:25:03,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.27 | bwd_microstep: 3366.51 | bwd_inner_microstep: 3365.61 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.95 [2025-06-19 19:25:03,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.27 | bwd: 3366.53 | bwd_inner: 3365.61 | bwd_allreduce: 0.88 | step: 6.96 38%|███▊ | 3769/10000 [5:55:24<9:32:00, 5.51s/it] {'loss': 0.2508, 'grad_norm': 3.618769884109497, 'learning_rate': 2.8650151939758947e-05, 'epoch': 3.77} 38%|███▊ | 3769/10000 [5:55:24<9:32:00, 5.51s/it][2025-06-19 19:25:09,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:25:09,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.57 | bwd_microstep: 3324.30 | bwd_inner_microstep: 3323.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 19:25:09,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.57 | bwd: 3324.31 | bwd_inner: 3323.50 | bwd_allreduce: 0.76 | step: 6.76 38%|███▊ | 3770/10000 [5:55:29<9:30:44, 5.50s/it] {'loss': 0.025, 'grad_norm': 0.6364105939865112, 'learning_rate': 2.8644311166401154e-05, 'epoch': 3.77} 38%|███▊ | 3770/10000 [5:55:29<9:30:44, 5.50s/it][2025-06-19 19:25:14,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:25:14,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.84 | bwd_microstep: 3315.39 | bwd_inner_microstep: 3314.41 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.37 [2025-06-19 19:25:14,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.84 | bwd: 3315.40 | bwd_inner: 3314.41 | bwd_allreduce: 0.95 | step: 7.38 38%|███▊ | 3771/10000 [5:55:35<9:29:25, 5.48s/it] {'loss': 0.13, 'grad_norm': 1.8457685708999634, 'learning_rate': 2.8638469486295238e-05, 'epoch': 3.77} 38%|███▊ | 3771/10000 [5:55:35<9:29:25, 5.48s/it][2025-06-19 19:25:19,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:25:19,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.41 | bwd_microstep: 3312.37 | bwd_inner_microstep: 3311.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 19:25:19,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.41 | bwd: 3312.38 | bwd_inner: 3311.57 | bwd_allreduce: 0.77 | step: 6.95 38%|███▊ | 3772/10000 [5:55:40<9:28:37, 5.48s/it] {'loss': 0.0297, 'grad_norm': 1.1981443166732788, 'learning_rate': 2.8632626900053973e-05, 'epoch': 3.77} 38%|███▊ | 3772/10000 [5:55:40<9:28:37, 5.48s/it][2025-06-19 19:25:25,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:25:25,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.31 | bwd_microstep: 3360.15 | bwd_inner_microstep: 3359.32 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.87 [2025-06-19 19:25:25,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.31 | bwd: 3360.17 | bwd_inner: 3359.32 | bwd_allreduce: 0.80 | step: 6.87 38%|███▊ | 3773/10000 [5:55:46<9:30:10, 5.49s/it] {'loss': 0.0689, 'grad_norm': 1.2678637504577637, 'learning_rate': 2.86267834082902e-05, 'epoch': 3.77} 38%|███▊ | 3773/10000 [5:55:46<9:30:10, 5.49s/it][2025-06-19 19:25:31,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:25:31,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.09 | bwd_microstep: 3360.94 | bwd_inner_microstep: 3359.89 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.21 [2025-06-19 19:25:31,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.09 | bwd: 3360.95 | bwd_inner: 3359.89 | bwd_allreduce: 1.02 | step: 7.21 38%|███▊ | 3774/10000 [5:55:51<9:31:03, 5.50s/it] {'loss': 0.0394, 'grad_norm': 1.5196807384490967, 'learning_rate': 2.8620939011616893e-05, 'epoch': 3.77} 38%|███▊ | 3774/10000 [5:55:51<9:31:03, 5.50s/it][2025-06-19 19:25:36,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:25:36,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.48 | bwd_microstep: 3314.11 | bwd_inner_microstep: 3313.22 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.45 [2025-06-19 19:25:36,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.48 | bwd: 3314.13 | bwd_inner: 3313.22 | bwd_allreduce: 0.87 | step: 7.45 38%|███▊ | 3775/10000 [5:55:57<9:29:39, 5.49s/it] {'loss': 0.2046, 'grad_norm': 2.849762439727783, 'learning_rate': 2.8615093710647098e-05, 'epoch': 3.77} 38%|███▊ | 3775/10000 [5:55:57<9:29:39, 5.49s/it][2025-06-19 19:25:41,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:25:41,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.58 | bwd_microstep: 3318.42 | bwd_inner_microstep: 3317.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 19:25:41,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.58 | bwd: 3318.43 | bwd_inner: 3317.62 | bwd_allreduce: 0.76 | step: 6.72 38%|███▊ | 3776/10000 [5:56:02<9:28:36, 5.48s/it] {'loss': 0.0648, 'grad_norm': 1.0179226398468018, 'learning_rate': 2.8609247505993948e-05, 'epoch': 3.78} 38%|███▊ | 3776/10000 [5:56:02<9:28:36, 5.48s/it][2025-06-19 19:25:47,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:25:47,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.62 | bwd_microstep: 3309.46 | bwd_inner_microstep: 3308.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.29 [2025-06-19 19:25:47,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.62 | bwd: 3309.48 | bwd_inner: 3308.65 | bwd_allreduce: 0.78 | step: 7.30 38%|███▊ | 3777/10000 [5:56:08<9:27:31, 5.47s/it] {'loss': 0.0359, 'grad_norm': 0.6265830993652344, 'learning_rate': 2.86034003982707e-05, 'epoch': 3.78} 38%|███▊ | 3777/10000 [5:56:08<9:27:31, 5.47s/it][2025-06-19 19:25:52,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:25:52,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.22 | bwd_microstep: 3319.80 | bwd_inner_microstep: 3319.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 19:25:52,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.22 | bwd: 3319.82 | bwd_inner: 3319.01 | bwd_allreduce: 0.77 | step: 6.65 38%|███▊ | 3778/10000 [5:56:13<9:27:12, 5.47s/it] {'loss': 0.0784, 'grad_norm': 1.8460862636566162, 'learning_rate': 2.8597552388090673e-05, 'epoch': 3.78} 38%|███▊ | 3778/10000 [5:56:13<9:27:12, 5.47s/it][2025-06-19 19:25:58,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:25:58,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.28 | bwd_microstep: 3318.91 | bwd_inner_microstep: 3318.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 19:25:58,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.28 | bwd: 3318.93 | bwd_inner: 3318.12 | bwd_allreduce: 0.76 | step: 6.87 38%|███▊ | 3779/10000 [5:56:19<9:26:48, 5.47s/it] {'loss': 0.0997, 'grad_norm': 2.012561321258545, 'learning_rate': 2.859170347606731e-05, 'epoch': 3.78} 38%|███▊ | 3779/10000 [5:56:19<9:26:48, 5.47s/it][2025-06-19 19:26:03,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:26:03,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.00 | bwd_microstep: 3361.81 | bwd_inner_microstep: 3361.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-19 19:26:03,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.00 | bwd: 3361.83 | bwd_inner: 3361.00 | bwd_allreduce: 0.78 | step: 7.23 38%|███▊ | 3780/10000 [5:56:24<9:28:26, 5.48s/it] {'loss': 0.1061, 'grad_norm': 2.0860109329223633, 'learning_rate': 2.858585366281412e-05, 'epoch': 3.78} 38%|███▊ | 3780/10000 [5:56:24<9:28:26, 5.48s/it][2025-06-19 19:26:09,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:26:09,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.43 | bwd_microstep: 3308.98 | bwd_inner_microstep: 3308.01 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.22 [2025-06-19 19:26:09,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.43 | bwd: 3308.99 | bwd_inner: 3308.01 | bwd_allreduce: 0.93 | step: 7.23 38%|███▊ | 3781/10000 [5:56:30<9:27:16, 5.47s/it] {'loss': 0.1185, 'grad_norm': 2.431924343109131, 'learning_rate': 2.8580002948944732e-05, 'epoch': 3.78} 38%|███▊ | 3781/10000 [5:56:30<9:27:16, 5.47s/it][2025-06-19 19:26:14,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:26:14,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.29 | bwd_microstep: 3390.10 | bwd_inner_microstep: 3389.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 19:26:14,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.29 | bwd: 3390.12 | bwd_inner: 3389.32 | bwd_allreduce: 0.76 | step: 6.68 38%|███▊ | 3782/10000 [5:56:35<9:29:58, 5.50s/it] {'loss': 0.0674, 'grad_norm': 1.3888580799102783, 'learning_rate': 2.857415133507286e-05, 'epoch': 3.78} 38%|███▊ | 3782/10000 [5:56:35<9:29:58, 5.50s/it][2025-06-19 19:26:20,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:26:20,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.13 | bwd_microstep: 3359.17 | bwd_inner_microstep: 3358.35 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.97 [2025-06-19 19:26:20,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.13 | bwd: 3359.19 | bwd_inner: 3358.35 | bwd_allreduce: 0.80 | step: 6.97 38%|███▊ | 3783/10000 [5:56:41<9:30:30, 5.51s/it] {'loss': 0.0151, 'grad_norm': 0.3803204298019409, 'learning_rate': 2.85682988218123e-05, 'epoch': 3.78} 38%|███▊ | 3783/10000 [5:56:41<9:30:30, 5.51s/it][2025-06-19 19:26:25,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:26:25,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.76 | bwd_microstep: 3365.04 | bwd_inner_microstep: 3364.17 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.20 [2025-06-19 19:26:25,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.76 | bwd: 3365.05 | bwd_inner: 3364.17 | bwd_allreduce: 0.84 | step: 7.20 38%|███▊ | 3784/10000 [5:56:46<9:31:17, 5.51s/it] {'loss': 0.0677, 'grad_norm': 1.6282405853271484, 'learning_rate': 2.856244540977696e-05, 'epoch': 3.78} 38%|███▊ | 3784/10000 [5:56:46<9:31:17, 5.51s/it][2025-06-19 19:26:31,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:26:31,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.24 | bwd_microstep: 3308.49 | bwd_inner_microstep: 3307.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.20 [2025-06-19 19:26:31,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.24 | bwd: 3308.51 | bwd_inner: 3307.69 | bwd_allreduce: 0.77 | step: 7.20 38%|███▊ | 3785/10000 [5:56:52<9:29:26, 5.50s/it] {'loss': 0.0723, 'grad_norm': 0.9374363422393799, 'learning_rate': 2.8556591099580838e-05, 'epoch': 3.79} 38%|███▊ | 3785/10000 [5:56:52<9:29:26, 5.50s/it][2025-06-19 19:26:36,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:26:36,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.96 | bwd_microstep: 3307.62 | bwd_inner_microstep: 3306.84 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-19 19:26:36,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.96 | bwd: 3307.63 | bwd_inner: 3306.84 | bwd_allreduce: 0.75 | step: 6.78 38%|███▊ | 3786/10000 [5:56:57<9:27:58, 5.48s/it] {'loss': 0.0531, 'grad_norm': 1.5472675561904907, 'learning_rate': 2.8550735891838026e-05, 'epoch': 3.79} 38%|███▊ | 3786/10000 [5:56:57<9:27:58, 5.48s/it][2025-06-19 19:26:42,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:26:42,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.28 | bwd_microstep: 3307.82 | bwd_inner_microstep: 3307.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.82 [2025-06-19 19:26:42,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.28 | bwd: 3307.84 | bwd_inner: 3307.03 | bwd_allreduce: 0.76 | step: 6.82 38%|███▊ | 3787/10000 [5:57:03<9:26:36, 5.47s/it] {'loss': 0.0706, 'grad_norm': 2.361306667327881, 'learning_rate': 2.85448797871627e-05, 'epoch': 3.79} 38%|███▊ | 3787/10000 [5:57:03<9:26:36, 5.47s/it][2025-06-19 19:26:47,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:26:47,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.02 | bwd_microstep: 3359.43 | bwd_inner_microstep: 3358.60 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.04 [2025-06-19 19:26:47,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.02 | bwd: 3359.44 | bwd_inner: 3358.60 | bwd_allreduce: 0.79 | step: 7.04 38%|███▊ | 3788/10000 [5:57:08<9:27:59, 5.49s/it] {'loss': 0.0594, 'grad_norm': 1.7110991477966309, 'learning_rate': 2.8539022786169143e-05, 'epoch': 3.79} 38%|███▊ | 3788/10000 [5:57:08<9:27:59, 5.49s/it][2025-06-19 19:26:53,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:26:53,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.57 | bwd_microstep: 3306.16 | bwd_inner_microstep: 3305.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 19:26:53,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.57 | bwd: 3306.17 | bwd_inner: 3305.36 | bwd_allreduce: 0.77 | step: 6.73 38%|███▊ | 3789/10000 [5:57:14<9:26:34, 5.47s/it] {'loss': 0.0241, 'grad_norm': 0.7138490676879883, 'learning_rate': 2.8533164889471732e-05, 'epoch': 3.79} 38%|███▊ | 3789/10000 [5:57:14<9:26:34, 5.47s/it][2025-06-19 19:26:58,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:26:58,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2094.92 | bwd_microstep: 3310.09 | bwd_inner_microstep: 3309.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 19:26:58,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2094.92 | bwd: 3310.10 | bwd_inner: 3309.28 | bwd_allreduce: 0.78 | step: 6.80 38%|███▊ | 3790/10000 [5:57:19<9:25:33, 5.46s/it] {'loss': 0.0439, 'grad_norm': 1.0721451044082642, 'learning_rate': 2.8527306097684922e-05, 'epoch': 3.79} 38%|███▊ | 3790/10000 [5:57:19<9:25:33, 5.46s/it][2025-06-19 19:27:04,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:27:04,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.61 | bwd_microstep: 3311.56 | bwd_inner_microstep: 3310.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 19:27:04,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.61 | bwd: 3311.58 | bwd_inner: 3310.76 | bwd_allreduce: 0.77 | step: 6.81 38%|███▊ | 3791/10000 [5:57:24<9:24:59, 5.46s/it] {'loss': 0.0205, 'grad_norm': 0.5689105987548828, 'learning_rate': 2.852144641142328e-05, 'epoch': 3.79} 38%|███▊ | 3791/10000 [5:57:24<9:24:59, 5.46s/it][2025-06-19 19:27:09,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 19:27:09,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.70 | bwd_microstep: 3304.75 | bwd_inner_microstep: 3303.63 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.94 [2025-06-19 19:27:09,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.70 | bwd: 3304.78 | bwd_inner: 3303.63 | bwd_allreduce: 1.09 | step: 7.94 38%|███▊ | 3792/10000 [5:57:30<9:24:58, 5.46s/it] {'loss': 0.0216, 'grad_norm': 0.39713728427886963, 'learning_rate': 2.8515585831301456e-05, 'epoch': 3.79} 38%|███▊ | 3792/10000 [5:57:30<9:24:58, 5.46s/it][2025-06-19 19:27:15,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:27:15,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.44 | bwd_microstep: 3310.20 | bwd_inner_microstep: 3309.03 | bwd_allreduce_microstep: 1.11 | step_microstep: 7.54 [2025-06-19 19:27:15,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.44 | bwd: 3310.22 | bwd_inner: 3309.03 | bwd_allreduce: 1.14 | step: 7.54 38%|███▊ | 3793/10000 [5:57:35<9:24:39, 5.46s/it] {'loss': 0.0666, 'grad_norm': 1.8279986381530762, 'learning_rate': 2.8509724357934197e-05, 'epoch': 3.79} 38%|███▊ | 3793/10000 [5:57:35<9:24:39, 5.46s/it][2025-06-19 19:27:20,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:27:20,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.75 | bwd_microstep: 3303.45 | bwd_inner_microstep: 3302.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-19 19:27:20,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.75 | bwd: 3303.47 | bwd_inner: 3302.65 | bwd_allreduce: 0.77 | step: 6.78 38%|███▊ | 3794/10000 [5:57:41<9:24:21, 5.46s/it] {'loss': 0.0523, 'grad_norm': 1.1770422458648682, 'learning_rate': 2.8503861991936357e-05, 'epoch': 3.79} 38%|███▊ | 3794/10000 [5:57:41<9:24:21, 5.46s/it][2025-06-19 19:27:25,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.81 [2025-06-19 19:27:25,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.98 | bwd_microstep: 3356.21 | bwd_inner_microstep: 3355.34 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.92 [2025-06-19 19:27:25,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.98 | bwd: 3356.23 | bwd_inner: 3355.34 | bwd_allreduce: 0.84 | step: 6.93 38%|███▊ | 3795/10000 [5:57:46<9:26:15, 5.48s/it] {'loss': 0.0607, 'grad_norm': 1.191849946975708, 'learning_rate': 2.849799873392286e-05, 'epoch': 3.79} 38%|███▊ | 3795/10000 [5:57:46<9:26:15, 5.48s/it][2025-06-19 19:27:31,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:27:31,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.61 | bwd_microstep: 3364.66 | bwd_inner_microstep: 3363.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 19:27:31,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.61 | bwd: 3364.68 | bwd_inner: 3363.85 | bwd_allreduce: 0.78 | step: 7.23 38%|███▊ | 3796/10000 [5:57:52<9:28:00, 5.49s/it] {'loss': 0.1706, 'grad_norm': 1.9350143671035767, 'learning_rate': 2.8492134584508734e-05, 'epoch': 3.8} 38%|███▊ | 3796/10000 [5:57:52<9:28:00, 5.49s/it][2025-06-19 19:27:36,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 19:27:36,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.77 | bwd_microstep: 3315.98 | bwd_inner_microstep: 3314.63 | bwd_allreduce_microstep: 1.28 | step_microstep: 8.04 [2025-06-19 19:27:36,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.77 | bwd: 3316.00 | bwd_inner: 3314.63 | bwd_allreduce: 1.31 | step: 8.04 38%|███▊ | 3797/10000 [5:57:57<9:27:13, 5.49s/it] {'loss': 0.1853, 'grad_norm': 2.1655685901641846, 'learning_rate': 2.848626954430911e-05, 'epoch': 3.8} 38%|███▊ | 3797/10000 [5:57:57<9:27:13, 5.49s/it][2025-06-19 19:27:42,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:27:42,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.12 | bwd_microstep: 3318.04 | bwd_inner_microstep: 3317.25 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.61 [2025-06-19 19:27:42,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.12 | bwd: 3318.05 | bwd_inner: 3317.25 | bwd_allreduce: 0.76 | step: 6.61 38%|███▊ | 3798/10000 [5:58:03<9:26:45, 5.48s/it] {'loss': 0.0312, 'grad_norm': 1.0720608234405518, 'learning_rate': 2.8480403613939188e-05, 'epoch': 3.8} 38%|███▊ | 3798/10000 [5:58:03<9:26:45, 5.48s/it][2025-06-19 19:27:48,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:27:48,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.67 | bwd_microstep: 3367.26 | bwd_inner_microstep: 3366.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 19:27:48,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.67 | bwd: 3367.28 | bwd_inner: 3366.47 | bwd_allreduce: 0.76 | step: 6.67 38%|███▊ | 3799/10000 [5:58:08<9:28:14, 5.50s/it] {'loss': 0.0301, 'grad_norm': 0.755178689956665, 'learning_rate': 2.847453679401429e-05, 'epoch': 3.8} 38%|███▊ | 3799/10000 [5:58:08<9:28:14, 5.50s/it][2025-06-19 19:27:53,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:27:53,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.07 | bwd_microstep: 3318.03 | bwd_inner_microstep: 3317.13 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.26 [2025-06-19 19:27:53,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.07 | bwd: 3318.04 | bwd_inner: 3317.13 | bwd_allreduce: 0.87 | step: 7.26 38%|███▊ | 3800/10000 [5:58:14<9:26:48, 5.49s/it] {'loss': 0.033, 'grad_norm': 0.7879011631011963, 'learning_rate': 2.8468669085149812e-05, 'epoch': 3.8} 38%|███▊ | 3800/10000 [5:58:14<9:26:48, 5.49s/it][2025-06-19 19:27:59,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:27:59,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.43 | bwd_microstep: 3372.44 | bwd_inner_microstep: 3371.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.14 [2025-06-19 19:27:59,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.43 | bwd: 3372.45 | bwd_inner: 3371.64 | bwd_allreduce: 0.77 | step: 7.15 38%|███▊ | 3801/10000 [5:58:19<9:28:34, 5.50s/it] {'loss': 0.0642, 'grad_norm': 1.8338083028793335, 'learning_rate': 2.8462800487961245e-05, 'epoch': 3.8} 38%|███▊ | 3801/10000 [5:58:19<9:28:34, 5.50s/it][2025-06-19 19:28:04,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:28:04,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.19 | bwd_microstep: 3374.86 | bwd_inner_microstep: 3373.97 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.91 [2025-06-19 19:28:04,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.19 | bwd: 3374.87 | bwd_inner: 3373.97 | bwd_allreduce: 0.85 | step: 6.92 38%|███▊ | 3802/10000 [5:58:25<9:29:56, 5.52s/it] {'loss': 0.0544, 'grad_norm': 1.8562299013137817, 'learning_rate': 2.8456931003064193e-05, 'epoch': 3.8} 38%|███▊ | 3802/10000 [5:58:25<9:29:56, 5.52s/it][2025-06-19 19:28:10,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:28:10,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.40 | bwd_microstep: 3370.54 | bwd_inner_microstep: 3369.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 19:28:10,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.40 | bwd: 3370.56 | bwd_inner: 3369.76 | bwd_allreduce: 0.76 | step: 6.66 38%|███▊ | 3803/10000 [5:58:30<9:30:33, 5.52s/it] {'loss': 0.0569, 'grad_norm': 1.8041088581085205, 'learning_rate': 2.845106063107432e-05, 'epoch': 3.8} 38%|███▊ | 3803/10000 [5:58:30<9:30:33, 5.52s/it][2025-06-19 19:28:15,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:28:15,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.61 | bwd_microstep: 3312.48 | bwd_inner_microstep: 3311.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 19:28:15,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.61 | bwd: 3312.50 | bwd_inner: 3311.70 | bwd_allreduce: 0.76 | step: 6.71 38%|███▊ | 3804/10000 [5:58:36<9:28:45, 5.51s/it] {'loss': 0.0214, 'grad_norm': 0.39906156063079834, 'learning_rate': 2.8445189372607415e-05, 'epoch': 3.8} 38%|███▊ | 3804/10000 [5:58:36<9:28:45, 5.51s/it][2025-06-19 19:28:21,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:28:21,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.99 | bwd_microstep: 3326.58 | bwd_inner_microstep: 3325.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 19:28:21,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.99 | bwd: 3326.59 | bwd_inner: 3325.79 | bwd_allreduce: 0.76 | step: 6.61 38%|███▊ | 3805/10000 [5:58:41<9:27:36, 5.50s/it] {'loss': 0.0251, 'grad_norm': 0.8501609563827515, 'learning_rate': 2.8439317228279347e-05, 'epoch': 3.81} 38%|███▊ | 3805/10000 [5:58:41<9:27:36, 5.50s/it][2025-06-19 19:28:26,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:28:26,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.87 | bwd_microstep: 3317.67 | bwd_inner_microstep: 3316.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 19:28:26,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.87 | bwd: 3317.68 | bwd_inner: 3316.87 | bwd_allreduce: 0.77 | step: 7.10 38%|███▊ | 3806/10000 [5:58:47<9:26:30, 5.49s/it] {'loss': 0.0454, 'grad_norm': 1.5975418090820312, 'learning_rate': 2.843344419870606e-05, 'epoch': 3.81} 38%|███▊ | 3806/10000 [5:58:47<9:26:30, 5.49s/it][2025-06-19 19:28:31,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:28:31,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.94 | bwd_microstep: 3328.02 | bwd_inner_microstep: 3327.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 19:28:31,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.94 | bwd: 3328.04 | bwd_inner: 3327.24 | bwd_allreduce: 0.75 | step: 6.66 38%|███▊ | 3807/10000 [5:58:52<9:26:03, 5.48s/it] {'loss': 0.0283, 'grad_norm': 1.0452371835708618, 'learning_rate': 2.8427570284503625e-05, 'epoch': 3.81} 38%|███▊ | 3807/10000 [5:58:52<9:26:03, 5.48s/it][2025-06-19 19:28:37,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:28:37,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.96 | bwd_microstep: 3369.77 | bwd_inner_microstep: 3368.95 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.97 [2025-06-19 19:28:37,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.96 | bwd: 3369.78 | bwd_inner: 3368.95 | bwd_allreduce: 0.78 | step: 6.97 38%|███▊ | 3808/10000 [5:58:58<9:27:35, 5.50s/it] {'loss': 0.0164, 'grad_norm': 0.8951809406280518, 'learning_rate': 2.8421695486288173e-05, 'epoch': 3.81} 38%|███▊ | 3808/10000 [5:58:58<9:27:35, 5.50s/it][2025-06-19 19:28:42,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:28:42,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.65 | bwd_microstep: 3319.61 | bwd_inner_microstep: 3318.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-19 19:28:42,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.65 | bwd: 3319.62 | bwd_inner: 3318.81 | bwd_allreduce: 0.77 | step: 6.86 38%|███▊ | 3809/10000 [5:59:03<9:26:30, 5.49s/it] {'loss': 0.3025, 'grad_norm': 2.10404372215271, 'learning_rate': 2.8415819804675955e-05, 'epoch': 3.81} 38%|███▊ | 3809/10000 [5:59:03<9:26:30, 5.49s/it][2025-06-19 19:28:48,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:28:48,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.28 | bwd_microstep: 3368.36 | bwd_inner_microstep: 3367.34 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.73 [2025-06-19 19:28:48,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.28 | bwd: 3368.39 | bwd_inner: 3367.34 | bwd_allreduce: 0.98 | step: 7.73 38%|███▊ | 3810/10000 [5:59:09<9:27:50, 5.50s/it] {'loss': 0.0245, 'grad_norm': 0.6473464965820312, 'learning_rate': 2.8409943240283296e-05, 'epoch': 3.81} 38%|███▊ | 3810/10000 [5:59:09<9:27:50, 5.50s/it][2025-06-19 19:28:53,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 19:28:53,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.13 | bwd_microstep: 3311.77 | bwd_inner_microstep: 3310.66 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.38 [2025-06-19 19:28:53,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.13 | bwd: 3311.79 | bwd_inner: 3310.66 | bwd_allreduce: 1.08 | step: 7.38 38%|███▊ | 3811/10000 [5:59:14<9:26:30, 5.49s/it] {'loss': 0.0778, 'grad_norm': 1.6962873935699463, 'learning_rate': 2.840406579372662e-05, 'epoch': 3.81} 38%|███▊ | 3811/10000 [5:59:14<9:26:30, 5.49s/it][2025-06-19 19:28:59,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:28:59,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.73 | bwd_microstep: 3322.10 | bwd_inner_microstep: 3321.10 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.17 [2025-06-19 19:28:59,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.74 | bwd: 3322.12 | bwd_inner: 3321.10 | bwd_allreduce: 0.96 | step: 7.17 38%|███▊ | 3812/10000 [5:59:20<9:26:08, 5.49s/it] {'loss': 0.0283, 'grad_norm': 1.355141282081604, 'learning_rate': 2.8398187465622454e-05, 'epoch': 3.81} 38%|███▊ | 3812/10000 [5:59:20<9:26:08, 5.49s/it][2025-06-19 19:29:04,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:29:04,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.66 | bwd_microstep: 3324.09 | bwd_inner_microstep: 3323.29 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-19 19:29:04,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.66 | bwd: 3324.11 | bwd_inner: 3323.29 | bwd_allreduce: 0.77 | step: 7.02 38%|███▊ | 3813/10000 [5:59:25<9:25:48, 5.49s/it] {'loss': 0.0466, 'grad_norm': 1.2782251834869385, 'learning_rate': 2.8392308256587396e-05, 'epoch': 3.81} 38%|███▊ | 3813/10000 [5:59:25<9:25:48, 5.49s/it][2025-06-19 19:29:10,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:29:10,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.04 | bwd_microstep: 3327.60 | bwd_inner_microstep: 3326.79 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.75 [2025-06-19 19:29:10,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.04 | bwd: 3327.62 | bwd_inner: 3326.79 | bwd_allreduce: 0.78 | step: 6.76 38%|███▊ | 3814/10000 [5:59:31<9:25:16, 5.48s/it] {'loss': 0.1003, 'grad_norm': 3.325275421142578, 'learning_rate': 2.838642816723815e-05, 'epoch': 3.81} 38%|███▊ | 3814/10000 [5:59:31<9:25:16, 5.48s/it][2025-06-19 19:29:15,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:29:15,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.55 | bwd_microstep: 3329.32 | bwd_inner_microstep: 3328.47 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.85 [2025-06-19 19:29:15,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.55 | bwd: 3329.34 | bwd_inner: 3328.47 | bwd_allreduce: 0.81 | step: 6.85 38%|███▊ | 3815/10000 [5:59:36<9:25:19, 5.48s/it] {'loss': 0.1083, 'grad_norm': 1.4758734703063965, 'learning_rate': 2.8380547198191523e-05, 'epoch': 3.81} 38%|███▊ | 3815/10000 [5:59:36<9:25:19, 5.48s/it][2025-06-19 19:29:21,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.74 [2025-06-19 19:29:21,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.68 | bwd_microstep: 3326.77 | bwd_inner_microstep: 3325.80 | bwd_allreduce_microstep: 0.91 | step_microstep: 8.49 [2025-06-19 19:29:21,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.68 | bwd: 3326.79 | bwd_inner: 3325.80 | bwd_allreduce: 0.93 | step: 8.50 38%|███▊ | 3816/10000 [5:59:42<9:25:08, 5.48s/it] {'loss': 0.0438, 'grad_norm': 1.1403874158859253, 'learning_rate': 2.8374665350064377e-05, 'epoch': 3.82} 38%|███▊ | 3816/10000 [5:59:42<9:25:08, 5.48s/it][2025-06-19 19:29:26,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:29:26,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.16 | bwd_microstep: 3378.08 | bwd_inner_microstep: 3377.26 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.43 [2025-06-19 19:29:26,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.16 | bwd: 3378.10 | bwd_inner: 3377.27 | bwd_allreduce: 0.78 | step: 7.44 38%|███▊ | 3817/10000 [5:59:47<9:27:12, 5.50s/it] {'loss': 0.0639, 'grad_norm': 1.285534381866455, 'learning_rate': 2.836878262347371e-05, 'epoch': 3.82} 38%|███▊ | 3817/10000 [5:59:47<9:27:12, 5.50s/it][2025-06-19 19:29:32,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:29:32,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.73 | bwd_microstep: 3324.95 | bwd_inner_microstep: 3324.05 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.90 [2025-06-19 19:29:32,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.73 | bwd: 3324.96 | bwd_inner: 3324.05 | bwd_allreduce: 0.86 | step: 6.90 38%|███▊ | 3818/10000 [5:59:53<9:26:21, 5.50s/it] {'loss': 0.111, 'grad_norm': 1.7350289821624756, 'learning_rate': 2.8362899019036584e-05, 'epoch': 3.82} 38%|███▊ | 3818/10000 [5:59:53<9:26:21, 5.50s/it][2025-06-19 19:29:37,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:29:37,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.25 | bwd_microstep: 3377.11 | bwd_inner_microstep: 3375.81 | bwd_allreduce_microstep: 1.23 | step_microstep: 7.22 [2025-06-19 19:29:37,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.25 | bwd: 3377.13 | bwd_inner: 3375.81 | bwd_allreduce: 1.27 | step: 7.22 38%|███▊ | 3819/10000 [5:59:58<9:27:49, 5.51s/it] {'loss': 0.0165, 'grad_norm': 0.5536887049674988, 'learning_rate': 2.8357014537370165e-05, 'epoch': 3.82} 38%|███▊ | 3819/10000 [5:59:58<9:27:49, 5.51s/it][2025-06-19 19:29:43,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:29:43,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.48 | bwd_microstep: 3380.23 | bwd_inner_microstep: 3379.42 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 19:29:43,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.48 | bwd: 3380.24 | bwd_inner: 3379.42 | bwd_allreduce: 0.78 | step: 7.12 38%|███▊ | 3820/10000 [6:00:04<9:28:58, 5.52s/it] {'loss': 0.0567, 'grad_norm': 1.4322807788848877, 'learning_rate': 2.8351129179091706e-05, 'epoch': 3.82} 38%|███▊ | 3820/10000 [6:00:04<9:28:58, 5.52s/it][2025-06-19 19:29:49,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:29:49,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.93 | bwd_microstep: 3386.61 | bwd_inner_microstep: 3385.61 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.81 [2025-06-19 19:29:49,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.93 | bwd: 3386.62 | bwd_inner: 3385.61 | bwd_allreduce: 0.97 | step: 7.82 38%|███▊ | 3821/10000 [6:00:09<9:30:32, 5.54s/it] {'loss': 0.122, 'grad_norm': 1.857150912284851, 'learning_rate': 2.834524294481856e-05, 'epoch': 3.82} 38%|███▊ | 3821/10000 [6:00:09<9:30:32, 5.54s/it][2025-06-19 19:29:54,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:29:54,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.98 | bwd_microstep: 3335.83 | bwd_inner_microstep: 3334.93 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.00 [2025-06-19 19:29:54,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.98 | bwd: 3335.84 | bwd_inner: 3334.93 | bwd_allreduce: 0.87 | step: 7.00 38%|███▊ | 3822/10000 [6:00:15<9:29:20, 5.53s/it] {'loss': 0.0553, 'grad_norm': 1.2504698038101196, 'learning_rate': 2.833935583516816e-05, 'epoch': 3.82} 38%|███▊ | 3822/10000 [6:00:15<9:29:20, 5.53s/it][2025-06-19 19:30:00,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.86 | optimizer_step: 2.73 [2025-06-19 19:30:00,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.58 | bwd_microstep: 3315.60 | bwd_inner_microstep: 3314.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.35 [2025-06-19 19:30:00,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.58 | bwd: 3315.62 | bwd_inner: 3314.81 | bwd_allreduce: 0.77 | step: 7.36 38%|███▊ | 3823/10000 [6:00:20<9:27:18, 5.51s/it] {'loss': 0.1304, 'grad_norm': 3.0942139625549316, 'learning_rate': 2.8333467850758034e-05, 'epoch': 3.82} 38%|███▊ | 3823/10000 [6:00:20<9:27:18, 5.51s/it][2025-06-19 19:30:05,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:30:05,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.67 | bwd_microstep: 3333.38 | bwd_inner_microstep: 3332.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-19 19:30:05,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.67 | bwd: 3333.39 | bwd_inner: 3332.57 | bwd_allreduce: 0.77 | step: 6.96 38%|███▊ | 3824/10000 [6:00:26<9:26:39, 5.51s/it] {'loss': 0.0538, 'grad_norm': 2.4919543266296387, 'learning_rate': 2.8327578992205805e-05, 'epoch': 3.82} 38%|███▊ | 3824/10000 [6:00:26<9:26:39, 5.51s/it][2025-06-19 19:30:11,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:30:11,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.66 | bwd_microstep: 3374.81 | bwd_inner_microstep: 3373.96 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.80 [2025-06-19 19:30:11,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.66 | bwd: 3374.83 | bwd_inner: 3373.96 | bwd_allreduce: 0.82 | step: 6.81 38%|███▊ | 3825/10000 [6:00:31<9:28:02, 5.52s/it] {'loss': 0.0126, 'grad_norm': 0.430854469537735, 'learning_rate': 2.8321689260129192e-05, 'epoch': 3.83} 38%|███▊ | 3825/10000 [6:00:31<9:28:02, 5.52s/it][2025-06-19 19:30:16,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:30:16,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.81 | bwd_microstep: 3341.74 | bwd_inner_microstep: 3340.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-19 19:30:16,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.81 | bwd: 3341.76 | bwd_inner: 3340.94 | bwd_allreduce: 0.77 | step: 7.06 38%|███▊ | 3826/10000 [6:00:37<9:27:16, 5.51s/it] {'loss': 0.005, 'grad_norm': 0.15176722407341003, 'learning_rate': 2.8315798655145997e-05, 'epoch': 3.83} 38%|███▊ | 3826/10000 [6:00:37<9:27:16, 5.51s/it][2025-06-19 19:30:22,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:30:22,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.88 | bwd_microstep: 3332.05 | bwd_inner_microstep: 3331.13 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.91 [2025-06-19 19:30:22,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.88 | bwd: 3332.07 | bwd_inner: 3331.13 | bwd_allreduce: 0.90 | step: 6.91 38%|███▊ | 3827/10000 [6:00:42<9:26:17, 5.50s/it] {'loss': 0.0361, 'grad_norm': 0.7477315664291382, 'learning_rate': 2.8309907177874118e-05, 'epoch': 3.83} 38%|███▊ | 3827/10000 [6:00:42<9:26:17, 5.50s/it][2025-06-19 19:30:27,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:30:27,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.84 | bwd_microstep: 3377.71 | bwd_inner_microstep: 3376.74 | bwd_allreduce_microstep: 0.92 | step_microstep: 6.99 [2025-06-19 19:30:27,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.84 | bwd: 3377.72 | bwd_inner: 3376.74 | bwd_allreduce: 0.94 | step: 6.99 38%|███▊ | 3828/10000 [6:00:48<9:27:30, 5.52s/it] {'loss': 0.0322, 'grad_norm': 0.9963383078575134, 'learning_rate': 2.8304014828931542e-05, 'epoch': 3.83} 38%|███▊ | 3828/10000 [6:00:48<9:27:30, 5.52s/it][2025-06-19 19:30:33,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:30:33,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.76 | bwd_microstep: 3386.93 | bwd_inner_microstep: 3386.12 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 19:30:33,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.76 | bwd: 3386.94 | bwd_inner: 3386.12 | bwd_allreduce: 0.77 | step: 6.97 38%|███▊ | 3829/10000 [6:00:54<9:28:38, 5.53s/it] {'loss': 0.0498, 'grad_norm': 1.4028804302215576, 'learning_rate': 2.8298121608936357e-05, 'epoch': 3.83} 38%|███▊ | 3829/10000 [6:00:54<9:28:38, 5.53s/it][2025-06-19 19:30:38,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:30:38,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.77 | bwd_microstep: 3334.45 | bwd_inner_microstep: 3333.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 19:30:38,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.77 | bwd: 3334.47 | bwd_inner: 3333.65 | bwd_allreduce: 0.78 | step: 6.80 38%|███▊ | 3830/10000 [6:00:59<9:27:25, 5.52s/it] {'loss': 0.0094, 'grad_norm': 0.3993335962295532, 'learning_rate': 2.8292227518506723e-05, 'epoch': 3.83} 38%|███▊ | 3830/10000 [6:00:59<9:27:25, 5.52s/it][2025-06-19 19:30:44,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:30:44,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.07 | bwd_microstep: 3329.28 | bwd_inner_microstep: 3328.48 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 19:30:44,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.07 | bwd: 3329.30 | bwd_inner: 3328.48 | bwd_allreduce: 0.77 | step: 6.94 38%|███▊ | 3831/10000 [6:01:04<9:26:11, 5.51s/it] {'loss': 0.2144, 'grad_norm': 2.0909676551818848, 'learning_rate': 2.8286332558260908e-05, 'epoch': 3.83} 38%|███▊ | 3831/10000 [6:01:04<9:26:11, 5.51s/it][2025-06-19 19:30:49,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:30:49,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.31 | bwd_microstep: 3322.03 | bwd_inner_microstep: 3321.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 19:30:49,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.31 | bwd: 3322.05 | bwd_inner: 3321.23 | bwd_allreduce: 0.77 | step: 6.81 38%|███▊ | 3832/10000 [6:01:10<9:24:48, 5.49s/it] {'loss': 0.0736, 'grad_norm': 3.5088553428649902, 'learning_rate': 2.8280436728817268e-05, 'epoch': 3.83} 38%|███▊ | 3832/10000 [6:01:10<9:24:48, 5.49s/it][2025-06-19 19:30:55,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:30:55,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.05 | bwd_microstep: 3341.41 | bwd_inner_microstep: 3340.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 19:30:55,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.05 | bwd: 3341.42 | bwd_inner: 3340.61 | bwd_allreduce: 0.76 | step: 6.69 38%|███▊ | 3833/10000 [6:01:15<9:24:51, 5.50s/it] {'loss': 0.0217, 'grad_norm': 1.0707005262374878, 'learning_rate': 2.8274540030794243e-05, 'epoch': 3.83} 38%|███▊ | 3833/10000 [6:01:15<9:24:51, 5.50s/it][2025-06-19 19:31:00,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:31:00,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.67 | bwd_microstep: 3331.02 | bwd_inner_microstep: 3330.21 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-19 19:31:00,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.67 | bwd: 3331.04 | bwd_inner: 3330.21 | bwd_allreduce: 0.78 | step: 6.86 38%|███▊ | 3834/10000 [6:01:21<9:24:25, 5.49s/it] {'loss': 0.1913, 'grad_norm': 2.702979326248169, 'learning_rate': 2.826864246481037e-05, 'epoch': 3.83} 38%|███▊ | 3834/10000 [6:01:21<9:24:25, 5.49s/it][2025-06-19 19:31:06,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:31:06,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.93 | bwd_microstep: 3389.63 | bwd_inner_microstep: 3388.64 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.59 [2025-06-19 19:31:06,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.93 | bwd: 3389.65 | bwd_inner: 3388.64 | bwd_allreduce: 0.96 | step: 7.60 38%|███▊ | 3835/10000 [6:01:26<9:26:47, 5.52s/it] {'loss': 0.0738, 'grad_norm': 0.8864802122116089, 'learning_rate': 2.8262744031484282e-05, 'epoch': 3.83} 38%|███▊ | 3835/10000 [6:01:26<9:26:47, 5.52s/it][2025-06-19 19:31:11,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:31:11,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.08 | bwd_microstep: 3338.34 | bwd_inner_microstep: 3337.45 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.90 [2025-06-19 19:31:11,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.08 | bwd: 3338.36 | bwd_inner: 3337.45 | bwd_allreduce: 0.86 | step: 6.90 38%|███▊ | 3836/10000 [6:01:32<9:26:17, 5.51s/it] {'loss': 0.0464, 'grad_norm': 0.8549931645393372, 'learning_rate': 2.8256844731434692e-05, 'epoch': 3.84} 38%|███▊ | 3836/10000 [6:01:32<9:26:17, 5.51s/it][2025-06-19 19:31:17,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:31:17,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.63 | bwd_microstep: 3329.29 | bwd_inner_microstep: 3328.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 19:31:17,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.63 | bwd: 3329.30 | bwd_inner: 3328.50 | bwd_allreduce: 0.76 | step: 6.68 38%|███▊ | 3837/10000 [6:01:37<9:25:08, 5.50s/it] {'loss': 0.0775, 'grad_norm': 1.2185711860656738, 'learning_rate': 2.825094456528041e-05, 'epoch': 3.84} 38%|███▊ | 3837/10000 [6:01:37<9:25:08, 5.50s/it][2025-06-19 19:31:22,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:31:22,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.14 | bwd_microstep: 3387.89 | bwd_inner_microstep: 3387.09 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 19:31:22,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.14 | bwd: 3387.91 | bwd_inner: 3387.09 | bwd_allreduce: 0.77 | step: 6.90 38%|███▊ | 3838/10000 [6:01:43<9:27:02, 5.52s/it] {'loss': 0.0569, 'grad_norm': 1.1147723197937012, 'learning_rate': 2.8245043533640344e-05, 'epoch': 3.84} 38%|███▊ | 3838/10000 [6:01:43<9:27:02, 5.52s/it][2025-06-19 19:31:28,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:31:28,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.48 | bwd_microstep: 3375.65 | bwd_inner_microstep: 3374.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 19:31:28,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.48 | bwd: 3375.66 | bwd_inner: 3374.85 | bwd_allreduce: 0.77 | step: 7.07 38%|███▊ | 3839/10000 [6:01:49<9:27:57, 5.53s/it] {'loss': 0.0607, 'grad_norm': 1.7081942558288574, 'learning_rate': 2.823914163713347e-05, 'epoch': 3.84} 38%|███▊ | 3839/10000 [6:01:49<9:27:57, 5.53s/it][2025-06-19 19:31:33,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:31:33,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.36 | bwd_microstep: 3331.85 | bwd_inner_microstep: 3330.80 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.53 [2025-06-19 19:31:33,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.36 | bwd: 3331.87 | bwd_inner: 3330.80 | bwd_allreduce: 1.01 | step: 7.54 38%|███▊ | 3840/10000 [6:01:54<9:26:57, 5.52s/it] {'loss': 0.0128, 'grad_norm': 0.3939504027366638, 'learning_rate': 2.823323887637888e-05, 'epoch': 3.84} 38%|███▊ | 3840/10000 [6:01:54<9:26:57, 5.52s/it][2025-06-19 19:31:39,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:31:39,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.67 | bwd_microstep: 3370.20 | bwd_inner_microstep: 3369.29 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.89 [2025-06-19 19:31:39,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.67 | bwd: 3370.21 | bwd_inner: 3369.29 | bwd_allreduce: 0.88 | step: 6.90 38%|███▊ | 3841/10000 [6:02:00<9:27:44, 5.53s/it] {'loss': 0.0803, 'grad_norm': 1.6433466672897339, 'learning_rate': 2.8227335251995743e-05, 'epoch': 3.84} 38%|███▊ | 3841/10000 [6:02:00<9:27:44, 5.53s/it][2025-06-19 19:31:44,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:31:44,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.22 | bwd_microstep: 3379.59 | bwd_inner_microstep: 3378.72 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.82 [2025-06-19 19:31:44,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.22 | bwd: 3379.61 | bwd_inner: 3378.72 | bwd_allreduce: 0.84 | step: 6.82 38%|███▊ | 3842/10000 [6:02:05<9:28:28, 5.54s/it] {'loss': 0.0812, 'grad_norm': 1.2231528759002686, 'learning_rate': 2.8221430764603322e-05, 'epoch': 3.84} 38%|███▊ | 3842/10000 [6:02:05<9:28:28, 5.54s/it][2025-06-19 19:31:50,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:31:50,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.06 | bwd_microstep: 3378.16 | bwd_inner_microstep: 3377.24 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.06 [2025-06-19 19:31:50,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.06 | bwd: 3378.17 | bwd_inner: 3377.24 | bwd_allreduce: 0.88 | step: 7.07 38%|███▊ | 3843/10000 [6:02:11<9:28:53, 5.54s/it] {'loss': 0.0562, 'grad_norm': 1.214716911315918, 'learning_rate': 2.8215525414820972e-05, 'epoch': 3.84} 38%|███▊ | 3843/10000 [6:02:11<9:28:53, 5.54s/it][2025-06-19 19:31:56,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:31:56,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.55 | bwd_microstep: 3375.90 | bwd_inner_microstep: 3375.06 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.18 [2025-06-19 19:31:56,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.55 | bwd: 3375.91 | bwd_inner: 3375.06 | bwd_allreduce: 0.80 | step: 7.18 38%|███▊ | 3844/10000 [6:02:16<9:29:12, 5.55s/it] {'loss': 0.1323, 'grad_norm': 2.5933048725128174, 'learning_rate': 2.8209619203268138e-05, 'epoch': 3.84} 38%|███▊ | 3844/10000 [6:02:16<9:29:12, 5.55s/it][2025-06-19 19:32:01,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:32:01,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.03 | bwd_microstep: 3324.38 | bwd_inner_microstep: 3323.54 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.80 [2025-06-19 19:32:01,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.03 | bwd: 3324.39 | bwd_inner: 3323.54 | bwd_allreduce: 0.81 | step: 6.80 38%|███▊ | 3845/10000 [6:02:22<9:26:47, 5.53s/it] {'loss': 0.0643, 'grad_norm': 1.4605404138565063, 'learning_rate': 2.8203712130564347e-05, 'epoch': 3.84} 38%|███▊ | 3845/10000 [6:02:22<9:26:47, 5.53s/it][2025-06-19 19:32:06,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:32:06,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.72 | bwd_microstep: 3331.69 | bwd_inner_microstep: 3330.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 19:32:06,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.72 | bwd: 3331.71 | bwd_inner: 3330.90 | bwd_allreduce: 0.76 | step: 6.67 38%|███▊ | 3846/10000 [6:02:27<9:25:29, 5.51s/it] {'loss': 0.0216, 'grad_norm': 0.7858083248138428, 'learning_rate': 2.8197804197329228e-05, 'epoch': 3.85} 38%|███▊ | 3846/10000 [6:02:27<9:25:29, 5.51s/it][2025-06-19 19:32:12,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:32:12,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.04 | bwd_microstep: 3375.94 | bwd_inner_microstep: 3375.00 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.38 [2025-06-19 19:32:12,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.04 | bwd: 3375.95 | bwd_inner: 3375.00 | bwd_allreduce: 0.91 | step: 7.39 38%|███▊ | 3847/10000 [6:02:33<9:26:37, 5.53s/it] {'loss': 0.0203, 'grad_norm': 0.7577469348907471, 'learning_rate': 2.8191895404182496e-05, 'epoch': 3.85} 38%|███▊ | 3847/10000 [6:02:33<9:26:37, 5.53s/it][2025-06-19 19:32:18,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:32:18,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.75 | bwd_microstep: 3368.64 | bwd_inner_microstep: 3367.83 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-19 19:32:18,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.75 | bwd: 3368.65 | bwd_inner: 3367.83 | bwd_allreduce: 0.78 | step: 6.90 38%|███▊ | 3848/10000 [6:02:38<9:27:06, 5.53s/it] {'loss': 0.0438, 'grad_norm': 1.5838760137557983, 'learning_rate': 2.818598575174396e-05, 'epoch': 3.85} 38%|███▊ | 3848/10000 [6:02:38<9:27:06, 5.53s/it][2025-06-19 19:32:23,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:32:23,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.94 | bwd_microstep: 3336.30 | bwd_inner_microstep: 3335.26 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.66 [2025-06-19 19:32:23,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.94 | bwd: 3336.31 | bwd_inner: 3335.26 | bwd_allreduce: 1.00 | step: 7.66 38%|███▊ | 3849/10000 [6:02:44<9:26:08, 5.52s/it] {'loss': 0.0479, 'grad_norm': 1.5991002321243286, 'learning_rate': 2.8180075240633505e-05, 'epoch': 3.85} 38%|███▊ | 3849/10000 [6:02:44<9:26:08, 5.52s/it][2025-06-19 19:32:29,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:32:29,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.11 | bwd_microstep: 3365.22 | bwd_inner_microstep: 3364.37 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.98 [2025-06-19 19:32:29,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.11 | bwd: 3365.23 | bwd_inner: 3364.37 | bwd_allreduce: 0.81 | step: 6.98 38%|███▊ | 3850/10000 [6:02:49<9:26:24, 5.53s/it] {'loss': 0.0299, 'grad_norm': 0.8593732714653015, 'learning_rate': 2.817416387147113e-05, 'epoch': 3.85} 38%|███▊ | 3850/10000 [6:02:49<9:26:24, 5.53s/it][2025-06-19 19:32:34,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 19:32:34,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.01 | bwd_microstep: 3368.78 | bwd_inner_microstep: 3367.73 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.45 [2025-06-19 19:32:34,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.01 | bwd: 3368.80 | bwd_inner: 3367.73 | bwd_allreduce: 1.02 | step: 7.45 39%|███▊ | 3851/10000 [6:02:55<9:27:10, 5.53s/it] {'loss': 0.0417, 'grad_norm': 1.169663906097412, 'learning_rate': 2.8168251644876895e-05, 'epoch': 3.85} 39%|███▊ | 3851/10000 [6:02:55<9:27:10, 5.53s/it][2025-06-19 19:32:40,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:32:40,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.38 | bwd_microstep: 3321.18 | bwd_inner_microstep: 3320.14 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.41 [2025-06-19 19:32:40,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.38 | bwd: 3321.20 | bwd_inner: 3320.14 | bwd_allreduce: 1.01 | step: 7.41 39%|███▊ | 3852/10000 [6:03:00<9:25:35, 5.52s/it] {'loss': 0.0328, 'grad_norm': 0.8806277513504028, 'learning_rate': 2.8162338561470975e-05, 'epoch': 3.85} 39%|███▊ | 3852/10000 [6:03:00<9:25:35, 5.52s/it][2025-06-19 19:32:45,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:32:45,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.98 | bwd_microstep: 3370.90 | bwd_inner_microstep: 3369.98 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.16 [2025-06-19 19:32:45,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.98 | bwd: 3370.91 | bwd_inner: 3369.98 | bwd_allreduce: 0.89 | step: 7.16 39%|███▊ | 3853/10000 [6:03:06<9:26:15, 5.53s/it] {'loss': 0.0562, 'grad_norm': 1.4300732612609863, 'learning_rate': 2.815642462187362e-05, 'epoch': 3.85} 39%|███▊ | 3853/10000 [6:03:06<9:26:15, 5.53s/it][2025-06-19 19:32:51,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:32:51,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.65 | bwd_microstep: 3330.85 | bwd_inner_microstep: 3330.04 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-19 19:32:51,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.65 | bwd: 3330.86 | bwd_inner: 3330.04 | bwd_allreduce: 0.78 | step: 6.79 39%|███▊ | 3854/10000 [6:03:11<9:24:48, 5.51s/it] {'loss': 0.0769, 'grad_norm': 1.3320509195327759, 'learning_rate': 2.815050982670518e-05, 'epoch': 3.85} 39%|███▊ | 3854/10000 [6:03:11<9:24:48, 5.51s/it][2025-06-19 19:32:56,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:32:56,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.49 | bwd_microstep: 3317.54 | bwd_inner_microstep: 3316.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 19:32:56,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.49 | bwd: 3317.55 | bwd_inner: 3316.74 | bwd_allreduce: 0.76 | step: 6.80 39%|███▊ | 3855/10000 [6:03:17<9:23:05, 5.50s/it] {'loss': 0.0203, 'grad_norm': 0.8677151799201965, 'learning_rate': 2.8144594176586087e-05, 'epoch': 3.85} 39%|███▊ | 3855/10000 [6:03:17<9:23:05, 5.50s/it][2025-06-19 19:33:02,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:33:02,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.25 | bwd_microstep: 3318.39 | bwd_inner_microstep: 3317.59 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 19:33:02,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.25 | bwd: 3318.41 | bwd_inner: 3317.59 | bwd_allreduce: 0.78 | step: 7.13 39%|███▊ | 3856/10000 [6:03:22<9:21:54, 5.49s/it] {'loss': 0.0475, 'grad_norm': 1.4377306699752808, 'learning_rate': 2.813867767213687e-05, 'epoch': 3.86} 39%|███▊ | 3856/10000 [6:03:22<9:21:54, 5.49s/it][2025-06-19 19:33:07,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:33:07,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.89 | bwd_microstep: 3317.01 | bwd_inner_microstep: 3316.04 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.13 [2025-06-19 19:33:07,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.89 | bwd: 3317.03 | bwd_inner: 3316.04 | bwd_allreduce: 0.94 | step: 7.13 39%|███▊ | 3857/10000 [6:03:28<9:21:23, 5.48s/it] {'loss': 0.0187, 'grad_norm': 0.3906092643737793, 'learning_rate': 2.8132760313978125e-05, 'epoch': 3.86} 39%|███▊ | 3857/10000 [6:03:28<9:21:23, 5.48s/it][2025-06-19 19:33:13,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.72 [2025-06-19 19:33:13,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.54 | bwd_microstep: 3332.46 | bwd_inner_microstep: 3331.67 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.80 [2025-06-19 19:33:13,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.54 | bwd: 3332.47 | bwd_inner: 3331.67 | bwd_allreduce: 0.76 | step: 6.80 39%|███▊ | 3858/10000 [6:03:33<9:21:02, 5.48s/it] {'loss': 0.1012, 'grad_norm': 2.282658815383911, 'learning_rate': 2.812684210273058e-05, 'epoch': 3.86} 39%|███▊ | 3858/10000 [6:03:33<9:21:02, 5.48s/it][2025-06-19 19:33:18,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:33:18,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.09 | bwd_microstep: 3323.63 | bwd_inner_microstep: 3322.68 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.37 [2025-06-19 19:33:18,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.09 | bwd: 3323.64 | bwd_inner: 3322.68 | bwd_allreduce: 0.92 | step: 7.37 39%|███▊ | 3859/10000 [6:03:39<9:20:41, 5.48s/it] {'loss': 0.049, 'grad_norm': 1.375651240348816, 'learning_rate': 2.8120923039015012e-05, 'epoch': 3.86} 39%|███▊ | 3859/10000 [6:03:39<9:20:41, 5.48s/it][2025-06-19 19:33:24,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:33:24,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.47 | bwd_microstep: 3324.86 | bwd_inner_microstep: 3324.09 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.64 [2025-06-19 19:33:24,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.47 | bwd: 3324.88 | bwd_inner: 3324.09 | bwd_allreduce: 0.75 | step: 6.65 39%|███▊ | 3860/10000 [6:03:44<9:20:40, 5.48s/it] {'loss': 0.0946, 'grad_norm': 1.6155115365982056, 'learning_rate': 2.8115003123452314e-05, 'epoch': 3.86} 39%|███▊ | 3860/10000 [6:03:44<9:20:40, 5.48s/it][2025-06-19 19:33:29,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:33:29,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.19 | bwd_microstep: 3313.56 | bwd_inner_microstep: 3312.74 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.92 [2025-06-19 19:33:29,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.19 | bwd: 3313.58 | bwd_inner: 3312.74 | bwd_allreduce: 0.79 | step: 6.92 39%|███▊ | 3861/10000 [6:03:50<9:19:59, 5.47s/it] {'loss': 0.0595, 'grad_norm': 2.3294973373413086, 'learning_rate': 2.810908235666344e-05, 'epoch': 3.86} 39%|███▊ | 3861/10000 [6:03:50<9:19:59, 5.47s/it][2025-06-19 19:33:35,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:33:35,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.33 | bwd_microstep: 3375.59 | bwd_inner_microstep: 3374.76 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.34 [2025-06-19 19:33:35,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.33 | bwd: 3375.61 | bwd_inner: 3374.76 | bwd_allreduce: 0.80 | step: 7.34 39%|███▊ | 3862/10000 [6:03:55<9:22:16, 5.50s/it] {'loss': 0.0153, 'grad_norm': 0.5616326332092285, 'learning_rate': 2.8103160739269468e-05, 'epoch': 3.86} 39%|███▊ | 3862/10000 [6:03:55<9:22:16, 5.50s/it][2025-06-19 19:33:40,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:33:40,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.16 | bwd_microstep: 3319.76 | bwd_inner_microstep: 3318.86 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.92 [2025-06-19 19:33:40,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.16 | bwd: 3319.78 | bwd_inner: 3318.86 | bwd_allreduce: 0.87 | step: 6.92 39%|███▊ | 3863/10000 [6:04:01<9:21:05, 5.49s/it] {'loss': 0.0174, 'grad_norm': 0.6463903784751892, 'learning_rate': 2.8097238271891538e-05, 'epoch': 3.86} 39%|███▊ | 3863/10000 [6:04:01<9:21:05, 5.49s/it][2025-06-19 19:33:45,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:33:45,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.47 | bwd_microstep: 3324.00 | bwd_inner_microstep: 3323.07 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.00 [2025-06-19 19:33:45,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.47 | bwd: 3324.01 | bwd_inner: 3323.07 | bwd_allreduce: 0.90 | step: 7.01 39%|███▊ | 3864/10000 [6:04:06<9:20:28, 5.48s/it] {'loss': 0.0549, 'grad_norm': 1.3243939876556396, 'learning_rate': 2.80913149551509e-05, 'epoch': 3.86} 39%|███▊ | 3864/10000 [6:04:06<9:20:28, 5.48s/it][2025-06-19 19:33:51,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:33:51,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.07 | bwd_microstep: 3372.13 | bwd_inner_microstep: 3371.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-19 19:33:51,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.07 | bwd: 3372.14 | bwd_inner: 3371.33 | bwd_allreduce: 0.77 | step: 7.01 39%|███▊ | 3865/10000 [6:04:12<9:22:16, 5.50s/it] {'loss': 0.0863, 'grad_norm': 3.2412965297698975, 'learning_rate': 2.808539078966887e-05, 'epoch': 3.87} 39%|███▊ | 3865/10000 [6:04:12<9:22:16, 5.50s/it][2025-06-19 19:33:57,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:33:57,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.08 | bwd_microstep: 3362.08 | bwd_inner_microstep: 3361.12 | bwd_allreduce_microstep: 0.91 | step_microstep: 6.98 [2025-06-19 19:33:57,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.08 | bwd: 3362.10 | bwd_inner: 3361.12 | bwd_allreduce: 0.93 | step: 6.99 39%|███▊ | 3866/10000 [6:04:17<9:23:14, 5.51s/it] {'loss': 0.034, 'grad_norm': 0.9925917983055115, 'learning_rate': 2.807946577606688e-05, 'epoch': 3.87} 39%|███▊ | 3866/10000 [6:04:17<9:23:14, 5.51s/it][2025-06-19 19:34:02,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.84 [2025-06-19 19:34:02,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.40 | bwd_microstep: 3318.65 | bwd_inner_microstep: 3317.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 19:34:02,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.40 | bwd: 3318.66 | bwd_inner: 3317.85 | bwd_allreduce: 0.77 | step: 6.77 39%|███▊ | 3867/10000 [6:04:23<9:21:38, 5.49s/it] {'loss': 0.0384, 'grad_norm': 0.6698286533355713, 'learning_rate': 2.8073539914966424e-05, 'epoch': 3.87} 39%|███▊ | 3867/10000 [6:04:23<9:21:38, 5.49s/it][2025-06-19 19:34:08,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:34:08,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.22 | bwd_microstep: 3393.89 | bwd_inner_microstep: 3393.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 19:34:08,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.22 | bwd: 3393.91 | bwd_inner: 3393.11 | bwd_allreduce: 0.75 | step: 6.59 39%|███▊ | 3868/10000 [6:04:28<9:23:59, 5.52s/it] {'loss': 0.0361, 'grad_norm': 1.4257375001907349, 'learning_rate': 2.80676132069891e-05, 'epoch': 3.87} 39%|███▊ | 3868/10000 [6:04:28<9:23:59, 5.52s/it][2025-06-19 19:34:13,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:34:13,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.53 | bwd_microstep: 3384.31 | bwd_inner_microstep: 3383.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 19:34:13,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.53 | bwd: 3384.32 | bwd_inner: 3383.51 | bwd_allreduce: 0.77 | step: 6.89 39%|███▊ | 3869/10000 [6:04:34<9:25:06, 5.53s/it] {'loss': 0.0198, 'grad_norm': 0.6509943008422852, 'learning_rate': 2.8061685652756595e-05, 'epoch': 3.87} 39%|███▊ | 3869/10000 [6:04:34<9:25:06, 5.53s/it][2025-06-19 19:34:19,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:34:19,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.73 | bwd_microstep: 3327.21 | bwd_inner_microstep: 3326.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 19:34:19,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.73 | bwd: 3327.22 | bwd_inner: 3326.41 | bwd_allreduce: 0.76 | step: 6.89 39%|███▊ | 3870/10000 [6:04:39<9:23:22, 5.51s/it] {'loss': 0.0511, 'grad_norm': 0.8885103464126587, 'learning_rate': 2.8055757252890677e-05, 'epoch': 3.87} 39%|███▊ | 3870/10000 [6:04:39<9:23:22, 5.51s/it][2025-06-19 19:34:24,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:34:24,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.27 | bwd_microstep: 3376.06 | bwd_inner_microstep: 3375.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 19:34:24,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.27 | bwd: 3376.07 | bwd_inner: 3375.27 | bwd_allreduce: 0.76 | step: 6.66 39%|███▊ | 3871/10000 [6:04:45<9:24:03, 5.52s/it] {'loss': 0.0372, 'grad_norm': 0.990402102470398, 'learning_rate': 2.8049828008013213e-05, 'epoch': 3.87} 39%|███▊ | 3871/10000 [6:04:45<9:24:03, 5.52s/it][2025-06-19 19:34:30,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:34:30,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.90 | bwd_microstep: 3323.96 | bwd_inner_microstep: 3323.14 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.26 [2025-06-19 19:34:30,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.90 | bwd: 3323.97 | bwd_inner: 3323.14 | bwd_allreduce: 0.79 | step: 7.26 39%|███▊ | 3872/10000 [6:04:50<9:22:18, 5.51s/it] {'loss': 0.0495, 'grad_norm': 1.2472236156463623, 'learning_rate': 2.804389791874615e-05, 'epoch': 3.87} 39%|███▊ | 3872/10000 [6:04:50<9:22:18, 5.51s/it][2025-06-19 19:34:35,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:34:35,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.24 | bwd_microstep: 3324.98 | bwd_inner_microstep: 3324.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 19:34:35,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.24 | bwd: 3325.00 | bwd_inner: 3324.18 | bwd_allreduce: 0.77 | step: 6.88 39%|███▊ | 3873/10000 [6:04:56<9:20:54, 5.49s/it] {'loss': 0.0071, 'grad_norm': 0.2146323025226593, 'learning_rate': 2.8037966985711535e-05, 'epoch': 3.87} 39%|███▊ | 3873/10000 [6:04:56<9:20:54, 5.49s/it][2025-06-19 19:34:41,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:34:41,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.04 | bwd_microstep: 3331.76 | bwd_inner_microstep: 3330.93 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.78 [2025-06-19 19:34:41,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.04 | bwd: 3331.78 | bwd_inner: 3330.93 | bwd_allreduce: 0.80 | step: 6.79 39%|███▊ | 3874/10000 [6:05:01<9:20:13, 5.49s/it] {'loss': 0.0116, 'grad_norm': 0.3218778669834137, 'learning_rate': 2.8032035209531483e-05, 'epoch': 3.87} 39%|███▊ | 3874/10000 [6:05:01<9:20:13, 5.49s/it][2025-06-19 19:34:46,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:34:46,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.17 | bwd_microstep: 3309.30 | bwd_inner_microstep: 3308.44 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.82 [2025-06-19 19:34:46,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.17 | bwd: 3309.31 | bwd_inner: 3308.44 | bwd_allreduce: 0.83 | step: 6.83 39%|███▉ | 3875/10000 [6:05:07<9:19:02, 5.48s/it] {'loss': 0.0406, 'grad_norm': 1.5929514169692993, 'learning_rate': 2.8026102590828216e-05, 'epoch': 3.88} 39%|███▉ | 3875/10000 [6:05:07<9:19:02, 5.48s/it][2025-06-19 19:34:51,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:34:51,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.00 | bwd_microstep: 3322.86 | bwd_inner_microstep: 3322.06 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-19 19:34:51,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.00 | bwd: 3322.88 | bwd_inner: 3322.06 | bwd_allreduce: 0.78 | step: 6.96 39%|███▉ | 3876/10000 [6:05:12<9:18:34, 5.47s/it] {'loss': 0.1579, 'grad_norm': 2.676377534866333, 'learning_rate': 2.8020169130224036e-05, 'epoch': 3.88} 39%|███▉ | 3876/10000 [6:05:12<9:18:34, 5.47s/it][2025-06-19 19:34:57,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:34:57,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.48 | bwd_microstep: 3368.16 | bwd_inner_microstep: 3367.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-19 19:34:57,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.48 | bwd: 3368.18 | bwd_inner: 3367.35 | bwd_allreduce: 0.78 | step: 6.96 39%|███▉ | 3877/10000 [6:05:18<9:20:12, 5.49s/it] {'loss': 0.0717, 'grad_norm': 1.732718825340271, 'learning_rate': 2.801423482834134e-05, 'epoch': 3.88} 39%|███▉ | 3877/10000 [6:05:18<9:20:12, 5.49s/it][2025-06-19 19:35:03,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:35:03,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.68 | bwd_microstep: 3395.06 | bwd_inner_microstep: 3394.26 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-19 19:35:03,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.68 | bwd: 3395.07 | bwd_inner: 3394.26 | bwd_allreduce: 0.77 | step: 6.74 39%|███▉ | 3878/10000 [6:05:23<9:22:23, 5.51s/it] {'loss': 0.0104, 'grad_norm': 0.2993035614490509, 'learning_rate': 2.800829968580261e-05, 'epoch': 3.88} 39%|███▉ | 3878/10000 [6:05:23<9:22:23, 5.51s/it][2025-06-19 19:35:08,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:35:08,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.70 | bwd_microstep: 3376.77 | bwd_inner_microstep: 3375.74 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.41 [2025-06-19 19:35:08,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.70 | bwd: 3376.78 | bwd_inner: 3375.74 | bwd_allreduce: 0.99 | step: 7.41 39%|███▉ | 3879/10000 [6:05:29<9:23:23, 5.52s/it] {'loss': 0.096, 'grad_norm': 3.0154449939727783, 'learning_rate': 2.8002363703230406e-05, 'epoch': 3.88} 39%|███▉ | 3879/10000 [6:05:29<9:23:23, 5.52s/it][2025-06-19 19:35:14,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:35:14,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.24 | bwd_microstep: 3314.44 | bwd_inner_microstep: 3313.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 19:35:14,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.24 | bwd: 3314.45 | bwd_inner: 3313.64 | bwd_allreduce: 0.77 | step: 6.78 39%|███▉ | 3880/10000 [6:05:34<9:21:17, 5.50s/it] {'loss': 0.0852, 'grad_norm': 2.830902576446533, 'learning_rate': 2.7996426881247382e-05, 'epoch': 3.88} 39%|███▉ | 3880/10000 [6:05:34<9:21:17, 5.50s/it][2025-06-19 19:35:19,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:35:19,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.87 | bwd_microstep: 3369.90 | bwd_inner_microstep: 3369.10 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.79 [2025-06-19 19:35:19,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.87 | bwd: 3369.92 | bwd_inner: 3369.10 | bwd_allreduce: 0.77 | step: 6.79 39%|███▉ | 3881/10000 [6:05:40<9:22:05, 5.51s/it] {'loss': 0.0208, 'grad_norm': 1.297672986984253, 'learning_rate': 2.79904892204763e-05, 'epoch': 3.88} 39%|███▉ | 3881/10000 [6:05:40<9:22:05, 5.51s/it][2025-06-19 19:35:25,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:35:25,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.53 | bwd_microstep: 3323.03 | bwd_inner_microstep: 3322.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 19:35:25,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.53 | bwd: 3323.05 | bwd_inner: 3322.23 | bwd_allreduce: 0.77 | step: 6.84 39%|███▉ | 3882/10000 [6:05:45<9:20:31, 5.50s/it] {'loss': 0.0316, 'grad_norm': 1.5799784660339355, 'learning_rate': 2.7984550721539978e-05, 'epoch': 3.88} 39%|███▉ | 3882/10000 [6:05:45<9:20:31, 5.50s/it][2025-06-19 19:35:30,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:35:30,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.91 | bwd_microstep: 3332.06 | bwd_inner_microstep: 3331.25 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-19 19:35:30,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.91 | bwd: 3332.07 | bwd_inner: 3331.25 | bwd_allreduce: 0.78 | step: 7.24 39%|███▉ | 3883/10000 [6:05:51<9:19:52, 5.49s/it] {'loss': 0.0197, 'grad_norm': 1.2537070512771606, 'learning_rate': 2.797861138506135e-05, 'epoch': 3.88} 39%|███▉ | 3883/10000 [6:05:51<9:19:52, 5.49s/it][2025-06-19 19:35:36,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:35:36,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.73 | bwd_microstep: 3388.59 | bwd_inner_microstep: 3387.80 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 19:35:36,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.73 | bwd: 3388.60 | bwd_inner: 3387.80 | bwd_allreduce: 0.76 | step: 6.77 39%|███▉ | 3884/10000 [6:05:56<9:22:04, 5.51s/it] {'loss': 0.0267, 'grad_norm': 0.8076044917106628, 'learning_rate': 2.7972671211663415e-05, 'epoch': 3.88} 39%|███▉ | 3884/10000 [6:05:56<9:22:04, 5.51s/it][2025-06-19 19:35:41,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:35:41,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.20 | bwd_microstep: 3308.63 | bwd_inner_microstep: 3307.66 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.21 [2025-06-19 19:35:41,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.20 | bwd: 3308.64 | bwd_inner: 3307.66 | bwd_allreduce: 0.94 | step: 7.21 39%|███▉ | 3885/10000 [6:06:02<9:20:16, 5.50s/it] {'loss': 0.0267, 'grad_norm': 1.0080899000167847, 'learning_rate': 2.7966730201969267e-05, 'epoch': 3.88} 39%|███▉ | 3885/10000 [6:06:02<9:20:16, 5.50s/it][2025-06-19 19:35:47,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:35:47,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.43 | bwd_microstep: 3320.49 | bwd_inner_microstep: 3319.68 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.19 [2025-06-19 19:35:47,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.43 | bwd: 3320.51 | bwd_inner: 3319.68 | bwd_allreduce: 0.79 | step: 7.20 39%|███▉ | 3886/10000 [6:06:07<9:19:28, 5.49s/it] {'loss': 0.0187, 'grad_norm': 0.530333936214447, 'learning_rate': 2.7960788356602103e-05, 'epoch': 3.89} 39%|███▉ | 3886/10000 [6:06:07<9:19:28, 5.49s/it][2025-06-19 19:35:52,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:35:52,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.17 | bwd_microstep: 3317.01 | bwd_inner_microstep: 3315.90 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.08 [2025-06-19 19:35:52,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.17 | bwd: 3316.83 | bwd_inner: 3315.90 | bwd_allreduce: 0.88 | step: 7.09 39%|███▉ | 3887/10000 [6:06:13<9:18:37, 5.48s/it] {'loss': 0.0255, 'grad_norm': 0.8459738492965698, 'learning_rate': 2.7954845676185184e-05, 'epoch': 3.89} 39%|███▉ | 3887/10000 [6:06:13<9:18:37, 5.48s/it][2025-06-19 19:35:57,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:35:57,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.75 | bwd_microstep: 3313.74 | bwd_inner_microstep: 3312.96 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 19:35:57,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.75 | bwd: 3313.76 | bwd_inner: 3312.96 | bwd_allreduce: 0.75 | step: 6.54 39%|███▉ | 3888/10000 [6:06:18<9:17:50, 5.48s/it] {'loss': 0.0241, 'grad_norm': 0.8428340554237366, 'learning_rate': 2.7948902161341868e-05, 'epoch': 3.89} 39%|███▉ | 3888/10000 [6:06:18<9:17:50, 5.48s/it][2025-06-19 19:36:03,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:36:03,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.25 | bwd_microstep: 3321.13 | bwd_inner_microstep: 3320.28 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.39 [2025-06-19 19:36:03,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.25 | bwd: 3321.14 | bwd_inner: 3320.28 | bwd_allreduce: 0.81 | step: 7.39 39%|███▉ | 3889/10000 [6:06:24<9:17:16, 5.47s/it] {'loss': 0.026, 'grad_norm': 2.4553189277648926, 'learning_rate': 2.7942957812695613e-05, 'epoch': 3.89} 39%|███▉ | 3889/10000 [6:06:24<9:17:16, 5.47s/it][2025-06-19 19:36:08,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:36:08,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.32 | bwd_microstep: 3314.32 | bwd_inner_microstep: 3313.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.93 [2025-06-19 19:36:08,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.32 | bwd: 3314.51 | bwd_inner: 3313.53 | bwd_allreduce: 0.76 | step: 6.93 39%|███▉ | 3890/10000 [6:06:29<9:16:35, 5.47s/it] {'loss': 0.1376, 'grad_norm': 2.4191858768463135, 'learning_rate': 2.7937012630869946e-05, 'epoch': 3.89} 39%|███▉ | 3890/10000 [6:06:29<9:16:35, 5.47s/it][2025-06-19 19:36:14,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:36:14,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.22 | bwd_microstep: 3310.35 | bwd_inner_microstep: 3309.38 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.14 [2025-06-19 19:36:14,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.22 | bwd: 3310.37 | bwd_inner: 3309.38 | bwd_allreduce: 0.95 | step: 7.14 39%|███▉ | 3891/10000 [6:06:35<9:16:07, 5.46s/it] {'loss': 0.0731, 'grad_norm': 2.088334321975708, 'learning_rate': 2.7931066616488497e-05, 'epoch': 3.89} 39%|███▉ | 3891/10000 [6:06:35<9:16:07, 5.46s/it][2025-06-19 19:36:19,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:36:19,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.00 | bwd_microstep: 3311.64 | bwd_inner_microstep: 3310.72 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.79 [2025-06-19 19:36:19,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.00 | bwd: 3311.65 | bwd_inner: 3310.72 | bwd_allreduce: 0.88 | step: 6.79 39%|███▉ | 3892/10000 [6:06:40<9:15:44, 5.46s/it] {'loss': 0.0233, 'grad_norm': 1.1183823347091675, 'learning_rate': 2.7925119770174963e-05, 'epoch': 3.89} 39%|███▉ | 3892/10000 [6:06:40<9:15:44, 5.46s/it][2025-06-19 19:36:25,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:36:25,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.13 | bwd_microstep: 3317.97 | bwd_inner_microstep: 3317.17 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 19:36:25,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.13 | bwd: 3317.99 | bwd_inner: 3317.17 | bwd_allreduce: 0.77 | step: 6.74 39%|███▉ | 3893/10000 [6:06:46<9:15:39, 5.46s/it] {'loss': 0.0654, 'grad_norm': 1.8835418224334717, 'learning_rate': 2.7919172092553144e-05, 'epoch': 3.89} 39%|███▉ | 3893/10000 [6:06:46<9:15:39, 5.46s/it][2025-06-19 19:36:30,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:36:30,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.71 | bwd_microstep: 3312.94 | bwd_inner_microstep: 3312.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 19:36:30,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.71 | bwd: 3312.95 | bwd_inner: 3312.15 | bwd_allreduce: 0.76 | step: 6.66 39%|███▉ | 3894/10000 [6:06:51<9:15:39, 5.46s/it] {'loss': 0.024, 'grad_norm': 1.1685194969177246, 'learning_rate': 2.7913223584246937e-05, 'epoch': 3.89} 39%|███▉ | 3894/10000 [6:06:51<9:15:39, 5.46s/it][2025-06-19 19:36:36,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:36:36,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.13 | bwd_microstep: 3362.41 | bwd_inner_microstep: 3361.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 19:36:36,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.13 | bwd: 3362.42 | bwd_inner: 3361.60 | bwd_allreduce: 0.78 | step: 7.23 39%|███▉ | 3895/10000 [6:06:57<9:17:41, 5.48s/it] {'loss': 0.038, 'grad_norm': 2.2542150020599365, 'learning_rate': 2.7907274245880293e-05, 'epoch': 3.9} 39%|███▉ | 3895/10000 [6:06:57<9:17:41, 5.48s/it][2025-06-19 19:36:41,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.74 [2025-06-19 19:36:41,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.38 | bwd_microstep: 3369.70 | bwd_inner_microstep: 3368.62 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.28 [2025-06-19 19:36:41,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.38 | bwd: 3369.72 | bwd_inner: 3368.62 | bwd_allreduce: 1.04 | step: 7.28 39%|███▉ | 3896/10000 [6:07:02<9:19:22, 5.50s/it] {'loss': 0.0391, 'grad_norm': 1.8785247802734375, 'learning_rate': 2.7901324078077287e-05, 'epoch': 3.9} 39%|███▉ | 3896/10000 [6:07:02<9:19:22, 5.50s/it][2025-06-19 19:36:47,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:36:47,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.41 | bwd_microstep: 3310.96 | bwd_inner_microstep: 3310.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 19:36:47,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.41 | bwd: 3310.98 | bwd_inner: 3310.15 | bwd_allreduce: 0.78 | step: 7.01 39%|███▉ | 3897/10000 [6:07:08<9:18:07, 5.49s/it] {'loss': 0.0216, 'grad_norm': 0.6902146339416504, 'learning_rate': 2.789537308146205e-05, 'epoch': 3.9} 39%|███▉ | 3897/10000 [6:07:08<9:18:07, 5.49s/it][2025-06-19 19:36:52,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:36:52,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.46 | bwd_microstep: 3391.30 | bwd_inner_microstep: 3390.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 19:36:52,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.46 | bwd: 3391.32 | bwd_inner: 3390.51 | bwd_allreduce: 0.76 | step: 6.94 39%|███▉ | 3898/10000 [6:07:13<9:20:34, 5.51s/it] {'loss': 0.0324, 'grad_norm': 1.0553690195083618, 'learning_rate': 2.7889421256658818e-05, 'epoch': 3.9} 39%|███▉ | 3898/10000 [6:07:13<9:20:34, 5.51s/it][2025-06-19 19:36:58,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:36:58,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.91 | bwd_microstep: 3320.88 | bwd_inner_microstep: 3320.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 19:36:58,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.92 | bwd: 3320.89 | bwd_inner: 3320.08 | bwd_allreduce: 0.76 | step: 6.62 39%|███▉ | 3899/10000 [6:07:19<9:19:11, 5.50s/it] {'loss': 0.0494, 'grad_norm': 2.0288031101226807, 'learning_rate': 2.788346860429192e-05, 'epoch': 3.9} 39%|███▉ | 3899/10000 [6:07:19<9:19:11, 5.50s/it][2025-06-19 19:37:03,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:37:03,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.97 | bwd_microstep: 3311.59 | bwd_inner_microstep: 3310.80 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 19:37:03,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.97 | bwd: 3311.61 | bwd_inner: 3310.80 | bwd_allreduce: 0.77 | step: 6.85 39%|███▉ | 3900/10000 [6:07:24<9:17:50, 5.49s/it] {'loss': 0.0138, 'grad_norm': 0.7996427416801453, 'learning_rate': 2.7877515124985745e-05, 'epoch': 3.9} 39%|███▉ | 3900/10000 [6:07:24<9:17:50, 5.49s/it][2025-06-19 19:37:09,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:37:09,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.05 | bwd_microstep: 3317.94 | bwd_inner_microstep: 3317.12 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.27 [2025-06-19 19:37:09,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.05 | bwd: 3317.95 | bwd_inner: 3317.12 | bwd_allreduce: 0.78 | step: 7.27 39%|███▉ | 3901/10000 [6:07:29<9:16:57, 5.48s/it] {'loss': 0.0058, 'grad_norm': 0.19705937802791595, 'learning_rate': 2.78715608193648e-05, 'epoch': 3.9} 39%|███▉ | 3901/10000 [6:07:29<9:16:57, 5.48s/it][2025-06-19 19:37:14,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:37:14,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.82 | bwd_microstep: 3362.70 | bwd_inner_microstep: 3361.91 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 19:37:14,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.82 | bwd: 3362.71 | bwd_inner: 3361.91 | bwd_allreduce: 0.76 | step: 6.67 39%|███▉ | 3902/10000 [6:07:35<9:18:17, 5.49s/it] {'loss': 0.0042, 'grad_norm': 0.186209574341774, 'learning_rate': 2.786560568805366e-05, 'epoch': 3.9} 39%|███▉ | 3902/10000 [6:07:35<9:18:17, 5.49s/it][2025-06-19 19:37:20,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:37:20,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.67 | bwd_microstep: 3365.68 | bwd_inner_microstep: 3364.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 19:37:20,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.67 | bwd: 3365.69 | bwd_inner: 3364.87 | bwd_allreduce: 0.78 | step: 7.13 39%|███▉ | 3903/10000 [6:07:41<9:19:22, 5.50s/it] {'loss': 0.0432, 'grad_norm': 2.002516508102417, 'learning_rate': 2.785964973167698e-05, 'epoch': 3.9} 39%|███▉ | 3903/10000 [6:07:41<9:19:22, 5.50s/it][2025-06-19 19:37:25,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:37:25,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.01 | bwd_microstep: 3311.82 | bwd_inner_microstep: 3310.85 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.15 [2025-06-19 19:37:25,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.01 | bwd: 3311.83 | bwd_inner: 3310.85 | bwd_allreduce: 0.94 | step: 7.15 39%|███▉ | 3904/10000 [6:07:46<9:17:38, 5.49s/it] {'loss': 0.0141, 'grad_norm': 0.6407009363174438, 'learning_rate': 2.7853692950859534e-05, 'epoch': 3.9} 39%|███▉ | 3904/10000 [6:07:46<9:17:38, 5.49s/it][2025-06-19 19:37:31,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:37:31,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.81 | bwd_microstep: 3361.37 | bwd_inner_microstep: 3360.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-19 19:37:31,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.81 | bwd: 3361.38 | bwd_inner: 3360.57 | bwd_allreduce: 0.77 | step: 7.05 39%|███▉ | 3905/10000 [6:07:52<9:18:56, 5.50s/it] {'loss': 0.2798, 'grad_norm': 3.081544876098633, 'learning_rate': 2.7847735346226134e-05, 'epoch': 3.91} 39%|███▉ | 3905/10000 [6:07:52<9:18:56, 5.50s/it][2025-06-19 19:37:36,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:37:36,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.85 | bwd_microstep: 3319.02 | bwd_inner_microstep: 3317.99 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.44 [2025-06-19 19:37:36,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.85 | bwd: 3319.04 | bwd_inner: 3317.99 | bwd_allreduce: 0.99 | step: 7.45 39%|███▉ | 3906/10000 [6:07:57<9:18:08, 5.50s/it] {'loss': 0.0532, 'grad_norm': 2.8803651332855225, 'learning_rate': 2.7841776918401727e-05, 'epoch': 3.91} 39%|███▉ | 3906/10000 [6:07:57<9:18:08, 5.50s/it][2025-06-19 19:37:42,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 19:37:42,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.77 | bwd_microstep: 3311.25 | bwd_inner_microstep: 3310.18 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.82 [2025-06-19 19:37:42,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.77 | bwd: 3311.27 | bwd_inner: 3310.18 | bwd_allreduce: 1.04 | step: 7.82 39%|███▉ | 3907/10000 [6:08:02<9:17:21, 5.49s/it] {'loss': 0.0468, 'grad_norm': 3.9252548217773438, 'learning_rate': 2.7835817668011313e-05, 'epoch': 3.91} 39%|███▉ | 3907/10000 [6:08:02<9:17:21, 5.49s/it][2025-06-19 19:37:47,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 19:37:47,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.11 | bwd_microstep: 3370.18 | bwd_inner_microstep: 3369.09 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.43 [2025-06-19 19:37:47,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.11 | bwd: 3370.20 | bwd_inner: 3369.09 | bwd_allreduce: 1.05 | step: 7.44 39%|███▉ | 3908/10000 [6:08:08<9:19:33, 5.51s/it] {'loss': 0.0978, 'grad_norm': 8.582655906677246, 'learning_rate': 2.782985759567999e-05, 'epoch': 3.91} 39%|███▉ | 3908/10000 [6:08:08<9:19:33, 5.51s/it][2025-06-19 19:37:53,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:37:53,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.77 | bwd_microstep: 3335.28 | bwd_inner_microstep: 3334.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 19:37:53,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.77 | bwd: 3335.30 | bwd_inner: 3334.49 | bwd_allreduce: 0.76 | step: 6.70 39%|███▉ | 3909/10000 [6:08:14<9:18:34, 5.50s/it] {'loss': 0.099, 'grad_norm': 2.1613991260528564, 'learning_rate': 2.782389670203295e-05, 'epoch': 3.91} 39%|███▉ | 3909/10000 [6:08:14<9:18:34, 5.50s/it][2025-06-19 19:37:58,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:37:58,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.76 | bwd_microstep: 3328.26 | bwd_inner_microstep: 3327.25 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.82 [2025-06-19 19:37:58,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.76 | bwd: 3328.27 | bwd_inner: 3327.25 | bwd_allreduce: 0.97 | step: 7.83 39%|███▉ | 3910/10000 [6:08:19<9:17:54, 5.50s/it] {'loss': 0.163, 'grad_norm': 3.4557745456695557, 'learning_rate': 2.781793498769546e-05, 'epoch': 3.91} 39%|███▉ | 3910/10000 [6:08:19<9:17:54, 5.50s/it][2025-06-19 19:38:04,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.72 | optimizer_step: 2.73 [2025-06-19 19:38:04,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.94 | bwd_microstep: 3319.98 | bwd_inner_microstep: 3319.08 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.22 [2025-06-19 19:38:04,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.94 | bwd: 3320.00 | bwd_inner: 3319.08 | bwd_allreduce: 0.86 | step: 7.22 39%|███▉ | 3911/10000 [6:08:24<9:16:59, 5.49s/it] {'loss': 0.0298, 'grad_norm': 1.2554192543029785, 'learning_rate': 2.781197245329287e-05, 'epoch': 3.91} 39%|███▉ | 3911/10000 [6:08:24<9:16:59, 5.49s/it][2025-06-19 19:38:09,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:38:09,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.40 | bwd_microstep: 3329.80 | bwd_inner_microstep: 3328.73 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.25 [2025-06-19 19:38:09,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.40 | bwd: 3329.82 | bwd_inner: 3328.73 | bwd_allreduce: 1.04 | step: 7.26 39%|███▉ | 3912/10000 [6:08:30<9:16:24, 5.48s/it] {'loss': 0.0407, 'grad_norm': 1.9265199899673462, 'learning_rate': 2.7806009099450625e-05, 'epoch': 3.91} 39%|███▉ | 3912/10000 [6:08:30<9:16:24, 5.48s/it][2025-06-19 19:38:15,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:38:15,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.70 | bwd_microstep: 3311.81 | bwd_inner_microstep: 3311.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-19 19:38:15,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.70 | bwd: 3311.82 | bwd_inner: 3311.01 | bwd_allreduce: 0.77 | step: 7.02 39%|███▉ | 3913/10000 [6:08:35<9:15:47, 5.48s/it] {'loss': 0.0634, 'grad_norm': 2.1396877765655518, 'learning_rate': 2.7800044926794254e-05, 'epoch': 3.91} 39%|███▉ | 3913/10000 [6:08:35<9:15:47, 5.48s/it][2025-06-19 19:38:20,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:38:20,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.03 | bwd_microstep: 3373.90 | bwd_inner_microstep: 3373.00 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.02 [2025-06-19 19:38:20,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.03 | bwd: 3373.92 | bwd_inner: 3373.00 | bwd_allreduce: 0.87 | step: 7.02 39%|███▉ | 3914/10000 [6:08:41<9:17:37, 5.50s/it] {'loss': 0.0507, 'grad_norm': 0.9326047301292419, 'learning_rate': 2.7794079935949375e-05, 'epoch': 3.91} 39%|███▉ | 3914/10000 [6:08:41<9:17:37, 5.50s/it][2025-06-19 19:38:26,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:38:26,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.94 | bwd_microstep: 3315.99 | bwd_inner_microstep: 3315.13 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.07 [2025-06-19 19:38:26,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.94 | bwd: 3316.00 | bwd_inner: 3315.14 | bwd_allreduce: 0.82 | step: 7.07 39%|███▉ | 3915/10000 [6:08:46<9:16:29, 5.49s/it] {'loss': 0.0805, 'grad_norm': 2.35302734375, 'learning_rate': 2.7788114127541682e-05, 'epoch': 3.92} 39%|███▉ | 3915/10000 [6:08:46<9:16:29, 5.49s/it][2025-06-19 19:38:31,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:38:31,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.23 | bwd_microstep: 3380.41 | bwd_inner_microstep: 3379.61 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-19 19:38:31,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.23 | bwd: 3380.42 | bwd_inner: 3379.61 | bwd_allreduce: 0.77 | step: 6.91 39%|███▉ | 3916/10000 [6:08:52<9:18:25, 5.51s/it] {'loss': 0.2207, 'grad_norm': 3.0783374309539795, 'learning_rate': 2.7782147502196962e-05, 'epoch': 3.92} 39%|███▉ | 3916/10000 [6:08:52<9:18:25, 5.51s/it][2025-06-19 19:38:37,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:38:37,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.71 | bwd_microstep: 3374.37 | bwd_inner_microstep: 3373.41 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.72 [2025-06-19 19:38:37,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.71 | bwd: 3374.39 | bwd_inner: 3373.41 | bwd_allreduce: 0.93 | step: 7.72 39%|███▉ | 3917/10000 [6:08:58<9:19:27, 5.52s/it] {'loss': 0.1059, 'grad_norm': 4.874753475189209, 'learning_rate': 2.777618006054109e-05, 'epoch': 3.92} 39%|███▉ | 3917/10000 [6:08:58<9:19:27, 5.52s/it][2025-06-19 19:38:42,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:38:42,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.77 | bwd_microstep: 3325.48 | bwd_inner_microstep: 3324.46 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.29 [2025-06-19 19:38:42,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.77 | bwd: 3325.49 | bwd_inner: 3324.46 | bwd_allreduce: 0.99 | step: 7.29 39%|███▉ | 3918/10000 [6:09:03<9:18:06, 5.51s/it] {'loss': 0.1171, 'grad_norm': 1.8690310716629028, 'learning_rate': 2.7770211803200013e-05, 'epoch': 3.92} 39%|███▉ | 3918/10000 [6:09:03<9:18:06, 5.51s/it][2025-06-19 19:38:48,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.73 | optimizer_step: 2.73 [2025-06-19 19:38:48,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.21 | bwd_microstep: 3381.71 | bwd_inner_microstep: 3380.90 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-19 19:38:48,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.21 | bwd: 3381.73 | bwd_inner: 3380.90 | bwd_allreduce: 0.79 | step: 7.13 39%|███▉ | 3919/10000 [6:09:09<9:19:58, 5.53s/it] {'loss': 0.0111, 'grad_norm': 1.2647067308425903, 'learning_rate': 2.776424273079979e-05, 'epoch': 3.92} 39%|███▉ | 3919/10000 [6:09:09<9:19:58, 5.53s/it][2025-06-19 19:38:53,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:38:53,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.48 | bwd_microstep: 3326.46 | bwd_inner_microstep: 3325.50 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.36 [2025-06-19 19:38:53,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.48 | bwd: 3326.47 | bwd_inner: 3325.50 | bwd_allreduce: 0.92 | step: 7.37 39%|███▉ | 3920/10000 [6:09:14<9:18:44, 5.51s/it] {'loss': 0.0264, 'grad_norm': 1.5374500751495361, 'learning_rate': 2.775827284396653e-05, 'epoch': 3.92} 39%|███▉ | 3920/10000 [6:09:14<9:18:44, 5.51s/it][2025-06-19 19:38:59,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:38:59,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.96 | bwd_microstep: 3342.28 | bwd_inner_microstep: 3341.47 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.21 [2025-06-19 19:38:59,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.96 | bwd: 3342.30 | bwd_inner: 3341.47 | bwd_allreduce: 0.79 | step: 7.21 39%|███▉ | 3921/10000 [6:09:20<9:18:00, 5.51s/it] {'loss': 0.0396, 'grad_norm': 0.9441166520118713, 'learning_rate': 2.7752302143326466e-05, 'epoch': 3.92} 39%|███▉ | 3921/10000 [6:09:20<9:18:00, 5.51s/it][2025-06-19 19:39:04,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:39:04,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.35 | bwd_microstep: 3337.47 | bwd_inner_microstep: 3336.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 19:39:04,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.35 | bwd: 3337.48 | bwd_inner: 3336.69 | bwd_allreduce: 0.75 | step: 6.65 39%|███▉ | 3922/10000 [6:09:25<9:17:31, 5.50s/it] {'loss': 0.0996, 'grad_norm': 2.4248528480529785, 'learning_rate': 2.7746330629505882e-05, 'epoch': 3.92} 39%|███▉ | 3922/10000 [6:09:25<9:17:31, 5.50s/it][2025-06-19 19:39:10,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:39:10,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.56 | bwd_microstep: 3381.51 | bwd_inner_microstep: 3380.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 19:39:10,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.56 | bwd: 3381.52 | bwd_inner: 3380.73 | bwd_allreduce: 0.75 | step: 6.56 39%|███▉ | 3923/10000 [6:09:31<9:18:58, 5.52s/it] {'loss': 0.0161, 'grad_norm': 0.8320306539535522, 'learning_rate': 2.7740358303131162e-05, 'epoch': 3.92} 39%|███▉ | 3923/10000 [6:09:31<9:18:58, 5.52s/it][2025-06-19 19:39:15,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:39:15,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.77 | bwd_microstep: 3336.20 | bwd_inner_microstep: 3335.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 19:39:15,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.77 | bwd: 3336.21 | bwd_inner: 3335.41 | bwd_allreduce: 0.76 | step: 6.61 39%|███▉ | 3924/10000 [6:09:36<9:17:51, 5.51s/it] {'loss': 0.0613, 'grad_norm': 1.0661029815673828, 'learning_rate': 2.7734385164828785e-05, 'epoch': 3.92} 39%|███▉ | 3924/10000 [6:09:36<9:17:51, 5.51s/it][2025-06-19 19:39:21,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:39:21,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.08 | bwd_microstep: 3333.23 | bwd_inner_microstep: 3331.31 | bwd_allreduce_microstep: 1.87 | step_microstep: 7.54 [2025-06-19 19:39:21,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.08 | bwd: 3333.25 | bwd_inner: 3331.31 | bwd_allreduce: 1.90 | step: 7.54 39%|███▉ | 3925/10000 [6:09:42<9:17:01, 5.50s/it] {'loss': 0.0399, 'grad_norm': 2.790325164794922, 'learning_rate': 2.7728411215225296e-05, 'epoch': 3.92} 39%|███▉ | 3925/10000 [6:09:42<9:17:01, 5.50s/it][2025-06-19 19:39:26,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 3.03 [2025-06-19 19:39:26,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.26 | bwd_microstep: 3395.52 | bwd_inner_microstep: 3394.49 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.78 [2025-06-19 19:39:26,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.26 | bwd: 3395.53 | bwd_inner: 3394.49 | bwd_allreduce: 1.00 | step: 7.79 39%|███▉ | 3926/10000 [6:09:47<9:19:28, 5.53s/it] {'loss': 0.0281, 'grad_norm': 0.9764383435249329, 'learning_rate': 2.7722436454947348e-05, 'epoch': 3.93} 39%|███▉ | 3926/10000 [6:09:47<9:19:28, 5.53s/it][2025-06-19 19:39:32,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:39:32,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.54 | bwd_microstep: 3329.90 | bwd_inner_microstep: 3329.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 19:39:32,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.54 | bwd: 3329.92 | bwd_inner: 3329.12 | bwd_allreduce: 0.75 | step: 6.56 39%|███▉ | 3927/10000 [6:09:53<9:18:14, 5.52s/it] {'loss': 0.0506, 'grad_norm': 0.72221440076828, 'learning_rate': 2.7716460884621656e-05, 'epoch': 3.93} 39%|███▉ | 3927/10000 [6:09:53<9:18:14, 5.52s/it][2025-06-19 19:39:37,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:39:37,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.88 | bwd_microstep: 3329.41 | bwd_inner_microstep: 3328.34 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.25 [2025-06-19 19:39:37,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.88 | bwd: 3329.43 | bwd_inner: 3328.34 | bwd_allreduce: 1.04 | step: 7.25 39%|███▉ | 3928/10000 [6:09:58<9:17:10, 5.51s/it] {'loss': 0.0412, 'grad_norm': 1.075242042541504, 'learning_rate': 2.7710484504875023e-05, 'epoch': 3.93} 39%|███▉ | 3928/10000 [6:09:58<9:17:10, 5.51s/it][2025-06-19 19:39:43,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:39:43,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.50 | bwd_microstep: 3325.29 | bwd_inner_microstep: 3324.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 19:39:43,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.50 | bwd: 3325.30 | bwd_inner: 3324.49 | bwd_allreduce: 0.77 | step: 6.71 39%|███▉ | 3929/10000 [6:10:04<9:16:29, 5.50s/it] {'loss': 0.024, 'grad_norm': 0.8046867847442627, 'learning_rate': 2.7704507316334356e-05, 'epoch': 3.93} 39%|███▉ | 3929/10000 [6:10:04<9:16:29, 5.50s/it][2025-06-19 19:39:48,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:39:48,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.44 | bwd_microstep: 3337.64 | bwd_inner_microstep: 3336.55 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.62 [2025-06-19 19:39:48,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.44 | bwd: 3337.66 | bwd_inner: 3336.55 | bwd_allreduce: 1.06 | step: 7.62 39%|███▉ | 3930/10000 [6:10:09<9:16:21, 5.50s/it] {'loss': 0.0389, 'grad_norm': 2.1368582248687744, 'learning_rate': 2.7698529319626628e-05, 'epoch': 3.93} 39%|███▉ | 3930/10000 [6:10:09<9:16:21, 5.50s/it][2025-06-19 19:39:54,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:39:54,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.00 | bwd_microstep: 3384.67 | bwd_inner_microstep: 3383.72 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.17 [2025-06-19 19:39:54,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.00 | bwd: 3384.69 | bwd_inner: 3383.72 | bwd_allreduce: 0.93 | step: 7.18 39%|███▉ | 3931/10000 [6:10:15<9:18:14, 5.52s/it] {'loss': 0.0696, 'grad_norm': 1.6094354391098022, 'learning_rate': 2.76925505153789e-05, 'epoch': 3.93} 39%|███▉ | 3931/10000 [6:10:15<9:18:14, 5.52s/it][2025-06-19 19:39:59,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:39:59,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.33 | bwd_microstep: 3340.50 | bwd_inner_microstep: 3339.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 19:39:59,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.33 | bwd: 3340.51 | bwd_inner: 3339.72 | bwd_allreduce: 0.75 | step: 6.72 39%|███▉ | 3932/10000 [6:10:20<9:17:17, 5.51s/it] {'loss': 0.0662, 'grad_norm': 0.8701049089431763, 'learning_rate': 2.7686570904218332e-05, 'epoch': 3.93} 39%|███▉ | 3932/10000 [6:10:20<9:17:17, 5.51s/it][2025-06-19 19:40:05,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:40:05,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.31 | bwd_microstep: 3330.43 | bwd_inner_microstep: 3329.62 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.76 [2025-06-19 19:40:05,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.31 | bwd: 3330.45 | bwd_inner: 3329.62 | bwd_allreduce: 0.79 | step: 6.77 39%|███▉ | 3933/10000 [6:10:26<9:16:21, 5.50s/it] {'loss': 0.1742, 'grad_norm': 1.9317845106124878, 'learning_rate': 2.7680590486772145e-05, 'epoch': 3.93} 39%|███▉ | 3933/10000 [6:10:26<9:16:21, 5.50s/it][2025-06-19 19:40:10,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:40:10,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.21 | bwd_microstep: 3343.89 | bwd_inner_microstep: 3342.84 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.48 [2025-06-19 19:40:10,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.21 | bwd: 3343.92 | bwd_inner: 3342.84 | bwd_allreduce: 1.02 | step: 7.49 39%|███▉ | 3934/10000 [6:10:31<9:16:16, 5.50s/it] {'loss': 3.2343, 'grad_norm': 114.4150161743164, 'learning_rate': 2.7674609263667664e-05, 'epoch': 3.93} 39%|███▉ | 3934/10000 [6:10:31<9:16:16, 5.50s/it][2025-06-19 19:40:16,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:40:16,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.48 | bwd_microstep: 3387.75 | bwd_inner_microstep: 3386.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 19:40:16,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.48 | bwd: 3387.76 | bwd_inner: 3386.94 | bwd_allreduce: 0.78 | step: 7.08 39%|███▉ | 3935/10000 [6:10:37<9:18:20, 5.52s/it] {'loss': 0.0549, 'grad_norm': 1.7936185598373413, 'learning_rate': 2.7668627235532292e-05, 'epoch': 3.94} 39%|███▉ | 3935/10000 [6:10:37<9:18:20, 5.52s/it][2025-06-19 19:40:21,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:40:21,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.43 | bwd_microstep: 3374.02 | bwd_inner_microstep: 3373.14 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.02 [2025-06-19 19:40:21,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.43 | bwd: 3374.03 | bwd_inner: 3373.14 | bwd_allreduce: 0.84 | step: 7.02 39%|███▉ | 3936/10000 [6:10:42<9:18:53, 5.53s/it] {'loss': 0.1183, 'grad_norm': 2.9283337593078613, 'learning_rate': 2.7662644402993515e-05, 'epoch': 3.94} 39%|███▉ | 3936/10000 [6:10:42<9:18:53, 5.53s/it][2025-06-19 19:40:27,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:40:27,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.94 | bwd_microstep: 3378.62 | bwd_inner_microstep: 3377.55 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.96 [2025-06-19 19:40:27,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.94 | bwd: 3378.64 | bwd_inner: 3377.55 | bwd_allreduce: 1.03 | step: 7.96 39%|███▉ | 3937/10000 [6:10:48<9:19:36, 5.54s/it] {'loss': 0.1507, 'grad_norm': 2.4227826595306396, 'learning_rate': 2.7656660766678905e-05, 'epoch': 3.94} 39%|███▉ | 3937/10000 [6:10:48<9:19:36, 5.54s/it][2025-06-19 19:40:32,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:40:32,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.22 | bwd_microstep: 3325.35 | bwd_inner_microstep: 3324.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 19:40:32,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.22 | bwd: 3325.37 | bwd_inner: 3324.56 | bwd_allreduce: 0.77 | step: 6.75 39%|███▉ | 3938/10000 [6:10:53<9:17:44, 5.52s/it] {'loss': 0.0706, 'grad_norm': 2.138252019882202, 'learning_rate': 2.765067632721611e-05, 'epoch': 3.94} 39%|███▉ | 3938/10000 [6:10:53<9:17:44, 5.52s/it][2025-06-19 19:40:38,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:40:38,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.81 | bwd_microstep: 3334.55 | bwd_inner_microstep: 3333.71 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.75 [2025-06-19 19:40:38,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.81 | bwd: 3334.57 | bwd_inner: 3333.71 | bwd_allreduce: 0.80 | step: 6.75 39%|███▉ | 3939/10000 [6:10:59<9:16:43, 5.51s/it] {'loss': 0.0333, 'grad_norm': 0.8448181748390198, 'learning_rate': 2.7644691085232885e-05, 'epoch': 3.94} 39%|███▉ | 3939/10000 [6:10:59<9:16:43, 5.51s/it][2025-06-19 19:40:44,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:40:44,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.29 | bwd_microstep: 3383.35 | bwd_inner_microstep: 3382.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-19 19:40:44,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.29 | bwd: 3383.36 | bwd_inner: 3382.56 | bwd_allreduce: 0.76 | step: 6.80 39%|███▉ | 3940/10000 [6:11:04<9:18:05, 5.53s/it] {'loss': 0.0332, 'grad_norm': 1.2389745712280273, 'learning_rate': 2.7638705041357038e-05, 'epoch': 3.94} 39%|███▉ | 3940/10000 [6:11:04<9:18:05, 5.53s/it][2025-06-19 19:40:49,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:40:49,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.18 | bwd_microstep: 3325.56 | bwd_inner_microstep: 3324.75 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.18 [2025-06-19 19:40:49,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.18 | bwd: 3325.58 | bwd_inner: 3324.75 | bwd_allreduce: 0.79 | step: 7.18 39%|███▉ | 3941/10000 [6:11:10<9:16:32, 5.51s/it] {'loss': 0.025, 'grad_norm': 0.92144376039505, 'learning_rate': 2.7632718196216486e-05, 'epoch': 3.94} 39%|███▉ | 3941/10000 [6:11:10<9:16:32, 5.51s/it][2025-06-19 19:40:55,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 19:40:55,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.95 | bwd_microstep: 3376.53 | bwd_inner_microstep: 3375.46 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.34 [2025-06-19 19:40:55,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.95 | bwd: 3376.55 | bwd_inner: 3375.46 | bwd_allreduce: 1.03 | step: 7.35 39%|███▉ | 3942/10000 [6:11:15<9:17:50, 5.53s/it] {'loss': 0.0796, 'grad_norm': 2.1890792846679688, 'learning_rate': 2.762673055043922e-05, 'epoch': 3.94} 39%|███▉ | 3942/10000 [6:11:15<9:17:50, 5.53s/it][2025-06-19 19:41:00,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:41:00,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.88 | bwd_microstep: 3329.16 | bwd_inner_microstep: 3328.08 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.45 [2025-06-19 19:41:00,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.88 | bwd: 3329.17 | bwd_inner: 3328.08 | bwd_allreduce: 1.04 | step: 7.46 39%|███▉ | 3943/10000 [6:11:21<9:16:36, 5.51s/it] {'loss': 0.1485, 'grad_norm': 2.971092462539673, 'learning_rate': 2.7620742104653312e-05, 'epoch': 3.94} 39%|███▉ | 3943/10000 [6:11:21<9:16:36, 5.51s/it][2025-06-19 19:41:06,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:41:06,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.54 | bwd_microstep: 3323.81 | bwd_inner_microstep: 3322.98 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.70 [2025-06-19 19:41:06,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.54 | bwd: 3323.83 | bwd_inner: 3322.98 | bwd_allreduce: 0.80 | step: 6.71 39%|███▉ | 3944/10000 [6:11:26<9:15:30, 5.50s/it] {'loss': 0.0346, 'grad_norm': 0.8944214582443237, 'learning_rate': 2.7614752859486934e-05, 'epoch': 3.94} 39%|███▉ | 3944/10000 [6:11:26<9:15:30, 5.50s/it][2025-06-19 19:41:11,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:41:11,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.15 | bwd_microstep: 3329.20 | bwd_inner_microstep: 3328.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-19 19:41:11,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.15 | bwd: 3329.21 | bwd_inner: 3328.40 | bwd_allreduce: 0.77 | step: 7.07 39%|███▉ | 3945/10000 [6:11:32<9:14:33, 5.50s/it] {'loss': 0.0183, 'grad_norm': 0.6845325827598572, 'learning_rate': 2.760876281556832e-05, 'epoch': 3.94} 39%|███▉ | 3945/10000 [6:11:32<9:14:33, 5.50s/it][2025-06-19 19:41:17,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:41:17,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.86 | bwd_microstep: 3328.05 | bwd_inner_microstep: 3327.25 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 19:41:17,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.86 | bwd: 3328.06 | bwd_inner: 3327.25 | bwd_allreduce: 0.77 | step: 6.88 39%|███▉ | 3946/10000 [6:11:37<9:14:06, 5.49s/it] {'loss': 0.0288, 'grad_norm': 0.8071032166481018, 'learning_rate': 2.76027719735258e-05, 'epoch': 3.95} 39%|███▉ | 3946/10000 [6:11:37<9:14:06, 5.49s/it][2025-06-19 19:41:22,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:41:22,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.46 | bwd_microstep: 3329.23 | bwd_inner_microstep: 3328.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 19:41:22,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.46 | bwd: 3329.25 | bwd_inner: 3328.44 | bwd_allreduce: 0.76 | step: 6.81 39%|███▉ | 3947/10000 [6:11:43<9:13:31, 5.49s/it] {'loss': 0.0238, 'grad_norm': 0.8631374835968018, 'learning_rate': 2.7596780333987783e-05, 'epoch': 3.95} 39%|███▉ | 3947/10000 [6:11:43<9:13:31, 5.49s/it][2025-06-19 19:41:27,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 19:41:27,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.67 | bwd_microstep: 3325.96 | bwd_inner_microstep: 3324.84 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.86 [2025-06-19 19:41:27,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.67 | bwd: 3325.97 | bwd_inner: 3324.84 | bwd_allreduce: 1.09 | step: 7.87 39%|███▉ | 3948/10000 [6:11:48<9:13:06, 5.48s/it] {'loss': 0.0246, 'grad_norm': 0.7220205664634705, 'learning_rate': 2.7590787897582764e-05, 'epoch': 3.95} 39%|███▉ | 3948/10000 [6:11:48<9:13:06, 5.48s/it][2025-06-19 19:41:33,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:41:33,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.51 | bwd_microstep: 3325.69 | bwd_inner_microstep: 3324.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 19:41:33,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.51 | bwd: 3325.71 | bwd_inner: 3324.90 | bwd_allreduce: 0.77 | step: 6.99 39%|███▉ | 3949/10000 [6:11:54<9:13:06, 5.48s/it] {'loss': 0.0096, 'grad_norm': 0.37331515550613403, 'learning_rate': 2.7584794664939333e-05, 'epoch': 3.95} 39%|███▉ | 3949/10000 [6:11:54<9:13:06, 5.48s/it][2025-06-19 19:41:38,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:41:38,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.24 | bwd_microstep: 3329.35 | bwd_inner_microstep: 3328.53 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.94 [2025-06-19 19:41:38,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.24 | bwd: 3329.37 | bwd_inner: 3328.53 | bwd_allreduce: 0.79 | step: 6.95 40%|███▉ | 3950/10000 [6:11:59<9:12:50, 5.48s/it] {'loss': 0.0953, 'grad_norm': 4.058095455169678, 'learning_rate': 2.757880063668614e-05, 'epoch': 3.95} 40%|███▉ | 3950/10000 [6:11:59<9:12:50, 5.48s/it][2025-06-19 19:41:44,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:41:44,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.20 | bwd_microstep: 3317.97 | bwd_inner_microstep: 3317.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 19:41:44,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.20 | bwd: 3317.98 | bwd_inner: 3317.18 | bwd_allreduce: 0.76 | step: 6.59 40%|███▉ | 3951/10000 [6:12:05<9:12:09, 5.48s/it] {'loss': 0.014, 'grad_norm': 0.42750659584999084, 'learning_rate': 2.757280581345193e-05, 'epoch': 3.95} 40%|███▉ | 3951/10000 [6:12:05<9:12:09, 5.48s/it][2025-06-19 19:41:49,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:41:49,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.61 | bwd_microstep: 3329.13 | bwd_inner_microstep: 3328.32 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 19:41:49,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.61 | bwd: 3329.15 | bwd_inner: 3328.32 | bwd_allreduce: 0.78 | step: 7.19 40%|███▉ | 3952/10000 [6:12:10<9:12:02, 5.48s/it] {'loss': 0.0442, 'grad_norm': 1.4622323513031006, 'learning_rate': 2.7566810195865543e-05, 'epoch': 3.95} 40%|███▉ | 3952/10000 [6:12:10<9:12:02, 5.48s/it][2025-06-19 19:41:55,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:41:55,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.04 | bwd_microstep: 3333.04 | bwd_inner_microstep: 3332.17 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.41 [2025-06-19 19:41:55,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.04 | bwd: 3333.07 | bwd_inner: 3332.17 | bwd_allreduce: 0.83 | step: 7.41 40%|███▉ | 3953/10000 [6:12:16<9:12:06, 5.48s/it] {'loss': 0.0704, 'grad_norm': 1.7399274110794067, 'learning_rate': 2.756081378455588e-05, 'epoch': 3.95} 40%|███▉ | 3953/10000 [6:12:16<9:12:06, 5.48s/it][2025-06-19 19:42:00,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:42:00,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.52 | bwd_microstep: 3326.95 | bwd_inner_microstep: 3326.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 19:42:00,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.52 | bwd: 3326.97 | bwd_inner: 3326.16 | bwd_allreduce: 0.76 | step: 6.68 40%|███▉ | 3954/10000 [6:12:21<9:11:55, 5.48s/it] {'loss': 0.0671, 'grad_norm': 1.896593689918518, 'learning_rate': 2.755481658015194e-05, 'epoch': 3.95} 40%|███▉ | 3954/10000 [6:12:21<9:11:55, 5.48s/it][2025-06-19 19:42:06,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.76 [2025-06-19 19:42:06,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.47 | bwd_microstep: 3380.70 | bwd_inner_microstep: 3379.76 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.27 [2025-06-19 19:42:06,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.47 | bwd: 3380.72 | bwd_inner: 3379.76 | bwd_allreduce: 0.91 | step: 7.27 40%|███▉ | 3955/10000 [6:12:27<9:13:54, 5.50s/it] {'loss': 0.018, 'grad_norm': 1.3395851850509644, 'learning_rate': 2.7548818583282814e-05, 'epoch': 3.96} 40%|███▉ | 3955/10000 [6:12:27<9:13:54, 5.50s/it][2025-06-19 19:42:11,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:42:11,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.49 | bwd_microstep: 3375.81 | bwd_inner_microstep: 3374.85 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.47 [2025-06-19 19:42:11,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.49 | bwd: 3375.82 | bwd_inner: 3374.85 | bwd_allreduce: 0.92 | step: 7.48 40%|███▉ | 3956/10000 [6:12:32<9:15:29, 5.51s/it] {'loss': 0.0333, 'grad_norm': 1.447379231452942, 'learning_rate': 2.7542819794577646e-05, 'epoch': 3.96} 40%|███▉ | 3956/10000 [6:12:32<9:15:29, 5.51s/it][2025-06-19 19:42:17,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:42:17,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.41 | bwd_microstep: 3323.70 | bwd_inner_microstep: 3322.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-19 19:42:17,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.41 | bwd: 3323.71 | bwd_inner: 3322.89 | bwd_allreduce: 0.77 | step: 6.93 40%|███▉ | 3957/10000 [6:12:38<9:14:14, 5.50s/it] {'loss': 0.0561, 'grad_norm': 1.3330479860305786, 'learning_rate': 2.7536820214665693e-05, 'epoch': 3.96} 40%|███▉ | 3957/10000 [6:12:38<9:14:14, 5.50s/it][2025-06-19 19:42:22,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.79 [2025-06-19 19:42:22,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.38 | bwd_microstep: 3318.82 | bwd_inner_microstep: 3318.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 19:42:22,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.38 | bwd: 3318.83 | bwd_inner: 3318.03 | bwd_allreduce: 0.76 | step: 6.75 40%|███▉ | 3958/10000 [6:12:43<9:12:51, 5.49s/it] {'loss': 0.0377, 'grad_norm': 3.548856019973755, 'learning_rate': 2.7530819844176274e-05, 'epoch': 3.96} 40%|███▉ | 3958/10000 [6:12:43<9:12:51, 5.49s/it][2025-06-19 19:42:28,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:42:28,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.75 | bwd_microstep: 3382.80 | bwd_inner_microstep: 3382.01 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 19:42:28,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.75 | bwd: 3382.81 | bwd_inner: 3382.01 | bwd_allreduce: 0.76 | step: 6.70 40%|███▉ | 3959/10000 [6:12:49<9:14:29, 5.51s/it] {'loss': 1.3306, 'grad_norm': 145.9803466796875, 'learning_rate': 2.752481868373881e-05, 'epoch': 3.96} 40%|███▉ | 3959/10000 [6:12:49<9:14:29, 5.51s/it][2025-06-19 19:42:33,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:42:33,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.31 | bwd_microstep: 3401.31 | bwd_inner_microstep: 3400.41 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.38 [2025-06-19 19:42:33,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.31 | bwd: 3401.33 | bwd_inner: 3400.41 | bwd_allreduce: 0.87 | step: 7.38 40%|███▉ | 3960/10000 [6:12:54<9:16:21, 5.53s/it] {'loss': 0.1525, 'grad_norm': 4.257388114929199, 'learning_rate': 2.7518816733982782e-05, 'epoch': 3.96} 40%|███▉ | 3960/10000 [6:12:54<9:16:21, 5.53s/it][2025-06-19 19:42:39,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:42:39,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.60 | bwd_microstep: 3321.02 | bwd_inner_microstep: 3320.12 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.48 [2025-06-19 19:42:39,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.60 | bwd: 3321.04 | bwd_inner: 3320.12 | bwd_allreduce: 0.87 | step: 7.50 40%|███▉ | 3961/10000 [6:13:00<9:14:36, 5.51s/it] {'loss': 0.008, 'grad_norm': 0.39892590045928955, 'learning_rate': 2.751281399553778e-05, 'epoch': 3.96} 40%|███▉ | 3961/10000 [6:13:00<9:14:36, 5.51s/it][2025-06-19 19:42:44,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:42:44,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.17 | bwd_microstep: 3317.56 | bwd_inner_microstep: 3316.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 19:42:44,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.17 | bwd: 3317.57 | bwd_inner: 3316.76 | bwd_allreduce: 0.77 | step: 6.69 40%|███▉ | 3962/10000 [6:13:05<9:13:02, 5.50s/it] {'loss': 0.025, 'grad_norm': 1.5229206085205078, 'learning_rate': 2.750681046903346e-05, 'epoch': 3.96} 40%|███▉ | 3962/10000 [6:13:05<9:13:02, 5.50s/it][2025-06-19 19:42:50,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:42:50,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.23 | bwd_microstep: 3319.94 | bwd_inner_microstep: 3318.78 | bwd_allreduce_microstep: 1.10 | step_microstep: 7.44 [2025-06-19 19:42:50,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.23 | bwd: 3319.97 | bwd_inner: 3318.78 | bwd_allreduce: 1.13 | step: 7.44 40%|███▉ | 3963/10000 [6:13:11<9:11:57, 5.49s/it] {'loss': 0.1227, 'grad_norm': 2.785327434539795, 'learning_rate': 2.750080615509956e-05, 'epoch': 3.96} 40%|███▉ | 3963/10000 [6:13:11<9:11:57, 5.49s/it][2025-06-19 19:42:55,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 19:42:55,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.78 | bwd_microstep: 3323.18 | bwd_inner_microstep: 3322.21 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.22 [2025-06-19 19:42:55,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.78 | bwd: 3323.21 | bwd_inner: 3322.21 | bwd_allreduce: 0.93 | step: 7.22 40%|███▉ | 3964/10000 [6:13:16<9:11:38, 5.48s/it] {'loss': 0.1192, 'grad_norm': 2.4863929748535156, 'learning_rate': 2.7494801054365904e-05, 'epoch': 3.96} 40%|███▉ | 3964/10000 [6:13:16<9:11:38, 5.48s/it][2025-06-19 19:43:01,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:43:01,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.77 | bwd_microstep: 3368.65 | bwd_inner_microstep: 3367.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 19:43:01,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.77 | bwd: 3368.66 | bwd_inner: 3367.86 | bwd_allreduce: 0.76 | step: 6.64 40%|███▉ | 3965/10000 [6:13:22<9:13:20, 5.50s/it] {'loss': 0.0746, 'grad_norm': 2.3705015182495117, 'learning_rate': 2.7488795167462408e-05, 'epoch': 3.96} 40%|███▉ | 3965/10000 [6:13:22<9:13:20, 5.50s/it][2025-06-19 19:43:06,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.77 [2025-06-19 19:43:06,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.26 | bwd_microstep: 3323.61 | bwd_inner_microstep: 3322.76 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.63 [2025-06-19 19:43:06,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.26 | bwd: 3323.63 | bwd_inner: 3322.76 | bwd_allreduce: 0.82 | step: 7.63 40%|███▉ | 3966/10000 [6:13:27<9:12:40, 5.50s/it] {'loss': 0.0116, 'grad_norm': 0.6910802721977234, 'learning_rate': 2.748278849501905e-05, 'epoch': 3.97} 40%|███▉ | 3966/10000 [6:13:27<9:12:40, 5.50s/it][2025-06-19 19:43:12,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:43:12,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.58 | bwd_microstep: 3375.06 | bwd_inner_microstep: 3374.26 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-19 19:43:12,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.58 | bwd: 3375.08 | bwd_inner: 3374.26 | bwd_allreduce: 0.78 | step: 6.86 40%|███▉ | 3967/10000 [6:13:33<9:14:05, 5.51s/it] {'loss': 0.07, 'grad_norm': 4.288180828094482, 'learning_rate': 2.747678103766591e-05, 'epoch': 3.97} 40%|███▉ | 3967/10000 [6:13:33<9:14:05, 5.51s/it][2025-06-19 19:43:17,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:43:17,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.92 | bwd_microstep: 3325.16 | bwd_inner_microstep: 3324.10 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.04 [2025-06-19 19:43:17,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.92 | bwd: 3325.18 | bwd_inner: 3324.10 | bwd_allreduce: 0.82 | step: 7.04 40%|███▉ | 3968/10000 [6:13:38<9:12:42, 5.50s/it] {'loss': 0.0576, 'grad_norm': 1.16050124168396, 'learning_rate': 2.747077279603314e-05, 'epoch': 3.97} 40%|███▉ | 3968/10000 [6:13:38<9:12:42, 5.50s/it][2025-06-19 19:43:23,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 19:43:23,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.83 | bwd_microstep: 3373.59 | bwd_inner_microstep: 3372.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 19:43:23,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.83 | bwd: 3373.61 | bwd_inner: 3372.81 | bwd_allreduce: 0.75 | step: 6.72 40%|███▉ | 3969/10000 [6:13:44<9:13:51, 5.51s/it] {'loss': 0.0525, 'grad_norm': 2.306509017944336, 'learning_rate': 2.7464763770750977e-05, 'epoch': 3.97} 40%|███▉ | 3969/10000 [6:13:44<9:13:51, 5.51s/it][2025-06-19 19:43:28,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:43:28,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.83 | bwd_microstep: 3315.78 | bwd_inner_microstep: 3314.99 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 19:43:28,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.83 | bwd: 3315.80 | bwd_inner: 3314.99 | bwd_allreduce: 0.77 | step: 6.97 40%|███▉ | 3970/10000 [6:13:49<9:12:16, 5.50s/it] {'loss': 0.0405, 'grad_norm': 1.603528380393982, 'learning_rate': 2.7458753962449738e-05, 'epoch': 3.97} 40%|███▉ | 3970/10000 [6:13:49<9:12:16, 5.50s/it][2025-06-19 19:43:34,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:43:34,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.76 | bwd_microstep: 3317.51 | bwd_inner_microstep: 3316.36 | bwd_allreduce_microstep: 1.10 | step_microstep: 7.55 [2025-06-19 19:43:34,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.76 | bwd: 3317.52 | bwd_inner: 3316.36 | bwd_allreduce: 1.12 | step: 7.55 40%|███▉ | 3971/10000 [6:13:55<9:11:30, 5.49s/it] {'loss': 0.1268, 'grad_norm': 3.1502797603607178, 'learning_rate': 2.7452743371759823e-05, 'epoch': 3.97} 40%|███▉ | 3971/10000 [6:13:55<9:11:30, 5.49s/it][2025-06-19 19:43:39,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:43:39,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.11 | bwd_microstep: 3312.63 | bwd_inner_microstep: 3311.82 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.11 [2025-06-19 19:43:39,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.11 | bwd: 3312.65 | bwd_inner: 3311.82 | bwd_allreduce: 0.78 | step: 7.11 40%|███▉ | 3972/10000 [6:14:00<9:10:37, 5.48s/it] {'loss': 0.0274, 'grad_norm': 1.26689875125885, 'learning_rate': 2.7446731999311718e-05, 'epoch': 3.97} 40%|███▉ | 3972/10000 [6:14:00<9:10:37, 5.48s/it][2025-06-19 19:43:45,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:43:45,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.26 | bwd_microstep: 3315.20 | bwd_inner_microstep: 3314.36 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.18 [2025-06-19 19:43:45,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.26 | bwd: 3315.21 | bwd_inner: 3314.36 | bwd_allreduce: 0.81 | step: 7.18 40%|███▉ | 3973/10000 [6:14:06<9:09:51, 5.47s/it] {'loss': 0.0154, 'grad_norm': 0.521834135055542, 'learning_rate': 2.7440719845735992e-05, 'epoch': 3.97} 40%|███▉ | 3973/10000 [6:14:06<9:09:51, 5.47s/it][2025-06-19 19:43:50,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:43:50,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.95 | bwd_microstep: 3368.76 | bwd_inner_microstep: 3367.73 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.55 [2025-06-19 19:43:50,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.95 | bwd: 3368.78 | bwd_inner: 3367.73 | bwd_allreduce: 1.01 | step: 7.56 40%|███▉ | 3974/10000 [6:14:11<9:11:44, 5.49s/it] {'loss': 0.0452, 'grad_norm': 2.1714189052581787, 'learning_rate': 2.7434706911663286e-05, 'epoch': 3.97} 40%|███▉ | 3974/10000 [6:14:11<9:11:44, 5.49s/it][2025-06-19 19:43:56,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:43:56,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.77 | bwd_microstep: 3319.94 | bwd_inner_microstep: 3319.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-19 19:43:56,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.77 | bwd: 3319.95 | bwd_inner: 3319.14 | bwd_allreduce: 0.77 | step: 7.03 40%|███▉ | 3975/10000 [6:14:17<9:10:54, 5.49s/it] {'loss': 0.2171, 'grad_norm': 3.8541533946990967, 'learning_rate': 2.7428693197724335e-05, 'epoch': 3.98} 40%|███▉ | 3975/10000 [6:14:17<9:10:54, 5.49s/it][2025-06-19 19:44:01,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:44:01,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.42 | bwd_microstep: 3319.35 | bwd_inner_microstep: 3318.49 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.24 [2025-06-19 19:44:01,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.42 | bwd: 3319.37 | bwd_inner: 3318.49 | bwd_allreduce: 0.82 | step: 7.24 40%|███▉ | 3976/10000 [6:14:22<9:10:10, 5.48s/it] {'loss': 0.0193, 'grad_norm': 0.9038508534431458, 'learning_rate': 2.7422678704549938e-05, 'epoch': 3.98} 40%|███▉ | 3976/10000 [6:14:22<9:10:10, 5.48s/it][2025-06-19 19:44:07,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:44:07,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.87 | bwd_microstep: 3368.72 | bwd_inner_microstep: 3367.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 19:44:07,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.87 | bwd: 3368.73 | bwd_inner: 3367.92 | bwd_allreduce: 0.77 | step: 6.96 40%|███▉ | 3977/10000 [6:14:28<9:11:55, 5.50s/it] {'loss': 0.0474, 'grad_norm': 1.72627592086792, 'learning_rate': 2.7416663432770993e-05, 'epoch': 3.98} 40%|███▉ | 3977/10000 [6:14:28<9:11:55, 5.50s/it][2025-06-19 19:44:12,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:44:12,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.08 | bwd_microstep: 3324.05 | bwd_inner_microstep: 3323.23 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 19:44:12,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.08 | bwd: 3324.06 | bwd_inner: 3323.23 | bwd_allreduce: 0.78 | step: 7.05 40%|███▉ | 3978/10000 [6:14:33<9:10:59, 5.49s/it] {'loss': 0.1308, 'grad_norm': 3.5998411178588867, 'learning_rate': 2.741064738301848e-05, 'epoch': 3.98} 40%|███▉ | 3978/10000 [6:14:33<9:10:59, 5.49s/it][2025-06-19 19:44:18,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:44:18,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.13 | bwd_microstep: 3374.43 | bwd_inner_microstep: 3373.62 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-19 19:44:18,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.13 | bwd: 3374.44 | bwd_inner: 3373.62 | bwd_allreduce: 0.78 | step: 6.94 40%|███▉ | 3979/10000 [6:14:39<9:12:38, 5.51s/it] {'loss': 0.1763, 'grad_norm': 3.6345629692077637, 'learning_rate': 2.7404630555923448e-05, 'epoch': 3.98} 40%|███▉ | 3979/10000 [6:14:39<9:12:38, 5.51s/it][2025-06-19 19:44:23,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:44:23,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.69 | bwd_microstep: 3326.28 | bwd_inner_microstep: 3325.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 19:44:23,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.69 | bwd: 3326.30 | bwd_inner: 3325.49 | bwd_allreduce: 0.76 | step: 6.70 40%|███▉ | 3980/10000 [6:14:44<9:11:29, 5.50s/it] {'loss': 0.0381, 'grad_norm': 1.5444413423538208, 'learning_rate': 2.7398612952117046e-05, 'epoch': 3.98} 40%|███▉ | 3980/10000 [6:14:44<9:11:29, 5.50s/it][2025-06-19 19:44:29,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:44:29,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.17 | bwd_microstep: 3314.69 | bwd_inner_microstep: 3313.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-19 19:44:29,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.17 | bwd: 3314.70 | bwd_inner: 3313.89 | bwd_allreduce: 0.77 | step: 7.01 40%|███▉ | 3981/10000 [6:14:50<9:10:22, 5.49s/it] {'loss': 0.0081, 'grad_norm': 0.5895918607711792, 'learning_rate': 2.7392594572230475e-05, 'epoch': 3.98} 40%|███▉ | 3981/10000 [6:14:50<9:10:22, 5.49s/it][2025-06-19 19:44:34,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:44:34,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.25 | bwd_microstep: 3332.03 | bwd_inner_microstep: 3331.18 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.09 [2025-06-19 19:44:34,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.25 | bwd: 3332.06 | bwd_inner: 3331.18 | bwd_allreduce: 0.82 | step: 7.09 40%|███▉ | 3982/10000 [6:14:55<9:10:15, 5.49s/it] {'loss': 0.1731, 'grad_norm': 3.4277236461639404, 'learning_rate': 2.7386575416895046e-05, 'epoch': 3.98} 40%|███▉ | 3982/10000 [6:14:55<9:10:15, 5.49s/it][2025-06-19 19:44:40,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:44:40,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.29 | bwd_microstep: 3315.74 | bwd_inner_microstep: 3314.96 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.54 [2025-06-19 19:44:40,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.29 | bwd: 3315.76 | bwd_inner: 3314.96 | bwd_allreduce: 0.75 | step: 6.55 40%|███▉ | 3983/10000 [6:15:01<9:10:07, 5.49s/it] {'loss': 0.0162, 'grad_norm': 0.8313466906547546, 'learning_rate': 2.738055548674213e-05, 'epoch': 3.98} 40%|███▉ | 3983/10000 [6:15:01<9:10:07, 5.49s/it][2025-06-19 19:44:45,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:44:45,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.40 | bwd_microstep: 3319.78 | bwd_inner_microstep: 3318.84 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.99 [2025-06-19 19:44:45,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.40 | bwd: 3319.79 | bwd_inner: 3318.84 | bwd_allreduce: 0.91 | step: 6.99 40%|███▉ | 3984/10000 [6:15:06<9:10:04, 5.49s/it] {'loss': 0.0579, 'grad_norm': 2.355572462081909, 'learning_rate': 2.737453478240321e-05, 'epoch': 3.98} 40%|███▉ | 3984/10000 [6:15:06<9:10:04, 5.49s/it][2025-06-19 19:44:51,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:44:51,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.35 | bwd_microstep: 3311.52 | bwd_inner_microstep: 3310.72 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 19:44:51,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.35 | bwd: 3311.54 | bwd_inner: 3310.72 | bwd_allreduce: 0.78 | step: 7.12 40%|███▉ | 3985/10000 [6:15:11<9:08:57, 5.48s/it] {'loss': 0.1873, 'grad_norm': 3.5278360843658447, 'learning_rate': 2.736851330450981e-05, 'epoch': 3.98} 40%|███▉ | 3985/10000 [6:15:11<9:08:57, 5.48s/it][2025-06-19 19:44:56,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:44:56,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.27 | bwd_microstep: 3318.35 | bwd_inner_microstep: 3317.47 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.19 [2025-06-19 19:44:56,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.27 | bwd: 3318.37 | bwd_inner: 3317.47 | bwd_allreduce: 0.84 | step: 7.19 40%|███▉ | 3986/10000 [6:15:17<9:08:24, 5.47s/it] {'loss': 0.08, 'grad_norm': 2.03019642829895, 'learning_rate': 2.7362491053693564e-05, 'epoch': 3.99} 40%|███▉ | 3986/10000 [6:15:17<9:08:24, 5.47s/it][2025-06-19 19:45:02,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:45:02,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.28 | bwd_microstep: 3312.42 | bwd_inner_microstep: 3311.60 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.19 [2025-06-19 19:45:02,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.28 | bwd: 3312.43 | bwd_inner: 3311.60 | bwd_allreduce: 0.79 | step: 7.20 40%|███▉ | 3987/10000 [6:15:22<9:08:09, 5.47s/it] {'loss': 0.0527, 'grad_norm': 1.96412992477417, 'learning_rate': 2.7356468030586177e-05, 'epoch': 3.99} 40%|███▉ | 3987/10000 [6:15:22<9:08:09, 5.47s/it][2025-06-19 19:45:07,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:45:07,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.25 | bwd_microstep: 3314.52 | bwd_inner_microstep: 3313.71 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 19:45:07,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.25 | bwd: 3314.54 | bwd_inner: 3313.71 | bwd_allreduce: 0.78 | step: 6.96 40%|███▉ | 3988/10000 [6:15:28<9:07:53, 5.47s/it] {'loss': 0.0157, 'grad_norm': 1.0866649150848389, 'learning_rate': 2.7350444235819433e-05, 'epoch': 3.99} 40%|███▉ | 3988/10000 [6:15:28<9:07:53, 5.47s/it][2025-06-19 19:45:13,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:45:13,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.44 | bwd_microstep: 3326.29 | bwd_inner_microstep: 3325.47 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-19 19:45:13,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.44 | bwd: 3326.30 | bwd_inner: 3325.47 | bwd_allreduce: 0.79 | step: 7.13 40%|███▉ | 3989/10000 [6:15:33<9:08:09, 5.47s/it] {'loss': 0.0079, 'grad_norm': 0.42637598514556885, 'learning_rate': 2.7344419670025203e-05, 'epoch': 3.99} 40%|███▉ | 3989/10000 [6:15:33<9:08:09, 5.47s/it][2025-06-19 19:45:18,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:45:18,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.38 | bwd_microstep: 3312.10 | bwd_inner_microstep: 3311.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 19:45:18,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.38 | bwd: 3312.11 | bwd_inner: 3311.31 | bwd_allreduce: 0.76 | step: 6.71 40%|███▉ | 3990/10000 [6:15:39<9:07:29, 5.47s/it] {'loss': 0.0284, 'grad_norm': 1.329127311706543, 'learning_rate': 2.7338394333835447e-05, 'epoch': 3.99} 40%|███▉ | 3990/10000 [6:15:39<9:07:29, 5.47s/it][2025-06-19 19:45:23,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:45:23,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.86 | bwd_microstep: 3319.23 | bwd_inner_microstep: 3318.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.70 [2025-06-19 19:45:23,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.86 | bwd: 3319.24 | bwd_inner: 3318.43 | bwd_allreduce: 0.77 | step: 6.70 40%|███▉ | 3991/10000 [6:15:44<9:07:09, 5.46s/it] {'loss': 0.0282, 'grad_norm': 1.6524155139923096, 'learning_rate': 2.733236822788217e-05, 'epoch': 3.99} 40%|███▉ | 3991/10000 [6:15:44<9:07:09, 5.46s/it][2025-06-19 19:45:29,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:45:29,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.39 | bwd_microstep: 3311.72 | bwd_inner_microstep: 3310.91 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 19:45:29,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.39 | bwd: 3311.73 | bwd_inner: 3310.91 | bwd_allreduce: 0.78 | step: 7.17 40%|███▉ | 3992/10000 [6:15:50<9:06:45, 5.46s/it] {'loss': 0.0275, 'grad_norm': 0.6366029381752014, 'learning_rate': 2.73263413527975e-05, 'epoch': 3.99} 40%|███▉ | 3992/10000 [6:15:50<9:06:45, 5.46s/it][2025-06-19 19:45:34,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:45:34,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.78 | bwd_microstep: 3308.25 | bwd_inner_microstep: 3307.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 19:45:34,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.78 | bwd: 3308.26 | bwd_inner: 3307.46 | bwd_allreduce: 0.76 | step: 6.64 40%|███▉ | 3993/10000 [6:15:55<9:06:10, 5.46s/it] {'loss': 0.0595, 'grad_norm': 1.4036502838134766, 'learning_rate': 2.7320313709213638e-05, 'epoch': 3.99} 40%|███▉ | 3993/10000 [6:15:55<9:06:10, 5.46s/it][2025-06-19 19:45:40,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:45:40,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.80 | bwd_microstep: 3312.33 | bwd_inner_microstep: 3311.53 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 19:45:40,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.81 | bwd: 3312.34 | bwd_inner: 3311.53 | bwd_allreduce: 0.76 | step: 6.64 40%|███▉ | 3994/10000 [6:16:01<9:06:13, 5.46s/it] {'loss': 0.0157, 'grad_norm': 0.6521095037460327, 'learning_rate': 2.7314285297762828e-05, 'epoch': 3.99} 40%|███▉ | 3994/10000 [6:16:01<9:06:13, 5.46s/it][2025-06-19 19:45:45,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:45:45,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.17 | bwd_microstep: 3321.24 | bwd_inner_microstep: 3320.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-19 19:45:45,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.17 | bwd: 3321.25 | bwd_inner: 3320.43 | bwd_allreduce: 0.78 | step: 7.27 40%|███▉ | 3995/10000 [6:16:06<9:06:16, 5.46s/it] {'loss': 0.0195, 'grad_norm': 0.758142352104187, 'learning_rate': 2.730825611907744e-05, 'epoch': 4.0} 40%|███▉ | 3995/10000 [6:16:06<9:06:16, 5.46s/it][2025-06-19 19:45:51,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:45:51,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.20 | bwd_microstep: 3314.38 | bwd_inner_microstep: 3313.38 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.69 [2025-06-19 19:45:51,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.20 | bwd: 3314.39 | bwd_inner: 3313.38 | bwd_allreduce: 0.96 | step: 7.69 40%|███▉ | 3996/10000 [6:16:12<9:06:10, 5.46s/it] {'loss': 0.0117, 'grad_norm': 0.6249870657920837, 'learning_rate': 2.7302226173789902e-05, 'epoch': 4.0} 40%|███▉ | 3996/10000 [6:16:12<9:06:10, 5.46s/it][2025-06-19 19:45:56,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:45:56,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.57 | bwd_microstep: 3393.20 | bwd_inner_microstep: 3392.27 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.98 [2025-06-19 19:45:56,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.57 | bwd: 3393.21 | bwd_inner: 3392.27 | bwd_allreduce: 0.90 | step: 6.98 40%|███▉ | 3997/10000 [6:16:17<9:09:14, 5.49s/it] {'loss': 0.0588, 'grad_norm': 1.5068916082382202, 'learning_rate': 2.7296195462532737e-05, 'epoch': 4.0} 40%|███▉ | 3997/10000 [6:16:17<9:09:14, 5.49s/it][2025-06-19 19:46:02,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:46:02,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.48 | bwd_microstep: 3389.92 | bwd_inner_microstep: 3388.98 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.22 [2025-06-19 19:46:02,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.48 | bwd: 3389.94 | bwd_inner: 3388.98 | bwd_allreduce: 0.92 | step: 7.22 40%|███▉ | 3998/10000 [6:16:23<9:11:22, 5.51s/it] {'loss': 0.0067, 'grad_norm': 0.18672126531600952, 'learning_rate': 2.7290163985938525e-05, 'epoch': 4.0} 40%|███▉ | 3998/10000 [6:16:23<9:11:22, 5.51s/it][2025-06-19 19:46:07,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:46:07,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.38 | bwd_microstep: 3357.23 | bwd_inner_microstep: 3356.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 19:46:07,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.38 | bwd: 3357.25 | bwd_inner: 3356.44 | bwd_allreduce: 0.77 | step: 6.77 40%|███▉ | 3999/10000 [6:16:28<9:11:36, 5.52s/it] {'loss': 0.0743, 'grad_norm': 2.860596179962158, 'learning_rate': 2.7284131744639942e-05, 'epoch': 4.0} 40%|███▉ | 3999/10000 [6:16:28<9:11:36, 5.52s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-06-19 19:46:15,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:46:15,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2093.73 | bwd_microstep: 3300.05 | bwd_inner_microstep: 3299.11 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.12 [2025-06-19 19:46:15,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2093.73 | bwd: 3300.06 | bwd_inner: 3299.11 | bwd_allreduce: 0.91 | step: 7.13 40%|████ | 4000/10000 [6:16:36<10:10:23, 6.10s/it] {'loss': 0.0223, 'grad_norm': 0.9870715141296387, 'learning_rate': 2.7278098739269757e-05, 'epoch': 4.0} 40%|████ | 4000/10000 [6:16:36<10:10:23, 6.10s/it]evaluate! [INFO|trainer.py:3910] 2025-06-19 19:46:25,476 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 19:46:25,481 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 19:46:25,481 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 19:47:19,183 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 19:47:19,187 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 19:47:19,187 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 19:47:19,188 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json evaluate! [INFO|trainer.py:3910] 2025-06-19 19:47:38,277 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 19:47:38,286 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 19:47:38,286 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 19:48:36,087 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 19:48:36,091 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 19:48:36,091 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 19:48:36,091 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json [2025-06-19 19:48:41,400] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 19:48:47,547] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 19:48:53,516] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 19:48:59,751] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 19:49:18,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.66 | optimizer_step: 2.73 [2025-06-19 19:49:18,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.46 | bwd_microstep: 3331.35 | bwd_inner_microstep: 3330.13 | bwd_allreduce_microstep: 1.14 | step_microstep: 8.42 [2025-06-19 19:49:18,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.41 | bwd: 3331.38 | bwd_inner: 3330.13 | bwd_allreduce: 1.18 | step: 8.42 40%|████ | 4001/10000 [6:19:39<98:40:53, 59.22s/it] {'loss': 0.0489, 'grad_norm': 4.17549991607666, 'learning_rate': 2.727206497046078e-05, 'epoch': 4.0} 40%|████ | 4001/10000 [6:19:39<98:40:53, 59.22s/it][2025-06-19 19:49:24,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:49:24,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.24 | bwd_microstep: 3341.51 | bwd_inner_microstep: 3340.28 | bwd_allreduce_microstep: 1.16 | step_microstep: 7.58 [2025-06-19 19:49:24,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.24 | bwd: 3341.53 | bwd_inner: 3340.28 | bwd_allreduce: 1.19 | step: 7.59 40%|████ | 4002/10000 [6:19:44<71:49:37, 43.11s/it] {'loss': 0.1041, 'grad_norm': 2.6954517364501953, 'learning_rate': 2.726603043884595e-05, 'epoch': 4.0} 40%|████ | 4002/10000 [6:19:44<71:49:37, 43.11s/it][2025-06-19 19:49:29,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.84 [2025-06-19 19:49:29,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.63 | bwd_microstep: 3344.94 | bwd_inner_microstep: 3344.07 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.80 [2025-06-19 19:49:29,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.63 | bwd: 3344.96 | bwd_inner: 3344.07 | bwd_allreduce: 0.83 | step: 7.81 40%|████ | 4003/10000 [6:19:50<53:01:48, 31.83s/it] {'loss': 0.0846, 'grad_norm': 1.4214004278182983, 'learning_rate': 2.725999514505824e-05, 'epoch': 4.0} 40%|████ | 4003/10000 [6:19:50<53:01:48, 31.83s/it][2025-06-19 19:49:34,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:49:34,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.50 | bwd_microstep: 3301.03 | bwd_inner_microstep: 3300.21 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 19:49:34,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.50 | bwd: 3301.05 | bwd_inner: 3300.21 | bwd_allreduce: 0.79 | step: 7.30 40%|████ | 4004/10000 [6:19:55<39:50:17, 23.92s/it] {'loss': 0.0073, 'grad_norm': 0.27933669090270996, 'learning_rate': 2.7253959089730738e-05, 'epoch': 4.0} 40%|████ | 4004/10000 [6:19:55<39:50:17, 23.92s/it][2025-06-19 19:49:40,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:49:40,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.96 | bwd_microstep: 3353.88 | bwd_inner_microstep: 3353.05 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.95 [2025-06-19 19:49:40,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.96 | bwd: 3353.90 | bwd_inner: 3353.05 | bwd_allreduce: 0.81 | step: 6.95 40%|████ | 4005/10000 [6:20:01<30:38:16, 18.40s/it] {'loss': 0.0417, 'grad_norm': 1.1506085395812988, 'learning_rate': 2.7247922273496592e-05, 'epoch': 4.0} 40%|████ | 4005/10000 [6:20:01<30:38:16, 18.40s/it][2025-06-19 19:49:45,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:49:45,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2093.01 | bwd_microstep: 3307.19 | bwd_inner_microstep: 3306.22 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.87 [2025-06-19 19:49:45,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2093.01 | bwd: 3307.20 | bwd_inner: 3306.22 | bwd_allreduce: 0.92 | step: 6.87 40%|████ | 4006/10000 [6:20:06<24:09:31, 14.51s/it] {'loss': 0.0622, 'grad_norm': 2.5851023197174072, 'learning_rate': 2.7241884696989037e-05, 'epoch': 4.01} 40%|████ | 4006/10000 [6:20:06<24:09:31, 14.51s/it][2025-06-19 19:49:51,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:49:51,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2089.46 | bwd_microstep: 3289.17 | bwd_inner_microstep: 3288.32 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.35 [2025-06-19 19:49:51,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2089.46 | bwd: 3289.19 | bwd_inner: 3288.32 | bwd_allreduce: 0.81 | step: 7.35 40%|████ | 4007/10000 [6:20:12<19:37:05, 11.78s/it] {'loss': 0.0323, 'grad_norm': 1.1297553777694702, 'learning_rate': 2.7235846360841392e-05, 'epoch': 4.01} 40%|████ | 4007/10000 [6:20:12<19:37:05, 11.78s/it][2025-06-19 19:49:56,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 19:49:56,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.79 | bwd_microstep: 3293.90 | bwd_inner_microstep: 3292.84 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.47 [2025-06-19 19:49:56,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.79 | bwd: 3293.92 | bwd_inner: 3292.84 | bwd_allreduce: 1.02 | step: 7.48 40%|████ | 4008/10000 [6:20:17<16:26:52, 9.88s/it] {'loss': 0.0094, 'grad_norm': 0.5321229696273804, 'learning_rate': 2.7229807265687046e-05, 'epoch': 4.01} 40%|████ | 4008/10000 [6:20:17<16:26:52, 9.88s/it][2025-06-19 19:50:02,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:50:02,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.85 | bwd_microstep: 3304.89 | bwd_inner_microstep: 3303.88 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.09 [2025-06-19 19:50:02,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.85 | bwd: 3304.90 | bwd_inner: 3303.88 | bwd_allreduce: 0.97 | step: 7.10 40%|████ | 4009/10000 [6:20:23<14:14:59, 8.56s/it] {'loss': 0.0354, 'grad_norm': 1.6108901500701904, 'learning_rate': 2.722376741215947e-05, 'epoch': 4.01} 40%|████ | 4009/10000 [6:20:23<14:14:59, 8.56s/it][2025-06-19 19:50:07,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 19:50:07,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.34 | bwd_microstep: 3293.16 | bwd_inner_microstep: 3292.04 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.86 [2025-06-19 19:50:07,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.34 | bwd: 3293.19 | bwd_inner: 3292.04 | bwd_allreduce: 1.09 | step: 7.87 40%|████ | 4010/10000 [6:20:28<12:41:10, 7.62s/it] {'loss': 0.0094, 'grad_norm': 0.49877092242240906, 'learning_rate': 2.7217726800892227e-05, 'epoch': 4.01} 40%|████ | 4010/10000 [6:20:28<12:41:10, 7.62s/it][2025-06-19 19:50:13,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:50:13,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.20 | bwd_microstep: 3353.73 | bwd_inner_microstep: 3352.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.90 [2025-06-19 19:50:13,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.20 | bwd: 3353.74 | bwd_inner: 3352.94 | bwd_allreduce: 0.76 | step: 6.91 40%|████ | 4011/10000 [6:20:34<11:38:07, 6.99s/it] {'loss': 0.0088, 'grad_norm': 0.4285600483417511, 'learning_rate': 2.721168543251893e-05, 'epoch': 4.01} 40%|████ | 4011/10000 [6:20:34<11:38:07, 6.99s/it][2025-06-19 19:50:18,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:50:18,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2095.04 | bwd_microstep: 3310.22 | bwd_inner_microstep: 3309.22 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.37 [2025-06-19 19:50:18,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2095.04 | bwd: 3310.24 | bwd_inner: 3309.22 | bwd_allreduce: 0.96 | step: 7.38 40%|████ | 4012/10000 [6:20:39<10:51:49, 6.53s/it] {'loss': 0.0029, 'grad_norm': 0.1226610541343689, 'learning_rate': 2.7205643307673307e-05, 'epoch': 4.01} 40%|████ | 4012/10000 [6:20:39<10:51:49, 6.53s/it][2025-06-19 19:50:24,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:50:24,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.33 | bwd_microstep: 3314.11 | bwd_inner_microstep: 3313.25 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.34 [2025-06-19 19:50:24,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.33 | bwd: 3314.13 | bwd_inner: 3313.25 | bwd_allreduce: 0.83 | step: 7.34 40%|████ | 4013/10000 [6:20:44<10:19:38, 6.21s/it] {'loss': 0.0097, 'grad_norm': 0.3607342839241028, 'learning_rate': 2.7199600426989142e-05, 'epoch': 4.01} 40%|████ | 4013/10000 [6:20:44<10:19:38, 6.21s/it][2025-06-19 19:50:29,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:50:29,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.40 | bwd_microstep: 3311.91 | bwd_inner_microstep: 3310.88 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.51 [2025-06-19 19:50:29,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.40 | bwd: 3311.93 | bwd_inner: 3310.88 | bwd_allreduce: 0.99 | step: 7.52 40%|████ | 4014/10000 [6:20:50<9:57:18, 5.99s/it] {'loss': 0.0029, 'grad_norm': 0.12561677396297455, 'learning_rate': 2.719355679110031e-05, 'epoch': 4.01} 40%|████ | 4014/10000 [6:20:50<9:57:18, 5.99s/it][2025-06-19 19:50:35,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:50:35,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.41 | bwd_microstep: 3309.11 | bwd_inner_microstep: 3308.29 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.83 [2025-06-19 19:50:35,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.41 | bwd: 3309.13 | bwd_inner: 3308.29 | bwd_allreduce: 0.79 | step: 6.82 40%|████ | 4015/10000 [6:20:55<9:41:15, 5.83s/it] {'loss': 0.0119, 'grad_norm': 0.4177790880203247, 'learning_rate': 2.7187512400640746e-05, 'epoch': 4.01} 40%|████ | 4015/10000 [6:20:55<9:41:15, 5.83s/it][2025-06-19 19:50:40,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:50:40,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.65 | bwd_microstep: 3323.92 | bwd_inner_microstep: 3322.89 | bwd_allreduce_microstep: 0.97 | step_microstep: 6.99 [2025-06-19 19:50:40,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.65 | bwd: 3323.94 | bwd_inner: 3322.89 | bwd_allreduce: 0.99 | step: 6.99 40%|████ | 4016/10000 [6:21:01<9:30:28, 5.72s/it] {'loss': 0.1039, 'grad_norm': 1.3094041347503662, 'learning_rate': 2.7181467256244506e-05, 'epoch': 4.02} 40%|████ | 4016/10000 [6:21:01<9:30:28, 5.72s/it][2025-06-19 19:50:46,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:50:46,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.43 | bwd_microstep: 3324.82 | bwd_inner_microstep: 3323.96 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.14 [2025-06-19 19:50:46,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.43 | bwd: 3324.83 | bwd_inner: 3323.96 | bwd_allreduce: 0.83 | step: 7.14 40%|████ | 4017/10000 [6:21:06<9:23:06, 5.65s/it] {'loss': 0.0079, 'grad_norm': 0.22833997011184692, 'learning_rate': 2.7175421358545667e-05, 'epoch': 4.02} 40%|████ | 4017/10000 [6:21:06<9:23:06, 5.65s/it][2025-06-19 19:50:51,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:50:51,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.83 | bwd_microstep: 3374.07 | bwd_inner_microstep: 3373.25 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.98 [2025-06-19 19:50:51,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.83 | bwd: 3374.09 | bwd_inner: 3373.25 | bwd_allreduce: 0.79 | step: 6.98 40%|████ | 4018/10000 [6:21:12<9:20:07, 5.62s/it] {'loss': 0.1263, 'grad_norm': 2.165870428085327, 'learning_rate': 2.7169374708178443e-05, 'epoch': 4.02} 40%|████ | 4018/10000 [6:21:12<9:20:07, 5.62s/it][2025-06-19 19:50:57,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:50:57,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.10 | bwd_microstep: 3324.53 | bwd_inner_microstep: 3323.70 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.92 [2025-06-19 19:50:57,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.10 | bwd: 3324.54 | bwd_inner: 3323.71 | bwd_allreduce: 0.80 | step: 6.92 40%|████ | 4019/10000 [6:21:17<9:15:40, 5.57s/it] {'loss': 0.0159, 'grad_norm': 0.650512158870697, 'learning_rate': 2.7163327305777073e-05, 'epoch': 4.02} 40%|████ | 4019/10000 [6:21:17<9:15:40, 5.57s/it][2025-06-19 19:51:02,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:51:02,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.19 | bwd_microstep: 3373.61 | bwd_inner_microstep: 3372.79 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.81 [2025-06-19 19:51:02,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.19 | bwd: 3373.63 | bwd_inner: 3372.79 | bwd_allreduce: 0.79 | step: 6.82 40%|████ | 4020/10000 [6:21:23<9:14:51, 5.57s/it] {'loss': 0.0486, 'grad_norm': 1.9358853101730347, 'learning_rate': 2.715727915197592e-05, 'epoch': 4.02} 40%|████ | 4020/10000 [6:21:23<9:14:51, 5.57s/it][2025-06-19 19:51:08,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 19:51:08,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.18 | bwd_microstep: 3328.49 | bwd_inner_microstep: 3327.27 | bwd_allreduce_microstep: 1.13 | step_microstep: 8.32 [2025-06-19 19:51:08,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.18 | bwd: 3328.52 | bwd_inner: 3327.27 | bwd_allreduce: 1.17 | step: 8.33 40%|████ | 4021/10000 [6:21:28<9:12:19, 5.54s/it] {'loss': 0.0036, 'grad_norm': 0.18065577745437622, 'learning_rate': 2.7151230247409407e-05, 'epoch': 4.02} 40%|████ | 4021/10000 [6:21:28<9:12:19, 5.54s/it][2025-06-19 19:51:13,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 19:51:13,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2176.81 | bwd_microstep: 3379.79 | bwd_inner_microstep: 3378.71 | bwd_allreduce_microstep: 1.02 | step_microstep: 8.29 [2025-06-19 19:51:13,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2176.81 | bwd: 3379.81 | bwd_inner: 3378.71 | bwd_allreduce: 1.05 | step: 8.30 40%|████ | 4022/10000 [6:21:34<9:14:11, 5.56s/it] {'loss': 0.0143, 'grad_norm': 1.3561731576919556, 'learning_rate': 2.7145180592712024e-05, 'epoch': 4.02} 40%|████ | 4022/10000 [6:21:34<9:14:11, 5.56s/it][2025-06-19 19:51:19,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.71 | optimizer_step: 2.73 [2025-06-19 19:51:19,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2152.00 | bwd_microstep: 3384.43 | bwd_inner_microstep: 3382.94 | bwd_allreduce_microstep: 1.34 | step_microstep: 10.56 [2025-06-19 19:51:19,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2152.00 | bwd: 3384.47 | bwd_inner: 3382.94 | bwd_allreduce: 1.42 | step: 10.56 40%|████ | 4023/10000 [6:21:40<9:15:04, 5.57s/it] {'loss': 0.0058, 'grad_norm': 0.34735509753227234, 'learning_rate': 2.713913018851836e-05, 'epoch': 4.02} 40%|████ | 4023/10000 [6:21:40<9:15:04, 5.57s/it][2025-06-19 19:51:24,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 19:51:24,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2159.06 | bwd_microstep: 3345.71 | bwd_inner_microstep: 3344.76 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.93 [2025-06-19 19:51:24,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2159.06 | bwd: 3345.74 | bwd_inner: 3344.76 | bwd_allreduce: 0.90 | step: 7.94 40%|████ | 4024/10000 [6:21:45<9:14:35, 5.57s/it] {'loss': 0.0365, 'grad_norm': 1.8220767974853516, 'learning_rate': 2.7133079035463073e-05, 'epoch': 4.02} 40%|████ | 4024/10000 [6:21:45<9:14:35, 5.57s/it][2025-06-19 19:51:30,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-19 19:51:30,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.43 | bwd_microstep: 3314.16 | bwd_inner_microstep: 3313.17 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.65 [2025-06-19 19:51:30,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.43 | bwd: 3314.20 | bwd_inner: 3313.17 | bwd_allreduce: 0.93 | step: 7.63 40%|████ | 4025/10000 [6:21:51<9:12:24, 5.55s/it] {'loss': 0.0102, 'grad_norm': 0.5064238905906677, 'learning_rate': 2.71270271341809e-05, 'epoch': 4.03} 40%|████ | 4025/10000 [6:21:51<9:12:24, 5.55s/it][2025-06-19 19:51:35,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:51:35,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.30 | bwd_microstep: 3321.47 | bwd_inner_microstep: 3320.46 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.60 [2025-06-19 19:51:35,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.30 | bwd: 3321.49 | bwd_inner: 3320.46 | bwd_allreduce: 0.99 | step: 7.61 40%|████ | 4026/10000 [6:21:56<9:10:27, 5.53s/it] {'loss': 0.0065, 'grad_norm': 0.9198450446128845, 'learning_rate': 2.7120974485306662e-05, 'epoch': 4.03} 40%|████ | 4026/10000 [6:21:56<9:10:27, 5.53s/it][2025-06-19 19:51:41,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 19:51:41,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.87 | bwd_microstep: 3321.65 | bwd_inner_microstep: 3320.79 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.32 [2025-06-19 19:51:41,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.87 | bwd: 3321.67 | bwd_inner: 3320.79 | bwd_allreduce: 0.82 | step: 7.32 40%|████ | 4027/10000 [6:22:02<9:09:15, 5.52s/it] {'loss': 0.0388, 'grad_norm': 1.6116210222244263, 'learning_rate': 2.711492108947525e-05, 'epoch': 4.03} 40%|████ | 4027/10000 [6:22:02<9:09:15, 5.52s/it][2025-06-19 19:51:46,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:51:46,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2152.92 | bwd_microstep: 3370.78 | bwd_inner_microstep: 3369.71 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.22 [2025-06-19 19:51:46,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2152.92 | bwd: 3370.80 | bwd_inner: 3369.71 | bwd_allreduce: 1.03 | step: 7.22 40%|████ | 4028/10000 [6:22:07<9:10:35, 5.53s/it] {'loss': 0.093, 'grad_norm': 2.163853168487549, 'learning_rate': 2.7108866947321636e-05, 'epoch': 4.03} 40%|████ | 4028/10000 [6:22:07<9:10:35, 5.53s/it][2025-06-19 19:51:52,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 19:51:52,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.08 | bwd_microstep: 3369.21 | bwd_inner_microstep: 3367.84 | bwd_allreduce_microstep: 1.27 | step_microstep: 8.44 [2025-06-19 19:51:52,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.08 | bwd: 3369.24 | bwd_inner: 3367.84 | bwd_allreduce: 1.32 | step: 8.44 40%|████ | 4029/10000 [6:22:13<9:11:03, 5.54s/it] {'loss': 0.0283, 'grad_norm': 1.1536577939987183, 'learning_rate': 2.710281205948087e-05, 'epoch': 4.03} 40%|████ | 4029/10000 [6:22:13<9:11:03, 5.54s/it][2025-06-19 19:51:57,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:51:57,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.78 | bwd_microstep: 3319.88 | bwd_inner_microstep: 3319.03 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.13 [2025-06-19 19:51:57,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.78 | bwd: 3319.90 | bwd_inner: 3319.03 | bwd_allreduce: 0.81 | step: 7.13 40%|████ | 4030/10000 [6:22:18<9:09:19, 5.52s/it] {'loss': 0.0463, 'grad_norm': 1.3509753942489624, 'learning_rate': 2.7096756426588084e-05, 'epoch': 4.03} 40%|████ | 4030/10000 [6:22:18<9:09:19, 5.52s/it][2025-06-19 19:52:03,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:52:03,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.01 | bwd_microstep: 3323.68 | bwd_inner_microstep: 3322.73 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.87 [2025-06-19 19:52:03,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.01 | bwd: 3323.69 | bwd_inner: 3322.73 | bwd_allreduce: 0.92 | step: 6.88 40%|████ | 4031/10000 [6:22:24<9:08:30, 5.51s/it] {'loss': 0.0132, 'grad_norm': 0.510429322719574, 'learning_rate': 2.709070004927848e-05, 'epoch': 4.03} 40%|████ | 4031/10000 [6:22:24<9:08:30, 5.51s/it][2025-06-19 19:52:08,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.66 | optimizer_step: 2.72 [2025-06-19 19:52:08,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.64 | bwd_microstep: 3322.35 | bwd_inner_microstep: 3321.03 | bwd_allreduce_microstep: 1.20 | step_microstep: 7.62 [2025-06-19 19:52:08,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.64 | bwd: 3322.39 | bwd_inner: 3321.03 | bwd_allreduce: 1.25 | step: 7.64 40%|████ | 4032/10000 [6:22:29<9:07:55, 5.51s/it] {'loss': 0.0853, 'grad_norm': 2.8393397331237793, 'learning_rate': 2.708464292818736e-05, 'epoch': 4.03} 40%|████ | 4032/10000 [6:22:29<9:07:55, 5.51s/it][2025-06-19 19:52:14,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:52:14,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.46 | bwd_microstep: 3363.88 | bwd_inner_microstep: 3362.92 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.52 [2025-06-19 19:52:14,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.46 | bwd: 3363.90 | bwd_inner: 3362.92 | bwd_allreduce: 0.93 | step: 7.52 40%|████ | 4033/10000 [6:22:35<9:09:20, 5.52s/it] {'loss': 0.0036, 'grad_norm': 0.19785147905349731, 'learning_rate': 2.7078585063950077e-05, 'epoch': 4.03} 40%|████ | 4033/10000 [6:22:35<9:09:20, 5.52s/it][2025-06-19 19:52:19,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 19:52:19,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.09 | bwd_microstep: 3314.63 | bwd_inner_microstep: 3313.68 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.28 [2025-06-19 19:52:19,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.09 | bwd: 3314.65 | bwd_inner: 3313.68 | bwd_allreduce: 0.93 | step: 7.28 40%|████ | 4034/10000 [6:22:40<9:07:47, 5.51s/it] {'loss': 0.0617, 'grad_norm': 2.7008213996887207, 'learning_rate': 2.7072526457202073e-05, 'epoch': 4.03} 40%|████ | 4034/10000 [6:22:40<9:07:47, 5.51s/it][2025-06-19 19:52:25,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:52:25,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.61 | bwd_microstep: 3322.07 | bwd_inner_microstep: 3320.93 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.14 [2025-06-19 19:52:25,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.61 | bwd: 3322.10 | bwd_inner: 3320.93 | bwd_allreduce: 1.11 | step: 7.14 40%|████ | 4035/10000 [6:22:46<9:06:54, 5.50s/it] {'loss': 0.0245, 'grad_norm': 1.1392120122909546, 'learning_rate': 2.7066467108578865e-05, 'epoch': 4.04} 40%|████ | 4035/10000 [6:22:46<9:06:54, 5.50s/it][2025-06-19 19:52:30,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:52:30,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.70 | bwd_microstep: 3316.23 | bwd_inner_microstep: 3315.30 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.04 [2025-06-19 19:52:30,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.70 | bwd: 3316.25 | bwd_inner: 3315.30 | bwd_allreduce: 0.90 | step: 7.04 40%|████ | 4036/10000 [6:22:51<9:06:02, 5.49s/it] {'loss': 0.0154, 'grad_norm': 0.7021244168281555, 'learning_rate': 2.706040701871606e-05, 'epoch': 4.04} 40%|████ | 4036/10000 [6:22:51<9:06:02, 5.49s/it][2025-06-19 19:52:36,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:52:36,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.67 | bwd_microstep: 3376.90 | bwd_inner_microstep: 3376.03 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.56 [2025-06-19 19:52:36,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.67 | bwd: 3376.93 | bwd_inner: 3376.03 | bwd_allreduce: 0.83 | step: 7.56 40%|████ | 4037/10000 [6:22:57<9:07:54, 5.51s/it] {'loss': 0.0704, 'grad_norm': 2.1297004222869873, 'learning_rate': 2.7054346188249317e-05, 'epoch': 4.04} 40%|████ | 4037/10000 [6:22:57<9:07:54, 5.51s/it][2025-06-19 19:52:41,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:52:41,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.10 | bwd_microstep: 3317.42 | bwd_inner_microstep: 3316.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 19:52:41,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.10 | bwd: 3317.43 | bwd_inner: 3316.62 | bwd_allreduce: 0.77 | step: 6.78 40%|████ | 4038/10000 [6:23:02<9:06:36, 5.50s/it] {'loss': 0.0911, 'grad_norm': 2.297414541244507, 'learning_rate': 2.704828461781441e-05, 'epoch': 4.04} 40%|████ | 4038/10000 [6:23:02<9:06:36, 5.50s/it][2025-06-19 19:52:47,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:52:47,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.68 | bwd_microstep: 3314.24 | bwd_inner_microstep: 3313.35 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.11 [2025-06-19 19:52:47,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.68 | bwd: 3314.26 | bwd_inner: 3313.36 | bwd_allreduce: 0.84 | step: 7.11 40%|████ | 4039/10000 [6:23:08<9:05:51, 5.49s/it] {'loss': 0.0176, 'grad_norm': 1.106345534324646, 'learning_rate': 2.7042222308047147e-05, 'epoch': 4.04} 40%|████ | 4039/10000 [6:23:08<9:05:51, 5.49s/it][2025-06-19 19:52:52,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:52:52,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.39 | bwd_microstep: 3322.50 | bwd_inner_microstep: 3321.67 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.74 [2025-06-19 19:52:52,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.39 | bwd: 3322.52 | bwd_inner: 3321.67 | bwd_allreduce: 0.80 | step: 6.74 40%|████ | 4040/10000 [6:23:13<9:06:21, 5.50s/it] {'loss': 0.0267, 'grad_norm': 1.530403971672058, 'learning_rate': 2.703615925958345e-05, 'epoch': 4.04} 40%|████ | 4040/10000 [6:23:13<9:06:21, 5.50s/it][2025-06-19 19:52:58,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.72 [2025-06-19 19:52:58,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.96 | bwd_microstep: 3314.81 | bwd_inner_microstep: 3313.53 | bwd_allreduce_microstep: 1.17 | step_microstep: 8.65 [2025-06-19 19:52:58,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.96 | bwd: 3314.85 | bwd_inner: 3313.53 | bwd_allreduce: 1.22 | step: 8.65 40%|████ | 4041/10000 [6:23:19<9:05:59, 5.50s/it] {'loss': 0.0116, 'grad_norm': 0.8748428821563721, 'learning_rate': 2.7030095473059312e-05, 'epoch': 4.04} 40%|████ | 4041/10000 [6:23:19<9:05:59, 5.50s/it][2025-06-19 19:53:03,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:53:03,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.75 | bwd_microstep: 3360.73 | bwd_inner_microstep: 3359.89 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.86 [2025-06-19 19:53:03,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.75 | bwd: 3360.76 | bwd_inner: 3359.89 | bwd_allreduce: 0.82 | step: 6.86 40%|████ | 4042/10000 [6:23:24<9:07:11, 5.51s/it] {'loss': 0.1373, 'grad_norm': 4.254879951477051, 'learning_rate': 2.7024030949110778e-05, 'epoch': 4.04} 40%|████ | 4042/10000 [6:23:24<9:07:11, 5.51s/it][2025-06-19 19:53:09,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:53:09,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.05 | bwd_microstep: 3308.05 | bwd_inner_microstep: 3307.14 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.98 [2025-06-19 19:53:09,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.05 | bwd: 3308.06 | bwd_inner: 3307.14 | bwd_allreduce: 0.88 | step: 6.98 40%|████ | 4043/10000 [6:23:30<9:05:18, 5.49s/it] {'loss': 0.0091, 'grad_norm': 0.39513158798217773, 'learning_rate': 2.7017965688374005e-05, 'epoch': 4.04} 40%|████ | 4043/10000 [6:23:30<9:05:18, 5.49s/it][2025-06-19 19:53:14,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:53:14,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.54 | bwd_microstep: 3317.99 | bwd_inner_microstep: 3317.18 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-19 19:53:14,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.54 | bwd: 3318.01 | bwd_inner: 3317.18 | bwd_allreduce: 0.78 | step: 6.96 40%|████ | 4044/10000 [6:23:35<9:04:31, 5.49s/it] {'loss': 0.0314, 'grad_norm': 1.3202060461044312, 'learning_rate': 2.7011899691485198e-05, 'epoch': 4.04} 40%|████ | 4044/10000 [6:23:35<9:04:31, 5.49s/it][2025-06-19 19:53:20,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:53:20,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.72 | bwd_microstep: 3361.75 | bwd_inner_microstep: 3360.92 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.09 [2025-06-19 19:53:20,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.72 | bwd: 3361.76 | bwd_inner: 3360.92 | bwd_allreduce: 0.79 | step: 7.09 40%|████ | 4045/10000 [6:23:41<9:06:03, 5.50s/it] {'loss': 0.1266, 'grad_norm': 3.081583261489868, 'learning_rate': 2.7005832959080653e-05, 'epoch': 4.04} 40%|████ | 4045/10000 [6:23:41<9:06:03, 5.50s/it][2025-06-19 19:53:25,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 19:53:25,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.65 | bwd_microstep: 3315.84 | bwd_inner_microstep: 3314.80 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.08 [2025-06-19 19:53:25,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.66 | bwd: 3315.86 | bwd_inner: 3314.79 | bwd_allreduce: 1.01 | step: 8.09 40%|████ | 4046/10000 [6:23:46<9:04:56, 5.49s/it] {'loss': 0.0109, 'grad_norm': 1.0480889081954956, 'learning_rate': 2.699976549179675e-05, 'epoch': 4.05} 40%|████ | 4046/10000 [6:23:46<9:04:56, 5.49s/it][2025-06-19 19:53:31,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:53:31,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.43 | bwd_microstep: 3375.24 | bwd_inner_microstep: 3374.41 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.79 [2025-06-19 19:53:31,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.43 | bwd: 3375.26 | bwd_inner: 3374.41 | bwd_allreduce: 0.79 | step: 6.80 40%|████ | 4047/10000 [6:23:52<9:07:01, 5.51s/it] {'loss': 0.0073, 'grad_norm': 0.42537057399749756, 'learning_rate': 2.6993697290269935e-05, 'epoch': 4.05} 40%|████ | 4047/10000 [6:23:52<9:07:01, 5.51s/it][2025-06-19 19:53:36,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:53:36,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.47 | bwd_microstep: 3357.44 | bwd_inner_microstep: 3356.59 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.91 [2025-06-19 19:53:36,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.47 | bwd: 3357.46 | bwd_inner: 3356.59 | bwd_allreduce: 0.82 | step: 6.91 40%|████ | 4048/10000 [6:23:57<9:07:35, 5.52s/it] {'loss': 0.026, 'grad_norm': 2.4005398750305176, 'learning_rate': 2.698762835513673e-05, 'epoch': 4.05} 40%|████ | 4048/10000 [6:23:57<9:07:35, 5.52s/it][2025-06-19 19:53:42,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:53:42,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.75 | bwd_microstep: 3363.26 | bwd_inner_microstep: 3362.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 19:53:42,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.75 | bwd: 3363.28 | bwd_inner: 3362.47 | bwd_allreduce: 0.77 | step: 6.73 40%|████ | 4049/10000 [6:24:03<9:08:01, 5.53s/it] {'loss': 0.0364, 'grad_norm': 1.7453727722167969, 'learning_rate': 2.698155868703374e-05, 'epoch': 4.05} 40%|████ | 4049/10000 [6:24:03<9:08:01, 5.53s/it][2025-06-19 19:53:47,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:53:47,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.15 | bwd_microstep: 3315.70 | bwd_inner_microstep: 3314.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-19 19:53:47,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.15 | bwd: 3315.72 | bwd_inner: 3314.89 | bwd_allreduce: 0.78 | step: 6.88 40%|████ | 4050/10000 [6:24:08<9:05:55, 5.51s/it] {'loss': 0.0097, 'grad_norm': 0.4278554916381836, 'learning_rate': 2.6975488286597643e-05, 'epoch': 4.05} 40%|████ | 4050/10000 [6:24:08<9:05:55, 5.51s/it][2025-06-19 19:53:53,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:53:53,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.77 | bwd_microstep: 3314.02 | bwd_inner_microstep: 3313.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-19 19:53:53,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.77 | bwd: 3314.04 | bwd_inner: 3313.22 | bwd_allreduce: 0.77 | step: 6.79 41%|████ | 4051/10000 [6:24:14<9:04:35, 5.49s/it] {'loss': 0.0513, 'grad_norm': 2.5529305934906006, 'learning_rate': 2.6969417154465212e-05, 'epoch': 4.05} 41%|████ | 4051/10000 [6:24:14<9:04:35, 5.49s/it][2025-06-19 19:53:58,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:53:58,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.19 | bwd_microstep: 3356.90 | bwd_inner_microstep: 3356.08 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.22 [2025-06-19 19:53:58,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.19 | bwd: 3356.91 | bwd_inner: 3356.08 | bwd_allreduce: 0.79 | step: 7.23 41%|████ | 4052/10000 [6:24:19<9:05:21, 5.50s/it] {'loss': 0.005, 'grad_norm': 0.2520199120044708, 'learning_rate': 2.6963345291273254e-05, 'epoch': 4.05} 41%|████ | 4052/10000 [6:24:19<9:05:21, 5.50s/it][2025-06-19 19:54:04,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.73 [2025-06-19 19:54:04,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.71 | bwd_microstep: 3307.78 | bwd_inner_microstep: 3306.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 19:54:04,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.71 | bwd: 3307.80 | bwd_inner: 3306.97 | bwd_allreduce: 0.78 | step: 7.17 41%|████ | 4053/10000 [6:24:25<9:03:32, 5.48s/it] {'loss': 0.0587, 'grad_norm': 2.5071425437927246, 'learning_rate': 2.69572726976587e-05, 'epoch': 4.05} 41%|████ | 4053/10000 [6:24:25<9:03:32, 5.48s/it][2025-06-19 19:54:09,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:54:09,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.51 | bwd_microstep: 3368.42 | bwd_inner_microstep: 3367.31 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.36 [2025-06-19 19:54:09,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.51 | bwd: 3368.44 | bwd_inner: 3367.31 | bwd_allreduce: 1.07 | step: 7.36 41%|████ | 4054/10000 [6:24:30<9:04:54, 5.50s/it] {'loss': 0.1516, 'grad_norm': 2.5043487548828125, 'learning_rate': 2.695119937425853e-05, 'epoch': 4.05} 41%|████ | 4054/10000 [6:24:30<9:04:54, 5.50s/it][2025-06-19 19:54:15,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:54:15,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.35 | bwd_microstep: 3315.90 | bwd_inner_microstep: 3315.09 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.33 [2025-06-19 19:54:15,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.36 | bwd: 3315.92 | bwd_inner: 3315.09 | bwd_allreduce: 0.78 | step: 7.33 41%|████ | 4055/10000 [6:24:36<9:03:45, 5.49s/it] {'loss': 0.0379, 'grad_norm': 2.0715603828430176, 'learning_rate': 2.6945125321709804e-05, 'epoch': 4.05} 41%|████ | 4055/10000 [6:24:36<9:03:45, 5.49s/it][2025-06-19 19:54:20,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:54:20,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.80 | bwd_microstep: 3319.99 | bwd_inner_microstep: 3319.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 19:54:20,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.80 | bwd: 3320.00 | bwd_inner: 3319.20 | bwd_allreduce: 0.76 | step: 6.65 41%|████ | 4056/10000 [6:24:41<9:03:08, 5.48s/it] {'loss': 0.0311, 'grad_norm': 1.7211428880691528, 'learning_rate': 2.693905054064967e-05, 'epoch': 4.06} 41%|████ | 4056/10000 [6:24:41<9:03:08, 5.48s/it][2025-06-19 19:54:26,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:54:26,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.97 | bwd_microstep: 3364.49 | bwd_inner_microstep: 3363.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 19:54:26,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.97 | bwd: 3364.50 | bwd_inner: 3363.70 | bwd_allreduce: 0.76 | step: 6.63 41%|████ | 4057/10000 [6:24:47<9:04:27, 5.50s/it] {'loss': 0.0107, 'grad_norm': 0.5386976599693298, 'learning_rate': 2.6932975031715335e-05, 'epoch': 4.06} 41%|████ | 4057/10000 [6:24:47<9:04:27, 5.50s/it][2025-06-19 19:54:31,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:54:31,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.18 | bwd_microstep: 3319.69 | bwd_inner_microstep: 3318.81 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.51 [2025-06-19 19:54:31,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.18 | bwd: 3319.71 | bwd_inner: 3318.81 | bwd_allreduce: 0.84 | step: 7.52 41%|████ | 4058/10000 [6:24:52<9:03:39, 5.49s/it] {'loss': 0.0227, 'grad_norm': 0.8049699068069458, 'learning_rate': 2.692689879554409e-05, 'epoch': 4.06} 41%|████ | 4058/10000 [6:24:52<9:03:39, 5.49s/it][2025-06-19 19:54:37,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:54:37,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.55 | bwd_microstep: 3394.31 | bwd_inner_microstep: 3393.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 19:54:37,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.55 | bwd: 3394.32 | bwd_inner: 3393.51 | bwd_allreduce: 0.77 | step: 6.90 41%|████ | 4059/10000 [6:24:58<9:06:30, 5.52s/it] {'loss': 0.0555, 'grad_norm': 3.532032012939453, 'learning_rate': 2.6920821832773323e-05, 'epoch': 4.06} 41%|████ | 4059/10000 [6:24:58<9:06:30, 5.52s/it][2025-06-19 19:54:42,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:54:42,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.89 | bwd_microstep: 3323.94 | bwd_inner_microstep: 3323.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-19 19:54:42,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.89 | bwd: 3323.96 | bwd_inner: 3323.14 | bwd_allreduce: 0.78 | step: 6.87 41%|████ | 4060/10000 [6:25:03<9:04:49, 5.50s/it] {'loss': 0.0941, 'grad_norm': 1.6943039894104004, 'learning_rate': 2.691474414404046e-05, 'epoch': 4.06} 41%|████ | 4060/10000 [6:25:03<9:04:49, 5.50s/it][2025-06-19 19:54:48,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:54:48,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.72 | bwd_microstep: 3365.15 | bwd_inner_microstep: 3364.33 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 19:54:48,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.72 | bwd: 3365.16 | bwd_inner: 3364.33 | bwd_allreduce: 0.79 | step: 7.29 41%|████ | 4061/10000 [6:25:09<9:05:34, 5.51s/it] {'loss': 0.0635, 'grad_norm': 2.4383111000061035, 'learning_rate': 2.690866572998303e-05, 'epoch': 4.06} 41%|████ | 4061/10000 [6:25:09<9:05:34, 5.51s/it][2025-06-19 19:54:53,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:54:53,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.82 | bwd_microstep: 3326.55 | bwd_inner_microstep: 3325.74 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-19 19:54:53,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.82 | bwd: 3326.57 | bwd_inner: 3325.74 | bwd_allreduce: 0.78 | step: 6.72 41%|████ | 4062/10000 [6:25:14<9:04:11, 5.50s/it] {'loss': 0.0034, 'grad_norm': 0.15484263002872467, 'learning_rate': 2.6902586591238623e-05, 'epoch': 4.06} 41%|████ | 4062/10000 [6:25:14<9:04:11, 5.50s/it][2025-06-19 19:54:59,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 19:54:59,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.50 | bwd_microstep: 3326.58 | bwd_inner_microstep: 3325.42 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.98 [2025-06-19 19:54:59,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.50 | bwd: 3326.61 | bwd_inner: 3325.42 | bwd_allreduce: 1.12 | step: 7.98 41%|████ | 4063/10000 [6:25:20<9:03:37, 5.49s/it] {'loss': 0.1157, 'grad_norm': 4.306854724884033, 'learning_rate': 2.6896506728444925e-05, 'epoch': 4.06} 41%|████ | 4063/10000 [6:25:20<9:03:37, 5.49s/it][2025-06-19 19:55:04,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:55:04,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.29 | bwd_microstep: 3369.16 | bwd_inner_microstep: 3368.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.79 [2025-06-19 19:55:04,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.29 | bwd: 3369.17 | bwd_inner: 3368.35 | bwd_allreduce: 0.77 | step: 6.79 41%|████ | 4064/10000 [6:25:25<9:05:01, 5.51s/it] {'loss': 0.0044, 'grad_norm': 0.4008919596672058, 'learning_rate': 2.6890426142239682e-05, 'epoch': 4.06} 41%|████ | 4064/10000 [6:25:25<9:05:01, 5.51s/it][2025-06-19 19:55:10,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:55:10,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.45 | bwd_microstep: 3371.98 | bwd_inner_microstep: 3371.04 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.79 [2025-06-19 19:55:10,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.45 | bwd: 3371.99 | bwd_inner: 3371.04 | bwd_allreduce: 0.91 | step: 6.80 41%|████ | 4065/10000 [6:25:31<9:05:54, 5.52s/it] {'loss': 0.0413, 'grad_norm': 1.7953885793685913, 'learning_rate': 2.688434483326071e-05, 'epoch': 4.07} 41%|████ | 4065/10000 [6:25:31<9:05:54, 5.52s/it][2025-06-19 19:55:16,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:55:16,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.97 | bwd_microstep: 3360.62 | bwd_inner_microstep: 3359.78 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.77 [2025-06-19 19:55:16,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.97 | bwd: 3360.64 | bwd_inner: 3359.78 | bwd_allreduce: 0.80 | step: 6.78 41%|████ | 4066/10000 [6:25:36<9:06:09, 5.52s/it] {'loss': 0.0125, 'grad_norm': 1.0196893215179443, 'learning_rate': 2.6878262802145916e-05, 'epoch': 4.07} 41%|████ | 4066/10000 [6:25:36<9:06:09, 5.52s/it][2025-06-19 19:55:21,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:55:21,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.06 | bwd_microstep: 3369.06 | bwd_inner_microstep: 3368.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.06 [2025-06-19 19:55:21,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.06 | bwd: 3369.07 | bwd_inner: 3368.27 | bwd_allreduce: 0.76 | step: 7.07 41%|████ | 4067/10000 [6:25:42<9:06:26, 5.53s/it] {'loss': 0.037, 'grad_norm': 1.8347976207733154, 'learning_rate': 2.6872180049533276e-05, 'epoch': 4.07} 41%|████ | 4067/10000 [6:25:42<9:06:26, 5.53s/it][2025-06-19 19:55:27,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 19:55:27,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.92 | bwd_microstep: 3365.25 | bwd_inner_microstep: 3364.05 | bwd_allreduce_microstep: 1.13 | step_microstep: 7.88 [2025-06-19 19:55:27,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.92 | bwd: 3365.27 | bwd_inner: 3364.05 | bwd_allreduce: 1.16 | step: 7.88 41%|████ | 4068/10000 [6:25:47<9:06:43, 5.53s/it] {'loss': 0.0637, 'grad_norm': 2.316349506378174, 'learning_rate': 2.686609657606085e-05, 'epoch': 4.07} 41%|████ | 4068/10000 [6:25:47<9:06:43, 5.53s/it][2025-06-19 19:55:32,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:55:32,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.18 | bwd_microstep: 3313.81 | bwd_inner_microstep: 3313.01 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 19:55:32,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.18 | bwd: 3313.83 | bwd_inner: 3313.01 | bwd_allreduce: 0.78 | step: 7.16 41%|████ | 4069/10000 [6:25:53<9:05:06, 5.51s/it] {'loss': 0.0288, 'grad_norm': 1.730415940284729, 'learning_rate': 2.6860012382366756e-05, 'epoch': 4.07} 41%|████ | 4069/10000 [6:25:53<9:05:06, 5.51s/it][2025-06-19 19:55:38,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:55:38,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.01 | bwd_microstep: 3312.81 | bwd_inner_microstep: 3311.98 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.86 [2025-06-19 19:55:38,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.01 | bwd: 3312.83 | bwd_inner: 3311.98 | bwd_allreduce: 0.80 | step: 6.86 41%|████ | 4070/10000 [6:25:58<9:03:26, 5.50s/it] {'loss': 0.0042, 'grad_norm': 0.2820179760456085, 'learning_rate': 2.6853927469089202e-05, 'epoch': 4.07} 41%|████ | 4070/10000 [6:25:58<9:03:26, 5.50s/it][2025-06-19 19:55:43,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:55:43,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.51 | bwd_microstep: 3381.70 | bwd_inner_microstep: 3380.66 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.42 [2025-06-19 19:55:43,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.52 | bwd: 3381.72 | bwd_inner: 3380.66 | bwd_allreduce: 1.00 | step: 7.42 41%|████ | 4071/10000 [6:26:04<9:05:11, 5.52s/it] {'loss': 0.0022, 'grad_norm': 0.10305827856063843, 'learning_rate': 2.6847841836866468e-05, 'epoch': 4.07} 41%|████ | 4071/10000 [6:26:04<9:05:11, 5.52s/it][2025-06-19 19:55:49,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:55:49,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.04 | bwd_microstep: 3368.45 | bwd_inner_microstep: 3367.50 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.48 [2025-06-19 19:55:49,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.04 | bwd: 3368.47 | bwd_inner: 3367.50 | bwd_allreduce: 0.92 | step: 7.48 41%|████ | 4072/10000 [6:26:09<9:06:02, 5.53s/it] {'loss': 0.0254, 'grad_norm': 1.1624544858932495, 'learning_rate': 2.68417554863369e-05, 'epoch': 4.07} 41%|████ | 4072/10000 [6:26:09<9:06:02, 5.53s/it][2025-06-19 19:55:54,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:55:54,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.78 | bwd_microstep: 3320.10 | bwd_inner_microstep: 3319.26 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.93 [2025-06-19 19:55:54,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.78 | bwd: 3320.12 | bwd_inner: 3319.26 | bwd_allreduce: 0.80 | step: 6.93 41%|████ | 4073/10000 [6:26:15<9:04:19, 5.51s/it] {'loss': 0.0162, 'grad_norm': 0.4760107100009918, 'learning_rate': 2.6835668418138943e-05, 'epoch': 4.07} 41%|████ | 4073/10000 [6:26:15<9:04:19, 5.51s/it][2025-06-19 19:56:00,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:56:00,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.80 | bwd_microstep: 3323.17 | bwd_inner_microstep: 3322.22 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.55 [2025-06-19 19:56:00,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.80 | bwd: 3323.19 | bwd_inner: 3322.22 | bwd_allreduce: 0.92 | step: 7.55 41%|████ | 4074/10000 [6:26:20<9:02:56, 5.50s/it] {'loss': 0.0501, 'grad_norm': 1.9217208623886108, 'learning_rate': 2.6829580632911092e-05, 'epoch': 4.07} 41%|████ | 4074/10000 [6:26:20<9:02:56, 5.50s/it][2025-06-19 19:56:05,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 19:56:05,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.28 | bwd_microstep: 3311.05 | bwd_inner_microstep: 3310.20 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.06 [2025-06-19 19:56:05,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.28 | bwd: 3311.07 | bwd_inner: 3310.20 | bwd_allreduce: 0.82 | step: 7.07 41%|████ | 4075/10000 [6:26:26<9:01:36, 5.48s/it] {'loss': 0.0104, 'grad_norm': 0.5687869191169739, 'learning_rate': 2.6823492131291923e-05, 'epoch': 4.08} 41%|████ | 4075/10000 [6:26:26<9:01:36, 5.48s/it][2025-06-19 19:56:11,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:56:11,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.48 | bwd_microstep: 3371.22 | bwd_inner_microstep: 3370.40 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.82 [2025-06-19 19:56:11,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.49 | bwd: 3371.24 | bwd_inner: 3370.40 | bwd_allreduce: 0.79 | step: 6.83 41%|████ | 4076/10000 [6:26:31<9:03:14, 5.50s/it] {'loss': 0.0094, 'grad_norm': 0.3833673596382141, 'learning_rate': 2.6817402913920104e-05, 'epoch': 4.08} 41%|████ | 4076/10000 [6:26:31<9:03:14, 5.50s/it][2025-06-19 19:56:16,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:56:16,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.60 | bwd_microstep: 3367.97 | bwd_inner_microstep: 3367.08 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.06 [2025-06-19 19:56:16,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.59 | bwd: 3367.99 | bwd_inner: 3367.08 | bwd_allreduce: 0.85 | step: 7.06 41%|████ | 4077/10000 [6:26:37<9:04:16, 5.51s/it] {'loss': 0.0225, 'grad_norm': 0.9370264410972595, 'learning_rate': 2.681131298143436e-05, 'epoch': 4.08} 41%|████ | 4077/10000 [6:26:37<9:04:16, 5.51s/it][2025-06-19 19:56:22,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:56:22,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.64 | bwd_microstep: 3316.03 | bwd_inner_microstep: 3315.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 19:56:22,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.64 | bwd: 3316.05 | bwd_inner: 3315.22 | bwd_allreduce: 0.78 | step: 7.08 41%|████ | 4078/10000 [6:26:42<9:02:57, 5.50s/it] {'loss': 0.0034, 'grad_norm': 0.14523428678512573, 'learning_rate': 2.6805222334473496e-05, 'epoch': 4.08} 41%|████ | 4078/10000 [6:26:42<9:02:57, 5.50s/it][2025-06-19 19:56:27,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:56:27,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.93 | bwd_microstep: 3363.18 | bwd_inner_microstep: 3362.23 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.41 [2025-06-19 19:56:27,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.93 | bwd: 3363.19 | bwd_inner: 3362.23 | bwd_allreduce: 0.92 | step: 7.42 41%|████ | 4079/10000 [6:26:48<9:03:51, 5.51s/it] {'loss': 0.0447, 'grad_norm': 2.0221691131591797, 'learning_rate': 2.67991309736764e-05, 'epoch': 4.08} 41%|████ | 4079/10000 [6:26:48<9:03:51, 5.51s/it][2025-06-19 19:56:33,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:56:33,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.01 | bwd_microstep: 3316.84 | bwd_inner_microstep: 3316.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 19:56:33,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.01 | bwd: 3316.86 | bwd_inner: 3316.04 | bwd_allreduce: 0.77 | step: 6.95 41%|████ | 4080/10000 [6:26:53<9:02:21, 5.50s/it] {'loss': 0.0131, 'grad_norm': 1.0213289260864258, 'learning_rate': 2.679303889968201e-05, 'epoch': 4.08} 41%|████ | 4080/10000 [6:26:53<9:02:21, 5.50s/it][2025-06-19 19:56:38,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:56:38,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.12 | bwd_microstep: 3307.82 | bwd_inner_microstep: 3307.01 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 19:56:38,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.12 | bwd: 3307.84 | bwd_inner: 3307.01 | bwd_allreduce: 0.78 | step: 7.01 41%|████ | 4081/10000 [6:26:59<9:00:42, 5.48s/it] {'loss': 0.0027, 'grad_norm': 0.11546211689710617, 'learning_rate': 2.678694611312937e-05, 'epoch': 4.08} 41%|████ | 4081/10000 [6:26:59<9:00:42, 5.48s/it][2025-06-19 19:56:44,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 19:56:44,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.06 | bwd_microstep: 3321.10 | bwd_inner_microstep: 3320.06 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.44 [2025-06-19 19:56:44,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.06 | bwd: 3321.12 | bwd_inner: 3320.06 | bwd_allreduce: 1.01 | step: 7.45 41%|████ | 4082/10000 [6:27:04<9:00:01, 5.48s/it] {'loss': 0.0041, 'grad_norm': 0.41548100113868713, 'learning_rate': 2.6780852614657588e-05, 'epoch': 4.08} 41%|████ | 4082/10000 [6:27:04<9:00:01, 5.48s/it][2025-06-19 19:56:49,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:56:49,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.65 | bwd_microstep: 3368.83 | bwd_inner_microstep: 3368.03 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 19:56:49,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.65 | bwd: 3368.85 | bwd_inner: 3368.03 | bwd_allreduce: 0.78 | step: 7.19 41%|████ | 4083/10000 [6:27:10<9:01:39, 5.49s/it] {'loss': 0.0336, 'grad_norm': 2.133474588394165, 'learning_rate': 2.6774758404905833e-05, 'epoch': 4.08} 41%|████ | 4083/10000 [6:27:10<9:01:39, 5.49s/it][2025-06-19 19:56:54,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:56:55,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.67 | bwd_microstep: 3310.90 | bwd_inner_microstep: 3310.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.75 [2025-06-19 19:56:55,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.67 | bwd: 3310.91 | bwd_inner: 3310.11 | bwd_allreduce: 0.76 | step: 6.76 41%|████ | 4084/10000 [6:27:15<9:00:15, 5.48s/it] {'loss': 0.0141, 'grad_norm': 0.7292873859405518, 'learning_rate': 2.676866348451336e-05, 'epoch': 4.08} 41%|████ | 4084/10000 [6:27:15<9:00:15, 5.48s/it][2025-06-19 19:57:00,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:57:00,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.58 | bwd_microstep: 3315.08 | bwd_inner_microstep: 3314.11 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.59 [2025-06-19 19:57:00,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.58 | bwd: 3315.09 | bwd_inner: 3314.11 | bwd_allreduce: 0.93 | step: 7.60 41%|████ | 4085/10000 [6:27:21<8:59:29, 5.47s/it] {'loss': 0.0313, 'grad_norm': 1.5960780382156372, 'learning_rate': 2.6762567854119503e-05, 'epoch': 4.08} 41%|████ | 4085/10000 [6:27:21<8:59:29, 5.47s/it][2025-06-19 19:57:05,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:57:05,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.16 | bwd_microstep: 3305.33 | bwd_inner_microstep: 3304.55 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.61 [2025-06-19 19:57:05,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.16 | bwd: 3305.34 | bwd_inner: 3304.55 | bwd_allreduce: 0.75 | step: 6.61 41%|████ | 4086/10000 [6:27:26<8:58:38, 5.46s/it] {'loss': 0.0047, 'grad_norm': 0.24188832938671112, 'learning_rate': 2.675647151436367e-05, 'epoch': 4.09} 41%|████ | 4086/10000 [6:27:26<8:58:38, 5.46s/it][2025-06-19 19:57:11,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:57:11,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.49 | bwd_microstep: 3310.22 | bwd_inner_microstep: 3309.41 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.24 [2025-06-19 19:57:11,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.49 | bwd: 3310.24 | bwd_inner: 3309.41 | bwd_allreduce: 0.78 | step: 7.25 41%|████ | 4087/10000 [6:27:32<8:58:17, 5.46s/it] {'loss': 0.0249, 'grad_norm': 1.0027061700820923, 'learning_rate': 2.6750374465885332e-05, 'epoch': 4.09} 41%|████ | 4087/10000 [6:27:32<8:58:17, 5.46s/it][2025-06-19 19:57:16,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:57:16,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.81 | bwd_microstep: 3392.75 | bwd_inner_microstep: 3391.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 19:57:16,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.81 | bwd: 3392.77 | bwd_inner: 3391.96 | bwd_allreduce: 0.76 | step: 6.77 41%|████ | 4088/10000 [6:27:37<9:01:15, 5.49s/it] {'loss': 0.0021, 'grad_norm': 0.10246427357196808, 'learning_rate': 2.6744276709324037e-05, 'epoch': 4.09} 41%|████ | 4088/10000 [6:27:37<9:01:15, 5.49s/it][2025-06-19 19:57:22,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:57:22,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.99 | bwd_microstep: 3371.51 | bwd_inner_microstep: 3370.70 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.25 [2025-06-19 19:57:22,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.99 | bwd: 3371.52 | bwd_inner: 3370.70 | bwd_allreduce: 0.78 | step: 7.25 41%|████ | 4089/10000 [6:27:43<9:02:36, 5.51s/it] {'loss': 0.0396, 'grad_norm': 1.5874032974243164, 'learning_rate': 2.673817824531942e-05, 'epoch': 4.09} 41%|████ | 4089/10000 [6:27:43<9:02:36, 5.51s/it][2025-06-19 19:57:27,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:57:27,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.76 | bwd_microstep: 3322.02 | bwd_inner_microstep: 3321.21 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-19 19:57:27,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.76 | bwd: 3322.03 | bwd_inner: 3321.21 | bwd_allreduce: 0.77 | step: 6.99 41%|████ | 4090/10000 [6:27:48<9:01:12, 5.49s/it] {'loss': 0.0025, 'grad_norm': 0.13144619762897491, 'learning_rate': 2.6732079074511178e-05, 'epoch': 4.09} 41%|████ | 4090/10000 [6:27:48<9:01:12, 5.49s/it][2025-06-19 19:57:33,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 19:57:33,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.05 | bwd_microstep: 3324.75 | bwd_inner_microstep: 3323.71 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.70 [2025-06-19 19:57:33,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.05 | bwd: 3324.77 | bwd_inner: 3323.71 | bwd_allreduce: 1.01 | step: 7.69 41%|████ | 4091/10000 [6:27:54<9:00:24, 5.49s/it] {'loss': 0.0027, 'grad_norm': 0.16825856268405914, 'learning_rate': 2.6725979197539085e-05, 'epoch': 4.09} 41%|████ | 4091/10000 [6:27:54<9:00:24, 5.49s/it][2025-06-19 19:57:38,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 19:57:38,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.86 | bwd_microstep: 3314.50 | bwd_inner_microstep: 3313.55 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.12 [2025-06-19 19:57:38,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.86 | bwd: 3314.51 | bwd_inner: 3313.55 | bwd_allreduce: 0.92 | step: 7.12 41%|████ | 4092/10000 [6:27:59<8:59:36, 5.48s/it] {'loss': 0.0078, 'grad_norm': 0.7852106690406799, 'learning_rate': 2.6719878615042988e-05, 'epoch': 4.09} 41%|████ | 4092/10000 [6:27:59<8:59:36, 5.48s/it][2025-06-19 19:57:44,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:57:44,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.53 | bwd_microstep: 3366.51 | bwd_inner_microstep: 3365.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 19:57:44,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.53 | bwd: 3366.53 | bwd_inner: 3365.73 | bwd_allreduce: 0.76 | step: 6.67 41%|████ | 4093/10000 [6:28:05<9:01:02, 5.50s/it] {'loss': 0.0353, 'grad_norm': 2.8336334228515625, 'learning_rate': 2.6713777327662812e-05, 'epoch': 4.09} 41%|████ | 4093/10000 [6:28:05<9:01:02, 5.50s/it][2025-06-19 19:57:49,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:57:49,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.59 | bwd_microstep: 3315.41 | bwd_inner_microstep: 3314.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 19:57:49,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.59 | bwd: 3315.42 | bwd_inner: 3314.60 | bwd_allreduce: 0.78 | step: 7.13 41%|████ | 4094/10000 [6:28:10<8:59:56, 5.49s/it] {'loss': 0.0548, 'grad_norm': 3.5826034545898438, 'learning_rate': 2.670767533603856e-05, 'epoch': 4.09} 41%|████ | 4094/10000 [6:28:10<8:59:56, 5.49s/it][2025-06-19 19:57:55,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:57:55,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.50 | bwd_microstep: 3356.85 | bwd_inner_microstep: 3355.98 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.93 [2025-06-19 19:57:55,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.50 | bwd: 3356.87 | bwd_inner: 3355.98 | bwd_allreduce: 0.84 | step: 6.93 41%|████ | 4095/10000 [6:28:16<9:00:53, 5.50s/it] {'loss': 0.0667, 'grad_norm': 3.285877227783203, 'learning_rate': 2.670157264081029e-05, 'epoch': 4.09} 41%|████ | 4095/10000 [6:28:16<9:00:53, 5.50s/it][2025-06-19 19:58:00,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:58:00,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.52 | bwd_microstep: 3315.92 | bwd_inner_microstep: 3315.12 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-19 19:58:00,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.52 | bwd: 3315.94 | bwd_inner: 3315.12 | bwd_allreduce: 0.77 | step: 6.85 41%|████ | 4096/10000 [6:28:21<8:59:52, 5.49s/it] {'loss': 0.0824, 'grad_norm': 2.42718243598938, 'learning_rate': 2.669546924261815e-05, 'epoch': 4.1} 41%|████ | 4096/10000 [6:28:21<8:59:52, 5.49s/it][2025-06-19 19:58:06,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:58:06,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.50 | bwd_microstep: 3372.15 | bwd_inner_microstep: 3371.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 19:58:06,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.50 | bwd: 3372.17 | bwd_inner: 3371.36 | bwd_allreduce: 0.77 | step: 6.93 41%|████ | 4097/10000 [6:28:27<9:01:18, 5.50s/it] {'loss': 0.0358, 'grad_norm': 3.3845317363739014, 'learning_rate': 2.6689365142102374e-05, 'epoch': 4.1} 41%|████ | 4097/10000 [6:28:27<9:01:18, 5.50s/it][2025-06-19 19:58:11,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:58:11,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.67 | bwd_microstep: 3319.82 | bwd_inner_microstep: 3319.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 19:58:11,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.67 | bwd: 3319.83 | bwd_inner: 3319.03 | bwd_allreduce: 0.76 | step: 6.69 41%|████ | 4098/10000 [6:28:32<8:59:58, 5.49s/it] {'loss': 0.0528, 'grad_norm': 2.6312899589538574, 'learning_rate': 2.6683260339903232e-05, 'epoch': 4.1} 41%|████ | 4098/10000 [6:28:32<8:59:58, 5.49s/it][2025-06-19 19:58:17,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:58:17,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.12 | bwd_microstep: 3311.79 | bwd_inner_microstep: 3310.98 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-19 19:58:17,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.12 | bwd: 3311.81 | bwd_inner: 3310.98 | bwd_allreduce: 0.78 | step: 6.94 41%|████ | 4099/10000 [6:28:38<8:58:57, 5.48s/it] {'loss': 0.002, 'grad_norm': 0.09090269356966019, 'learning_rate': 2.6677154836661104e-05, 'epoch': 4.1} 41%|████ | 4099/10000 [6:28:38<8:58:57, 5.48s/it][2025-06-19 19:58:22,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:58:22,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.14 | bwd_microstep: 3372.20 | bwd_inner_microstep: 3371.40 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 19:58:22,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.14 | bwd: 3372.22 | bwd_inner: 3371.40 | bwd_allreduce: 0.78 | step: 7.09 41%|████ | 4100/10000 [6:28:43<9:00:50, 5.50s/it] {'loss': 0.0297, 'grad_norm': 2.1338493824005127, 'learning_rate': 2.6671048633016416e-05, 'epoch': 4.1} 41%|████ | 4100/10000 [6:28:43<9:00:50, 5.50s/it][2025-06-19 19:58:28,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:58:28,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.44 | bwd_microstep: 3319.86 | bwd_inner_microstep: 3319.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 19:58:28,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.44 | bwd: 3319.87 | bwd_inner: 3319.06 | bwd_allreduce: 0.77 | step: 6.80 41%|████ | 4101/10000 [6:28:49<8:59:39, 5.49s/it] {'loss': 0.0124, 'grad_norm': 0.7543787956237793, 'learning_rate': 2.666494172960969e-05, 'epoch': 4.1} 41%|████ | 4101/10000 [6:28:49<8:59:39, 5.49s/it][2025-06-19 19:58:33,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:58:33,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.60 | bwd_microstep: 3325.56 | bwd_inner_microstep: 3324.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 19:58:33,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.60 | bwd: 3325.57 | bwd_inner: 3324.76 | bwd_allreduce: 0.76 | step: 6.68 41%|████ | 4102/10000 [6:28:54<8:59:02, 5.48s/it] {'loss': 0.0029, 'grad_norm': 0.13417622447013855, 'learning_rate': 2.665883412708151e-05, 'epoch': 4.1} 41%|████ | 4102/10000 [6:28:54<8:59:02, 5.48s/it][2025-06-19 19:58:39,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:58:39,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.94 | bwd_microstep: 3370.40 | bwd_inner_microstep: 3369.59 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.25 [2025-06-19 19:58:39,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.94 | bwd: 3370.42 | bwd_inner: 3369.59 | bwd_allreduce: 0.78 | step: 7.26 41%|████ | 4103/10000 [6:29:00<9:00:38, 5.50s/it] {'loss': 0.0062, 'grad_norm': 0.9048090577125549, 'learning_rate': 2.6652725826072532e-05, 'epoch': 4.1} 41%|████ | 4103/10000 [6:29:00<9:00:38, 5.50s/it][2025-06-19 19:58:44,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:58:44,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.73 | bwd_microstep: 3324.38 | bwd_inner_microstep: 3323.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 19:58:44,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.73 | bwd: 3324.39 | bwd_inner: 3323.58 | bwd_allreduce: 0.77 | step: 6.75 41%|████ | 4104/10000 [6:29:05<8:59:30, 5.49s/it] {'loss': 0.0559, 'grad_norm': 1.7597771883010864, 'learning_rate': 2.6646616827223497e-05, 'epoch': 4.1} 41%|████ | 4104/10000 [6:29:05<8:59:30, 5.49s/it][2025-06-19 19:58:50,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 19:58:50,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.70 | bwd_microstep: 3323.87 | bwd_inner_microstep: 3323.07 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 19:58:50,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.70 | bwd: 3323.88 | bwd_inner: 3323.08 | bwd_allreduce: 0.77 | step: 6.71 41%|████ | 4105/10000 [6:29:11<8:59:02, 5.49s/it] {'loss': 0.0063, 'grad_norm': 0.6073613166809082, 'learning_rate': 2.6640507131175204e-05, 'epoch': 4.11} 41%|████ | 4105/10000 [6:29:11<8:59:02, 5.49s/it][2025-06-19 19:58:55,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:58:55,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.79 | bwd_microstep: 3321.27 | bwd_inner_microstep: 3320.46 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.70 [2025-06-19 19:58:55,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.79 | bwd: 3321.28 | bwd_inner: 3320.46 | bwd_allreduce: 0.78 | step: 7.71 41%|████ | 4106/10000 [6:29:16<8:58:33, 5.48s/it] {'loss': 0.0104, 'grad_norm': 0.7547749280929565, 'learning_rate': 2.6634396738568528e-05, 'epoch': 4.11} 41%|████ | 4106/10000 [6:29:16<8:58:33, 5.48s/it][2025-06-19 19:59:01,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 19:59:01,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.20 | bwd_microstep: 3324.97 | bwd_inner_microstep: 3324.01 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.05 [2025-06-19 19:59:01,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.20 | bwd: 3324.99 | bwd_inner: 3324.01 | bwd_allreduce: 0.93 | step: 7.05 41%|████ | 4107/10000 [6:29:22<8:58:10, 5.48s/it] {'loss': 0.0547, 'grad_norm': 1.7953802347183228, 'learning_rate': 2.6628285650044427e-05, 'epoch': 4.11} 41%|████ | 4107/10000 [6:29:22<8:58:10, 5.48s/it][2025-06-19 19:59:06,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 19:59:06,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.24 | bwd_microstep: 3331.73 | bwd_inner_microstep: 3330.77 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.30 [2025-06-19 19:59:06,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.24 | bwd: 3331.75 | bwd_inner: 3330.77 | bwd_allreduce: 0.93 | step: 7.30 41%|████ | 4108/10000 [6:29:27<8:58:08, 5.48s/it] {'loss': 0.0413, 'grad_norm': 2.5268714427948, 'learning_rate': 2.6622173866243922e-05, 'epoch': 4.11} 41%|████ | 4108/10000 [6:29:27<8:58:08, 5.48s/it][2025-06-19 19:59:12,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:59:12,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.83 | bwd_microstep: 3325.06 | bwd_inner_microstep: 3324.25 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-19 19:59:12,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.83 | bwd: 3325.08 | bwd_inner: 3324.25 | bwd_allreduce: 0.78 | step: 7.00 41%|████ | 4109/10000 [6:29:32<8:57:51, 5.48s/it] {'loss': 0.0023, 'grad_norm': 0.14105729758739471, 'learning_rate': 2.661606138780812e-05, 'epoch': 4.11} 41%|████ | 4109/10000 [6:29:32<8:57:51, 5.48s/it][2025-06-19 19:59:17,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:59:17,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.99 | bwd_microstep: 3336.98 | bwd_inner_microstep: 3336.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-19 19:59:17,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.99 | bwd: 3336.99 | bwd_inner: 3336.18 | bwd_allreduce: 0.77 | step: 6.97 41%|████ | 4110/10000 [6:29:38<8:58:07, 5.48s/it] {'loss': 0.0227, 'grad_norm': 1.9721182584762573, 'learning_rate': 2.6609948215378177e-05, 'epoch': 4.11} 41%|████ | 4110/10000 [6:29:38<8:58:07, 5.48s/it][2025-06-19 19:59:23,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:59:23,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.77 | bwd_microstep: 3370.74 | bwd_inner_microstep: 3369.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.65 [2025-06-19 19:59:23,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.77 | bwd: 3370.76 | bwd_inner: 3369.94 | bwd_allreduce: 0.77 | step: 6.65 41%|████ | 4111/10000 [6:29:43<8:59:41, 5.50s/it] {'loss': 0.0056, 'grad_norm': 0.23603104054927826, 'learning_rate': 2.6603834349595347e-05, 'epoch': 4.11} 41%|████ | 4111/10000 [6:29:43<8:59:41, 5.50s/it][2025-06-19 19:59:28,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:59:28,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.83 | bwd_microstep: 3375.82 | bwd_inner_microstep: 3375.02 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 19:59:28,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.83 | bwd: 3375.84 | bwd_inner: 3375.02 | bwd_allreduce: 0.78 | step: 7.12 41%|████ | 4112/10000 [6:29:49<9:01:02, 5.51s/it] {'loss': 0.0007, 'grad_norm': 0.03443325310945511, 'learning_rate': 2.6597719791100946e-05, 'epoch': 4.11} 41%|████ | 4112/10000 [6:29:49<9:01:02, 5.51s/it][2025-06-19 19:59:34,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 19:59:34,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.84 | bwd_microstep: 3321.26 | bwd_inner_microstep: 3320.46 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-19 19:59:34,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.84 | bwd: 3321.28 | bwd_inner: 3320.46 | bwd_allreduce: 0.77 | step: 6.90 41%|████ | 4113/10000 [6:29:55<8:59:41, 5.50s/it] {'loss': 0.0133, 'grad_norm': 0.6942189931869507, 'learning_rate': 2.6591604540536354e-05, 'epoch': 4.11} 41%|████ | 4113/10000 [6:29:55<8:59:41, 5.50s/it][2025-06-19 19:59:39,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 19:59:39,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.48 | bwd_microstep: 3370.78 | bwd_inner_microstep: 3369.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 19:59:39,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.49 | bwd: 3370.80 | bwd_inner: 3369.98 | bwd_allreduce: 0.77 | step: 6.75 41%|████ | 4114/10000 [6:30:00<9:00:57, 5.51s/it] {'loss': 0.0026, 'grad_norm': 0.11890053004026413, 'learning_rate': 2.658548859854304e-05, 'epoch': 4.11} 41%|████ | 4114/10000 [6:30:00<9:00:57, 5.51s/it][2025-06-19 19:59:45,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 19:59:45,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.96 | bwd_microstep: 3329.53 | bwd_inner_microstep: 3328.71 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.30 [2025-06-19 19:59:45,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.96 | bwd: 3329.55 | bwd_inner: 3328.71 | bwd_allreduce: 0.79 | step: 7.30 41%|████ | 4115/10000 [6:30:06<9:00:15, 5.51s/it] {'loss': 0.0008, 'grad_norm': 0.0427454374730587, 'learning_rate': 2.657937196576254e-05, 'epoch': 4.12} 41%|████ | 4115/10000 [6:30:06<9:00:15, 5.51s/it][2025-06-19 19:59:50,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 19:59:50,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.52 | bwd_microstep: 3326.65 | bwd_inner_microstep: 3325.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 19:59:50,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.52 | bwd: 3326.66 | bwd_inner: 3325.86 | bwd_allreduce: 0.75 | step: 6.69 41%|████ | 4116/10000 [6:30:11<8:59:24, 5.50s/it] {'loss': 0.032, 'grad_norm': 1.4156849384307861, 'learning_rate': 2.6573254642836457e-05, 'epoch': 4.12} 41%|████ | 4116/10000 [6:30:11<8:59:24, 5.50s/it][2025-06-19 19:59:56,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 19:59:56,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.77 | bwd_microstep: 3382.15 | bwd_inner_microstep: 3381.30 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.86 [2025-06-19 19:59:56,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.77 | bwd: 3382.16 | bwd_inner: 3381.30 | bwd_allreduce: 0.82 | step: 6.87 41%|████ | 4117/10000 [6:30:17<9:00:55, 5.52s/it] {'loss': 0.0379, 'grad_norm': 3.393462896347046, 'learning_rate': 2.6567136630406463e-05, 'epoch': 4.12} 41%|████ | 4117/10000 [6:30:17<9:00:55, 5.52s/it][2025-06-19 20:00:01,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 20:00:01,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.89 | bwd_microstep: 3372.54 | bwd_inner_microstep: 3371.41 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.54 [2025-06-19 20:00:01,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.89 | bwd: 3372.56 | bwd_inner: 3371.41 | bwd_allreduce: 1.09 | step: 7.54 41%|████ | 4118/10000 [6:30:22<9:01:53, 5.53s/it] {'loss': 0.0655, 'grad_norm': 3.4110705852508545, 'learning_rate': 2.6561017929114324e-05, 'epoch': 4.12} 41%|████ | 4118/10000 [6:30:22<9:01:53, 5.53s/it][2025-06-19 20:00:07,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:00:07,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.79 | bwd_microstep: 3331.99 | bwd_inner_microstep: 3330.98 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.59 [2025-06-19 20:00:07,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.79 | bwd: 3332.01 | bwd_inner: 3330.98 | bwd_allreduce: 0.97 | step: 7.60 41%|████ | 4119/10000 [6:30:28<9:00:56, 5.52s/it] {'loss': 0.0665, 'grad_norm': 7.606621265411377, 'learning_rate': 2.6554898539601848e-05, 'epoch': 4.12} 41%|████ | 4119/10000 [6:30:28<9:00:56, 5.52s/it][2025-06-19 20:00:12,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:00:12,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.71 | bwd_microstep: 3327.73 | bwd_inner_microstep: 3326.93 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 20:00:12,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.71 | bwd: 3327.75 | bwd_inner: 3326.93 | bwd_allreduce: 0.78 | step: 6.81 41%|████ | 4120/10000 [6:30:33<8:59:59, 5.51s/it] {'loss': 0.0038, 'grad_norm': 0.30946657061576843, 'learning_rate': 2.6548778462510932e-05, 'epoch': 4.12} 41%|████ | 4120/10000 [6:30:33<8:59:59, 5.51s/it][2025-06-19 20:00:18,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:00:18,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.43 | bwd_microstep: 3330.68 | bwd_inner_microstep: 3329.76 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.06 [2025-06-19 20:00:18,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.43 | bwd: 3330.69 | bwd_inner: 3329.76 | bwd_allreduce: 0.89 | step: 7.06 41%|████ | 4121/10000 [6:30:39<8:59:14, 5.50s/it] {'loss': 0.0018, 'grad_norm': 0.11107700318098068, 'learning_rate': 2.654265769848356e-05, 'epoch': 4.12} 41%|████ | 4121/10000 [6:30:39<8:59:14, 5.50s/it][2025-06-19 20:00:23,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:00:23,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.07 | bwd_microstep: 3386.00 | bwd_inner_microstep: 3385.12 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.80 [2025-06-19 20:00:23,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.07 | bwd: 3386.01 | bwd_inner: 3385.12 | bwd_allreduce: 0.84 | step: 6.80 41%|████ | 4122/10000 [6:30:44<9:00:52, 5.52s/it] {'loss': 0.0018, 'grad_norm': 0.09103375673294067, 'learning_rate': 2.6536536248161762e-05, 'epoch': 4.12} 41%|████ | 4122/10000 [6:30:44<9:00:52, 5.52s/it][2025-06-19 20:00:29,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:00:29,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.22 | bwd_microstep: 3377.04 | bwd_inner_microstep: 3376.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 20:00:29,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.22 | bwd: 3377.05 | bwd_inner: 3376.24 | bwd_allreduce: 0.77 | step: 6.75 41%|████ | 4123/10000 [6:30:50<9:01:32, 5.53s/it] {'loss': 0.0059, 'grad_norm': 0.4235597252845764, 'learning_rate': 2.653041411218764e-05, 'epoch': 4.12} 41%|████ | 4123/10000 [6:30:50<9:01:32, 5.53s/it][2025-06-19 20:00:34,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:00:34,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.44 | bwd_microstep: 3337.53 | bwd_inner_microstep: 3336.56 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.60 [2025-06-19 20:00:34,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.44 | bwd: 3337.55 | bwd_inner: 3336.56 | bwd_allreduce: 0.94 | step: 7.60 41%|████ | 4124/10000 [6:30:55<9:00:10, 5.52s/it] {'loss': 0.0892, 'grad_norm': 4.240018367767334, 'learning_rate': 2.6524291291203386e-05, 'epoch': 4.12} 41%|████ | 4124/10000 [6:30:55<9:00:10, 5.52s/it][2025-06-19 20:00:40,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:00:40,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.25 | bwd_microstep: 3379.17 | bwd_inner_microstep: 3378.27 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.95 [2025-06-19 20:00:40,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.25 | bwd: 3379.19 | bwd_inner: 3378.27 | bwd_allreduce: 0.87 | step: 6.95 41%|████▏ | 4125/10000 [6:31:01<9:01:10, 5.53s/it] {'loss': 0.0062, 'grad_norm': 0.2721370458602905, 'learning_rate': 2.6518167785851266e-05, 'epoch': 4.12} 41%|████▏ | 4125/10000 [6:31:01<9:01:10, 5.53s/it][2025-06-19 20:00:45,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.72 [2025-06-19 20:00:45,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.17 | bwd_microstep: 3321.66 | bwd_inner_microstep: 3320.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.85 [2025-06-19 20:00:45,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.17 | bwd: 3321.68 | bwd_inner: 3320.88 | bwd_allreduce: 0.76 | step: 6.87 41%|████▏ | 4126/10000 [6:31:06<8:59:21, 5.51s/it] {'loss': 0.0364, 'grad_norm': 3.7816524505615234, 'learning_rate': 2.651204359677359e-05, 'epoch': 4.13} 41%|████▏ | 4126/10000 [6:31:06<8:59:21, 5.51s/it][2025-06-19 20:00:51,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:00:51,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.74 | bwd_microstep: 3331.12 | bwd_inner_microstep: 3330.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-19 20:00:51,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.74 | bwd: 3331.14 | bwd_inner: 3330.33 | bwd_allreduce: 0.76 | step: 6.76 41%|████▏ | 4127/10000 [6:31:12<8:58:20, 5.50s/it] {'loss': 0.0042, 'grad_norm': 0.20124316215515137, 'learning_rate': 2.6505918724612762e-05, 'epoch': 4.13} 41%|████▏ | 4127/10000 [6:31:12<8:58:20, 5.50s/it][2025-06-19 20:00:56,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:00:56,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.21 | bwd_microstep: 3323.90 | bwd_inner_microstep: 3323.08 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.41 [2025-06-19 20:00:56,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.21 | bwd: 3323.92 | bwd_inner: 3323.08 | bwd_allreduce: 0.79 | step: 7.41 41%|████▏ | 4128/10000 [6:31:17<8:57:45, 5.49s/it] {'loss': 0.0135, 'grad_norm': 1.4168031215667725, 'learning_rate': 2.6499793170011257e-05, 'epoch': 4.13} 41%|████▏ | 4128/10000 [6:31:17<8:57:45, 5.49s/it][2025-06-19 20:01:02,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:01:02,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.47 | bwd_microstep: 3317.84 | bwd_inner_microstep: 3317.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 20:01:02,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.47 | bwd: 3317.86 | bwd_inner: 3317.04 | bwd_allreduce: 0.77 | step: 6.73 41%|████▏ | 4129/10000 [6:31:23<8:56:54, 5.49s/it] {'loss': 0.0206, 'grad_norm': 2.419718027114868, 'learning_rate': 2.6493666933611614e-05, 'epoch': 4.13} 41%|████▏ | 4129/10000 [6:31:23<8:56:54, 5.49s/it][2025-06-19 20:01:07,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:01:07,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.70 | bwd_microstep: 3369.64 | bwd_inner_microstep: 3368.84 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.79 [2025-06-19 20:01:07,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.70 | bwd: 3369.66 | bwd_inner: 3368.84 | bwd_allreduce: 0.77 | step: 6.79 41%|████▏ | 4130/10000 [6:31:28<8:58:31, 5.50s/it] {'loss': 0.0024, 'grad_norm': 0.1221667155623436, 'learning_rate': 2.648754001605645e-05, 'epoch': 4.13} 41%|████▏ | 4130/10000 [6:31:28<8:58:31, 5.50s/it][2025-06-19 20:01:13,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:01:13,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.75 | bwd_microstep: 3374.57 | bwd_inner_microstep: 3373.74 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.88 [2025-06-19 20:01:13,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.75 | bwd: 3374.59 | bwd_inner: 3373.74 | bwd_allreduce: 0.80 | step: 6.89 41%|████▏ | 4131/10000 [6:31:34<8:59:35, 5.52s/it] {'loss': 0.0265, 'grad_norm': 3.244213104248047, 'learning_rate': 2.6481412417988452e-05, 'epoch': 4.13} 41%|████▏ | 4131/10000 [6:31:34<8:59:35, 5.52s/it][2025-06-19 20:01:19,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:01:19,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.97 | bwd_microstep: 3375.53 | bwd_inner_microstep: 3374.72 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.05 [2025-06-19 20:01:19,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.97 | bwd: 3375.55 | bwd_inner: 3374.72 | bwd_allreduce: 0.79 | step: 7.05 41%|████▏ | 4132/10000 [6:31:39<9:00:32, 5.53s/it] {'loss': 0.0017, 'grad_norm': 0.11852256953716278, 'learning_rate': 2.6475284140050366e-05, 'epoch': 4.13} 41%|████▏ | 4132/10000 [6:31:39<9:00:32, 5.53s/it][2025-06-19 20:01:24,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:01:24,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.09 | bwd_microstep: 3367.62 | bwd_inner_microstep: 3366.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-19 20:01:24,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.09 | bwd: 3367.63 | bwd_inner: 3366.82 | bwd_allreduce: 0.77 | step: 6.97 41%|████▏ | 4133/10000 [6:31:45<9:00:56, 5.53s/it] {'loss': 0.0104, 'grad_norm': 0.8007556796073914, 'learning_rate': 2.646915518288503e-05, 'epoch': 4.13} 41%|████▏ | 4133/10000 [6:31:45<9:00:56, 5.53s/it][2025-06-19 20:01:30,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:01:30,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.44 | bwd_microstep: 3332.12 | bwd_inner_microstep: 3331.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 20:01:30,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.44 | bwd: 3332.14 | bwd_inner: 3331.33 | bwd_allreduce: 0.76 | step: 6.73 41%|████▏ | 4134/10000 [6:31:50<8:59:16, 5.52s/it] {'loss': 0.0195, 'grad_norm': 1.6687958240509033, 'learning_rate': 2.6463025547135334e-05, 'epoch': 4.13} 41%|████▏ | 4134/10000 [6:31:50<8:59:16, 5.52s/it][2025-06-19 20:01:35,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 20:01:35,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.98 | bwd_microstep: 3317.05 | bwd_inner_microstep: 3316.16 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.91 [2025-06-19 20:01:35,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.98 | bwd: 3317.06 | bwd_inner: 3316.16 | bwd_allreduce: 0.86 | step: 6.92 41%|████▏ | 4135/10000 [6:31:56<8:57:47, 5.50s/it] {'loss': 0.0025, 'grad_norm': 0.27966806292533875, 'learning_rate': 2.6456895233444256e-05, 'epoch': 4.13} 41%|████▏ | 4135/10000 [6:31:56<8:57:47, 5.50s/it][2025-06-19 20:01:40,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:01:40,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.38 | bwd_microstep: 3326.07 | bwd_inner_microstep: 3325.27 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 20:01:40,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.38 | bwd: 3326.09 | bwd_inner: 3325.27 | bwd_allreduce: 0.78 | step: 7.10 41%|████▏ | 4136/10000 [6:32:01<8:56:55, 5.49s/it] {'loss': 0.0251, 'grad_norm': 1.5465164184570312, 'learning_rate': 2.645076424245484e-05, 'epoch': 4.14} 41%|████▏ | 4136/10000 [6:32:01<8:56:55, 5.49s/it][2025-06-19 20:01:46,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:01:46,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.92 | bwd_microstep: 3321.58 | bwd_inner_microstep: 3320.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-19 20:01:46,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.92 | bwd: 3321.59 | bwd_inner: 3320.77 | bwd_allreduce: 0.77 | step: 6.92 41%|████▏ | 4137/10000 [6:32:07<8:56:00, 5.49s/it] {'loss': 0.0015, 'grad_norm': 0.14742204546928406, 'learning_rate': 2.644463257481019e-05, 'epoch': 4.14} 41%|████▏ | 4137/10000 [6:32:07<8:56:00, 5.49s/it][2025-06-19 20:01:51,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:01:51,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.54 | bwd_microstep: 3375.94 | bwd_inner_microstep: 3375.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 20:01:51,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.54 | bwd: 3375.96 | bwd_inner: 3375.15 | bwd_allreduce: 0.76 | step: 6.71 41%|████▏ | 4138/10000 [6:32:12<8:57:23, 5.50s/it] {'loss': 0.0148, 'grad_norm': 1.1441456079483032, 'learning_rate': 2.643850023115349e-05, 'epoch': 4.14} 41%|████▏ | 4138/10000 [6:32:12<8:57:23, 5.50s/it][2025-06-19 20:01:57,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 20:01:57,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.68 | bwd_microstep: 3320.32 | bwd_inner_microstep: 3319.23 | bwd_allreduce_microstep: 1.02 | step_microstep: 8.34 [2025-06-19 20:01:57,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.68 | bwd: 3320.34 | bwd_inner: 3319.23 | bwd_allreduce: 1.05 | step: 8.35 41%|████▏ | 4139/10000 [6:32:18<8:56:28, 5.49s/it] {'loss': 0.0061, 'grad_norm': 0.5948710441589355, 'learning_rate': 2.6432367212128e-05, 'epoch': 4.14} 41%|████▏ | 4139/10000 [6:32:18<8:56:28, 5.49s/it][2025-06-19 20:02:02,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:02:02,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.95 | bwd_microstep: 3317.70 | bwd_inner_microstep: 3316.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 20:02:02,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.95 | bwd: 3317.72 | bwd_inner: 3316.92 | bwd_allreduce: 0.75 | step: 6.57 41%|████▏ | 4140/10000 [6:32:23<8:55:36, 5.48s/it] {'loss': 0.0094, 'grad_norm': 1.1384539604187012, 'learning_rate': 2.642623351837705e-05, 'epoch': 4.14} 41%|████▏ | 4140/10000 [6:32:23<8:55:36, 5.48s/it][2025-06-19 20:02:08,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:02:08,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.62 | bwd_microstep: 3375.12 | bwd_inner_microstep: 3374.32 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.11 [2025-06-19 20:02:08,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.62 | bwd: 3375.14 | bwd_inner: 3374.32 | bwd_allreduce: 0.77 | step: 7.11 41%|████▏ | 4141/10000 [6:32:29<8:57:09, 5.50s/it] {'loss': 0.1294, 'grad_norm': 2.8834755420684814, 'learning_rate': 2.642009915054402e-05, 'epoch': 4.14} 41%|████▏ | 4141/10000 [6:32:29<8:57:09, 5.50s/it][2025-06-19 20:02:13,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:02:13,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.42 | bwd_microstep: 3325.93 | bwd_inner_microstep: 3325.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 20:02:13,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.42 | bwd: 3325.94 | bwd_inner: 3325.15 | bwd_allreduce: 0.76 | step: 6.63 41%|████▏ | 4142/10000 [6:32:34<8:56:20, 5.49s/it] {'loss': 0.0085, 'grad_norm': 0.9333972334861755, 'learning_rate': 2.641396410927239e-05, 'epoch': 4.14} 41%|████▏ | 4142/10000 [6:32:34<8:56:20, 5.49s/it][2025-06-19 20:02:19,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:02:19,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.42 | bwd_microstep: 3316.06 | bwd_inner_microstep: 3315.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 20:02:19,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.42 | bwd: 3316.08 | bwd_inner: 3315.27 | bwd_allreduce: 0.76 | step: 6.65 41%|████▏ | 4143/10000 [6:32:40<8:55:10, 5.48s/it] {'loss': 0.0363, 'grad_norm': 1.304863452911377, 'learning_rate': 2.6407828395205693e-05, 'epoch': 4.14} 41%|████▏ | 4143/10000 [6:32:40<8:55:10, 5.48s/it][2025-06-19 20:02:24,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:02:24,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.28 | bwd_microstep: 3369.58 | bwd_inner_microstep: 3368.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-19 20:02:24,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.28 | bwd: 3369.60 | bwd_inner: 3368.77 | bwd_allreduce: 0.78 | step: 7.15 41%|████▏ | 4144/10000 [6:32:45<8:56:48, 5.50s/it] {'loss': 0.0278, 'grad_norm': 1.6525607109069824, 'learning_rate': 2.640169200898754e-05, 'epoch': 4.14} 41%|████▏ | 4144/10000 [6:32:45<8:56:48, 5.50s/it][2025-06-19 20:02:30,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 20:02:30,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.12 | bwd_microstep: 3366.80 | bwd_inner_microstep: 3365.94 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.95 [2025-06-19 20:02:30,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.12 | bwd: 3366.81 | bwd_inner: 3365.94 | bwd_allreduce: 0.82 | step: 6.95 41%|████▏ | 4145/10000 [6:32:51<8:57:48, 5.51s/it] {'loss': 0.0072, 'grad_norm': 0.5560489892959595, 'learning_rate': 2.63955549512616e-05, 'epoch': 4.14} 41%|████▏ | 4145/10000 [6:32:51<8:57:48, 5.51s/it][2025-06-19 20:02:35,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:02:35,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.93 | bwd_microstep: 3317.29 | bwd_inner_microstep: 3316.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 20:02:35,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.93 | bwd: 3317.31 | bwd_inner: 3316.50 | bwd_allreduce: 0.76 | step: 6.61 41%|████▏ | 4146/10000 [6:32:56<8:56:15, 5.50s/it] {'loss': 0.0508, 'grad_norm': 2.1202619075775146, 'learning_rate': 2.638941722267163e-05, 'epoch': 4.15} 41%|████▏ | 4146/10000 [6:32:56<8:56:15, 5.50s/it][2025-06-19 20:02:41,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:02:41,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.03 | bwd_microstep: 3399.66 | bwd_inner_microstep: 3398.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 20:02:41,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.03 | bwd: 3399.67 | bwd_inner: 3398.85 | bwd_allreduce: 0.78 | step: 7.06 41%|████▏ | 4147/10000 [6:33:02<8:58:22, 5.52s/it] {'loss': 0.0006, 'grad_norm': 0.026618104428052902, 'learning_rate': 2.638327882386145e-05, 'epoch': 4.15} 41%|████▏ | 4147/10000 [6:33:02<8:58:22, 5.52s/it][2025-06-19 20:02:46,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-19 20:02:46,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.98 | bwd_microstep: 3326.29 | bwd_inner_microstep: 3325.22 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.82 [2025-06-19 20:02:46,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.98 | bwd: 3326.31 | bwd_inner: 3325.22 | bwd_allreduce: 1.04 | step: 7.83 41%|████▏ | 4148/10000 [6:33:07<8:57:01, 5.51s/it] {'loss': 0.0052, 'grad_norm': 0.31197571754455566, 'learning_rate': 2.637713975547495e-05, 'epoch': 4.15} 41%|████▏ | 4148/10000 [6:33:07<8:57:01, 5.51s/it][2025-06-19 20:02:52,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:02:52,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.51 | bwd_microstep: 3361.26 | bwd_inner_microstep: 3360.44 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.09 [2025-06-19 20:02:52,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.51 | bwd: 3361.27 | bwd_inner: 3360.44 | bwd_allreduce: 0.79 | step: 7.09 41%|████▏ | 4149/10000 [6:33:13<8:57:44, 5.51s/it] {'loss': 0.0024, 'grad_norm': 0.1717372089624405, 'learning_rate': 2.6371000018156075e-05, 'epoch': 4.15} 41%|████▏ | 4149/10000 [6:33:13<8:57:44, 5.51s/it][2025-06-19 20:02:58,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:02:58,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.07 | bwd_microstep: 3368.96 | bwd_inner_microstep: 3368.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-19 20:02:58,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.07 | bwd: 3368.98 | bwd_inner: 3368.16 | bwd_allreduce: 0.77 | step: 7.04 42%|████▏ | 4150/10000 [6:33:18<8:58:16, 5.52s/it] {'loss': 0.0013, 'grad_norm': 0.07551243901252747, 'learning_rate': 2.6364859612548884e-05, 'epoch': 4.15} 42%|████▏ | 4150/10000 [6:33:18<8:58:16, 5.52s/it][2025-06-19 20:03:03,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:03:03,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.96 | bwd_microstep: 3359.09 | bwd_inner_microstep: 3358.30 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 20:03:03,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.96 | bwd: 3359.11 | bwd_inner: 3358.30 | bwd_allreduce: 0.77 | step: 6.82 42%|████▏ | 4151/10000 [6:33:24<8:58:21, 5.52s/it] {'loss': 0.1512, 'grad_norm': 4.032063007354736, 'learning_rate': 2.6358718539297446e-05, 'epoch': 4.15} 42%|████▏ | 4151/10000 [6:33:24<8:58:21, 5.52s/it][2025-06-19 20:03:09,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:03:09,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.37 | bwd_microstep: 3318.44 | bwd_inner_microstep: 3317.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 20:03:09,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.37 | bwd: 3318.45 | bwd_inner: 3317.64 | bwd_allreduce: 0.77 | step: 6.67 42%|████▏ | 4152/10000 [6:33:29<8:56:37, 5.51s/it] {'loss': 0.0315, 'grad_norm': 2.4770021438598633, 'learning_rate': 2.635257679904595e-05, 'epoch': 4.15} 42%|████▏ | 4152/10000 [6:33:29<8:56:37, 5.51s/it][2025-06-19 20:03:14,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:03:14,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.30 | bwd_microstep: 3315.39 | bwd_inner_microstep: 3314.56 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.31 [2025-06-19 20:03:14,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.30 | bwd: 3315.41 | bwd_inner: 3314.56 | bwd_allreduce: 0.80 | step: 7.31 42%|████▏ | 4153/10000 [6:33:35<8:55:13, 5.49s/it] {'loss': 0.0039, 'grad_norm': 0.27675876021385193, 'learning_rate': 2.6346434392438634e-05, 'epoch': 4.15} 42%|████▏ | 4153/10000 [6:33:35<8:55:13, 5.49s/it][2025-06-19 20:03:20,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:03:20,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.83 | bwd_microstep: 3363.50 | bwd_inner_microstep: 3362.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 20:03:20,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.83 | bwd: 3363.52 | bwd_inner: 3362.71 | bwd_allreduce: 0.76 | step: 6.65 42%|████▏ | 4154/10000 [6:33:40<8:56:10, 5.50s/it] {'loss': 0.0679, 'grad_norm': 3.2959954738616943, 'learning_rate': 2.6340291320119797e-05, 'epoch': 4.15} 42%|████▏ | 4154/10000 [6:33:40<8:56:10, 5.50s/it][2025-06-19 20:03:25,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:03:25,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.09 | bwd_microstep: 3366.37 | bwd_inner_microstep: 3365.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-19 20:03:25,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.09 | bwd: 3366.39 | bwd_inner: 3365.57 | bwd_allreduce: 0.77 | step: 6.78 42%|████▏ | 4155/10000 [6:33:46<8:56:55, 5.51s/it] {'loss': 0.0091, 'grad_norm': 0.740608811378479, 'learning_rate': 2.633414758273383e-05, 'epoch': 4.16} 42%|████▏ | 4155/10000 [6:33:46<8:56:55, 5.51s/it][2025-06-19 20:03:31,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:03:31,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.46 | bwd_microstep: 3320.54 | bwd_inner_microstep: 3319.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 20:03:31,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.46 | bwd: 3320.55 | bwd_inner: 3319.73 | bwd_allreduce: 0.77 | step: 6.81 42%|████▏ | 4156/10000 [6:33:51<8:55:18, 5.50s/it] {'loss': 0.0791, 'grad_norm': 4.6097869873046875, 'learning_rate': 2.6328003180925185e-05, 'epoch': 4.16} 42%|████▏ | 4156/10000 [6:33:51<8:55:18, 5.50s/it][2025-06-19 20:03:36,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:03:36,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.74 | bwd_microstep: 3371.18 | bwd_inner_microstep: 3370.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 20:03:36,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.74 | bwd: 3371.20 | bwd_inner: 3370.38 | bwd_allreduce: 0.77 | step: 6.80 42%|████▏ | 4157/10000 [6:33:57<8:56:20, 5.51s/it] {'loss': 0.0112, 'grad_norm': 0.7034572958946228, 'learning_rate': 2.6321858115338367e-05, 'epoch': 4.16} 42%|████▏ | 4157/10000 [6:33:57<8:56:20, 5.51s/it][2025-06-19 20:03:42,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:03:42,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.39 | bwd_microstep: 3327.24 | bwd_inner_microstep: 3326.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.17 [2025-06-19 20:03:42,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.39 | bwd: 3327.26 | bwd_inner: 3326.44 | bwd_allreduce: 0.77 | step: 7.18 42%|████▏ | 4158/10000 [6:34:02<8:55:11, 5.50s/it] {'loss': 0.2175, 'grad_norm': 5.806079864501953, 'learning_rate': 2.6315712386617976e-05, 'epoch': 4.16} 42%|████▏ | 4158/10000 [6:34:02<8:55:11, 5.50s/it][2025-06-19 20:03:47,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:03:47,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.72 | bwd_microstep: 3316.99 | bwd_inner_microstep: 3315.91 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.38 [2025-06-19 20:03:47,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.73 | bwd: 3317.00 | bwd_inner: 3315.91 | bwd_allreduce: 1.04 | step: 7.38 42%|████▏ | 4159/10000 [6:34:08<8:54:09, 5.49s/it] {'loss': 0.0198, 'grad_norm': 1.2158271074295044, 'learning_rate': 2.630956599540867e-05, 'epoch': 4.16} 42%|████▏ | 4159/10000 [6:34:08<8:54:09, 5.49s/it][2025-06-19 20:03:52,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:03:52,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.60 | bwd_microstep: 3316.70 | bwd_inner_microstep: 3315.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 20:03:52,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.60 | bwd: 3316.71 | bwd_inner: 3315.89 | bwd_allreduce: 0.78 | step: 7.11 42%|████▏ | 4160/10000 [6:34:13<8:53:35, 5.48s/it] {'loss': 0.0509, 'grad_norm': 2.597597122192383, 'learning_rate': 2.6303418942355178e-05, 'epoch': 4.16} 42%|████▏ | 4160/10000 [6:34:13<8:53:35, 5.48s/it][2025-06-19 20:03:58,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:03:58,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.08 | bwd_microstep: 3313.99 | bwd_inner_microstep: 3313.18 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-19 20:03:58,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.08 | bwd: 3314.00 | bwd_inner: 3313.18 | bwd_allreduce: 0.78 | step: 7.12 42%|████▏ | 4161/10000 [6:34:19<8:52:43, 5.47s/it] {'loss': 0.0166, 'grad_norm': 1.2251144647598267, 'learning_rate': 2.6297271228102287e-05, 'epoch': 4.16} 42%|████▏ | 4161/10000 [6:34:19<8:52:43, 5.47s/it][2025-06-19 20:04:03,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:04:03,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.25 | bwd_microstep: 3317.66 | bwd_inner_microstep: 3316.70 | bwd_allreduce_microstep: 0.92 | step_microstep: 6.93 [2025-06-19 20:04:03,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.25 | bwd: 3317.68 | bwd_inner: 3316.70 | bwd_allreduce: 0.93 | step: 6.93 42%|████▏ | 4162/10000 [6:34:24<8:52:12, 5.47s/it] {'loss': 0.0009, 'grad_norm': 0.04006841406226158, 'learning_rate': 2.6291122853294875e-05, 'epoch': 4.16} 42%|████▏ | 4162/10000 [6:34:24<8:52:12, 5.47s/it][2025-06-19 20:04:09,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:04:09,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.35 | bwd_microstep: 3316.46 | bwd_inner_microstep: 3315.63 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.14 [2025-06-19 20:04:09,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.35 | bwd: 3316.47 | bwd_inner: 3315.63 | bwd_allreduce: 0.79 | step: 7.14 42%|████▏ | 4163/10000 [6:34:30<8:51:43, 5.47s/it] {'loss': 0.009, 'grad_norm': 0.4543800354003906, 'learning_rate': 2.6284973818577867e-05, 'epoch': 4.16} 42%|████▏ | 4163/10000 [6:34:30<8:51:43, 5.47s/it][2025-06-19 20:04:14,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:04:14,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.39 | bwd_microstep: 3321.73 | bwd_inner_microstep: 3320.93 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.25 [2025-06-19 20:04:14,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.39 | bwd: 3321.75 | bwd_inner: 3320.93 | bwd_allreduce: 0.77 | step: 7.25 42%|████▏ | 4164/10000 [6:34:35<8:51:38, 5.47s/it] {'loss': 0.002, 'grad_norm': 0.13475282490253448, 'learning_rate': 2.6278824124596282e-05, 'epoch': 4.16} 42%|████▏ | 4164/10000 [6:34:35<8:51:38, 5.47s/it][2025-06-19 20:04:20,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:04:20,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.46 | bwd_microstep: 3322.91 | bwd_inner_microstep: 3322.10 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-19 20:04:20,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.46 | bwd: 3322.92 | bwd_inner: 3322.10 | bwd_allreduce: 0.78 | step: 7.00 42%|████▏ | 4165/10000 [6:34:41<8:51:37, 5.47s/it] {'loss': 0.1067, 'grad_norm': 2.87416410446167, 'learning_rate': 2.6272673771995182e-05, 'epoch': 4.17} 42%|████▏ | 4165/10000 [6:34:41<8:51:37, 5.47s/it][2025-06-19 20:04:25,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:04:25,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.14 | bwd_microstep: 3308.97 | bwd_inner_microstep: 3308.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 20:04:25,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.14 | bwd: 3308.98 | bwd_inner: 3308.16 | bwd_allreduce: 0.78 | step: 7.22 42%|████▏ | 4166/10000 [6:34:46<8:51:32, 5.47s/it] {'loss': 0.0101, 'grad_norm': 0.6276589632034302, 'learning_rate': 2.6266522761419722e-05, 'epoch': 4.17} 42%|████▏ | 4166/10000 [6:34:46<8:51:32, 5.47s/it][2025-06-19 20:04:31,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:04:31,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.99 | bwd_microstep: 3360.25 | bwd_inner_microstep: 3359.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 20:04:31,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.99 | bwd: 3360.26 | bwd_inner: 3359.45 | bwd_allreduce: 0.76 | step: 6.70 42%|████▏ | 4167/10000 [6:34:52<8:53:09, 5.48s/it] {'loss': 0.0332, 'grad_norm': 2.2177517414093018, 'learning_rate': 2.6260371093515105e-05, 'epoch': 4.17} 42%|████▏ | 4167/10000 [6:34:52<8:53:09, 5.48s/it][2025-06-19 20:04:36,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:04:36,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.98 | bwd_microstep: 3313.46 | bwd_inner_microstep: 3312.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 20:04:36,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.98 | bwd: 3313.48 | bwd_inner: 3312.65 | bwd_allreduce: 0.77 | step: 6.98 42%|████▏ | 4168/10000 [6:34:57<8:52:07, 5.47s/it] {'loss': 0.0214, 'grad_norm': 1.76121985912323, 'learning_rate': 2.625421876892662e-05, 'epoch': 4.17} 42%|████▏ | 4168/10000 [6:34:57<8:52:07, 5.47s/it][2025-06-19 20:04:42,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:04:42,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.80 | bwd_microstep: 3312.60 | bwd_inner_microstep: 3311.77 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.94 [2025-06-19 20:04:42,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.80 | bwd: 3312.62 | bwd_inner: 3311.77 | bwd_allreduce: 0.81 | step: 6.94 42%|████▏ | 4169/10000 [6:35:02<8:51:26, 5.47s/it] {'loss': 0.0008, 'grad_norm': 0.032045263797044754, 'learning_rate': 2.6248065788299606e-05, 'epoch': 4.17} 42%|████▏ | 4169/10000 [6:35:02<8:51:26, 5.47s/it][2025-06-19 20:04:47,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:04:47,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.26 | bwd_microstep: 3392.69 | bwd_inner_microstep: 3391.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-19 20:04:47,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.26 | bwd: 3392.70 | bwd_inner: 3391.89 | bwd_allreduce: 0.77 | step: 6.97 42%|████▏ | 4170/10000 [6:35:08<8:54:13, 5.50s/it] {'loss': 0.0027, 'grad_norm': 0.19575181603431702, 'learning_rate': 2.624191215227949e-05, 'epoch': 4.17} 42%|████▏ | 4170/10000 [6:35:08<8:54:13, 5.50s/it][2025-06-19 20:04:53,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:04:53,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.71 | bwd_microstep: 3317.94 | bwd_inner_microstep: 3317.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-19 20:04:53,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.71 | bwd: 3317.95 | bwd_inner: 3317.14 | bwd_allreduce: 0.77 | step: 7.01 42%|████▏ | 4171/10000 [6:35:13<8:52:56, 5.49s/it] {'loss': 0.0036, 'grad_norm': 0.32083189487457275, 'learning_rate': 2.6235757861511762e-05, 'epoch': 4.17} 42%|████▏ | 4171/10000 [6:35:13<8:52:56, 5.49s/it][2025-06-19 20:04:58,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:04:58,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.64 | bwd_microstep: 3312.21 | bwd_inner_microstep: 3311.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.21 [2025-06-19 20:04:58,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.64 | bwd: 3312.23 | bwd_inner: 3311.41 | bwd_allreduce: 0.77 | step: 7.21 42%|████▏ | 4172/10000 [6:35:19<8:52:01, 5.48s/it] {'loss': 0.0021, 'grad_norm': 0.16522590816020966, 'learning_rate': 2.6229602916641975e-05, 'epoch': 4.17} 42%|████▏ | 4172/10000 [6:35:19<8:52:01, 5.48s/it][2025-06-19 20:05:04,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:05:04,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.55 | bwd_microstep: 3316.07 | bwd_inner_microstep: 3315.25 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.27 [2025-06-19 20:05:04,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.55 | bwd: 3316.09 | bwd_inner: 3315.25 | bwd_allreduce: 0.79 | step: 7.27 42%|████▏ | 4173/10000 [6:35:24<8:51:17, 5.47s/it] {'loss': 0.0013, 'grad_norm': 0.10098131746053696, 'learning_rate': 2.622344731831575e-05, 'epoch': 4.17} 42%|████▏ | 4173/10000 [6:35:24<8:51:17, 5.47s/it][2025-06-19 20:05:09,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:05:09,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.94 | bwd_microstep: 3363.76 | bwd_inner_microstep: 3362.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 20:05:09,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.94 | bwd: 3363.77 | bwd_inner: 3362.95 | bwd_allreduce: 0.77 | step: 7.19 42%|████▏ | 4174/10000 [6:35:30<8:52:55, 5.49s/it] {'loss': 0.0185, 'grad_norm': 3.2453205585479736, 'learning_rate': 2.621729106717879e-05, 'epoch': 4.17} 42%|████▏ | 4174/10000 [6:35:30<8:52:55, 5.49s/it][2025-06-19 20:05:15,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:05:15,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.88 | bwd_microstep: 3355.66 | bwd_inner_microstep: 3354.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.13 [2025-06-19 20:05:15,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.88 | bwd: 3355.68 | bwd_inner: 3354.86 | bwd_allreduce: 0.77 | step: 7.13 42%|████▏ | 4175/10000 [6:35:35<8:53:39, 5.50s/it] {'loss': 0.1165, 'grad_norm': 2.577364683151245, 'learning_rate': 2.6211134163876847e-05, 'epoch': 4.17} 42%|████▏ | 4175/10000 [6:35:35<8:53:39, 5.50s/it][2025-06-19 20:05:20,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:05:20,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.13 | bwd_microstep: 3369.41 | bwd_inner_microstep: 3368.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 20:05:20,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.13 | bwd: 3369.43 | bwd_inner: 3368.60 | bwd_allreduce: 0.78 | step: 7.23 42%|████▏ | 4176/10000 [6:35:41<8:54:49, 5.51s/it] {'loss': 0.0337, 'grad_norm': 1.7905699014663696, 'learning_rate': 2.6204976609055766e-05, 'epoch': 4.18} 42%|████▏ | 4176/10000 [6:35:41<8:54:49, 5.51s/it][2025-06-19 20:05:26,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:05:26,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.21 | bwd_microstep: 3365.86 | bwd_inner_microstep: 3365.05 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 20:05:26,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.21 | bwd: 3365.87 | bwd_inner: 3365.05 | bwd_allreduce: 0.78 | step: 7.12 42%|████▏ | 4177/10000 [6:35:47<8:55:17, 5.52s/it] {'loss': 0.0864, 'grad_norm': 2.302187442779541, 'learning_rate': 2.619881840336143e-05, 'epoch': 4.18} 42%|████▏ | 4177/10000 [6:35:47<8:55:17, 5.52s/it][2025-06-19 20:05:31,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.89 [2025-06-19 20:05:31,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.73 | bwd_microstep: 3320.57 | bwd_inner_microstep: 3319.77 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-19 20:05:31,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.73 | bwd: 3320.59 | bwd_inner: 3319.77 | bwd_allreduce: 0.77 | step: 7.08 42%|████▏ | 4178/10000 [6:35:52<8:53:29, 5.50s/it] {'loss': 0.0542, 'grad_norm': 5.066305160522461, 'learning_rate': 2.619265954743982e-05, 'epoch': 4.18} 42%|████▏ | 4178/10000 [6:35:52<8:53:29, 5.50s/it][2025-06-19 20:05:37,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:05:37,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.29 | bwd_microstep: 3312.48 | bwd_inner_microstep: 3311.64 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.39 [2025-06-19 20:05:37,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.29 | bwd: 3312.49 | bwd_inner: 3311.64 | bwd_allreduce: 0.80 | step: 7.40 42%|████▏ | 4179/10000 [6:35:57<8:52:00, 5.48s/it] {'loss': 0.0991, 'grad_norm': 4.180408954620361, 'learning_rate': 2.6186500041936954e-05, 'epoch': 4.18} 42%|████▏ | 4179/10000 [6:35:57<8:52:00, 5.48s/it][2025-06-19 20:05:42,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:05:42,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.53 | bwd_microstep: 3365.65 | bwd_inner_microstep: 3364.84 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-19 20:05:42,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.53 | bwd: 3365.66 | bwd_inner: 3364.84 | bwd_allreduce: 0.78 | step: 6.88 42%|████▏ | 4180/10000 [6:36:03<8:53:25, 5.50s/it] {'loss': 0.0106, 'grad_norm': 0.5725746154785156, 'learning_rate': 2.618033988749895e-05, 'epoch': 4.18} 42%|████▏ | 4180/10000 [6:36:03<8:53:25, 5.50s/it][2025-06-19 20:05:48,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:05:48,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.02 | bwd_microstep: 3395.78 | bwd_inner_microstep: 3394.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 20:05:48,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.02 | bwd: 3395.80 | bwd_inner: 3394.97 | bwd_allreduce: 0.78 | step: 7.09 42%|████▏ | 4181/10000 [6:36:09<8:55:30, 5.52s/it] {'loss': 0.0268, 'grad_norm': 3.388845443725586, 'learning_rate': 2.617417908477198e-05, 'epoch': 4.18} 42%|████▏ | 4181/10000 [6:36:09<8:55:30, 5.52s/it][2025-06-19 20:05:53,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:05:53,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.97 | bwd_microstep: 3313.85 | bwd_inner_microstep: 3313.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 20:05:53,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.97 | bwd: 3313.87 | bwd_inner: 3313.06 | bwd_allreduce: 0.77 | step: 6.96 42%|████▏ | 4182/10000 [6:36:14<8:53:28, 5.50s/it] {'loss': 0.0341, 'grad_norm': 1.470838189125061, 'learning_rate': 2.6168017634402274e-05, 'epoch': 4.18} 42%|████▏ | 4182/10000 [6:36:14<8:53:28, 5.50s/it][2025-06-19 20:05:59,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:05:59,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.60 | bwd_microstep: 3315.10 | bwd_inner_microstep: 3314.05 | bwd_allreduce_microstep: 0.98 | step_microstep: 8.04 [2025-06-19 20:05:59,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.60 | bwd: 3315.11 | bwd_inner: 3314.05 | bwd_allreduce: 1.01 | step: 8.04 42%|████▏ | 4183/10000 [6:36:19<8:52:11, 5.49s/it] {'loss': 0.0083, 'grad_norm': 1.0283517837524414, 'learning_rate': 2.616185553703615e-05, 'epoch': 4.18} 42%|████▏ | 4183/10000 [6:36:19<8:52:11, 5.49s/it][2025-06-19 20:06:04,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:06:04,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.23 | bwd_microstep: 3318.77 | bwd_inner_microstep: 3317.96 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 20:06:04,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.23 | bwd: 3318.79 | bwd_inner: 3317.96 | bwd_allreduce: 0.78 | step: 7.09 42%|████▏ | 4184/10000 [6:36:25<8:51:28, 5.48s/it] {'loss': 0.0382, 'grad_norm': 2.0550429821014404, 'learning_rate': 2.615569279331997e-05, 'epoch': 4.18} 42%|████▏ | 4184/10000 [6:36:25<8:51:28, 5.48s/it][2025-06-19 20:06:10,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:06:10,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.84 | bwd_microstep: 3363.41 | bwd_inner_microstep: 3362.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-19 20:06:10,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.84 | bwd: 3363.42 | bwd_inner: 3362.60 | bwd_allreduce: 0.78 | step: 6.94 42%|████▏ | 4185/10000 [6:36:30<8:52:41, 5.50s/it] {'loss': 0.0036, 'grad_norm': 0.4506494998931885, 'learning_rate': 2.614952940390019e-05, 'epoch': 4.18} 42%|████▏ | 4185/10000 [6:36:30<8:52:41, 5.50s/it][2025-06-19 20:06:15,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:06:15,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.17 | bwd_microstep: 3315.65 | bwd_inner_microstep: 3314.84 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.75 [2025-06-19 20:06:15,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.17 | bwd: 3315.66 | bwd_inner: 3314.84 | bwd_allreduce: 0.78 | step: 6.75 42%|████▏ | 4186/10000 [6:36:36<8:51:36, 5.49s/it] {'loss': 0.0137, 'grad_norm': 0.6716204285621643, 'learning_rate': 2.6143365369423322e-05, 'epoch': 4.19} 42%|████▏ | 4186/10000 [6:36:36<8:51:36, 5.49s/it][2025-06-19 20:06:21,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:06:21,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.72 | bwd_microstep: 3358.99 | bwd_inner_microstep: 3358.18 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 20:06:21,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.72 | bwd: 3359.00 | bwd_inner: 3358.18 | bwd_allreduce: 0.77 | step: 7.14 42%|████▏ | 4187/10000 [6:36:41<8:52:30, 5.50s/it] {'loss': 0.0184, 'grad_norm': 1.6821638345718384, 'learning_rate': 2.613720069053593e-05, 'epoch': 4.19} 42%|████▏ | 4187/10000 [6:36:41<8:52:30, 5.50s/it][2025-06-19 20:06:26,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:06:26,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.69 | bwd_microstep: 3367.85 | bwd_inner_microstep: 3366.90 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.29 [2025-06-19 20:06:26,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.69 | bwd: 3367.86 | bwd_inner: 3366.90 | bwd_allreduce: 0.92 | step: 7.30 42%|████▏ | 4188/10000 [6:36:47<8:53:28, 5.51s/it] {'loss': 0.0084, 'grad_norm': 0.7701065540313721, 'learning_rate': 2.6131035367884678e-05, 'epoch': 4.19} 42%|████▏ | 4188/10000 [6:36:47<8:53:28, 5.51s/it][2025-06-19 20:06:32,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:06:32,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.57 | bwd_microstep: 3305.81 | bwd_inner_microstep: 3305.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-19 20:06:32,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.57 | bwd: 3305.83 | bwd_inner: 3305.01 | bwd_allreduce: 0.77 | step: 6.91 42%|████▏ | 4189/10000 [6:36:52<8:51:51, 5.49s/it] {'loss': 0.1681, 'grad_norm': 4.470649242401123, 'learning_rate': 2.6124869402116267e-05, 'epoch': 4.19} 42%|████▏ | 4189/10000 [6:36:52<8:51:51, 5.49s/it][2025-06-19 20:06:37,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:06:37,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.83 | bwd_microstep: 3312.65 | bwd_inner_microstep: 3311.82 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.13 [2025-06-19 20:06:37,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.83 | bwd: 3312.66 | bwd_inner: 3311.83 | bwd_allreduce: 0.79 | step: 7.13 42%|████▏ | 4190/10000 [6:36:58<8:50:33, 5.48s/it] {'loss': 0.0542, 'grad_norm': 2.446255922317505, 'learning_rate': 2.611870279387748e-05, 'epoch': 4.19} 42%|████▏ | 4190/10000 [6:36:58<8:50:33, 5.48s/it][2025-06-19 20:06:43,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:06:43,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.17 | bwd_microstep: 3307.66 | bwd_inner_microstep: 3306.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 20:06:43,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.17 | bwd: 3307.67 | bwd_inner: 3306.86 | bwd_allreduce: 0.77 | step: 6.68 42%|████▏ | 4191/10000 [6:37:03<8:49:29, 5.47s/it] {'loss': 0.0037, 'grad_norm': 0.4490150213241577, 'learning_rate': 2.6112535543815166e-05, 'epoch': 4.19} 42%|████▏ | 4191/10000 [6:37:03<8:49:29, 5.47s/it][2025-06-19 20:06:48,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:06:48,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.47 | bwd_microstep: 3311.82 | bwd_inner_microstep: 3311.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 20:06:48,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.47 | bwd: 3311.84 | bwd_inner: 3311.03 | bwd_allreduce: 0.77 | step: 6.76 42%|████▏ | 4192/10000 [6:37:09<8:48:44, 5.46s/it] {'loss': 0.0038, 'grad_norm': 0.19848889112472534, 'learning_rate': 2.6106367652576247e-05, 'epoch': 4.19} 42%|████▏ | 4192/10000 [6:37:09<8:48:44, 5.46s/it][2025-06-19 20:06:53,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:06:53,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.20 | bwd_microstep: 3365.72 | bwd_inner_microstep: 3364.91 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.51 [2025-06-19 20:06:53,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.20 | bwd: 3365.74 | bwd_inner: 3364.91 | bwd_allreduce: 0.78 | step: 7.51 42%|████▏ | 4193/10000 [6:37:14<8:50:36, 5.48s/it] {'loss': 0.0017, 'grad_norm': 0.20816177129745483, 'learning_rate': 2.61001991208077e-05, 'epoch': 4.19} 42%|████▏ | 4193/10000 [6:37:14<8:50:36, 5.48s/it][2025-06-19 20:06:59,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:06:59,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.44 | bwd_microstep: 3313.80 | bwd_inner_microstep: 3313.01 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 20:06:59,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.44 | bwd: 3313.82 | bwd_inner: 3313.01 | bwd_allreduce: 0.76 | step: 6.77 42%|████▏ | 4194/10000 [6:37:20<8:49:47, 5.48s/it] {'loss': 0.0024, 'grad_norm': 0.2542567551136017, 'learning_rate': 2.609402994915658e-05, 'epoch': 4.19} 42%|████▏ | 4194/10000 [6:37:20<8:49:47, 5.48s/it][2025-06-19 20:07:04,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:07:04,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.39 | bwd_microstep: 3308.24 | bwd_inner_microstep: 3307.37 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.97 [2025-06-19 20:07:04,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.39 | bwd: 3308.26 | bwd_inner: 3307.37 | bwd_allreduce: 0.85 | step: 6.98 42%|████▏ | 4195/10000 [6:37:25<8:48:54, 5.47s/it] {'loss': 0.0031, 'grad_norm': 0.1997031569480896, 'learning_rate': 2.6087860138269996e-05, 'epoch': 4.2} 42%|████▏ | 4195/10000 [6:37:25<8:48:54, 5.47s/it][2025-06-19 20:07:10,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:07:10,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.33 | bwd_microstep: 3308.48 | bwd_inner_microstep: 3307.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-19 20:07:10,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.33 | bwd: 3308.50 | bwd_inner: 3307.68 | bwd_allreduce: 0.77 | step: 7.00 42%|████▏ | 4196/10000 [6:37:31<8:48:15, 5.46s/it] {'loss': 0.0114, 'grad_norm': 0.7884377837181091, 'learning_rate': 2.6081689688795146e-05, 'epoch': 4.2} 42%|████▏ | 4196/10000 [6:37:31<8:48:15, 5.46s/it][2025-06-19 20:07:15,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:07:15,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.44 | bwd_microstep: 3372.97 | bwd_inner_microstep: 3372.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 20:07:15,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.44 | bwd: 3372.98 | bwd_inner: 3372.18 | bwd_allreduce: 0.76 | step: 6.70 42%|████▏ | 4197/10000 [6:37:36<8:50:20, 5.48s/it] {'loss': 0.0731, 'grad_norm': 4.26752233505249, 'learning_rate': 2.6075518601379265e-05, 'epoch': 4.2} 42%|████▏ | 4197/10000 [6:37:36<8:50:20, 5.48s/it][2025-06-19 20:07:21,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:07:21,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.42 | bwd_microstep: 3326.71 | bwd_inner_microstep: 3325.87 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.76 [2025-06-19 20:07:21,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.42 | bwd: 3326.73 | bwd_inner: 3325.87 | bwd_allreduce: 0.82 | step: 6.77 42%|████▏ | 4198/10000 [6:37:42<8:49:54, 5.48s/it] {'loss': 0.0911, 'grad_norm': 2.814361095428467, 'learning_rate': 2.6069346876669682e-05, 'epoch': 4.2} 42%|████▏ | 4198/10000 [6:37:42<8:49:54, 5.48s/it][2025-06-19 20:07:26,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:07:26,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.34 | bwd_microstep: 3317.09 | bwd_inner_microstep: 3316.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-19 20:07:26,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.34 | bwd: 3317.10 | bwd_inner: 3316.29 | bwd_allreduce: 0.77 | step: 7.07 42%|████▏ | 4199/10000 [6:37:47<8:49:19, 5.47s/it] {'loss': 0.104, 'grad_norm': 4.792835712432861, 'learning_rate': 2.6063174515313785e-05, 'epoch': 4.2} 42%|████▏ | 4199/10000 [6:37:47<8:49:19, 5.47s/it][2025-06-19 20:07:32,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:07:32,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.90 | bwd_microstep: 3319.91 | bwd_inner_microstep: 3319.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-19 20:07:32,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.90 | bwd: 3319.93 | bwd_inner: 3319.13 | bwd_allreduce: 0.76 | step: 6.73 42%|████▏ | 4200/10000 [6:37:53<8:48:58, 5.47s/it] {'loss': 0.0006, 'grad_norm': 0.034424543380737305, 'learning_rate': 2.6057001517959015e-05, 'epoch': 4.2} 42%|████▏ | 4200/10000 [6:37:53<8:48:58, 5.47s/it][2025-06-19 20:07:37,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:07:37,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.18 | bwd_microstep: 3371.78 | bwd_inner_microstep: 3370.99 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 20:07:37,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.18 | bwd: 3371.80 | bwd_inner: 3370.99 | bwd_allreduce: 0.76 | step: 6.70 42%|████▏ | 4201/10000 [6:37:58<8:50:55, 5.49s/it] {'loss': 0.0303, 'grad_norm': 3.7011992931365967, 'learning_rate': 2.6050827885252905e-05, 'epoch': 4.2} 42%|████▏ | 4201/10000 [6:37:58<8:50:55, 5.49s/it][2025-06-19 20:07:43,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:07:43,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.70 | bwd_microstep: 3318.67 | bwd_inner_microstep: 3317.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 20:07:43,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.70 | bwd: 3318.69 | bwd_inner: 3317.87 | bwd_allreduce: 0.77 | step: 6.87 42%|████▏ | 4202/10000 [6:38:04<8:50:07, 5.49s/it] {'loss': 0.1404, 'grad_norm': 2.456878662109375, 'learning_rate': 2.6044653617843026e-05, 'epoch': 4.2} 42%|████▏ | 4202/10000 [6:38:04<8:50:07, 5.49s/it][2025-06-19 20:07:48,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:07:48,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.31 | bwd_microstep: 3311.77 | bwd_inner_microstep: 3310.93 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.26 [2025-06-19 20:07:48,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.31 | bwd: 3311.78 | bwd_inner: 3310.93 | bwd_allreduce: 0.81 | step: 7.27 42%|████▏ | 4203/10000 [6:38:09<8:49:07, 5.48s/it] {'loss': 0.0655, 'grad_norm': 1.8148857355117798, 'learning_rate': 2.603847871637704e-05, 'epoch': 4.2} 42%|████▏ | 4203/10000 [6:38:09<8:49:07, 5.48s/it][2025-06-19 20:07:54,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:07:54,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.67 | bwd_microstep: 3319.48 | bwd_inner_microstep: 3318.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 20:07:54,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.67 | bwd: 3319.50 | bwd_inner: 3318.69 | bwd_allreduce: 0.77 | step: 6.73 42%|████▏ | 4204/10000 [6:38:14<8:48:41, 5.47s/it] {'loss': 0.0085, 'grad_norm': 0.7098146080970764, 'learning_rate': 2.6032303181502663e-05, 'epoch': 4.2} 42%|████▏ | 4204/10000 [6:38:14<8:48:41, 5.47s/it][2025-06-19 20:07:59,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:07:59,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.82 | bwd_microstep: 3314.53 | bwd_inner_microstep: 3313.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.71 [2025-06-19 20:07:59,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.82 | bwd: 3314.55 | bwd_inner: 3313.73 | bwd_allreduce: 0.77 | step: 6.71 42%|████▏ | 4205/10000 [6:38:20<8:48:01, 5.47s/it] {'loss': 0.006, 'grad_norm': 0.7806726098060608, 'learning_rate': 2.6026127013867676e-05, 'epoch': 4.21} 42%|████▏ | 4205/10000 [6:38:20<8:48:01, 5.47s/it][2025-06-19 20:08:05,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:08:05,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.42 | bwd_microstep: 3320.22 | bwd_inner_microstep: 3319.41 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 20:08:05,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.42 | bwd: 3320.24 | bwd_inner: 3319.41 | bwd_allreduce: 0.78 | step: 7.08 42%|████▏ | 4206/10000 [6:38:25<8:47:53, 5.47s/it] {'loss': 0.0307, 'grad_norm': 2.2486608028411865, 'learning_rate': 2.601995021411994e-05, 'epoch': 4.21} 42%|████▏ | 4206/10000 [6:38:25<8:47:53, 5.47s/it][2025-06-19 20:08:10,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:08:10,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.98 | bwd_microstep: 3374.05 | bwd_inner_microstep: 3373.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-19 20:08:10,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.98 | bwd: 3374.06 | bwd_inner: 3373.24 | bwd_allreduce: 0.78 | step: 6.92 42%|████▏ | 4207/10000 [6:38:31<8:49:54, 5.49s/it] {'loss': 0.0177, 'grad_norm': 1.5074269771575928, 'learning_rate': 2.6013772782907357e-05, 'epoch': 4.21} 42%|████▏ | 4207/10000 [6:38:31<8:49:54, 5.49s/it][2025-06-19 20:08:16,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:08:16,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.91 | bwd_microstep: 3367.49 | bwd_inner_microstep: 3366.55 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.02 [2025-06-19 20:08:16,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.91 | bwd: 3367.50 | bwd_inner: 3366.55 | bwd_allreduce: 0.91 | step: 7.04 42%|████▏ | 4208/10000 [6:38:36<8:51:24, 5.50s/it] {'loss': 0.0226, 'grad_norm': 3.7297165393829346, 'learning_rate': 2.6007594720877922e-05, 'epoch': 4.21} 42%|████▏ | 4208/10000 [6:38:36<8:51:24, 5.50s/it][2025-06-19 20:08:21,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:08:21,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.84 | bwd_microstep: 3371.49 | bwd_inner_microstep: 3370.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 20:08:21,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.84 | bwd: 3371.50 | bwd_inner: 3370.69 | bwd_allreduce: 0.77 | step: 6.90 42%|████▏ | 4209/10000 [6:38:42<8:52:26, 5.52s/it] {'loss': 0.0081, 'grad_norm': 0.9814993739128113, 'learning_rate': 2.6001416028679684e-05, 'epoch': 4.21} 42%|████▏ | 4209/10000 [6:38:42<8:52:26, 5.52s/it][2025-06-19 20:08:27,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:08:27,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.83 | bwd_microstep: 3367.25 | bwd_inner_microstep: 3366.39 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.42 [2025-06-19 20:08:27,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.83 | bwd: 3367.26 | bwd_inner: 3366.39 | bwd_allreduce: 0.83 | step: 7.42 42%|████▏ | 4210/10000 [6:38:48<8:52:50, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.011611179448664188, 'learning_rate': 2.599523670696076e-05, 'epoch': 4.21} 42%|████▏ | 4210/10000 [6:38:48<8:52:50, 5.52s/it][2025-06-19 20:08:32,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:08:32,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.76 | bwd_microstep: 3373.70 | bwd_inner_microstep: 3372.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 20:08:32,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.76 | bwd: 3373.72 | bwd_inner: 3372.92 | bwd_allreduce: 0.76 | step: 6.70 42%|████▏ | 4211/10000 [6:38:53<8:53:38, 5.53s/it] {'loss': 0.0181, 'grad_norm': 0.9710188508033752, 'learning_rate': 2.5989056756369327e-05, 'epoch': 4.21} 42%|████▏ | 4211/10000 [6:38:53<8:53:38, 5.53s/it][2025-06-19 20:08:38,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:08:38,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.35 | bwd_microstep: 3325.29 | bwd_inner_microstep: 3324.21 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.49 [2025-06-19 20:08:38,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.35 | bwd: 3325.30 | bwd_inner: 3324.21 | bwd_allreduce: 1.05 | step: 7.50 42%|████▏ | 4212/10000 [6:38:59<8:51:55, 5.51s/it] {'loss': 0.0049, 'grad_norm': 0.34838464856147766, 'learning_rate': 2.5982876177553636e-05, 'epoch': 4.21} 42%|████▏ | 4212/10000 [6:38:59<8:51:55, 5.51s/it][2025-06-19 20:08:43,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:08:43,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.22 | bwd_microstep: 3372.07 | bwd_inner_microstep: 3371.27 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.75 [2025-06-19 20:08:43,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.22 | bwd: 3372.09 | bwd_inner: 3371.27 | bwd_allreduce: 0.77 | step: 6.76 42%|████▏ | 4213/10000 [6:39:04<8:52:46, 5.52s/it] {'loss': 0.0978, 'grad_norm': 2.71588397026062, 'learning_rate': 2.5976694971162006e-05, 'epoch': 4.21} 42%|████▏ | 4213/10000 [6:39:04<8:52:46, 5.52s/it][2025-06-19 20:08:49,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:08:49,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.57 | bwd_microstep: 3329.14 | bwd_inner_microstep: 3328.32 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.80 [2025-06-19 20:08:49,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.57 | bwd: 3329.16 | bwd_inner: 3328.32 | bwd_allreduce: 0.79 | step: 6.80 42%|████▏ | 4214/10000 [6:39:10<8:51:10, 5.51s/it] {'loss': 0.0065, 'grad_norm': 0.45618051290512085, 'learning_rate': 2.5970513137842808e-05, 'epoch': 4.21} 42%|████▏ | 4214/10000 [6:39:10<8:51:10, 5.51s/it][2025-06-19 20:08:54,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:08:54,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.83 | bwd_microstep: 3370.10 | bwd_inner_microstep: 3369.30 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 20:08:54,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.83 | bwd: 3370.11 | bwd_inner: 3369.30 | bwd_allreduce: 0.77 | step: 6.80 42%|████▏ | 4215/10000 [6:39:15<8:51:57, 5.52s/it] {'loss': 0.0163, 'grad_norm': 0.9527328014373779, 'learning_rate': 2.5964330678244495e-05, 'epoch': 4.21} 42%|████▏ | 4215/10000 [6:39:15<8:51:57, 5.52s/it][2025-06-19 20:09:00,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:09:00,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.41 | bwd_microstep: 3372.93 | bwd_inner_microstep: 3372.02 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.33 [2025-06-19 20:09:00,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.41 | bwd: 3372.95 | bwd_inner: 3372.02 | bwd_allreduce: 0.88 | step: 7.34 42%|████▏ | 4216/10000 [6:39:21<8:52:30, 5.52s/it] {'loss': 0.0179, 'grad_norm': 1.3641847372055054, 'learning_rate': 2.5958147593015573e-05, 'epoch': 4.22} 42%|████▏ | 4216/10000 [6:39:21<8:52:30, 5.52s/it][2025-06-19 20:09:05,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:09:05,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.39 | bwd_microstep: 3321.27 | bwd_inner_microstep: 3320.48 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 20:09:05,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.39 | bwd: 3321.29 | bwd_inner: 3320.48 | bwd_allreduce: 0.76 | step: 6.75 42%|████▏ | 4217/10000 [6:39:26<8:50:51, 5.51s/it] {'loss': 0.0004, 'grad_norm': 0.018407655879855156, 'learning_rate': 2.595196388280462e-05, 'epoch': 4.22} 42%|████▏ | 4217/10000 [6:39:26<8:50:51, 5.51s/it][2025-06-19 20:09:11,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:09:11,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.20 | bwd_microstep: 3335.51 | bwd_inner_microstep: 3334.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 20:09:11,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.20 | bwd: 3335.52 | bwd_inner: 3334.72 | bwd_allreduce: 0.76 | step: 6.60 42%|████▏ | 4218/10000 [6:39:32<8:49:58, 5.50s/it] {'loss': 0.0285, 'grad_norm': 1.8412609100341797, 'learning_rate': 2.594577954826028e-05, 'epoch': 4.22} 42%|████▏ | 4218/10000 [6:39:32<8:49:58, 5.50s/it][2025-06-19 20:09:16,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:09:16,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.42 | bwd_microstep: 3330.20 | bwd_inner_microstep: 3329.39 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 20:09:16,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.42 | bwd: 3330.21 | bwd_inner: 3329.39 | bwd_allreduce: 0.78 | step: 7.14 42%|████▏ | 4219/10000 [6:39:37<8:49:10, 5.49s/it] {'loss': 0.0059, 'grad_norm': 0.8990017175674438, 'learning_rate': 2.5939594590031264e-05, 'epoch': 4.22} 42%|████▏ | 4219/10000 [6:39:37<8:49:10, 5.49s/it][2025-06-19 20:09:22,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:09:22,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.12 | bwd_microstep: 3372.00 | bwd_inner_microstep: 3371.04 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.07 [2025-06-19 20:09:22,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.12 | bwd: 3372.02 | bwd_inner: 3371.04 | bwd_allreduce: 0.93 | step: 7.08 42%|████▏ | 4220/10000 [6:39:43<8:50:50, 5.51s/it] {'loss': 0.0125, 'grad_norm': 0.8932253122329712, 'learning_rate': 2.593340900876634e-05, 'epoch': 4.22} 42%|████▏ | 4220/10000 [6:39:43<8:50:50, 5.51s/it][2025-06-19 20:09:27,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:09:27,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.01 | bwd_microstep: 3329.91 | bwd_inner_microstep: 3329.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 20:09:27,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.01 | bwd: 3329.92 | bwd_inner: 3329.12 | bwd_allreduce: 0.76 | step: 6.69 42%|████▏ | 4221/10000 [6:39:48<8:50:14, 5.51s/it] {'loss': 0.0056, 'grad_norm': 0.4910619258880615, 'learning_rate': 2.592722280511435e-05, 'epoch': 4.22} 42%|████▏ | 4221/10000 [6:39:48<8:50:14, 5.51s/it][2025-06-19 20:09:33,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:09:33,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.58 | bwd_microstep: 3328.16 | bwd_inner_microstep: 3327.11 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.15 [2025-06-19 20:09:33,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.58 | bwd: 3328.17 | bwd_inner: 3327.11 | bwd_allreduce: 1.01 | step: 7.15 42%|████▏ | 4222/10000 [6:39:54<8:49:16, 5.50s/it] {'loss': 0.0063, 'grad_norm': 0.6121565103530884, 'learning_rate': 2.5921035979724193e-05, 'epoch': 4.22} 42%|████▏ | 4222/10000 [6:39:54<8:49:16, 5.50s/it][2025-06-19 20:09:38,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:09:38,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.09 | bwd_microstep: 3391.43 | bwd_inner_microstep: 3390.63 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.39 [2025-06-19 20:09:38,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.09 | bwd: 3391.45 | bwd_inner: 3390.63 | bwd_allreduce: 0.78 | step: 7.40 42%|████▏ | 4223/10000 [6:39:59<8:51:32, 5.52s/it] {'loss': 0.2301, 'grad_norm': 3.276959180831909, 'learning_rate': 2.5914848533244844e-05, 'epoch': 4.22} 42%|████▏ | 4223/10000 [6:39:59<8:51:32, 5.52s/it][2025-06-19 20:09:44,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:09:44,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.23 | bwd_microstep: 3339.24 | bwd_inner_microstep: 3338.41 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.78 [2025-06-19 20:09:44,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.23 | bwd: 3339.25 | bwd_inner: 3338.41 | bwd_allreduce: 0.79 | step: 6.78 42%|████▏ | 4224/10000 [6:40:05<8:50:44, 5.51s/it] {'loss': 0.002, 'grad_norm': 0.14311957359313965, 'learning_rate': 2.5908660466325338e-05, 'epoch': 4.22} 42%|████▏ | 4224/10000 [6:40:05<8:50:44, 5.51s/it][2025-06-19 20:09:49,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:09:49,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.83 | bwd_microstep: 3323.81 | bwd_inner_microstep: 3323.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 20:09:49,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.83 | bwd: 3323.82 | bwd_inner: 3323.03 | bwd_allreduce: 0.75 | step: 6.62 42%|████▏ | 4225/10000 [6:40:10<8:49:57, 5.51s/it] {'loss': 0.006, 'grad_norm': 0.46492472290992737, 'learning_rate': 2.590247177961477e-05, 'epoch': 4.22} 42%|████▏ | 4225/10000 [6:40:10<8:49:57, 5.51s/it][2025-06-19 20:09:55,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:09:55,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.12 | bwd_microstep: 3326.35 | bwd_inner_microstep: 3325.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 20:09:55,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.12 | bwd: 3326.36 | bwd_inner: 3325.55 | bwd_allreduce: 0.77 | step: 6.99 42%|████▏ | 4226/10000 [6:40:16<8:48:54, 5.50s/it] {'loss': 0.0974, 'grad_norm': 4.798129558563232, 'learning_rate': 2.5896282473762308e-05, 'epoch': 4.23} 42%|████▏ | 4226/10000 [6:40:16<8:48:54, 5.50s/it][2025-06-19 20:10:00,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.85 [2025-06-19 20:10:00,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.75 | bwd_microstep: 3375.64 | bwd_inner_microstep: 3374.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.02 [2025-06-19 20:10:00,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.75 | bwd: 3375.65 | bwd_inner: 3374.86 | bwd_allreduce: 0.75 | step: 7.03 42%|████▏ | 4227/10000 [6:40:21<8:50:40, 5.52s/it] {'loss': 0.0011, 'grad_norm': 0.11367803066968918, 'learning_rate': 2.5890092549417177e-05, 'epoch': 4.23} 42%|████▏ | 4227/10000 [6:40:21<8:50:40, 5.52s/it][2025-06-19 20:10:06,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:10:06,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.98 | bwd_microstep: 3376.49 | bwd_inner_microstep: 3375.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 20:10:06,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.98 | bwd: 3376.50 | bwd_inner: 3375.71 | bwd_allreduce: 0.75 | step: 6.59 42%|████▏ | 4228/10000 [6:40:27<8:51:36, 5.53s/it] {'loss': 0.0391, 'grad_norm': 2.6698639392852783, 'learning_rate': 2.5883902007228674e-05, 'epoch': 4.23} 42%|████▏ | 4228/10000 [6:40:27<8:51:36, 5.53s/it][2025-06-19 20:10:11,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:10:11,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.95 | bwd_microstep: 3332.16 | bwd_inner_microstep: 3331.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 20:10:11,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.95 | bwd: 3332.17 | bwd_inner: 3331.38 | bwd_allreduce: 0.75 | step: 6.68 42%|████▏ | 4229/10000 [6:40:32<8:50:04, 5.51s/it] {'loss': 0.0007, 'grad_norm': 0.0576309897005558, 'learning_rate': 2.5877710847846162e-05, 'epoch': 4.23} 42%|████▏ | 4229/10000 [6:40:32<8:50:04, 5.51s/it][2025-06-19 20:10:17,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:10:17,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.10 | bwd_microstep: 3378.89 | bwd_inner_microstep: 3377.77 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.28 [2025-06-19 20:10:17,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.10 | bwd: 3378.91 | bwd_inner: 3377.77 | bwd_allreduce: 1.08 | step: 7.28 42%|████▏ | 4230/10000 [6:40:38<8:51:12, 5.52s/it] {'loss': 0.0198, 'grad_norm': 2.125490665435791, 'learning_rate': 2.5871519071919057e-05, 'epoch': 4.23} 42%|████▏ | 4230/10000 [6:40:38<8:51:12, 5.52s/it][2025-06-19 20:10:23,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:10:23,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.59 | bwd_microstep: 3378.70 | bwd_inner_microstep: 3377.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.09 [2025-06-19 20:10:23,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.59 | bwd: 3378.72 | bwd_inner: 3377.91 | bwd_allreduce: 0.77 | step: 7.10 42%|████▏ | 4231/10000 [6:40:43<8:52:10, 5.53s/it] {'loss': 0.0642, 'grad_norm': 2.103337287902832, 'learning_rate': 2.5865326680096857e-05, 'epoch': 4.23} 42%|████▏ | 4231/10000 [6:40:43<8:52:10, 5.53s/it][2025-06-19 20:10:28,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:10:28,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.04 | bwd_microstep: 3333.31 | bwd_inner_microstep: 3332.44 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.88 [2025-06-19 20:10:28,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.04 | bwd: 3333.33 | bwd_inner: 3332.44 | bwd_allreduce: 0.85 | step: 6.88 42%|████▏ | 4232/10000 [6:40:49<8:50:25, 5.52s/it] {'loss': 0.009, 'grad_norm': 1.3356046676635742, 'learning_rate': 2.585913367302911e-05, 'epoch': 4.23} 42%|████▏ | 4232/10000 [6:40:49<8:50:25, 5.52s/it][2025-06-19 20:10:34,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:10:34,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.45 | bwd_microstep: 3342.68 | bwd_inner_microstep: 3341.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 20:10:34,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.45 | bwd: 3342.69 | bwd_inner: 3341.71 | bwd_allreduce: 0.94 | step: 6.53 42%|████▏ | 4233/10000 [6:40:54<8:49:50, 5.51s/it] {'loss': 0.006, 'grad_norm': 0.7550819516181946, 'learning_rate': 2.585294005136543e-05, 'epoch': 4.23} 42%|████▏ | 4233/10000 [6:40:54<8:49:50, 5.51s/it][2025-06-19 20:10:39,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:10:39,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.12 | bwd_microstep: 3333.61 | bwd_inner_microstep: 3332.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 20:10:39,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.12 | bwd: 3333.62 | bwd_inner: 3332.82 | bwd_allreduce: 0.76 | step: 6.58 42%|████▏ | 4234/10000 [6:41:00<8:49:14, 5.51s/it] {'loss': 0.0259, 'grad_norm': 2.1766836643218994, 'learning_rate': 2.5846745815755504e-05, 'epoch': 4.23} 42%|████▏ | 4234/10000 [6:41:00<8:49:14, 5.51s/it][2025-06-19 20:10:45,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:10:45,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.95 | bwd_microstep: 3378.61 | bwd_inner_microstep: 3377.70 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.97 [2025-06-19 20:10:45,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.95 | bwd: 3378.62 | bwd_inner: 3377.70 | bwd_allreduce: 0.88 | step: 6.97 42%|████▏ | 4235/10000 [6:41:05<8:50:32, 5.52s/it] {'loss': 0.0165, 'grad_norm': 1.30997633934021, 'learning_rate': 2.5840550966849076e-05, 'epoch': 4.24} 42%|████▏ | 4235/10000 [6:41:05<8:50:32, 5.52s/it][2025-06-19 20:10:50,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 20:10:50,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.13 | bwd_microstep: 3328.19 | bwd_inner_microstep: 3327.11 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.80 [2025-06-19 20:10:50,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.13 | bwd: 3328.21 | bwd_inner: 3327.11 | bwd_allreduce: 1.05 | step: 7.81 42%|████▏ | 4236/10000 [6:41:11<8:49:34, 5.51s/it] {'loss': 0.002, 'grad_norm': 0.19223663210868835, 'learning_rate': 2.583435550529596e-05, 'epoch': 4.24} 42%|████▏ | 4236/10000 [6:41:11<8:49:34, 5.51s/it][2025-06-19 20:10:56,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:10:56,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.51 | bwd_microstep: 3400.74 | bwd_inner_microstep: 3399.82 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.13 [2025-06-19 20:10:56,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.51 | bwd: 3400.76 | bwd_inner: 3399.82 | bwd_allreduce: 0.90 | step: 7.14 42%|████▏ | 4237/10000 [6:41:16<8:51:31, 5.53s/it] {'loss': 0.1548, 'grad_norm': 7.164421081542969, 'learning_rate': 2.582815943174603e-05, 'epoch': 4.24} 42%|████▏ | 4237/10000 [6:41:16<8:51:31, 5.53s/it][2025-06-19 20:11:01,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:11:01,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2178.81 | bwd_microstep: 3324.27 | bwd_inner_microstep: 3323.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 20:11:01,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2178.81 | bwd: 3324.29 | bwd_inner: 3323.47 | bwd_allreduce: 0.77 | step: 6.78 42%|████▏ | 4238/10000 [6:41:22<8:51:48, 5.54s/it] {'loss': 0.0015, 'grad_norm': 0.1429872363805771, 'learning_rate': 2.5821962746849228e-05, 'epoch': 4.24} 42%|████▏ | 4238/10000 [6:41:22<8:51:48, 5.54s/it][2025-06-19 20:11:07,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 20:11:07,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.99 | bwd_microstep: 3374.53 | bwd_inner_microstep: 3373.63 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.83 [2025-06-19 20:11:07,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.99 | bwd: 3374.55 | bwd_inner: 3373.63 | bwd_allreduce: 0.87 | step: 6.83 42%|████▏ | 4239/10000 [6:41:28<8:52:02, 5.54s/it] {'loss': 0.0021, 'grad_norm': 0.17418774962425232, 'learning_rate': 2.581576545125555e-05, 'epoch': 4.24} 42%|████▏ | 4239/10000 [6:41:28<8:52:02, 5.54s/it][2025-06-19 20:11:12,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:11:12,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.74 | bwd_microstep: 3374.90 | bwd_inner_microstep: 3374.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 20:11:12,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.74 | bwd: 3374.91 | bwd_inner: 3374.11 | bwd_allreduce: 0.76 | step: 6.69 42%|████▏ | 4240/10000 [6:41:33<8:52:07, 5.54s/it] {'loss': 0.0031, 'grad_norm': 0.16401056945323944, 'learning_rate': 2.5809567545615068e-05, 'epoch': 4.24} 42%|████▏ | 4240/10000 [6:41:33<8:52:07, 5.54s/it][2025-06-19 20:11:18,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:11:18,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.44 | bwd_microstep: 3318.42 | bwd_inner_microstep: 3317.60 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.39 [2025-06-19 20:11:18,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.44 | bwd: 3318.44 | bwd_inner: 3317.60 | bwd_allreduce: 0.79 | step: 7.39 42%|████▏ | 4241/10000 [6:41:39<8:49:53, 5.52s/it] {'loss': 0.0775, 'grad_norm': 1.8728578090667725, 'learning_rate': 2.5803369030577913e-05, 'epoch': 4.24} 42%|████▏ | 4241/10000 [6:41:39<8:49:53, 5.52s/it][2025-06-19 20:11:23,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:11:23,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.66 | bwd_microstep: 3326.64 | bwd_inner_microstep: 3325.79 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.98 [2025-06-19 20:11:23,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.66 | bwd: 3326.66 | bwd_inner: 3325.79 | bwd_allreduce: 0.80 | step: 6.99 42%|████▏ | 4242/10000 [6:41:44<8:48:46, 5.51s/it] {'loss': 0.0793, 'grad_norm': 1.8402150869369507, 'learning_rate': 2.579716990679429e-05, 'epoch': 4.24} 42%|████▏ | 4242/10000 [6:41:44<8:48:46, 5.51s/it][2025-06-19 20:11:29,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:11:29,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.23 | bwd_microstep: 3407.47 | bwd_inner_microstep: 3406.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 20:11:29,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.23 | bwd: 3407.49 | bwd_inner: 3406.67 | bwd_allreduce: 0.78 | step: 7.20 42%|████▏ | 4243/10000 [6:41:50<8:50:52, 5.53s/it] {'loss': 0.0607, 'grad_norm': 3.727757453918457, 'learning_rate': 2.579097017491444e-05, 'epoch': 4.24} 42%|████▏ | 4243/10000 [6:41:50<8:50:52, 5.53s/it][2025-06-19 20:11:34,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:11:34,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.41 | bwd_microstep: 3326.34 | bwd_inner_microstep: 3325.40 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.51 [2025-06-19 20:11:34,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.41 | bwd: 3326.35 | bwd_inner: 3325.40 | bwd_allreduce: 0.91 | step: 7.52 42%|████▏ | 4244/10000 [6:41:55<8:49:19, 5.52s/it] {'loss': 0.1462, 'grad_norm': 5.404234409332275, 'learning_rate': 2.5784769835588695e-05, 'epoch': 4.24} 42%|████▏ | 4244/10000 [6:41:55<8:49:19, 5.52s/it][2025-06-19 20:11:40,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:11:40,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.60 | bwd_microstep: 3381.86 | bwd_inner_microstep: 3381.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 20:11:40,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.60 | bwd: 3381.87 | bwd_inner: 3381.06 | bwd_allreduce: 0.76 | step: 6.65 42%|████▏ | 4245/10000 [6:42:01<8:50:29, 5.53s/it] {'loss': 0.0085, 'grad_norm': 0.9680315256118774, 'learning_rate': 2.5778568889467445e-05, 'epoch': 4.25} 42%|████▏ | 4245/10000 [6:42:01<8:50:29, 5.53s/it][2025-06-19 20:11:45,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:11:45,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.10 | bwd_microstep: 3383.13 | bwd_inner_microstep: 3382.21 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.14 [2025-06-19 20:11:45,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.10 | bwd: 3383.15 | bwd_inner: 3382.21 | bwd_allreduce: 0.89 | step: 7.14 42%|████▏ | 4246/10000 [6:42:06<8:51:03, 5.54s/it] {'loss': 0.1279, 'grad_norm': 3.974083185195923, 'learning_rate': 2.5772367337201133e-05, 'epoch': 4.25} 42%|████▏ | 4246/10000 [6:42:06<8:51:03, 5.54s/it][2025-06-19 20:11:51,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-19 20:11:51,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.45 | bwd_microstep: 3329.03 | bwd_inner_microstep: 3327.98 | bwd_allreduce_microstep: 0.98 | step_microstep: 8.04 [2025-06-19 20:11:51,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.45 | bwd: 3329.05 | bwd_inner: 3327.98 | bwd_allreduce: 1.01 | step: 8.05 42%|████▏ | 4247/10000 [6:42:12<8:49:28, 5.52s/it] {'loss': 0.0069, 'grad_norm': 0.6831266283988953, 'learning_rate': 2.576616517944029e-05, 'epoch': 4.25} 42%|████▏ | 4247/10000 [6:42:12<8:49:28, 5.52s/it][2025-06-19 20:11:56,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:11:56,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.01 | bwd_microstep: 3321.31 | bwd_inner_microstep: 3320.50 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 20:11:56,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.01 | bwd: 3321.33 | bwd_inner: 3320.50 | bwd_allreduce: 0.78 | step: 7.17 42%|████▏ | 4248/10000 [6:42:17<8:48:19, 5.51s/it] {'loss': 0.0055, 'grad_norm': 0.4036288857460022, 'learning_rate': 2.575996241683547e-05, 'epoch': 4.25} 42%|████▏ | 4248/10000 [6:42:17<8:48:19, 5.51s/it][2025-06-19 20:12:02,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:12:02,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.83 | bwd_microstep: 3324.48 | bwd_inner_microstep: 3323.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-19 20:12:02,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.83 | bwd: 3324.49 | bwd_inner: 3323.69 | bwd_allreduce: 0.76 | step: 6.74 42%|████▏ | 4249/10000 [6:42:23<8:47:23, 5.50s/it] {'loss': 0.0036, 'grad_norm': 0.19659213721752167, 'learning_rate': 2.5753759050037328e-05, 'epoch': 4.25} 42%|████▏ | 4249/10000 [6:42:23<8:47:23, 5.50s/it][2025-06-19 20:12:07,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:12:07,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.86 | bwd_microstep: 3316.37 | bwd_inner_microstep: 3315.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 20:12:07,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.86 | bwd: 3316.39 | bwd_inner: 3315.56 | bwd_allreduce: 0.78 | step: 6.94 42%|████▎ | 4250/10000 [6:42:28<8:46:07, 5.49s/it] {'loss': 0.0607, 'grad_norm': 2.8040804862976074, 'learning_rate': 2.574755507969657e-05, 'epoch': 4.25} 42%|████▎ | 4250/10000 [6:42:28<8:46:07, 5.49s/it][2025-06-19 20:12:13,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:12:13,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.95 | bwd_microstep: 3319.05 | bwd_inner_microstep: 3318.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 20:12:13,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.95 | bwd: 3319.07 | bwd_inner: 3318.26 | bwd_allreduce: 0.77 | step: 6.98 43%|████▎ | 4251/10000 [6:42:34<8:45:14, 5.48s/it] {'loss': 0.003, 'grad_norm': 0.1942429393529892, 'learning_rate': 2.5741350506463954e-05, 'epoch': 4.25} 43%|████▎ | 4251/10000 [6:42:34<8:45:14, 5.48s/it][2025-06-19 20:12:18,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 20:12:18,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.75 | bwd_microstep: 3317.65 | bwd_inner_microstep: 3316.54 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.80 [2025-06-19 20:12:18,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.75 | bwd: 3317.67 | bwd_inner: 3316.54 | bwd_allreduce: 1.08 | step: 7.81 43%|████▎ | 4252/10000 [6:42:39<8:44:49, 5.48s/it] {'loss': 0.1014, 'grad_norm': 4.502519607543945, 'learning_rate': 2.573514533099031e-05, 'epoch': 4.25} 43%|████▎ | 4252/10000 [6:42:39<8:44:49, 5.48s/it][2025-06-19 20:12:24,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:12:24,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.65 | bwd_microstep: 3330.36 | bwd_inner_microstep: 3329.51 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.35 [2025-06-19 20:12:24,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.65 | bwd: 3330.38 | bwd_inner: 3329.51 | bwd_allreduce: 0.81 | step: 7.35 43%|████▎ | 4253/10000 [6:42:45<8:44:58, 5.48s/it] {'loss': 0.0037, 'grad_norm': 0.41981270909309387, 'learning_rate': 2.572893955392655e-05, 'epoch': 4.25} 43%|████▎ | 4253/10000 [6:42:45<8:44:58, 5.48s/it][2025-06-19 20:12:29,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.72 [2025-06-19 20:12:29,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.47 | bwd_microstep: 3320.65 | bwd_inner_microstep: 3319.83 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.25 [2025-06-19 20:12:29,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.47 | bwd: 3320.67 | bwd_inner: 3319.83 | bwd_allreduce: 0.79 | step: 7.25 43%|████▎ | 4254/10000 [6:42:50<8:44:32, 5.48s/it] {'loss': 0.0058, 'grad_norm': 0.46537864208221436, 'learning_rate': 2.572273317592362e-05, 'epoch': 4.25} 43%|████▎ | 4254/10000 [6:42:50<8:44:32, 5.48s/it][2025-06-19 20:12:35,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:12:35,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.59 | bwd_microstep: 3322.16 | bwd_inner_microstep: 3321.37 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 20:12:35,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.59 | bwd: 3322.17 | bwd_inner: 3321.37 | bwd_allreduce: 0.76 | step: 6.72 43%|████▎ | 4255/10000 [6:42:56<8:44:27, 5.48s/it] {'loss': 0.025, 'grad_norm': 1.2611652612686157, 'learning_rate': 2.571652619763253e-05, 'epoch': 4.25} 43%|████▎ | 4255/10000 [6:42:56<8:44:27, 5.48s/it][2025-06-19 20:12:40,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:12:40,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.33 | bwd_microstep: 3312.48 | bwd_inner_microstep: 3311.58 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.83 [2025-06-19 20:12:40,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.33 | bwd: 3312.50 | bwd_inner: 3311.58 | bwd_allreduce: 0.87 | step: 6.84 43%|████▎ | 4256/10000 [6:43:01<8:43:45, 5.47s/it] {'loss': 0.0479, 'grad_norm': 2.401190996170044, 'learning_rate': 2.571031861970438e-05, 'epoch': 4.26} 43%|████▎ | 4256/10000 [6:43:01<8:43:45, 5.47s/it][2025-06-19 20:12:46,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:12:46,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.13 | bwd_microstep: 3372.17 | bwd_inner_microstep: 3371.24 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.17 [2025-06-19 20:12:46,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.13 | bwd: 3372.19 | bwd_inner: 3371.24 | bwd_allreduce: 0.90 | step: 7.17 43%|████▎ | 4257/10000 [6:43:07<8:45:40, 5.49s/it] {'loss': 0.0034, 'grad_norm': 0.8076600432395935, 'learning_rate': 2.5704110442790313e-05, 'epoch': 4.26} 43%|████▎ | 4257/10000 [6:43:07<8:45:40, 5.49s/it][2025-06-19 20:12:51,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 20:12:51,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.73 | bwd_microstep: 3323.76 | bwd_inner_microstep: 3322.88 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.05 [2025-06-19 20:12:51,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.73 | bwd: 3323.78 | bwd_inner: 3322.88 | bwd_allreduce: 0.86 | step: 7.05 43%|████▎ | 4258/10000 [6:43:12<8:45:08, 5.49s/it] {'loss': 0.001, 'grad_norm': 0.16928748786449432, 'learning_rate': 2.569790166754153e-05, 'epoch': 4.26} 43%|████▎ | 4258/10000 [6:43:12<8:45:08, 5.49s/it][2025-06-19 20:12:57,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:12:57,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.82 | bwd_microstep: 3323.42 | bwd_inner_microstep: 3322.29 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.77 [2025-06-19 20:12:57,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.82 | bwd: 3323.44 | bwd_inner: 3322.29 | bwd_allreduce: 1.09 | step: 7.78 43%|████▎ | 4259/10000 [6:43:17<8:44:42, 5.48s/it] {'loss': 0.0186, 'grad_norm': 2.831345796585083, 'learning_rate': 2.56916922946093e-05, 'epoch': 4.26} 43%|████▎ | 4259/10000 [6:43:17<8:44:42, 5.48s/it][2025-06-19 20:13:02,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:13:02,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.99 | bwd_microstep: 3317.03 | bwd_inner_microstep: 3316.21 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.79 [2025-06-19 20:13:02,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.99 | bwd: 3317.05 | bwd_inner: 3316.21 | bwd_allreduce: 0.78 | step: 6.79 43%|████▎ | 4260/10000 [6:43:23<8:43:59, 5.48s/it] {'loss': 0.002, 'grad_norm': 0.19495025277137756, 'learning_rate': 2.5685482324644975e-05, 'epoch': 4.26} 43%|████▎ | 4260/10000 [6:43:23<8:43:59, 5.48s/it][2025-06-19 20:13:08,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 20:13:08,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.58 | bwd_microstep: 3320.05 | bwd_inner_microstep: 3319.03 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.34 [2025-06-19 20:13:08,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.58 | bwd: 3320.07 | bwd_inner: 3319.03 | bwd_allreduce: 0.99 | step: 7.35 43%|████▎ | 4261/10000 [6:43:28<8:43:38, 5.47s/it] {'loss': 0.091, 'grad_norm': 3.9085798263549805, 'learning_rate': 2.5679271758299935e-05, 'epoch': 4.26} 43%|████▎ | 4261/10000 [6:43:28<8:43:38, 5.47s/it][2025-06-19 20:13:13,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:13:13,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.93 | bwd_microstep: 3367.69 | bwd_inner_microstep: 3366.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 20:13:13,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.93 | bwd: 3367.71 | bwd_inner: 3366.90 | bwd_allreduce: 0.77 | step: 6.70 43%|████▎ | 4262/10000 [6:43:34<8:45:15, 5.49s/it] {'loss': 0.0157, 'grad_norm': 1.0396981239318848, 'learning_rate': 2.567306059622565e-05, 'epoch': 4.26} 43%|████▎ | 4262/10000 [6:43:34<8:45:15, 5.49s/it][2025-06-19 20:13:19,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:13:19,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.18 | bwd_microstep: 3366.42 | bwd_inner_microstep: 3365.58 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.28 [2025-06-19 20:13:19,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.18 | bwd: 3366.44 | bwd_inner: 3365.58 | bwd_allreduce: 0.81 | step: 7.29 43%|████▎ | 4263/10000 [6:43:40<8:46:35, 5.51s/it] {'loss': 0.0256, 'grad_norm': 1.6881041526794434, 'learning_rate': 2.566684883907363e-05, 'epoch': 4.26} 43%|████▎ | 4263/10000 [6:43:40<8:46:35, 5.51s/it][2025-06-19 20:13:24,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.88 [2025-06-19 20:13:24,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.00 | bwd_microstep: 3312.91 | bwd_inner_microstep: 3312.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.98 [2025-06-19 20:13:24,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.00 | bwd: 3312.93 | bwd_inner: 3312.13 | bwd_allreduce: 0.76 | step: 6.99 43%|████▎ | 4264/10000 [6:43:45<8:45:13, 5.49s/it] {'loss': 0.0018, 'grad_norm': 0.1924436092376709, 'learning_rate': 2.5660636487495472e-05, 'epoch': 4.26} 43%|████▎ | 4264/10000 [6:43:45<8:45:13, 5.49s/it][2025-06-19 20:13:30,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:13:30,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.86 | bwd_microstep: 3314.03 | bwd_inner_microstep: 3313.25 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 20:13:30,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.86 | bwd: 3314.04 | bwd_inner: 3313.25 | bwd_allreduce: 0.75 | step: 6.63 43%|████▎ | 4265/10000 [6:43:50<8:44:10, 5.48s/it] {'loss': 0.0455, 'grad_norm': 1.80768620967865, 'learning_rate': 2.565442354214282e-05, 'epoch': 4.26} 43%|████▎ | 4265/10000 [6:43:50<8:44:10, 5.48s/it][2025-06-19 20:13:35,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:13:35,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.55 | bwd_microstep: 3310.43 | bwd_inner_microstep: 3309.33 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.61 [2025-06-19 20:13:35,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.55 | bwd: 3310.45 | bwd_inner: 3309.33 | bwd_allreduce: 1.06 | step: 7.62 43%|████▎ | 4266/10000 [6:43:56<8:43:24, 5.48s/it] {'loss': 0.0565, 'grad_norm': 3.9732308387756348, 'learning_rate': 2.5648210003667382e-05, 'epoch': 4.27} 43%|████▎ | 4266/10000 [6:43:56<8:43:24, 5.48s/it][2025-06-19 20:13:41,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:13:41,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.51 | bwd_microstep: 3321.29 | bwd_inner_microstep: 3320.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 20:13:41,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.52 | bwd: 3321.30 | bwd_inner: 3320.51 | bwd_allreduce: 0.76 | step: 6.64 43%|████▎ | 4267/10000 [6:44:01<8:43:10, 5.48s/it] {'loss': 0.0085, 'grad_norm': 1.1402686834335327, 'learning_rate': 2.5641995872720928e-05, 'epoch': 4.27} 43%|████▎ | 4267/10000 [6:44:01<8:43:10, 5.48s/it][2025-06-19 20:13:46,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.85 [2025-06-19 20:13:46,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.49 | bwd_microstep: 3363.96 | bwd_inner_microstep: 3362.90 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.82 [2025-06-19 20:13:46,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.49 | bwd: 3363.98 | bwd_inner: 3362.90 | bwd_allreduce: 1.02 | step: 7.83 43%|████▎ | 4268/10000 [6:44:07<8:44:51, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.021854231134057045, 'learning_rate': 2.5635781149955295e-05, 'epoch': 4.27} 43%|████▎ | 4268/10000 [6:44:07<8:44:51, 5.49s/it][2025-06-19 20:13:52,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:13:52,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.40 | bwd_microstep: 3311.39 | bwd_inner_microstep: 3310.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 20:13:52,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.40 | bwd: 3311.41 | bwd_inner: 3310.61 | bwd_allreduce: 0.75 | step: 6.56 43%|████▎ | 4269/10000 [6:44:12<8:43:37, 5.48s/it] {'loss': 0.0014, 'grad_norm': 0.20652426779270172, 'learning_rate': 2.5629565836022366e-05, 'epoch': 4.27} 43%|████▎ | 4269/10000 [6:44:12<8:43:37, 5.48s/it][2025-06-19 20:13:57,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:13:57,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.00 | bwd_microstep: 3394.61 | bwd_inner_microstep: 3393.64 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.46 [2025-06-19 20:13:57,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.00 | bwd: 3394.62 | bwd_inner: 3393.64 | bwd_allreduce: 0.93 | step: 7.46 43%|████▎ | 4270/10000 [6:44:18<8:46:10, 5.51s/it] {'loss': 0.0016, 'grad_norm': 0.07498835027217865, 'learning_rate': 2.5623349931574116e-05, 'epoch': 4.27} 43%|████▎ | 4270/10000 [6:44:18<8:46:10, 5.51s/it][2025-06-19 20:14:03,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:14:03,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.10 | bwd_microstep: 3316.57 | bwd_inner_microstep: 3315.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 20:14:03,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.10 | bwd: 3316.59 | bwd_inner: 3315.79 | bwd_allreduce: 0.76 | step: 6.69 43%|████▎ | 4271/10000 [6:44:23<8:44:56, 5.50s/it] {'loss': 0.0084, 'grad_norm': 0.903359591960907, 'learning_rate': 2.5617133437262557e-05, 'epoch': 4.27} 43%|████▎ | 4271/10000 [6:44:23<8:44:56, 5.50s/it][2025-06-19 20:14:08,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:14:08,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.85 | bwd_microstep: 3360.15 | bwd_inner_microstep: 3359.36 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 20:14:08,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.85 | bwd: 3360.16 | bwd_inner: 3359.36 | bwd_allreduce: 0.76 | step: 6.60 43%|████▎ | 4272/10000 [6:44:29<8:45:40, 5.51s/it] {'loss': 0.0128, 'grad_norm': 0.6212696433067322, 'learning_rate': 2.561091635373977e-05, 'epoch': 4.27} 43%|████▎ | 4272/10000 [6:44:29<8:45:40, 5.51s/it][2025-06-19 20:14:14,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:14:14,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.29 | bwd_microstep: 3309.77 | bwd_inner_microstep: 3308.84 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.95 [2025-06-19 20:14:14,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.29 | bwd: 3309.78 | bwd_inner: 3308.84 | bwd_allreduce: 0.90 | step: 6.95 43%|████▎ | 4273/10000 [6:44:34<8:43:58, 5.49s/it] {'loss': 0.0275, 'grad_norm': 1.1400331258773804, 'learning_rate': 2.5604698681657896e-05, 'epoch': 4.27} 43%|████▎ | 4273/10000 [6:44:34<8:43:58, 5.49s/it][2025-06-19 20:14:19,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:14:19,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.24 | bwd_microstep: 3311.63 | bwd_inner_microstep: 3310.77 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.23 [2025-06-19 20:14:19,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.24 | bwd: 3311.65 | bwd_inner: 3310.77 | bwd_allreduce: 0.83 | step: 7.23 43%|████▎ | 4274/10000 [6:44:40<8:43:00, 5.48s/it] {'loss': 0.0178, 'grad_norm': 1.4182212352752686, 'learning_rate': 2.5598480421669143e-05, 'epoch': 4.27} 43%|████▎ | 4274/10000 [6:44:40<8:43:00, 5.48s/it][2025-06-19 20:14:24,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:14:24,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.79 | bwd_microstep: 3312.93 | bwd_inner_microstep: 3312.11 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.00 [2025-06-19 20:14:24,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.79 | bwd: 3312.94 | bwd_inner: 3312.11 | bwd_allreduce: 0.79 | step: 7.00 43%|████▎ | 4275/10000 [6:44:45<8:42:05, 5.47s/it] {'loss': 0.0135, 'grad_norm': 0.6713475584983826, 'learning_rate': 2.559226157442578e-05, 'epoch': 4.28} 43%|████▎ | 4275/10000 [6:44:45<8:42:05, 5.47s/it][2025-06-19 20:14:30,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:14:30,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.97 | bwd_microstep: 3308.05 | bwd_inner_microstep: 3306.88 | bwd_allreduce_microstep: 1.10 | step_microstep: 7.93 [2025-06-19 20:14:30,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.97 | bwd: 3308.07 | bwd_inner: 3306.88 | bwd_allreduce: 1.13 | step: 7.93 43%|████▎ | 4276/10000 [6:44:51<8:41:19, 5.46s/it] {'loss': 0.0536, 'grad_norm': 2.3255271911621094, 'learning_rate': 2.558604214058013e-05, 'epoch': 4.28} 43%|████▎ | 4276/10000 [6:44:51<8:41:19, 5.46s/it][2025-06-19 20:14:35,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:14:35,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.96 | bwd_microstep: 3364.01 | bwd_inner_microstep: 3363.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 20:14:35,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.96 | bwd: 3364.03 | bwd_inner: 3363.23 | bwd_allreduce: 0.76 | step: 6.61 43%|████▎ | 4277/10000 [6:44:56<8:43:14, 5.49s/it] {'loss': 0.0082, 'grad_norm': 0.532551109790802, 'learning_rate': 2.557982212078459e-05, 'epoch': 4.28} 43%|████▎ | 4277/10000 [6:44:56<8:43:14, 5.49s/it][2025-06-19 20:14:41,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 20:14:41,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.69 | bwd_microstep: 3394.01 | bwd_inner_microstep: 3393.22 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 20:14:41,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.69 | bwd: 3394.03 | bwd_inner: 3393.23 | bwd_allreduce: 0.76 | step: 6.57 43%|████▎ | 4278/10000 [6:45:02<8:45:22, 5.51s/it] {'loss': 0.0429, 'grad_norm': 1.897371768951416, 'learning_rate': 2.5573601515691603e-05, 'epoch': 4.28} 43%|████▎ | 4278/10000 [6:45:02<8:45:22, 5.51s/it][2025-06-19 20:14:46,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:14:46,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.00 | bwd_microstep: 3312.40 | bwd_inner_microstep: 3311.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 20:14:46,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.00 | bwd: 3312.41 | bwd_inner: 3311.61 | bwd_allreduce: 0.76 | step: 6.65 43%|████▎ | 4279/10000 [6:45:07<8:43:32, 5.49s/it] {'loss': 0.0063, 'grad_norm': 0.3336658179759979, 'learning_rate': 2.5567380325953678e-05, 'epoch': 4.28} 43%|████▎ | 4279/10000 [6:45:07<8:43:32, 5.49s/it][2025-06-19 20:14:52,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:14:52,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.07 | bwd_microstep: 3311.74 | bwd_inner_microstep: 3310.96 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 20:14:52,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.07 | bwd: 3311.76 | bwd_inner: 3310.96 | bwd_allreduce: 0.75 | step: 6.56 43%|████▎ | 4280/10000 [6:45:13<8:42:35, 5.48s/it] {'loss': 0.0903, 'grad_norm': 3.684678792953491, 'learning_rate': 2.5561158552223405e-05, 'epoch': 4.28} 43%|████▎ | 4280/10000 [6:45:13<8:42:35, 5.48s/it][2025-06-19 20:14:57,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:14:57,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.75 | bwd_microstep: 3317.03 | bwd_inner_microstep: 3316.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 20:14:57,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.75 | bwd: 3317.04 | bwd_inner: 3316.24 | bwd_allreduce: 0.76 | step: 6.68 43%|████▎ | 4281/10000 [6:45:18<8:41:42, 5.47s/it] {'loss': 0.0921, 'grad_norm': 4.32958459854126, 'learning_rate': 2.5554936195153403e-05, 'epoch': 4.28} 43%|████▎ | 4281/10000 [6:45:18<8:41:42, 5.47s/it][2025-06-19 20:15:03,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:15:03,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.05 | bwd_microstep: 3362.81 | bwd_inner_microstep: 3362.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 20:15:03,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.05 | bwd: 3362.82 | bwd_inner: 3362.00 | bwd_allreduce: 0.78 | step: 7.16 43%|████▎ | 4282/10000 [6:45:24<8:43:06, 5.49s/it] {'loss': 0.0148, 'grad_norm': 1.6218746900558472, 'learning_rate': 2.5548713255396385e-05, 'epoch': 4.28} 43%|████▎ | 4282/10000 [6:45:24<8:43:06, 5.49s/it][2025-06-19 20:15:08,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:15:08,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.58 | bwd_microstep: 3363.02 | bwd_inner_microstep: 3362.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-19 20:15:08,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.58 | bwd: 3363.04 | bwd_inner: 3362.23 | bwd_allreduce: 0.76 | step: 6.60 43%|████▎ | 4283/10000 [6:45:29<8:44:02, 5.50s/it] {'loss': 0.0289, 'grad_norm': 1.7312806844711304, 'learning_rate': 2.5542489733605093e-05, 'epoch': 4.28} 43%|████▎ | 4283/10000 [6:45:29<8:44:02, 5.50s/it][2025-06-19 20:15:14,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:15:14,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.08 | bwd_microstep: 3307.96 | bwd_inner_microstep: 3307.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 20:15:14,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.08 | bwd: 3307.98 | bwd_inner: 3307.18 | bwd_allreduce: 0.76 | step: 6.65 43%|████▎ | 4284/10000 [6:45:35<8:42:35, 5.49s/it] {'loss': 0.0662, 'grad_norm': 2.0651235580444336, 'learning_rate': 2.5536265630432347e-05, 'epoch': 4.28} 43%|████▎ | 4284/10000 [6:45:35<8:42:35, 5.49s/it][2025-06-19 20:15:19,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:15:19,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.45 | bwd_microstep: 3361.87 | bwd_inner_microstep: 3361.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-19 20:15:19,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.45 | bwd: 3361.88 | bwd_inner: 3361.08 | bwd_allreduce: 0.76 | step: 6.74 43%|████▎ | 4285/10000 [6:45:40<8:43:30, 5.50s/it] {'loss': 0.0083, 'grad_norm': 0.9186444282531738, 'learning_rate': 2.553004094653104e-05, 'epoch': 4.29} 43%|████▎ | 4285/10000 [6:45:40<8:43:30, 5.50s/it][2025-06-19 20:15:25,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:15:25,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.49 | bwd_microstep: 3312.80 | bwd_inner_microstep: 3311.80 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.19 [2025-06-19 20:15:25,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.49 | bwd: 3312.82 | bwd_inner: 3311.80 | bwd_allreduce: 0.97 | step: 7.20 43%|████▎ | 4286/10000 [6:45:46<8:42:00, 5.48s/it] {'loss': 0.0012, 'grad_norm': 0.13637565076351166, 'learning_rate': 2.5523815682554096e-05, 'epoch': 4.29} 43%|████▎ | 4286/10000 [6:45:46<8:42:00, 5.48s/it][2025-06-19 20:15:30,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:15:30,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.72 | bwd_microstep: 3316.96 | bwd_inner_microstep: 3316.13 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.45 [2025-06-19 20:15:30,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.72 | bwd: 3316.98 | bwd_inner: 3316.13 | bwd_allreduce: 0.81 | step: 7.45 43%|████▎ | 4287/10000 [6:45:51<8:41:15, 5.47s/it] {'loss': 0.0057, 'grad_norm': 0.3522907793521881, 'learning_rate': 2.551758983915453e-05, 'epoch': 4.29} 43%|████▎ | 4287/10000 [6:45:51<8:41:15, 5.47s/it][2025-06-19 20:15:36,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:15:36,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.28 | bwd_microstep: 3318.40 | bwd_inner_microstep: 3317.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 20:15:36,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.28 | bwd: 3318.41 | bwd_inner: 3317.61 | bwd_allreduce: 0.76 | step: 6.64 43%|████▎ | 4288/10000 [6:45:57<8:40:50, 5.47s/it] {'loss': 0.0581, 'grad_norm': 7.844520092010498, 'learning_rate': 2.5511363416985392e-05, 'epoch': 4.29} 43%|████▎ | 4288/10000 [6:45:57<8:40:50, 5.47s/it][2025-06-19 20:15:41,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:15:41,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.82 | bwd_microstep: 3372.73 | bwd_inner_microstep: 3371.72 | bwd_allreduce_microstep: 0.96 | step_microstep: 6.78 [2025-06-19 20:15:41,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.82 | bwd: 3372.74 | bwd_inner: 3371.72 | bwd_allreduce: 0.98 | step: 6.79 43%|████▎ | 4289/10000 [6:46:02<8:42:41, 5.49s/it] {'loss': 0.0087, 'grad_norm': 0.7369465827941895, 'learning_rate': 2.550513641669981e-05, 'epoch': 4.29} 43%|████▎ | 4289/10000 [6:46:02<8:42:41, 5.49s/it][2025-06-19 20:15:47,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:15:47,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.80 | bwd_microstep: 3373.65 | bwd_inner_microstep: 3372.81 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.86 [2025-06-19 20:15:47,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.80 | bwd: 3373.66 | bwd_inner: 3372.81 | bwd_allreduce: 0.80 | step: 6.86 43%|████▎ | 4290/10000 [6:46:08<8:43:57, 5.51s/it] {'loss': 0.0027, 'grad_norm': 0.16740289330482483, 'learning_rate': 2.549890883895097e-05, 'epoch': 4.29} 43%|████▎ | 4290/10000 [6:46:08<8:43:57, 5.51s/it][2025-06-19 20:15:52,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:15:52,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.11 | bwd_microstep: 3362.19 | bwd_inner_microstep: 3361.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-19 20:15:52,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.11 | bwd: 3362.20 | bwd_inner: 3361.37 | bwd_allreduce: 0.78 | step: 7.05 43%|████▎ | 4291/10000 [6:46:13<8:44:31, 5.51s/it] {'loss': 0.0867, 'grad_norm': 2.959009885787964, 'learning_rate': 2.549268068439212e-05, 'epoch': 4.29} 43%|████▎ | 4291/10000 [6:46:13<8:44:31, 5.51s/it][2025-06-19 20:15:58,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:15:58,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.33 | bwd_microstep: 3316.60 | bwd_inner_microstep: 3315.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 20:15:58,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.33 | bwd: 3316.62 | bwd_inner: 3315.82 | bwd_allreduce: 0.76 | step: 6.72 43%|████▎ | 4292/10000 [6:46:19<8:42:52, 5.50s/it] {'loss': 0.0025, 'grad_norm': 0.22694271802902222, 'learning_rate': 2.5486451953676554e-05, 'epoch': 4.29} 43%|████▎ | 4292/10000 [6:46:19<8:42:52, 5.50s/it][2025-06-19 20:16:03,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:16:03,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.62 | bwd_microstep: 3320.53 | bwd_inner_microstep: 3319.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 20:16:03,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.62 | bwd: 3320.54 | bwd_inner: 3319.73 | bwd_allreduce: 0.77 | step: 6.69 43%|████▎ | 4293/10000 [6:46:24<8:41:53, 5.49s/it] {'loss': 0.0307, 'grad_norm': 2.674778461456299, 'learning_rate': 2.5480222647457637e-05, 'epoch': 4.29} 43%|████▎ | 4293/10000 [6:46:24<8:41:53, 5.49s/it][2025-06-19 20:16:09,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:16:09,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.88 | bwd_microstep: 3323.90 | bwd_inner_microstep: 3322.98 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.12 [2025-06-19 20:16:09,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.88 | bwd: 3323.92 | bwd_inner: 3322.98 | bwd_allreduce: 0.89 | step: 7.13 43%|████▎ | 4294/10000 [6:46:30<8:41:12, 5.48s/it] {'loss': 0.0069, 'grad_norm': 0.390165239572525, 'learning_rate': 2.5473992766388806e-05, 'epoch': 4.29} 43%|████▎ | 4294/10000 [6:46:30<8:41:12, 5.48s/it][2025-06-19 20:16:14,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:16:14,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.20 | bwd_microstep: 3321.27 | bwd_inner_microstep: 3320.49 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 20:16:14,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.20 | bwd: 3321.28 | bwd_inner: 3320.49 | bwd_allreduce: 0.75 | step: 6.59 43%|████▎ | 4295/10000 [6:46:35<8:40:50, 5.48s/it] {'loss': 0.0204, 'grad_norm': 2.6818735599517822, 'learning_rate': 2.546776231112354e-05, 'epoch': 4.29} 43%|████▎ | 4295/10000 [6:46:35<8:40:50, 5.48s/it][2025-06-19 20:16:20,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:16:20,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.79 | bwd_microstep: 3368.68 | bwd_inner_microstep: 3367.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 20:16:20,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.79 | bwd: 3368.69 | bwd_inner: 3367.90 | bwd_allreduce: 0.75 | step: 6.60 43%|████▎ | 4296/10000 [6:46:41<8:42:31, 5.50s/it] {'loss': 0.0304, 'grad_norm': 2.2247815132141113, 'learning_rate': 2.5461531282315378e-05, 'epoch': 4.3} 43%|████▎ | 4296/10000 [6:46:41<8:42:31, 5.50s/it][2025-06-19 20:16:25,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:16:25,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.85 | bwd_microstep: 3365.40 | bwd_inner_microstep: 3364.55 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.09 [2025-06-19 20:16:25,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.85 | bwd: 3365.43 | bwd_inner: 3364.55 | bwd_allreduce: 0.81 | step: 7.10 43%|████▎ | 4297/10000 [6:46:46<8:43:25, 5.51s/it] {'loss': 0.0087, 'grad_norm': 0.31214794516563416, 'learning_rate': 2.545529968061793e-05, 'epoch': 4.3} 43%|████▎ | 4297/10000 [6:46:46<8:43:25, 5.51s/it][2025-06-19 20:16:31,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:16:31,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.23 | bwd_microstep: 3312.89 | bwd_inner_microstep: 3312.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 20:16:31,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.23 | bwd: 3312.91 | bwd_inner: 3312.10 | bwd_allreduce: 0.76 | step: 6.88 43%|████▎ | 4298/10000 [6:46:52<8:42:04, 5.49s/it] {'loss': 0.0462, 'grad_norm': 1.9022401571273804, 'learning_rate': 2.5449067506684866e-05, 'epoch': 4.3} 43%|████▎ | 4298/10000 [6:46:52<8:42:04, 5.49s/it][2025-06-19 20:16:36,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 20:16:36,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.45 | bwd_microstep: 3309.82 | bwd_inner_microstep: 3308.89 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.15 [2025-06-19 20:16:36,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.45 | bwd: 3309.84 | bwd_inner: 3308.89 | bwd_allreduce: 0.91 | step: 7.15 43%|████▎ | 4299/10000 [6:46:57<8:41:01, 5.48s/it] {'loss': 0.0685, 'grad_norm': 4.369675159454346, 'learning_rate': 2.5442834761169915e-05, 'epoch': 4.3} 43%|████▎ | 4299/10000 [6:46:57<8:41:01, 5.48s/it][2025-06-19 20:16:42,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.97 [2025-06-19 20:16:42,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.99 | bwd_microstep: 3361.42 | bwd_inner_microstep: 3360.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 20:16:42,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.99 | bwd: 3361.43 | bwd_inner: 3360.63 | bwd_allreduce: 0.76 | step: 6.95 43%|████▎ | 4300/10000 [6:47:03<8:42:16, 5.50s/it] {'loss': 0.0618, 'grad_norm': 2.7478880882263184, 'learning_rate': 2.5436601444726862e-05, 'epoch': 4.3} 43%|████▎ | 4300/10000 [6:47:03<8:42:16, 5.50s/it][2025-06-19 20:16:47,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.82 [2025-06-19 20:16:47,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.49 | bwd_microstep: 3322.79 | bwd_inner_microstep: 3322.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 20:16:47,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.49 | bwd: 3322.81 | bwd_inner: 3322.00 | bwd_allreduce: 0.76 | step: 6.81 43%|████▎ | 4301/10000 [6:47:08<8:41:16, 5.49s/it] {'loss': 0.0029, 'grad_norm': 0.26043975353240967, 'learning_rate': 2.5430367558009543e-05, 'epoch': 4.3} 43%|████▎ | 4301/10000 [6:47:08<8:41:16, 5.49s/it][2025-06-19 20:16:53,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:16:53,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.47 | bwd_microstep: 3318.34 | bwd_inner_microstep: 3317.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 20:16:53,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.47 | bwd: 3318.35 | bwd_inner: 3317.53 | bwd_allreduce: 0.77 | step: 7.19 43%|████▎ | 4302/10000 [6:47:14<8:40:30, 5.48s/it] {'loss': 0.0096, 'grad_norm': 0.48642498254776, 'learning_rate': 2.542413310167187e-05, 'epoch': 4.3} 43%|████▎ | 4302/10000 [6:47:14<8:40:30, 5.48s/it][2025-06-19 20:16:58,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.73 [2025-06-19 20:16:58,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.76 | bwd_microstep: 3360.07 | bwd_inner_microstep: 3359.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 20:16:58,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.76 | bwd: 3360.08 | bwd_inner: 3359.29 | bwd_allreduce: 0.75 | step: 6.77 43%|████▎ | 4303/10000 [6:47:19<8:41:35, 5.49s/it] {'loss': 0.0206, 'grad_norm': 0.756777286529541, 'learning_rate': 2.5417898076367813e-05, 'epoch': 4.3} 43%|████▎ | 4303/10000 [6:47:19<8:41:35, 5.49s/it][2025-06-19 20:17:04,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:17:04,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.76 | bwd_microstep: 3324.90 | bwd_inner_microstep: 3324.08 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.73 [2025-06-19 20:17:04,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.76 | bwd: 3324.91 | bwd_inner: 3324.08 | bwd_allreduce: 0.79 | step: 6.73 43%|████▎ | 4304/10000 [6:47:24<8:40:46, 5.49s/it] {'loss': 0.1288, 'grad_norm': 5.881938457489014, 'learning_rate': 2.541166248275139e-05, 'epoch': 4.3} 43%|████▎ | 4304/10000 [6:47:24<8:40:46, 5.49s/it][2025-06-19 20:17:09,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:17:09,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.82 | bwd_microstep: 3372.07 | bwd_inner_microstep: 3371.13 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.35 [2025-06-19 20:17:09,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.82 | bwd: 3372.08 | bwd_inner: 3371.13 | bwd_allreduce: 0.91 | step: 7.35 43%|████▎ | 4305/10000 [6:47:30<8:42:24, 5.50s/it] {'loss': 0.031, 'grad_norm': 1.7108310461044312, 'learning_rate': 2.5405426321476688e-05, 'epoch': 4.3} 43%|████▎ | 4305/10000 [6:47:30<8:42:24, 5.50s/it][2025-06-19 20:17:15,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 20:17:15,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.70 | bwd_microstep: 3330.58 | bwd_inner_microstep: 3329.44 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.02 [2025-06-19 20:17:15,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.70 | bwd: 3330.59 | bwd_inner: 3329.44 | bwd_allreduce: 1.09 | step: 7.02 43%|████▎ | 4306/10000 [6:47:36<8:41:49, 5.50s/it] {'loss': 0.1701, 'grad_norm': 2.9382121562957764, 'learning_rate': 2.5399189593197853e-05, 'epoch': 4.31} 43%|████▎ | 4306/10000 [6:47:36<8:41:49, 5.50s/it][2025-06-19 20:17:20,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:17:20,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.21 | bwd_microstep: 3365.55 | bwd_inner_microstep: 3364.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 20:17:20,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.21 | bwd: 3365.57 | bwd_inner: 3364.75 | bwd_allreduce: 0.77 | step: 6.79 43%|████▎ | 4307/10000 [6:47:41<8:42:44, 5.51s/it] {'loss': 0.0543, 'grad_norm': 2.3571715354919434, 'learning_rate': 2.539295229856909e-05, 'epoch': 4.31} 43%|████▎ | 4307/10000 [6:47:41<8:42:44, 5.51s/it][2025-06-19 20:17:26,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:17:26,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.69 | bwd_microstep: 3326.95 | bwd_inner_microstep: 3326.07 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.32 [2025-06-19 20:17:26,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.69 | bwd: 3326.97 | bwd_inner: 3326.07 | bwd_allreduce: 0.85 | step: 7.33 43%|████▎ | 4308/10000 [6:47:47<8:41:42, 5.50s/it] {'loss': 0.0193, 'grad_norm': 0.9068168997764587, 'learning_rate': 2.5386714438244663e-05, 'epoch': 4.31} 43%|████▎ | 4308/10000 [6:47:47<8:41:42, 5.50s/it][2025-06-19 20:17:31,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:17:31,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.77 | bwd_microstep: 3328.18 | bwd_inner_microstep: 3327.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 20:17:31,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.77 | bwd: 3328.19 | bwd_inner: 3327.39 | bwd_allreduce: 0.76 | step: 6.73 43%|████▎ | 4309/10000 [6:47:52<8:40:56, 5.49s/it] {'loss': 0.1535, 'grad_norm': 2.8670103549957275, 'learning_rate': 2.538047601287888e-05, 'epoch': 4.31} 43%|████▎ | 4309/10000 [6:47:52<8:40:56, 5.49s/it][2025-06-19 20:17:37,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:17:37,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.46 | bwd_microstep: 3384.07 | bwd_inner_microstep: 3383.13 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.07 [2025-06-19 20:17:37,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.46 | bwd: 3384.09 | bwd_inner: 3383.13 | bwd_allreduce: 0.91 | step: 7.07 43%|████▎ | 4310/10000 [6:47:58<8:42:40, 5.51s/it] {'loss': 0.0036, 'grad_norm': 0.27969473600387573, 'learning_rate': 2.537423702312615e-05, 'epoch': 4.31} 43%|████▎ | 4310/10000 [6:47:58<8:42:40, 5.51s/it][2025-06-19 20:17:42,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:17:42,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.92 | bwd_microstep: 3323.60 | bwd_inner_microstep: 3322.76 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.00 [2025-06-19 20:17:42,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.92 | bwd: 3323.62 | bwd_inner: 3322.76 | bwd_allreduce: 0.80 | step: 7.01 43%|████▎ | 4311/10000 [6:48:03<8:41:35, 5.50s/it] {'loss': 0.015, 'grad_norm': 1.8546079397201538, 'learning_rate': 2.5367997469640887e-05, 'epoch': 4.31} 43%|████▎ | 4311/10000 [6:48:03<8:41:35, 5.50s/it][2025-06-19 20:17:48,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:17:48,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.92 | bwd_microstep: 3316.46 | bwd_inner_microstep: 3315.49 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.07 [2025-06-19 20:17:48,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.92 | bwd: 3316.48 | bwd_inner: 3315.49 | bwd_allreduce: 0.94 | step: 7.08 43%|████▎ | 4312/10000 [6:48:09<8:40:18, 5.49s/it] {'loss': 0.0129, 'grad_norm': 0.5678694844245911, 'learning_rate': 2.5361757353077607e-05, 'epoch': 4.31} 43%|████▎ | 4312/10000 [6:48:09<8:40:18, 5.49s/it][2025-06-19 20:17:53,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:17:53,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.21 | bwd_microstep: 3332.89 | bwd_inner_microstep: 3331.95 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.36 [2025-06-19 20:17:53,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.21 | bwd: 3332.91 | bwd_inner: 3331.95 | bwd_allreduce: 0.92 | step: 7.36 43%|████▎ | 4313/10000 [6:48:14<8:39:59, 5.49s/it] {'loss': 0.0077, 'grad_norm': 0.5946428179740906, 'learning_rate': 2.535551667409087e-05, 'epoch': 4.31} 43%|████▎ | 4313/10000 [6:48:14<8:39:59, 5.49s/it][2025-06-19 20:17:59,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 3.13 [2025-06-19 20:17:59,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.74 | bwd_microstep: 3325.91 | bwd_inner_microstep: 3324.93 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.86 [2025-06-19 20:17:59,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.74 | bwd: 3325.92 | bwd_inner: 3324.93 | bwd_allreduce: 0.94 | step: 7.86 43%|████▎ | 4314/10000 [6:48:19<8:39:29, 5.48s/it] {'loss': 0.0122, 'grad_norm': 0.9279585480690002, 'learning_rate': 2.5349275433335283e-05, 'epoch': 4.31} 43%|████▎ | 4314/10000 [6:48:19<8:39:29, 5.48s/it][2025-06-19 20:18:04,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:18:04,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.08 | bwd_microstep: 3338.57 | bwd_inner_microstep: 3337.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 20:18:04,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.08 | bwd: 3338.58 | bwd_inner: 3337.78 | bwd_allreduce: 0.76 | step: 6.67 43%|████▎ | 4315/10000 [6:48:25<8:39:40, 5.48s/it] {'loss': 0.0068, 'grad_norm': 0.9327638149261475, 'learning_rate': 2.5343033631465534e-05, 'epoch': 4.32} 43%|████▎ | 4315/10000 [6:48:25<8:39:40, 5.48s/it][2025-06-19 20:18:10,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.88 [2025-06-19 20:18:10,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.55 | bwd_microstep: 3331.87 | bwd_inner_microstep: 3331.07 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-19 20:18:10,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.55 | bwd: 3331.89 | bwd_inner: 3331.07 | bwd_allreduce: 0.78 | step: 6.91 43%|████▎ | 4316/10000 [6:48:30<8:39:30, 5.48s/it] {'loss': 0.0478, 'grad_norm': 4.589723110198975, 'learning_rate': 2.533679126913635e-05, 'epoch': 4.32} 43%|████▎ | 4316/10000 [6:48:30<8:39:30, 5.48s/it][2025-06-19 20:18:15,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:18:15,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.06 | bwd_microstep: 3332.72 | bwd_inner_microstep: 3331.90 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.82 [2025-06-19 20:18:15,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.06 | bwd: 3332.74 | bwd_inner: 3331.90 | bwd_allreduce: 0.80 | step: 6.84 43%|████▎ | 4317/10000 [6:48:36<8:39:16, 5.48s/it] {'loss': 0.0427, 'grad_norm': 3.376887798309326, 'learning_rate': 2.5330548347002536e-05, 'epoch': 4.32} 43%|████▎ | 4317/10000 [6:48:36<8:39:16, 5.48s/it][2025-06-19 20:18:21,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:18:21,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.63 | bwd_microstep: 3382.86 | bwd_inner_microstep: 3382.06 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-19 20:18:21,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.63 | bwd: 3382.87 | bwd_inner: 3382.06 | bwd_allreduce: 0.77 | step: 6.74 43%|████▎ | 4318/10000 [6:48:41<8:41:09, 5.50s/it] {'loss': 0.0054, 'grad_norm': 0.3562782108783722, 'learning_rate': 2.532430486571894e-05, 'epoch': 4.32} 43%|████▎ | 4318/10000 [6:48:41<8:41:09, 5.50s/it][2025-06-19 20:18:26,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:18:26,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.97 | bwd_microstep: 3327.61 | bwd_inner_microstep: 3326.80 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 20:18:26,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.97 | bwd: 3327.63 | bwd_inner: 3326.80 | bwd_allreduce: 0.78 | step: 6.97 43%|████▎ | 4319/10000 [6:48:47<8:40:29, 5.50s/it] {'loss': 0.0362, 'grad_norm': 1.064923882484436, 'learning_rate': 2.5318060825940473e-05, 'epoch': 4.32} 43%|████▎ | 4319/10000 [6:48:47<8:40:29, 5.50s/it][2025-06-19 20:18:32,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:18:32,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.12 | bwd_microstep: 3330.40 | bwd_inner_microstep: 3329.37 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.37 [2025-06-19 20:18:32,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.12 | bwd: 3330.41 | bwd_inner: 3329.37 | bwd_allreduce: 1.00 | step: 7.38 43%|████▎ | 4320/10000 [6:48:52<8:39:55, 5.49s/it] {'loss': 0.0514, 'grad_norm': 3.0084099769592285, 'learning_rate': 2.5311816228322106e-05, 'epoch': 4.32} 43%|████▎ | 4320/10000 [6:48:52<8:39:55, 5.49s/it][2025-06-19 20:18:37,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:18:37,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.26 | bwd_microstep: 3406.12 | bwd_inner_microstep: 3405.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 20:18:37,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.26 | bwd: 3406.14 | bwd_inner: 3405.33 | bwd_allreduce: 0.76 | step: 6.67 43%|████▎ | 4321/10000 [6:48:58<8:42:27, 5.52s/it] {'loss': 0.0916, 'grad_norm': 1.4425901174545288, 'learning_rate': 2.5305571073518876e-05, 'epoch': 4.32} 43%|████▎ | 4321/10000 [6:48:58<8:42:27, 5.52s/it][2025-06-19 20:18:43,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:18:43,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.31 | bwd_microstep: 3378.63 | bwd_inner_microstep: 3377.82 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 20:18:43,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.31 | bwd: 3378.65 | bwd_inner: 3377.82 | bwd_allreduce: 0.78 | step: 6.76 43%|████▎ | 4322/10000 [6:49:04<8:43:06, 5.53s/it] {'loss': 0.0314, 'grad_norm': 1.2320587635040283, 'learning_rate': 2.5299325362185852e-05, 'epoch': 4.32} 43%|████▎ | 4322/10000 [6:49:04<8:43:06, 5.53s/it][2025-06-19 20:18:48,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:18:48,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.38 | bwd_microstep: 3378.39 | bwd_inner_microstep: 3377.31 | bwd_allreduce_microstep: 1.03 | step_microstep: 6.93 [2025-06-19 20:18:48,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.38 | bwd: 3378.40 | bwd_inner: 3377.31 | bwd_allreduce: 1.05 | step: 6.93 43%|████▎ | 4323/10000 [6:49:09<8:43:42, 5.53s/it] {'loss': 0.0472, 'grad_norm': 1.4419972896575928, 'learning_rate': 2.52930790949782e-05, 'epoch': 4.32} 43%|████▎ | 4323/10000 [6:49:09<8:43:42, 5.53s/it][2025-06-19 20:18:54,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:18:54,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.62 | bwd_microstep: 3328.35 | bwd_inner_microstep: 3327.54 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-19 20:18:54,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.62 | bwd: 3328.37 | bwd_inner: 3327.54 | bwd_allreduce: 0.78 | step: 6.72 43%|████▎ | 4324/10000 [6:49:15<8:42:03, 5.52s/it] {'loss': 0.0032, 'grad_norm': 0.18212972581386566, 'learning_rate': 2.5286832272551114e-05, 'epoch': 4.32} 43%|████▎ | 4324/10000 [6:49:15<8:42:03, 5.52s/it][2025-06-19 20:18:59,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:18:59,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.27 | bwd_microstep: 3331.68 | bwd_inner_microstep: 3330.88 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 20:18:59,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.27 | bwd: 3331.70 | bwd_inner: 3330.88 | bwd_allreduce: 0.77 | step: 6.96 43%|████▎ | 4325/10000 [6:49:20<8:41:04, 5.51s/it] {'loss': 0.0036, 'grad_norm': 0.19964610040187836, 'learning_rate': 2.5280584895559864e-05, 'epoch': 4.33} 43%|████▎ | 4325/10000 [6:49:20<8:41:04, 5.51s/it][2025-06-19 20:19:05,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:19:05,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.05 | bwd_microstep: 3380.22 | bwd_inner_microstep: 3379.26 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.42 [2025-06-19 20:19:05,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.05 | bwd: 3380.23 | bwd_inner: 3379.26 | bwd_allreduce: 0.92 | step: 7.42 43%|████▎ | 4326/10000 [6:49:26<8:42:29, 5.53s/it] {'loss': 0.0281, 'grad_norm': 1.8014624118804932, 'learning_rate': 2.5274336964659764e-05, 'epoch': 4.33} 43%|████▎ | 4326/10000 [6:49:26<8:42:29, 5.53s/it][2025-06-19 20:19:10,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:19:10,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.44 | bwd_microstep: 3372.33 | bwd_inner_microstep: 3371.35 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.09 [2025-06-19 20:19:10,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.44 | bwd: 3372.35 | bwd_inner: 3371.35 | bwd_allreduce: 0.95 | step: 7.09 43%|████▎ | 4327/10000 [6:49:31<8:42:52, 5.53s/it] {'loss': 0.0466, 'grad_norm': 1.8832406997680664, 'learning_rate': 2.5268088480506195e-05, 'epoch': 4.33} 43%|████▎ | 4327/10000 [6:49:31<8:42:52, 5.53s/it][2025-06-19 20:19:16,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 20:19:16,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.64 | bwd_microstep: 3333.46 | bwd_inner_microstep: 3332.66 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 20:19:16,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.64 | bwd: 3333.47 | bwd_inner: 3332.66 | bwd_allreduce: 0.77 | step: 7.25 43%|████▎ | 4328/10000 [6:49:37<8:41:33, 5.52s/it] {'loss': 0.0308, 'grad_norm': 1.435837745666504, 'learning_rate': 2.5261839443754597e-05, 'epoch': 4.33} 43%|████▎ | 4328/10000 [6:49:37<8:41:33, 5.52s/it][2025-06-19 20:19:21,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:19:21,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.09 | bwd_microstep: 3326.06 | bwd_inner_microstep: 3325.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 20:19:21,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.09 | bwd: 3326.08 | bwd_inner: 3325.27 | bwd_allreduce: 0.76 | step: 6.66 43%|████▎ | 4329/10000 [6:49:42<8:40:19, 5.51s/it] {'loss': 0.0675, 'grad_norm': 2.915722131729126, 'learning_rate': 2.525558985506046e-05, 'epoch': 4.33} 43%|████▎ | 4329/10000 [6:49:42<8:40:19, 5.51s/it][2025-06-19 20:19:27,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:19:27,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.17 | bwd_microstep: 3328.56 | bwd_inner_microstep: 3327.77 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 20:19:27,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.18 | bwd: 3328.57 | bwd_inner: 3327.77 | bwd_allreduce: 0.76 | step: 6.71 43%|████▎ | 4330/10000 [6:49:48<8:39:24, 5.50s/it] {'loss': 0.0249, 'grad_norm': 2.6458797454833984, 'learning_rate': 2.5249339715079338e-05, 'epoch': 4.33} 43%|████▎ | 4330/10000 [6:49:48<8:39:24, 5.50s/it][2025-06-19 20:19:32,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 20:19:32,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.58 | bwd_microstep: 3343.53 | bwd_inner_microstep: 3342.60 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.30 [2025-06-19 20:19:32,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.58 | bwd: 3343.55 | bwd_inner: 3342.60 | bwd_allreduce: 0.90 | step: 7.31 43%|████▎ | 4331/10000 [6:49:53<8:39:34, 5.50s/it] {'loss': 0.0068, 'grad_norm': 0.3917871415615082, 'learning_rate': 2.5243089024466844e-05, 'epoch': 4.33} 43%|████▎ | 4331/10000 [6:49:53<8:39:34, 5.50s/it][2025-06-19 20:19:38,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:19:38,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.24 | bwd_microstep: 3328.89 | bwd_inner_microstep: 3328.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 20:19:38,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.24 | bwd: 3328.90 | bwd_inner: 3328.10 | bwd_allreduce: 0.76 | step: 6.65 43%|████▎ | 4332/10000 [6:49:59<8:38:51, 5.49s/it] {'loss': 0.0151, 'grad_norm': 1.5927847623825073, 'learning_rate': 2.523683778387864e-05, 'epoch': 4.33} 43%|████▎ | 4332/10000 [6:49:59<8:38:51, 5.49s/it][2025-06-19 20:19:43,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:19:43,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.82 | bwd_microstep: 3403.72 | bwd_inner_microstep: 3402.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 20:19:43,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.82 | bwd: 3403.74 | bwd_inner: 3402.93 | bwd_allreduce: 0.76 | step: 6.64 43%|████▎ | 4333/10000 [6:50:04<8:41:18, 5.52s/it] {'loss': 0.0033, 'grad_norm': 0.18702799081802368, 'learning_rate': 2.5230585993970467e-05, 'epoch': 4.33} 43%|████▎ | 4333/10000 [6:50:04<8:41:18, 5.52s/it][2025-06-19 20:19:49,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:19:49,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.94 | bwd_microstep: 3325.57 | bwd_inner_microstep: 3324.71 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.50 [2025-06-19 20:19:49,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.94 | bwd: 3325.58 | bwd_inner: 3324.71 | bwd_allreduce: 0.83 | step: 7.50 43%|████▎ | 4334/10000 [6:50:10<8:39:58, 5.51s/it] {'loss': 0.0052, 'grad_norm': 0.5636943578720093, 'learning_rate': 2.5224333655398098e-05, 'epoch': 4.33} 43%|████▎ | 4334/10000 [6:50:10<8:39:58, 5.51s/it][2025-06-19 20:19:54,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:19:54,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.39 | bwd_microstep: 3377.84 | bwd_inner_microstep: 3376.93 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.95 [2025-06-19 20:19:54,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.39 | bwd: 3377.85 | bwd_inner: 3376.93 | bwd_allreduce: 0.88 | step: 6.95 43%|████▎ | 4335/10000 [6:50:15<8:41:05, 5.52s/it] {'loss': 0.1185, 'grad_norm': 5.062640190124512, 'learning_rate': 2.521808076881737e-05, 'epoch': 4.33} 43%|████▎ | 4335/10000 [6:50:15<8:41:05, 5.52s/it][2025-06-19 20:20:00,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:20:00,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.92 | bwd_microstep: 3326.53 | bwd_inner_microstep: 3325.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 20:20:00,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.92 | bwd: 3326.54 | bwd_inner: 3325.75 | bwd_allreduce: 0.75 | step: 6.54 43%|████▎ | 4336/10000 [6:50:21<8:39:55, 5.51s/it] {'loss': 0.0433, 'grad_norm': 2.9445137977600098, 'learning_rate': 2.521182733488419e-05, 'epoch': 4.34} 43%|████▎ | 4336/10000 [6:50:21<8:39:55, 5.51s/it][2025-06-19 20:20:05,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:20:05,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.25 | bwd_microstep: 3328.53 | bwd_inner_microstep: 3327.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 20:20:05,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.25 | bwd: 3328.55 | bwd_inner: 3327.75 | bwd_allreduce: 0.75 | step: 6.60 43%|████▎ | 4337/10000 [6:50:26<8:39:04, 5.50s/it] {'loss': 0.0562, 'grad_norm': 20.81888771057129, 'learning_rate': 2.5205573354254512e-05, 'epoch': 4.34} 43%|████▎ | 4337/10000 [6:50:26<8:39:04, 5.50s/it][2025-06-19 20:20:11,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:20:11,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.68 | bwd_microstep: 3337.68 | bwd_inner_microstep: 3336.71 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.13 [2025-06-19 20:20:11,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.68 | bwd: 3337.70 | bwd_inner: 3336.71 | bwd_allreduce: 0.93 | step: 7.13 43%|████▎ | 4338/10000 [6:50:32<8:38:58, 5.50s/it] {'loss': 0.0036, 'grad_norm': 0.24802827835083008, 'learning_rate': 2.5199318827584353e-05, 'epoch': 4.34} 43%|████▎ | 4338/10000 [6:50:32<8:38:58, 5.50s/it][2025-06-19 20:20:16,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:20:16,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.22 | bwd_microstep: 3376.67 | bwd_inner_microstep: 3375.73 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.00 [2025-06-19 20:20:16,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.22 | bwd: 3376.69 | bwd_inner: 3375.73 | bwd_allreduce: 0.91 | step: 7.01 43%|████▎ | 4339/10000 [6:50:37<8:40:26, 5.52s/it] {'loss': 0.0095, 'grad_norm': 0.809121310710907, 'learning_rate': 2.5193063755529774e-05, 'epoch': 4.34} 43%|████▎ | 4339/10000 [6:50:37<8:40:26, 5.52s/it][2025-06-19 20:20:22,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:20:22,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.09 | bwd_microstep: 3384.44 | bwd_inner_microstep: 3383.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 20:20:22,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.09 | bwd: 3384.46 | bwd_inner: 3383.66 | bwd_allreduce: 0.75 | step: 6.59 43%|████▎ | 4340/10000 [6:50:43<8:41:34, 5.53s/it] {'loss': 0.0031, 'grad_norm': 0.1444113403558731, 'learning_rate': 2.5186808138746903e-05, 'epoch': 4.34} 43%|████▎ | 4340/10000 [6:50:43<8:41:34, 5.53s/it][2025-06-19 20:20:28,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:20:28,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.96 | bwd_microstep: 3377.56 | bwd_inner_microstep: 3376.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 20:20:28,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.96 | bwd: 3377.58 | bwd_inner: 3376.78 | bwd_allreduce: 0.75 | step: 6.64 43%|████▎ | 4341/10000 [6:50:48<8:42:03, 5.54s/it] {'loss': 0.0182, 'grad_norm': 1.8958441019058228, 'learning_rate': 2.5180551977891935e-05, 'epoch': 4.34} 43%|████▎ | 4341/10000 [6:50:48<8:42:03, 5.54s/it][2025-06-19 20:20:33,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:20:33,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.64 | bwd_microstep: 3330.34 | bwd_inner_microstep: 3329.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-19 20:20:33,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.64 | bwd: 3330.35 | bwd_inner: 3329.53 | bwd_allreduce: 0.78 | step: 6.95 43%|████▎ | 4342/10000 [6:50:54<8:40:33, 5.52s/it] {'loss': 0.1318, 'grad_norm': 2.8112168312072754, 'learning_rate': 2.5174295273621116e-05, 'epoch': 4.34} 43%|████▎ | 4342/10000 [6:50:54<8:40:33, 5.52s/it][2025-06-19 20:20:39,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:20:39,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.14 | bwd_microstep: 3384.25 | bwd_inner_microstep: 3383.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 20:20:39,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.14 | bwd: 3384.26 | bwd_inner: 3383.45 | bwd_allreduce: 0.77 | step: 6.90 43%|████▎ | 4343/10000 [6:50:59<8:41:42, 5.53s/it] {'loss': 0.0113, 'grad_norm': 0.6355083584785461, 'learning_rate': 2.516803802659073e-05, 'epoch': 4.34} 43%|████▎ | 4343/10000 [6:50:59<8:41:42, 5.53s/it][2025-06-19 20:20:44,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:20:44,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.07 | bwd_microstep: 3329.26 | bwd_inner_microstep: 3328.37 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.03 [2025-06-19 20:20:44,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.07 | bwd: 3329.27 | bwd_inner: 3328.37 | bwd_allreduce: 0.86 | step: 7.03 43%|████▎ | 4344/10000 [6:51:05<8:40:04, 5.52s/it] {'loss': 0.079, 'grad_norm': 6.112697601318359, 'learning_rate': 2.5161780237457144e-05, 'epoch': 4.34} 43%|████▎ | 4344/10000 [6:51:05<8:40:04, 5.52s/it][2025-06-19 20:20:50,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:20:50,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.00 | bwd_microstep: 3374.96 | bwd_inner_microstep: 3374.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 20:20:50,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.00 | bwd: 3374.98 | bwd_inner: 3374.18 | bwd_allreduce: 0.75 | step: 6.55 43%|████▎ | 4345/10000 [6:51:10<8:40:52, 5.53s/it] {'loss': 0.1398, 'grad_norm': 2.511793613433838, 'learning_rate': 2.5155521906876764e-05, 'epoch': 4.34} 43%|████▎ | 4345/10000 [6:51:10<8:40:52, 5.53s/it][2025-06-19 20:20:55,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:20:55,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.87 | bwd_microstep: 3338.58 | bwd_inner_microstep: 3337.74 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.10 [2025-06-19 20:20:55,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.87 | bwd: 3338.60 | bwd_inner: 3337.74 | bwd_allreduce: 0.80 | step: 7.10 43%|████▎ | 4346/10000 [6:51:16<8:39:45, 5.52s/it] {'loss': 0.0346, 'grad_norm': 3.210223913192749, 'learning_rate': 2.514926303550607e-05, 'epoch': 4.35} 43%|████▎ | 4346/10000 [6:51:16<8:39:45, 5.52s/it][2025-06-19 20:21:01,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:21:01,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.48 | bwd_microstep: 3323.45 | bwd_inner_microstep: 3322.67 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 20:21:01,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.48 | bwd: 3323.47 | bwd_inner: 3322.67 | bwd_allreduce: 0.75 | step: 6.56 43%|████▎ | 4347/10000 [6:51:21<8:38:37, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.19800032675266266, 'learning_rate': 2.5143003624001577e-05, 'epoch': 4.35} 43%|████▎ | 4347/10000 [6:51:21<8:38:37, 5.50s/it][2025-06-19 20:21:06,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 20:21:06,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.60 | bwd_microstep: 3327.56 | bwd_inner_microstep: 3326.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 20:21:06,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.60 | bwd: 3327.57 | bwd_inner: 3326.78 | bwd_allreduce: 0.75 | step: 6.54 43%|████▎ | 4348/10000 [6:51:27<8:37:37, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.18016666173934937, 'learning_rate': 2.513674367301987e-05, 'epoch': 4.35} 43%|████▎ | 4348/10000 [6:51:27<8:37:37, 5.50s/it][2025-06-19 20:21:12,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:21:12,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.59 | bwd_microstep: 3322.41 | bwd_inner_microstep: 3321.51 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.12 [2025-06-19 20:21:12,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.59 | bwd: 3322.42 | bwd_inner: 3321.51 | bwd_allreduce: 0.87 | step: 7.13 43%|████▎ | 4349/10000 [6:51:32<8:36:42, 5.49s/it] {'loss': 0.0134, 'grad_norm': 0.8920131921768188, 'learning_rate': 2.5130483183217597e-05, 'epoch': 4.35} 43%|████▎ | 4349/10000 [6:51:32<8:36:42, 5.49s/it][2025-06-19 20:21:17,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:21:17,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.02 | bwd_microstep: 3376.17 | bwd_inner_microstep: 3375.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 20:21:17,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.03 | bwd: 3376.18 | bwd_inner: 3375.37 | bwd_allreduce: 0.77 | step: 6.99 44%|████▎ | 4350/10000 [6:51:38<8:38:26, 5.51s/it] {'loss': 0.0211, 'grad_norm': 1.4474989175796509, 'learning_rate': 2.5124222155251445e-05, 'epoch': 4.35} 44%|████▎ | 4350/10000 [6:51:38<8:38:26, 5.51s/it][2025-06-19 20:21:23,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:21:23,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.12 | bwd_microstep: 3327.04 | bwd_inner_microstep: 3325.93 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.66 [2025-06-19 20:21:23,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.12 | bwd: 3327.07 | bwd_inner: 3325.93 | bwd_allreduce: 1.08 | step: 7.66 44%|████▎ | 4351/10000 [6:51:43<8:37:55, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.01437084935605526, 'learning_rate': 2.511796058977818e-05, 'epoch': 4.35} 44%|████▎ | 4351/10000 [6:51:43<8:37:55, 5.50s/it][2025-06-19 20:21:28,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:21:28,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.42 | bwd_microstep: 3333.75 | bwd_inner_microstep: 3332.79 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.29 [2025-06-19 20:21:28,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.42 | bwd: 3333.77 | bwd_inner: 3332.79 | bwd_allreduce: 0.94 | step: 7.30 44%|████▎ | 4352/10000 [6:51:49<8:37:45, 5.50s/it] {'loss': 0.0039, 'grad_norm': 0.23766258358955383, 'learning_rate': 2.5111698487454596e-05, 'epoch': 4.35} 44%|████▎ | 4352/10000 [6:51:49<8:37:45, 5.50s/it][2025-06-19 20:21:34,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:21:34,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.22 | bwd_microstep: 3368.22 | bwd_inner_microstep: 3367.28 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.96 [2025-06-19 20:21:34,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.22 | bwd: 3368.23 | bwd_inner: 3367.28 | bwd_allreduce: 0.91 | step: 6.96 44%|████▎ | 4353/10000 [6:51:54<8:39:04, 5.52s/it] {'loss': 0.003, 'grad_norm': 0.14665472507476807, 'learning_rate': 2.5105435848937566e-05, 'epoch': 4.35} 44%|████▎ | 4353/10000 [6:51:54<8:39:04, 5.52s/it][2025-06-19 20:21:39,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:21:39,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.18 | bwd_microstep: 3313.85 | bwd_inner_microstep: 3312.92 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.05 [2025-06-19 20:21:39,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.18 | bwd: 3313.87 | bwd_inner: 3312.92 | bwd_allreduce: 0.89 | step: 7.05 44%|████▎ | 4354/10000 [6:52:00<8:37:41, 5.50s/it] {'loss': 0.144, 'grad_norm': 4.257233142852783, 'learning_rate': 2.5099172674884013e-05, 'epoch': 4.35} 44%|████▎ | 4354/10000 [6:52:00<8:37:41, 5.50s/it][2025-06-19 20:21:45,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 20:21:45,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.09 | bwd_microstep: 3364.25 | bwd_inner_microstep: 3363.13 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.86 [2025-06-19 20:21:45,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.09 | bwd: 3364.27 | bwd_inner: 3363.13 | bwd_allreduce: 1.09 | step: 7.87 44%|████▎ | 4355/10000 [6:52:05<8:38:44, 5.51s/it] {'loss': 0.0543, 'grad_norm': 1.832753300666809, 'learning_rate': 2.5092908965950908e-05, 'epoch': 4.36} 44%|████▎ | 4355/10000 [6:52:05<8:38:44, 5.51s/it][2025-06-19 20:21:50,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 20:21:50,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.30 | bwd_microstep: 3319.76 | bwd_inner_microstep: 3318.90 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.24 [2025-06-19 20:21:50,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.30 | bwd: 3319.77 | bwd_inner: 3318.90 | bwd_allreduce: 0.83 | step: 7.25 44%|████▎ | 4356/10000 [6:52:11<8:37:25, 5.50s/it] {'loss': 0.1022, 'grad_norm': 5.000797748565674, 'learning_rate': 2.5086644722795296e-05, 'epoch': 4.36} 44%|████▎ | 4356/10000 [6:52:11<8:37:25, 5.50s/it][2025-06-19 20:21:56,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:21:56,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.69 | bwd_microstep: 3367.54 | bwd_inner_microstep: 3366.62 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.09 [2025-06-19 20:21:56,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.69 | bwd: 3367.55 | bwd_inner: 3366.62 | bwd_allreduce: 0.89 | step: 7.10 44%|████▎ | 4357/10000 [6:52:16<8:38:20, 5.51s/it] {'loss': 0.0541, 'grad_norm': 6.041257381439209, 'learning_rate': 2.5080379946074254e-05, 'epoch': 4.36} 44%|████▎ | 4357/10000 [6:52:16<8:38:20, 5.51s/it][2025-06-19 20:22:01,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:22:01,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.75 | bwd_microstep: 3363.51 | bwd_inner_microstep: 3362.65 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.30 [2025-06-19 20:22:01,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.75 | bwd: 3363.53 | bwd_inner: 3362.65 | bwd_allreduce: 0.83 | step: 7.31 44%|████▎ | 4358/10000 [6:52:22<8:38:51, 5.52s/it] {'loss': 0.0381, 'grad_norm': 3.9797544479370117, 'learning_rate': 2.5074114636444937e-05, 'epoch': 4.36} 44%|████▎ | 4358/10000 [6:52:22<8:38:51, 5.52s/it][2025-06-19 20:22:07,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:22:07,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.52 | bwd_microstep: 3317.11 | bwd_inner_microstep: 3316.07 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.52 [2025-06-19 20:22:07,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.52 | bwd: 3317.13 | bwd_inner: 3316.07 | bwd_allreduce: 1.01 | step: 7.52 44%|████▎ | 4359/10000 [6:52:27<8:37:24, 5.50s/it] {'loss': 0.0221, 'grad_norm': 1.069562554359436, 'learning_rate': 2.5067848794564537e-05, 'epoch': 4.36} 44%|████▎ | 4359/10000 [6:52:27<8:37:24, 5.50s/it][2025-06-19 20:22:12,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:22:12,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.79 | bwd_microstep: 3321.74 | bwd_inner_microstep: 3320.91 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.41 [2025-06-19 20:22:12,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.79 | bwd: 3321.75 | bwd_inner: 3320.91 | bwd_allreduce: 0.79 | step: 7.42 44%|████▎ | 4360/10000 [6:52:33<8:36:29, 5.49s/it] {'loss': 0.0603, 'grad_norm': 2.8241348266601562, 'learning_rate': 2.5061582421090325e-05, 'epoch': 4.36} 44%|████▎ | 4360/10000 [6:52:33<8:36:29, 5.49s/it][2025-06-19 20:22:18,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:22:18,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.23 | bwd_microstep: 3327.96 | bwd_inner_microstep: 3327.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 20:22:18,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.23 | bwd: 3327.98 | bwd_inner: 3327.16 | bwd_allreduce: 0.77 | step: 6.84 44%|████▎ | 4361/10000 [6:52:38<8:35:44, 5.49s/it] {'loss': 0.0037, 'grad_norm': 0.20722587406635284, 'learning_rate': 2.5055315516679613e-05, 'epoch': 4.36} 44%|████▎ | 4361/10000 [6:52:38<8:35:44, 5.49s/it][2025-06-19 20:22:23,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:22:23,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.69 | bwd_microstep: 3330.30 | bwd_inner_microstep: 3329.48 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.84 [2025-06-19 20:22:23,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.69 | bwd: 3330.32 | bwd_inner: 3329.48 | bwd_allreduce: 0.79 | step: 6.85 44%|████▎ | 4362/10000 [6:52:44<8:35:05, 5.48s/it] {'loss': 0.0259, 'grad_norm': 2.7298424243927, 'learning_rate': 2.5049048081989762e-05, 'epoch': 4.36} 44%|████▎ | 4362/10000 [6:52:44<8:35:05, 5.48s/it][2025-06-19 20:22:28,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:22:28,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.75 | bwd_microstep: 3313.72 | bwd_inner_microstep: 3312.90 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.13 [2025-06-19 20:22:28,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.75 | bwd: 3313.74 | bwd_inner: 3312.90 | bwd_allreduce: 0.79 | step: 7.13 44%|████▎ | 4363/10000 [6:52:49<8:34:13, 5.47s/it] {'loss': 0.013, 'grad_norm': 1.5381451845169067, 'learning_rate': 2.5042780117678195e-05, 'epoch': 4.36} 44%|████▎ | 4363/10000 [6:52:49<8:34:13, 5.47s/it][2025-06-19 20:22:34,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:22:34,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.65 | bwd_microstep: 3313.72 | bwd_inner_microstep: 3312.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 20:22:34,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.65 | bwd: 3313.73 | bwd_inner: 3312.93 | bwd_allreduce: 0.76 | step: 6.80 44%|████▎ | 4364/10000 [6:52:55<8:33:34, 5.47s/it] {'loss': 0.0534, 'grad_norm': 3.997204542160034, 'learning_rate': 2.5036511624402406e-05, 'epoch': 4.36} 44%|████▎ | 4364/10000 [6:52:55<8:33:34, 5.47s/it][2025-06-19 20:22:39,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:22:39,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.95 | bwd_microstep: 3327.66 | bwd_inner_microstep: 3326.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 20:22:39,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.95 | bwd: 3327.68 | bwd_inner: 3326.88 | bwd_allreduce: 0.76 | step: 6.70 44%|████▎ | 4365/10000 [6:53:00<8:33:48, 5.47s/it] {'loss': 0.0192, 'grad_norm': 1.5818883180618286, 'learning_rate': 2.503024260281992e-05, 'epoch': 4.37} 44%|████▎ | 4365/10000 [6:53:00<8:33:48, 5.47s/it][2025-06-19 20:22:45,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:22:45,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.01 | bwd_microstep: 3378.16 | bwd_inner_microstep: 3377.21 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.39 [2025-06-19 20:22:45,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.01 | bwd: 3378.18 | bwd_inner: 3377.21 | bwd_allreduce: 0.91 | step: 7.39 44%|████▎ | 4366/10000 [6:53:06<8:35:47, 5.49s/it] {'loss': 0.018, 'grad_norm': 1.343126893043518, 'learning_rate': 2.502397305358833e-05, 'epoch': 4.37} 44%|████▎ | 4366/10000 [6:53:06<8:35:47, 5.49s/it][2025-06-19 20:22:50,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:22:50,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.03 | bwd_microstep: 3322.05 | bwd_inner_microstep: 3321.25 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 20:22:50,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.03 | bwd: 3322.07 | bwd_inner: 3321.25 | bwd_allreduce: 0.78 | step: 7.06 44%|████▎ | 4367/10000 [6:53:11<8:35:00, 5.49s/it] {'loss': 0.0028, 'grad_norm': 0.2355882227420807, 'learning_rate': 2.5017702977365285e-05, 'epoch': 4.37} 44%|████▎ | 4367/10000 [6:53:11<8:35:00, 5.49s/it][2025-06-19 20:22:56,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:22:56,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.61 | bwd_microstep: 3370.89 | bwd_inner_microstep: 3370.09 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 20:22:56,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.61 | bwd: 3370.91 | bwd_inner: 3370.09 | bwd_allreduce: 0.77 | step: 6.77 44%|████▎ | 4368/10000 [6:53:17<8:36:21, 5.50s/it] {'loss': 0.1588, 'grad_norm': 3.68792462348938, 'learning_rate': 2.5011432374808484e-05, 'epoch': 4.37} 44%|████▎ | 4368/10000 [6:53:17<8:36:21, 5.50s/it][2025-06-19 20:23:02,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:23:02,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.38 | bwd_microstep: 3372.77 | bwd_inner_microstep: 3371.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.49 [2025-06-19 20:23:02,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.38 | bwd: 3372.79 | bwd_inner: 3371.97 | bwd_allreduce: 0.78 | step: 7.49 44%|████▎ | 4369/10000 [6:53:22<8:37:26, 5.51s/it] {'loss': 0.0552, 'grad_norm': 3.7508432865142822, 'learning_rate': 2.500516124657569e-05, 'epoch': 4.37} 44%|████▎ | 4369/10000 [6:53:22<8:37:26, 5.51s/it][2025-06-19 20:23:07,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:23:07,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.92 | bwd_microstep: 3320.86 | bwd_inner_microstep: 3319.94 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.28 [2025-06-19 20:23:07,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.92 | bwd: 3320.87 | bwd_inner: 3319.94 | bwd_allreduce: 0.88 | step: 7.28 44%|████▎ | 4370/10000 [6:53:28<8:35:52, 5.50s/it] {'loss': 0.0683, 'grad_norm': 3.425248622894287, 'learning_rate': 2.4998889593324708e-05, 'epoch': 4.37} 44%|████▎ | 4370/10000 [6:53:28<8:35:52, 5.50s/it][2025-06-19 20:23:13,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:23:13,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.70 | bwd_microstep: 3367.33 | bwd_inner_microstep: 3366.52 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.71 [2025-06-19 20:23:13,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.70 | bwd: 3367.35 | bwd_inner: 3366.52 | bwd_allreduce: 0.78 | step: 6.71 44%|████▎ | 4371/10000 [6:53:33<8:36:57, 5.51s/it] {'loss': 0.025, 'grad_norm': 2.0760488510131836, 'learning_rate': 2.499261741571341e-05, 'epoch': 4.37} 44%|████▎ | 4371/10000 [6:53:33<8:36:57, 5.51s/it][2025-06-19 20:23:18,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:23:18,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.90 | bwd_microstep: 3358.67 | bwd_inner_microstep: 3357.65 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.79 [2025-06-19 20:23:18,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.90 | bwd: 3358.69 | bwd_inner: 3357.66 | bwd_allreduce: 0.99 | step: 7.80 44%|████▎ | 4372/10000 [6:53:39<8:37:21, 5.52s/it] {'loss': 0.0023, 'grad_norm': 0.14997684955596924, 'learning_rate': 2.4986344714399712e-05, 'epoch': 4.37} 44%|████▎ | 4372/10000 [6:53:39<8:37:21, 5.52s/it][2025-06-19 20:23:24,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:23:24,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.20 | bwd_microstep: 3365.34 | bwd_inner_microstep: 3364.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 20:23:24,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.20 | bwd: 3365.35 | bwd_inner: 3364.55 | bwd_allreduce: 0.76 | step: 6.72 44%|████▎ | 4373/10000 [6:53:44<8:37:43, 5.52s/it] {'loss': 0.0441, 'grad_norm': 3.7246205806732178, 'learning_rate': 2.4980071490041597e-05, 'epoch': 4.37} 44%|████▎ | 4373/10000 [6:53:44<8:37:43, 5.52s/it][2025-06-19 20:23:29,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:23:29,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.73 | bwd_microstep: 3395.71 | bwd_inner_microstep: 3394.78 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.21 [2025-06-19 20:23:29,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.74 | bwd: 3395.72 | bwd_inner: 3394.78 | bwd_allreduce: 0.90 | step: 7.22 44%|████▎ | 4374/10000 [6:53:50<8:38:57, 5.53s/it] {'loss': 0.0341, 'grad_norm': 3.5288875102996826, 'learning_rate': 2.4973797743297103e-05, 'epoch': 4.37} 44%|████▎ | 4374/10000 [6:53:50<8:38:57, 5.53s/it][2025-06-19 20:23:35,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:23:35,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.97 | bwd_microstep: 3323.29 | bwd_inner_microstep: 3322.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 20:23:35,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.97 | bwd: 3323.30 | bwd_inner: 3322.50 | bwd_allreduce: 0.76 | step: 6.65 44%|████▍ | 4375/10000 [6:53:55<8:36:56, 5.51s/it] {'loss': 0.0087, 'grad_norm': 0.4645363390445709, 'learning_rate': 2.49675234748243e-05, 'epoch': 4.38} 44%|████▍ | 4375/10000 [6:53:55<8:36:56, 5.51s/it][2025-06-19 20:23:40,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:23:40,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.65 | bwd_microstep: 3318.59 | bwd_inner_microstep: 3317.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 20:23:40,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.65 | bwd: 3318.60 | bwd_inner: 3317.77 | bwd_allreduce: 0.78 | step: 7.17 44%|████▍ | 4376/10000 [6:54:01<8:35:23, 5.50s/it] {'loss': 0.0029, 'grad_norm': 0.224195659160614, 'learning_rate': 2.4961248685281346e-05, 'epoch': 4.38} 44%|████▍ | 4376/10000 [6:54:01<8:35:23, 5.50s/it][2025-06-19 20:23:46,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:23:46,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.78 | bwd_microstep: 3314.23 | bwd_inner_microstep: 3313.42 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.66 [2025-06-19 20:23:46,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.78 | bwd: 3314.24 | bwd_inner: 3313.42 | bwd_allreduce: 0.78 | step: 6.66 44%|████▍ | 4377/10000 [6:54:06<8:33:56, 5.48s/it] {'loss': 0.0178, 'grad_norm': 1.0507776737213135, 'learning_rate': 2.4954973375326425e-05, 'epoch': 4.38} 44%|████▍ | 4377/10000 [6:54:06<8:33:56, 5.48s/it][2025-06-19 20:23:51,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:23:51,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.72 | bwd_microstep: 3372.89 | bwd_inner_microstep: 3372.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 20:23:51,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.72 | bwd: 3372.90 | bwd_inner: 3372.10 | bwd_allreduce: 0.76 | step: 6.61 44%|████▍ | 4378/10000 [6:54:12<8:35:13, 5.50s/it] {'loss': 0.0771, 'grad_norm': 1.4246853590011597, 'learning_rate': 2.4948697545617793e-05, 'epoch': 4.38} 44%|████▍ | 4378/10000 [6:54:12<8:35:13, 5.50s/it][2025-06-19 20:23:57,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:23:57,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.87 | bwd_microstep: 3314.32 | bwd_inner_microstep: 3313.53 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 20:23:57,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.87 | bwd: 3314.34 | bwd_inner: 3313.53 | bwd_allreduce: 0.76 | step: 6.70 44%|████▍ | 4379/10000 [6:54:17<8:33:48, 5.48s/it] {'loss': 0.0094, 'grad_norm': 0.42849114537239075, 'learning_rate': 2.4942421196813765e-05, 'epoch': 4.38} 44%|████▍ | 4379/10000 [6:54:17<8:33:48, 5.48s/it][2025-06-19 20:24:02,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:24:02,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.71 | bwd_microstep: 3311.59 | bwd_inner_microstep: 3310.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-19 20:24:02,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.72 | bwd: 3311.61 | bwd_inner: 3310.78 | bwd_allreduce: 0.78 | step: 6.90 44%|████▍ | 4380/10000 [6:54:23<8:32:49, 5.47s/it] {'loss': 0.0811, 'grad_norm': 3.9417903423309326, 'learning_rate': 2.4936144329572683e-05, 'epoch': 4.38} 44%|████▍ | 4380/10000 [6:54:23<8:32:49, 5.47s/it][2025-06-19 20:24:07,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:24:07,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.94 | bwd_microstep: 3363.87 | bwd_inner_microstep: 3362.87 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.95 [2025-06-19 20:24:07,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.94 | bwd: 3363.89 | bwd_inner: 3362.87 | bwd_allreduce: 0.97 | step: 7.96 44%|████▍ | 4381/10000 [6:54:28<8:34:08, 5.49s/it] {'loss': 0.0385, 'grad_norm': 2.448349714279175, 'learning_rate': 2.4929866944552976e-05, 'epoch': 4.38} 44%|████▍ | 4381/10000 [6:54:28<8:34:08, 5.49s/it][2025-06-19 20:24:13,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:24:13,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.00 | bwd_microstep: 3391.73 | bwd_inner_microstep: 3390.68 | bwd_allreduce_microstep: 1.00 | step_microstep: 6.71 [2025-06-19 20:24:13,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.00 | bwd: 3391.75 | bwd_inner: 3390.68 | bwd_allreduce: 1.02 | step: 6.71 44%|████▍ | 4382/10000 [6:54:34<8:36:05, 5.51s/it] {'loss': 0.0059, 'grad_norm': 0.4213241636753082, 'learning_rate': 2.4923589042413108e-05, 'epoch': 4.38} 44%|████▍ | 4382/10000 [6:54:34<8:36:05, 5.51s/it][2025-06-19 20:24:19,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:24:19,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.09 | bwd_microstep: 3312.87 | bwd_inner_microstep: 3311.97 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.89 [2025-06-19 20:24:19,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.09 | bwd: 3312.88 | bwd_inner: 3311.97 | bwd_allreduce: 0.87 | step: 6.89 44%|████▍ | 4383/10000 [6:54:39<8:34:30, 5.50s/it] {'loss': 0.0177, 'grad_norm': 0.9588481187820435, 'learning_rate': 2.4917310623811597e-05, 'epoch': 4.38} 44%|████▍ | 4383/10000 [6:54:39<8:34:30, 5.50s/it][2025-06-19 20:24:24,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:24:24,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.16 | bwd_microstep: 3314.65 | bwd_inner_microstep: 3313.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 20:24:24,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.16 | bwd: 3314.66 | bwd_inner: 3313.87 | bwd_allreduce: 0.75 | step: 6.56 44%|████▍ | 4384/10000 [6:54:45<8:33:24, 5.49s/it] {'loss': 0.0111, 'grad_norm': 0.6684809327125549, 'learning_rate': 2.4911031689407025e-05, 'epoch': 4.38} 44%|████▍ | 4384/10000 [6:54:45<8:33:24, 5.49s/it][2025-06-19 20:24:29,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:24:29,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.03 | bwd_microstep: 3318.83 | bwd_inner_microstep: 3318.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 20:24:29,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.03 | bwd: 3318.85 | bwd_inner: 3318.04 | bwd_allreduce: 0.77 | step: 6.84 44%|████▍ | 4385/10000 [6:54:50<8:32:25, 5.48s/it] {'loss': 0.0675, 'grad_norm': 3.1388771533966064, 'learning_rate': 2.4904752239858017e-05, 'epoch': 4.38} 44%|████▍ | 4385/10000 [6:54:50<8:32:25, 5.48s/it][2025-06-19 20:24:35,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:24:35,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.75 | bwd_microstep: 3309.89 | bwd_inner_microstep: 3309.04 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.33 [2025-06-19 20:24:35,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.75 | bwd: 3309.91 | bwd_inner: 3309.04 | bwd_allreduce: 0.81 | step: 7.33 44%|████▍ | 4386/10000 [6:54:56<8:31:35, 5.47s/it] {'loss': 0.0063, 'grad_norm': 0.4840436577796936, 'learning_rate': 2.4898472275823268e-05, 'epoch': 4.39} 44%|████▍ | 4386/10000 [6:54:56<8:31:35, 5.47s/it][2025-06-19 20:24:40,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:24:40,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.74 | bwd_microstep: 3314.54 | bwd_inner_microstep: 3313.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 20:24:40,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.74 | bwd: 3314.56 | bwd_inner: 3313.75 | bwd_allreduce: 0.76 | step: 6.72 44%|████▍ | 4387/10000 [6:55:01<8:31:05, 5.46s/it] {'loss': 0.0717, 'grad_norm': 2.027184247970581, 'learning_rate': 2.489219179796151e-05, 'epoch': 4.39} 44%|████▍ | 4387/10000 [6:55:01<8:31:05, 5.46s/it][2025-06-19 20:24:46,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:24:46,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.37 | bwd_microstep: 3311.99 | bwd_inner_microstep: 3311.12 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.56 [2025-06-19 20:24:46,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.38 | bwd: 3312.00 | bwd_inner: 3311.12 | bwd_allreduce: 0.83 | step: 7.56 44%|████▍ | 4388/10000 [6:55:07<8:31:03, 5.46s/it] {'loss': 0.0068, 'grad_norm': 0.4735129773616791, 'learning_rate': 2.4885910806931537e-05, 'epoch': 4.39} 44%|████▍ | 4388/10000 [6:55:07<8:31:03, 5.46s/it][2025-06-19 20:24:51,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 20:24:51,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.08 | bwd_microstep: 3357.77 | bwd_inner_microstep: 3356.79 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.69 [2025-06-19 20:24:51,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.08 | bwd: 3357.79 | bwd_inner: 3356.79 | bwd_allreduce: 0.96 | step: 7.69 44%|████▍ | 4389/10000 [6:55:12<8:32:54, 5.48s/it] {'loss': 0.0175, 'grad_norm': 0.8900653123855591, 'learning_rate': 2.48796293033922e-05, 'epoch': 4.39} 44%|████▍ | 4389/10000 [6:55:12<8:32:54, 5.48s/it][2025-06-19 20:24:57,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:24:57,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.68 | bwd_microstep: 3311.45 | bwd_inner_microstep: 3310.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-19 20:24:57,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.69 | bwd: 3311.47 | bwd_inner: 3310.65 | bwd_allreduce: 0.77 | step: 6.89 44%|████▍ | 4390/10000 [6:55:18<8:32:08, 5.48s/it] {'loss': 0.0169, 'grad_norm': 0.9628908038139343, 'learning_rate': 2.4873347288002393e-05, 'epoch': 4.39} 44%|████▍ | 4390/10000 [6:55:18<8:32:08, 5.48s/it][2025-06-19 20:25:02,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:25:02,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.36 | bwd_microstep: 3311.55 | bwd_inner_microstep: 3310.54 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.60 [2025-06-19 20:25:02,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.36 | bwd: 3311.57 | bwd_inner: 3310.54 | bwd_allreduce: 0.98 | step: 7.61 44%|████▍ | 4391/10000 [6:55:23<8:31:35, 5.47s/it] {'loss': 0.0042, 'grad_norm': 0.3527607321739197, 'learning_rate': 2.486706476142107e-05, 'epoch': 4.39} 44%|████▍ | 4391/10000 [6:55:23<8:31:35, 5.47s/it][2025-06-19 20:25:08,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:25:08,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.99 | bwd_microstep: 3315.94 | bwd_inner_microstep: 3314.98 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.05 [2025-06-19 20:25:08,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.99 | bwd: 3315.95 | bwd_inner: 3314.98 | bwd_allreduce: 0.93 | step: 7.06 44%|████▍ | 4392/10000 [6:55:29<8:31:14, 5.47s/it] {'loss': 0.0191, 'grad_norm': 0.6361743211746216, 'learning_rate': 2.486078172430725e-05, 'epoch': 4.39} 44%|████▍ | 4392/10000 [6:55:29<8:31:14, 5.47s/it][2025-06-19 20:25:13,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:25:13,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.64 | bwd_microstep: 3312.33 | bwd_inner_microstep: 3311.45 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.02 [2025-06-19 20:25:13,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.64 | bwd: 3312.35 | bwd_inner: 3311.45 | bwd_allreduce: 0.84 | step: 7.01 44%|████▍ | 4393/10000 [6:55:34<8:30:55, 5.47s/it] {'loss': 0.0666, 'grad_norm': 2.502786874771118, 'learning_rate': 2.4854498177319983e-05, 'epoch': 4.39} 44%|████▍ | 4393/10000 [6:55:34<8:30:55, 5.47s/it][2025-06-19 20:25:19,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:25:19,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.14 | bwd_microstep: 3372.47 | bwd_inner_microstep: 3371.59 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.18 [2025-06-19 20:25:19,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.14 | bwd: 3372.48 | bwd_inner: 3371.59 | bwd_allreduce: 0.85 | step: 7.19 44%|████▍ | 4394/10000 [6:55:40<8:32:49, 5.49s/it] {'loss': 0.0058, 'grad_norm': 0.5225049257278442, 'learning_rate': 2.484821412111839e-05, 'epoch': 4.39} 44%|████▍ | 4394/10000 [6:55:40<8:32:49, 5.49s/it][2025-06-19 20:25:24,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:25:24,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.97 | bwd_microstep: 3303.93 | bwd_inner_microstep: 3303.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-19 20:25:24,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.97 | bwd: 3303.95 | bwd_inner: 3303.13 | bwd_allreduce: 0.77 | step: 6.78 44%|████▍ | 4395/10000 [6:55:45<8:31:39, 5.48s/it] {'loss': 0.0027, 'grad_norm': 0.1673360913991928, 'learning_rate': 2.484192955636163e-05, 'epoch': 4.39} 44%|████▍ | 4395/10000 [6:55:45<8:31:39, 5.48s/it][2025-06-19 20:25:30,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:25:30,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.92 | bwd_microstep: 3385.48 | bwd_inner_microstep: 3384.55 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.03 [2025-06-19 20:25:30,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.92 | bwd: 3385.49 | bwd_inner: 3384.55 | bwd_allreduce: 0.90 | step: 7.04 44%|████▍ | 4396/10000 [6:55:51<8:33:58, 5.50s/it] {'loss': 0.1274, 'grad_norm': 4.464972972869873, 'learning_rate': 2.4835644483708932e-05, 'epoch': 4.4} 44%|████▍ | 4396/10000 [6:55:51<8:33:58, 5.50s/it][2025-06-19 20:25:35,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 20:25:35,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.36 | bwd_microstep: 3372.93 | bwd_inner_microstep: 3371.90 | bwd_allreduce_microstep: 0.96 | step_microstep: 8.17 [2025-06-19 20:25:35,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.36 | bwd: 3372.95 | bwd_inner: 3371.90 | bwd_allreduce: 0.99 | step: 8.17 44%|████▍ | 4397/10000 [6:55:56<8:35:13, 5.52s/it] {'loss': 0.0735, 'grad_norm': 2.0208890438079834, 'learning_rate': 2.482935890381958e-05, 'epoch': 4.4} 44%|████▍ | 4397/10000 [6:55:56<8:35:13, 5.52s/it][2025-06-19 20:25:41,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:25:41,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.50 | bwd_microstep: 3398.42 | bwd_inner_microstep: 3397.51 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.02 [2025-06-19 20:25:41,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.50 | bwd: 3398.43 | bwd_inner: 3397.51 | bwd_allreduce: 0.88 | step: 7.02 44%|████▍ | 4398/10000 [6:56:02<8:37:05, 5.54s/it] {'loss': 0.015, 'grad_norm': 0.619114100933075, 'learning_rate': 2.4823072817352885e-05, 'epoch': 4.4} 44%|████▍ | 4398/10000 [6:56:02<8:37:05, 5.54s/it][2025-06-19 20:25:46,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:25:46,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.49 | bwd_microstep: 3362.24 | bwd_inner_microstep: 3361.41 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.05 [2025-06-19 20:25:46,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.49 | bwd: 3362.25 | bwd_inner: 3361.41 | bwd_allreduce: 0.80 | step: 7.06 44%|████▍ | 4399/10000 [6:56:07<8:37:00, 5.54s/it] {'loss': 0.0035, 'grad_norm': 0.2875206470489502, 'learning_rate': 2.4816786224968245e-05, 'epoch': 4.4} 44%|████▍ | 4399/10000 [6:56:07<8:37:00, 5.54s/it][2025-06-19 20:25:52,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:25:52,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.58 | bwd_microstep: 3361.90 | bwd_inner_microstep: 3361.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.11 [2025-06-19 20:25:52,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.58 | bwd: 3361.91 | bwd_inner: 3361.10 | bwd_allreduce: 0.77 | step: 7.11 44%|████▍ | 4400/10000 [6:56:13<8:36:38, 5.54s/it] {'loss': 0.0071, 'grad_norm': 0.6519185900688171, 'learning_rate': 2.4810499127325077e-05, 'epoch': 4.4} 44%|████▍ | 4400/10000 [6:56:13<8:36:38, 5.54s/it][2025-06-19 20:25:57,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:25:57,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.41 | bwd_microstep: 3371.21 | bwd_inner_microstep: 3370.41 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 20:25:57,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.41 | bwd: 3371.23 | bwd_inner: 3370.41 | bwd_allreduce: 0.77 | step: 6.95 44%|████▍ | 4401/10000 [6:56:18<8:36:33, 5.54s/it] {'loss': 0.0039, 'grad_norm': 0.28630053997039795, 'learning_rate': 2.480421152508288e-05, 'epoch': 4.4} 44%|████▍ | 4401/10000 [6:56:18<8:36:33, 5.54s/it][2025-06-19 20:26:03,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:26:03,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.48 | bwd_microstep: 3319.48 | bwd_inner_microstep: 3318.66 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.03 [2025-06-19 20:26:03,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.48 | bwd: 3319.50 | bwd_inner: 3318.66 | bwd_allreduce: 0.79 | step: 7.04 44%|████▍ | 4402/10000 [6:56:24<8:34:31, 5.51s/it] {'loss': 0.1758, 'grad_norm': 3.0477817058563232, 'learning_rate': 2.4797923418901186e-05, 'epoch': 4.4} 44%|████▍ | 4402/10000 [6:56:24<8:34:31, 5.51s/it][2025-06-19 20:26:08,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:26:08,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.38 | bwd_microstep: 3369.18 | bwd_inner_microstep: 3368.19 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.71 [2025-06-19 20:26:08,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.38 | bwd: 3369.20 | bwd_inner: 3368.19 | bwd_allreduce: 0.96 | step: 7.71 44%|████▍ | 4403/10000 [6:56:29<8:35:11, 5.52s/it] {'loss': 0.1893, 'grad_norm': 4.579155921936035, 'learning_rate': 2.4791634809439593e-05, 'epoch': 4.4} 44%|████▍ | 4403/10000 [6:56:29<8:35:11, 5.52s/it][2025-06-19 20:26:14,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:26:14,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.69 | bwd_microstep: 3359.55 | bwd_inner_microstep: 3358.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 20:26:14,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.69 | bwd: 3359.56 | bwd_inner: 3358.75 | bwd_allreduce: 0.77 | step: 6.87 44%|████▍ | 4404/10000 [6:56:35<8:35:26, 5.53s/it] {'loss': 0.003, 'grad_norm': 0.2636982798576355, 'learning_rate': 2.4785345697357754e-05, 'epoch': 4.4} 44%|████▍ | 4404/10000 [6:56:35<8:35:26, 5.53s/it][2025-06-19 20:26:19,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:26:19,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.60 | bwd_microstep: 3318.65 | bwd_inner_microstep: 3317.60 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.57 [2025-06-19 20:26:19,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.60 | bwd: 3318.67 | bwd_inner: 3317.60 | bwd_allreduce: 1.01 | step: 7.57 44%|████▍ | 4405/10000 [6:56:40<8:33:44, 5.51s/it] {'loss': 0.0406, 'grad_norm': 3.0990982055664062, 'learning_rate': 2.4779056083315358e-05, 'epoch': 4.41} 44%|████▍ | 4405/10000 [6:56:40<8:33:44, 5.51s/it][2025-06-19 20:26:25,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:26:25,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.97 | bwd_microstep: 3363.45 | bwd_inner_microstep: 3362.45 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.33 [2025-06-19 20:26:25,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.97 | bwd: 3363.46 | bwd_inner: 3362.45 | bwd_allreduce: 0.96 | step: 7.34 44%|████▍ | 4406/10000 [6:56:46<8:34:06, 5.51s/it] {'loss': 0.1059, 'grad_norm': 3.894766330718994, 'learning_rate': 2.4772765967972158e-05, 'epoch': 4.41} 44%|████▍ | 4406/10000 [6:56:46<8:34:06, 5.51s/it][2025-06-19 20:26:30,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:26:30,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.47 | bwd_microstep: 3313.83 | bwd_inner_microstep: 3313.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 20:26:30,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.47 | bwd: 3313.85 | bwd_inner: 3313.03 | bwd_allreduce: 0.77 | step: 6.79 44%|████▍ | 4407/10000 [6:56:51<8:32:38, 5.50s/it] {'loss': 0.0045, 'grad_norm': 0.40338334441185, 'learning_rate': 2.4766475351987966e-05, 'epoch': 4.41} 44%|████▍ | 4407/10000 [6:56:51<8:32:38, 5.50s/it][2025-06-19 20:26:36,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:26:36,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.61 | bwd_microstep: 3306.56 | bwd_inner_microstep: 3305.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.97 [2025-06-19 20:26:36,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.61 | bwd: 3306.58 | bwd_inner: 3305.78 | bwd_allreduce: 0.76 | step: 6.98 44%|████▍ | 4408/10000 [6:56:57<8:31:18, 5.49s/it] {'loss': 0.0049, 'grad_norm': 0.4551472067832947, 'learning_rate': 2.476018423602262e-05, 'epoch': 4.41} 44%|████▍ | 4408/10000 [6:56:57<8:31:18, 5.49s/it][2025-06-19 20:26:41,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:26:41,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.73 | bwd_microstep: 3305.00 | bwd_inner_microstep: 3304.21 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 20:26:41,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.72 | bwd: 3305.01 | bwd_inner: 3304.21 | bwd_allreduce: 0.75 | step: 6.61 44%|████▍ | 4409/10000 [6:57:02<8:30:17, 5.48s/it] {'loss': 0.012, 'grad_norm': 0.9119789600372314, 'learning_rate': 2.4753892620736046e-05, 'epoch': 4.41} 44%|████▍ | 4409/10000 [6:57:02<8:30:17, 5.48s/it][2025-06-19 20:26:47,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:26:47,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.85 | bwd_microstep: 3358.05 | bwd_inner_microstep: 3357.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 20:26:47,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.85 | bwd: 3358.06 | bwd_inner: 3357.26 | bwd_allreduce: 0.76 | step: 6.61 44%|████▍ | 4410/10000 [6:57:08<8:31:26, 5.49s/it] {'loss': 0.0281, 'grad_norm': 2.391805410385132, 'learning_rate': 2.47476005067882e-05, 'epoch': 4.41} 44%|████▍ | 4410/10000 [6:57:08<8:31:26, 5.49s/it][2025-06-19 20:26:52,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:26:52,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.20 | bwd_microstep: 3318.67 | bwd_inner_microstep: 3317.88 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 20:26:52,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.20 | bwd: 3318.68 | bwd_inner: 3317.88 | bwd_allreduce: 0.76 | step: 6.69 44%|████▍ | 4411/10000 [6:57:13<8:30:36, 5.48s/it] {'loss': 0.0087, 'grad_norm': 0.6123877167701721, 'learning_rate': 2.4741307894839096e-05, 'epoch': 4.41} 44%|████▍ | 4411/10000 [6:57:13<8:30:36, 5.48s/it][2025-06-19 20:26:58,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:26:58,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.86 | bwd_microstep: 3320.35 | bwd_inner_microstep: 3319.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 20:26:58,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.86 | bwd: 3320.36 | bwd_inner: 3319.55 | bwd_allreduce: 0.77 | step: 6.98 44%|████▍ | 4412/10000 [6:57:19<8:29:54, 5.48s/it] {'loss': 0.0095, 'grad_norm': 1.0229411125183105, 'learning_rate': 2.4735014785548795e-05, 'epoch': 4.41} 44%|████▍ | 4412/10000 [6:57:19<8:29:54, 5.48s/it][2025-06-19 20:27:03,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:27:03,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.16 | bwd_microstep: 3361.59 | bwd_inner_microstep: 3360.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 20:27:03,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.16 | bwd: 3361.61 | bwd_inner: 3360.81 | bwd_allreduce: 0.76 | step: 6.67 44%|████▍ | 4413/10000 [6:57:24<8:31:14, 5.49s/it] {'loss': 0.0142, 'grad_norm': 0.7291278839111328, 'learning_rate': 2.4728721179577422e-05, 'epoch': 4.41} 44%|████▍ | 4413/10000 [6:57:24<8:31:14, 5.49s/it][2025-06-19 20:27:09,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:27:09,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.37 | bwd_microstep: 3362.83 | bwd_inner_microstep: 3361.72 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.38 [2025-06-19 20:27:09,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.37 | bwd: 3362.86 | bwd_inner: 3361.72 | bwd_allreduce: 1.08 | step: 7.39 44%|████▍ | 4414/10000 [6:57:30<8:32:08, 5.50s/it] {'loss': 0.0023, 'grad_norm': 0.1958664357662201, 'learning_rate': 2.472242707758514e-05, 'epoch': 4.41} 44%|████▍ | 4414/10000 [6:57:30<8:32:08, 5.50s/it][2025-06-19 20:27:14,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:27:14,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.81 | bwd_microstep: 3319.11 | bwd_inner_microstep: 3318.28 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.81 [2025-06-19 20:27:14,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.81 | bwd: 3319.12 | bwd_inner: 3318.28 | bwd_allreduce: 0.80 | step: 6.81 44%|████▍ | 4415/10000 [6:57:35<8:31:01, 5.49s/it] {'loss': 0.0555, 'grad_norm': 2.457768201828003, 'learning_rate': 2.4716132480232183e-05, 'epoch': 4.42} 44%|████▍ | 4415/10000 [6:57:35<8:31:01, 5.49s/it][2025-06-19 20:27:20,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:27:20,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.95 | bwd_microstep: 3324.66 | bwd_inner_microstep: 3323.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 20:27:20,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.95 | bwd: 3324.67 | bwd_inner: 3323.87 | bwd_allreduce: 0.76 | step: 6.65 44%|████▍ | 4416/10000 [6:57:41<8:30:28, 5.49s/it] {'loss': 0.0191, 'grad_norm': 2.0249199867248535, 'learning_rate': 2.470983738817881e-05, 'epoch': 4.42} 44%|████▍ | 4416/10000 [6:57:41<8:30:28, 5.49s/it][2025-06-19 20:27:25,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:27:25,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.70 | bwd_microstep: 3325.41 | bwd_inner_microstep: 3324.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 20:27:25,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.70 | bwd: 3325.42 | bwd_inner: 3324.62 | bwd_allreduce: 0.76 | step: 6.64 44%|████▍ | 4417/10000 [6:57:46<8:30:06, 5.48s/it] {'loss': 0.0082, 'grad_norm': 0.9168206453323364, 'learning_rate': 2.470354180208536e-05, 'epoch': 4.42} 44%|████▍ | 4417/10000 [6:57:46<8:30:06, 5.48s/it][2025-06-19 20:27:31,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:27:31,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.64 | bwd_microstep: 3317.74 | bwd_inner_microstep: 3316.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.95 [2025-06-19 20:27:31,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.64 | bwd: 3317.76 | bwd_inner: 3316.95 | bwd_allreduce: 0.76 | step: 6.96 44%|████▍ | 4418/10000 [6:57:52<8:29:18, 5.47s/it] {'loss': 0.0542, 'grad_norm': 3.6847176551818848, 'learning_rate': 2.46972457226122e-05, 'epoch': 4.42} 44%|████▍ | 4418/10000 [6:57:52<8:29:18, 5.47s/it][2025-06-19 20:27:36,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 20:27:36,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.75 | bwd_microstep: 3367.06 | bwd_inner_microstep: 3366.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 20:27:36,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.75 | bwd: 3367.07 | bwd_inner: 3366.27 | bwd_allreduce: 0.76 | step: 6.67 44%|████▍ | 4419/10000 [6:57:57<8:30:52, 5.49s/it] {'loss': 0.0027, 'grad_norm': 0.522943913936615, 'learning_rate': 2.469094915041977e-05, 'epoch': 4.42} 44%|████▍ | 4419/10000 [6:57:57<8:30:52, 5.49s/it][2025-06-19 20:27:42,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:27:42,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.85 | bwd_microstep: 3325.21 | bwd_inner_microstep: 3324.26 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.07 [2025-06-19 20:27:42,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.85 | bwd: 3325.24 | bwd_inner: 3324.26 | bwd_allreduce: 0.92 | step: 7.07 44%|████▍ | 4420/10000 [6:58:03<8:30:09, 5.49s/it] {'loss': 0.0212, 'grad_norm': 0.9863564968109131, 'learning_rate': 2.4684652086168536e-05, 'epoch': 4.42} 44%|████▍ | 4420/10000 [6:58:03<8:30:09, 5.49s/it][2025-06-19 20:27:47,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:27:47,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.35 | bwd_microstep: 3376.06 | bwd_inner_microstep: 3375.20 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.95 [2025-06-19 20:27:47,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.35 | bwd: 3376.07 | bwd_inner: 3375.20 | bwd_allreduce: 0.83 | step: 6.96 44%|████▍ | 4421/10000 [6:58:08<8:31:43, 5.50s/it] {'loss': 0.0157, 'grad_norm': 3.1914172172546387, 'learning_rate': 2.467835453051904e-05, 'epoch': 4.42} 44%|████▍ | 4421/10000 [6:58:08<8:31:43, 5.50s/it][2025-06-19 20:27:53,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:27:53,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.47 | bwd_microstep: 3312.41 | bwd_inner_microstep: 3311.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 20:27:53,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.47 | bwd: 3312.43 | bwd_inner: 3311.62 | bwd_allreduce: 0.76 | step: 6.64 44%|████▍ | 4422/10000 [6:58:14<8:30:20, 5.49s/it] {'loss': 0.0056, 'grad_norm': 0.4019506871700287, 'learning_rate': 2.4672056484131876e-05, 'epoch': 4.42} 44%|████▍ | 4422/10000 [6:58:14<8:30:20, 5.49s/it][2025-06-19 20:27:58,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:27:58,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.25 | bwd_microstep: 3371.80 | bwd_inner_microstep: 3371.01 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 20:27:58,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.25 | bwd: 3371.81 | bwd_inner: 3371.01 | bwd_allreduce: 0.76 | step: 6.60 44%|████▍ | 4423/10000 [6:58:19<8:31:35, 5.50s/it] {'loss': 0.0546, 'grad_norm': 3.3687195777893066, 'learning_rate': 2.4665757947667667e-05, 'epoch': 4.42} 44%|████▍ | 4423/10000 [6:58:19<8:31:35, 5.50s/it][2025-06-19 20:28:04,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:28:04,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.27 | bwd_microstep: 3321.73 | bwd_inner_microstep: 3320.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-19 20:28:04,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.27 | bwd: 3321.74 | bwd_inner: 3320.92 | bwd_allreduce: 0.78 | step: 6.96 44%|████▍ | 4424/10000 [6:58:25<8:30:24, 5.49s/it] {'loss': 0.0184, 'grad_norm': 1.736481785774231, 'learning_rate': 2.4659458921787103e-05, 'epoch': 4.42} 44%|████▍ | 4424/10000 [6:58:25<8:30:24, 5.49s/it][2025-06-19 20:28:09,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:28:09,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.44 | bwd_microstep: 3330.84 | bwd_inner_microstep: 3329.82 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.42 [2025-06-19 20:28:09,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.44 | bwd: 3330.86 | bwd_inner: 3329.82 | bwd_allreduce: 0.99 | step: 7.42 44%|████▍ | 4425/10000 [6:58:30<8:29:54, 5.49s/it] {'loss': 0.0687, 'grad_norm': 5.7543721199035645, 'learning_rate': 2.4653159407150927e-05, 'epoch': 4.42} 44%|████▍ | 4425/10000 [6:58:30<8:29:54, 5.49s/it][2025-06-19 20:28:15,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:28:15,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.28 | bwd_microstep: 3323.81 | bwd_inner_microstep: 3323.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-19 20:28:15,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.28 | bwd: 3323.82 | bwd_inner: 3323.03 | bwd_allreduce: 0.75 | step: 6.74 44%|████▍ | 4426/10000 [6:58:36<8:29:47, 5.49s/it] {'loss': 0.0038, 'grad_norm': 0.44419533014297485, 'learning_rate': 2.464685940441992e-05, 'epoch': 4.43} 44%|████▍ | 4426/10000 [6:58:36<8:29:47, 5.49s/it][2025-06-19 20:28:20,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:28:20,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.52 | bwd_microstep: 3396.09 | bwd_inner_microstep: 3395.30 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 20:28:20,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.52 | bwd: 3396.10 | bwd_inner: 3395.30 | bwd_allreduce: 0.76 | step: 6.63 44%|████▍ | 4427/10000 [6:58:41<8:32:00, 5.51s/it] {'loss': 0.0694, 'grad_norm': 3.6514387130737305, 'learning_rate': 2.4640558914254933e-05, 'epoch': 4.43} 44%|████▍ | 4427/10000 [6:58:41<8:32:00, 5.51s/it][2025-06-19 20:28:26,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:28:26,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.85 | bwd_microstep: 3373.13 | bwd_inner_microstep: 3372.33 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-19 20:28:26,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.85 | bwd: 3373.15 | bwd_inner: 3372.33 | bwd_allreduce: 0.77 | step: 6.94 44%|████▍ | 4428/10000 [6:58:47<8:32:45, 5.52s/it] {'loss': 0.0203, 'grad_norm': 1.2293356657028198, 'learning_rate': 2.463425793731685e-05, 'epoch': 4.43} 44%|████▍ | 4428/10000 [6:58:47<8:32:45, 5.52s/it][2025-06-19 20:28:31,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:28:31,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.01 | bwd_microstep: 3328.40 | bwd_inner_microstep: 3327.43 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.45 [2025-06-19 20:28:31,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.01 | bwd: 3328.42 | bwd_inner: 3327.43 | bwd_allreduce: 0.95 | step: 7.46 44%|████▍ | 4429/10000 [6:58:52<8:31:31, 5.51s/it] {'loss': 0.0098, 'grad_norm': 0.6962620615959167, 'learning_rate': 2.4627956474266617e-05, 'epoch': 4.43} 44%|████▍ | 4429/10000 [6:58:52<8:31:31, 5.51s/it][2025-06-19 20:28:37,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:28:37,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.40 | bwd_microstep: 3328.85 | bwd_inner_microstep: 3327.94 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.99 [2025-06-19 20:28:37,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.40 | bwd: 3328.87 | bwd_inner: 3327.94 | bwd_allreduce: 0.88 | step: 6.99 44%|████▍ | 4430/10000 [6:58:58<8:30:42, 5.50s/it] {'loss': 0.0666, 'grad_norm': 2.8832669258117676, 'learning_rate': 2.462165452576523e-05, 'epoch': 4.43} 44%|████▍ | 4430/10000 [6:58:58<8:30:42, 5.50s/it][2025-06-19 20:28:42,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:28:42,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.69 | bwd_microstep: 3323.51 | bwd_inner_microstep: 3322.70 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-19 20:28:42,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.69 | bwd: 3323.53 | bwd_inner: 3322.70 | bwd_allreduce: 0.78 | step: 6.93 44%|████▍ | 4431/10000 [6:59:03<8:29:53, 5.49s/it] {'loss': 0.0201, 'grad_norm': 2.248735189437866, 'learning_rate': 2.461535209247373e-05, 'epoch': 4.43} 44%|████▍ | 4431/10000 [6:59:03<8:29:53, 5.49s/it][2025-06-19 20:28:48,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:28:48,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.59 | bwd_microstep: 3329.35 | bwd_inner_microstep: 3328.47 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.29 [2025-06-19 20:28:48,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.59 | bwd: 3329.36 | bwd_inner: 3328.47 | bwd_allreduce: 0.85 | step: 7.29 44%|████▍ | 4432/10000 [6:59:09<8:29:19, 5.49s/it] {'loss': 0.0206, 'grad_norm': 2.1329195499420166, 'learning_rate': 2.4609049175053218e-05, 'epoch': 4.43} 44%|████▍ | 4432/10000 [6:59:09<8:29:19, 5.49s/it][2025-06-19 20:28:53,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:28:53,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.57 | bwd_microstep: 3385.75 | bwd_inner_microstep: 3384.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 20:28:53,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.57 | bwd: 3385.76 | bwd_inner: 3384.96 | bwd_allreduce: 0.76 | step: 6.67 44%|████▍ | 4433/10000 [6:59:14<8:31:12, 5.51s/it] {'loss': 0.0444, 'grad_norm': 2.44354248046875, 'learning_rate': 2.460274577416484e-05, 'epoch': 4.43} 44%|████▍ | 4433/10000 [6:59:14<8:31:12, 5.51s/it][2025-06-19 20:28:59,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:28:59,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.21 | bwd_microstep: 3330.43 | bwd_inner_microstep: 3329.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 20:28:59,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.21 | bwd: 3330.44 | bwd_inner: 3329.63 | bwd_allreduce: 0.77 | step: 6.69 44%|████▍ | 4434/10000 [6:59:20<8:30:19, 5.50s/it] {'loss': 0.3247, 'grad_norm': 5.215985298156738, 'learning_rate': 2.4596441890469792e-05, 'epoch': 4.43} 44%|████▍ | 4434/10000 [6:59:20<8:30:19, 5.50s/it][2025-06-19 20:29:04,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:29:04,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.34 | bwd_microstep: 3328.70 | bwd_inner_microstep: 3327.91 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 20:29:04,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.34 | bwd: 3328.72 | bwd_inner: 3327.91 | bwd_allreduce: 0.76 | step: 6.71 44%|████▍ | 4435/10000 [6:59:25<8:29:31, 5.49s/it] {'loss': 0.0105, 'grad_norm': 0.7382643222808838, 'learning_rate': 2.4590137524629322e-05, 'epoch': 4.43} 44%|████▍ | 4435/10000 [6:59:25<8:29:31, 5.49s/it][2025-06-19 20:29:10,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:29:10,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.37 | bwd_microstep: 3373.93 | bwd_inner_microstep: 3373.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-19 20:29:10,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.37 | bwd: 3373.94 | bwd_inner: 3373.13 | bwd_allreduce: 0.77 | step: 7.02 44%|████▍ | 4436/10000 [6:59:31<8:30:55, 5.51s/it] {'loss': 0.0062, 'grad_norm': 0.6370433568954468, 'learning_rate': 2.458383267730473e-05, 'epoch': 4.44} 44%|████▍ | 4436/10000 [6:59:31<8:30:55, 5.51s/it][2025-06-19 20:29:15,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 20:29:15,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.94 | bwd_microstep: 3404.70 | bwd_inner_microstep: 3403.59 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.96 [2025-06-19 20:29:15,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.94 | bwd: 3404.72 | bwd_inner: 3403.59 | bwd_allreduce: 1.08 | step: 7.98 44%|████▍ | 4437/10000 [6:59:36<8:33:05, 5.53s/it] {'loss': 0.0298, 'grad_norm': 1.4873329401016235, 'learning_rate': 2.4577527349157362e-05, 'epoch': 4.44} 44%|████▍ | 4437/10000 [6:59:36<8:33:05, 5.53s/it][2025-06-19 20:29:21,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:29:21,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.10 | bwd_microstep: 3332.32 | bwd_inner_microstep: 3331.33 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.24 [2025-06-19 20:29:21,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.10 | bwd: 3332.33 | bwd_inner: 3331.33 | bwd_allreduce: 0.96 | step: 7.24 44%|████▍ | 4438/10000 [6:59:42<8:32:06, 5.52s/it] {'loss': 0.1086, 'grad_norm': 2.754286050796509, 'learning_rate': 2.4571221540848623e-05, 'epoch': 4.44} 44%|████▍ | 4438/10000 [6:59:42<8:32:06, 5.52s/it][2025-06-19 20:29:26,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:29:26,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.30 | bwd_microstep: 3342.81 | bwd_inner_microstep: 3342.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 20:29:26,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.30 | bwd: 3342.82 | bwd_inner: 3342.02 | bwd_allreduce: 0.76 | step: 6.68 44%|████▍ | 4439/10000 [6:59:47<8:31:12, 5.52s/it] {'loss': 0.0179, 'grad_norm': 3.151017427444458, 'learning_rate': 2.456491525303996e-05, 'epoch': 4.44} 44%|████▍ | 4439/10000 [6:59:47<8:31:12, 5.52s/it][2025-06-19 20:29:32,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:29:32,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.77 | bwd_microstep: 3379.32 | bwd_inner_microstep: 3378.29 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.11 [2025-06-19 20:29:32,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.77 | bwd: 3379.33 | bwd_inner: 3378.29 | bwd_allreduce: 0.99 | step: 7.12 44%|████▍ | 4440/10000 [6:59:53<8:32:20, 5.53s/it] {'loss': 0.2001, 'grad_norm': 3.1398985385894775, 'learning_rate': 2.4558608486392875e-05, 'epoch': 4.44} 44%|████▍ | 4440/10000 [6:59:53<8:32:20, 5.53s/it][2025-06-19 20:29:38,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:29:38,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.23 | bwd_microstep: 3380.79 | bwd_inner_microstep: 3379.99 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.68 [2025-06-19 20:29:38,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.23 | bwd: 3380.80 | bwd_inner: 3379.99 | bwd_allreduce: 0.77 | step: 6.69 44%|████▍ | 4441/10000 [6:59:58<8:33:05, 5.54s/it] {'loss': 0.0017, 'grad_norm': 0.17431022226810455, 'learning_rate': 2.455230124156891e-05, 'epoch': 4.44} 44%|████▍ | 4441/10000 [6:59:58<8:33:05, 5.54s/it][2025-06-19 20:29:43,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 20:29:43,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.00 | bwd_microstep: 3390.56 | bwd_inner_microstep: 3389.55 | bwd_allreduce_microstep: 0.95 | step_microstep: 8.09 [2025-06-19 20:29:43,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.00 | bwd: 3390.57 | bwd_inner: 3389.55 | bwd_allreduce: 0.98 | step: 8.09 44%|████▍ | 4442/10000 [7:00:04<8:33:58, 5.55s/it] {'loss': 0.0609, 'grad_norm': 3.569199323654175, 'learning_rate': 2.4545993519229675e-05, 'epoch': 4.44} 44%|████▍ | 4442/10000 [7:00:04<8:33:58, 5.55s/it][2025-06-19 20:29:49,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:29:49,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.30 | bwd_microstep: 3340.16 | bwd_inner_microstep: 3339.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 20:29:49,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.30 | bwd: 3340.18 | bwd_inner: 3339.38 | bwd_allreduce: 0.76 | step: 6.73 44%|████▍ | 4443/10000 [7:00:09<8:32:33, 5.53s/it] {'loss': 0.0618, 'grad_norm': 4.538810729980469, 'learning_rate': 2.4539685320036824e-05, 'epoch': 4.44} 44%|████▍ | 4443/10000 [7:00:09<8:32:33, 5.53s/it][2025-06-19 20:29:54,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:29:54,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.06 | bwd_microstep: 3332.73 | bwd_inner_microstep: 3331.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 20:29:54,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.05 | bwd: 3332.74 | bwd_inner: 3331.95 | bwd_allreduce: 0.75 | step: 6.54 44%|████▍ | 4444/10000 [7:00:15<8:31:23, 5.52s/it] {'loss': 0.0029, 'grad_norm': 0.24500344693660736, 'learning_rate': 2.453337664465205e-05, 'epoch': 4.44} 44%|████▍ | 4444/10000 [7:00:15<8:31:23, 5.52s/it][2025-06-19 20:30:00,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:30:00,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.25 | bwd_microstep: 3374.81 | bwd_inner_microstep: 3373.78 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.53 [2025-06-19 20:30:00,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.25 | bwd: 3374.83 | bwd_inner: 3373.78 | bwd_allreduce: 1.00 | step: 7.53 44%|████▍ | 4445/10000 [7:00:20<8:32:11, 5.53s/it] {'loss': 0.0012, 'grad_norm': 0.09696237742900848, 'learning_rate': 2.4527067493737105e-05, 'epoch': 4.45} 44%|████▍ | 4445/10000 [7:00:20<8:32:11, 5.53s/it][2025-06-19 20:30:05,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 20:30:05,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.64 | bwd_microstep: 3327.98 | bwd_inner_microstep: 3326.89 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.87 [2025-06-19 20:30:05,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.64 | bwd: 3328.00 | bwd_inner: 3326.89 | bwd_allreduce: 1.05 | step: 7.88 44%|████▍ | 4446/10000 [7:00:26<8:31:07, 5.52s/it] {'loss': 0.0023, 'grad_norm': 0.22119271755218506, 'learning_rate': 2.452075786795379e-05, 'epoch': 4.45} 44%|████▍ | 4446/10000 [7:00:26<8:31:07, 5.52s/it][2025-06-19 20:30:11,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:30:11,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.54 | bwd_microstep: 3379.58 | bwd_inner_microstep: 3378.56 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.22 [2025-06-19 20:30:11,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.54 | bwd: 3379.60 | bwd_inner: 3378.56 | bwd_allreduce: 0.99 | step: 7.23 44%|████▍ | 4447/10000 [7:00:31<8:31:55, 5.53s/it] {'loss': 0.0099, 'grad_norm': 0.6230782866477966, 'learning_rate': 2.4514447767963955e-05, 'epoch': 4.45} 44%|████▍ | 4447/10000 [7:00:31<8:31:55, 5.53s/it][2025-06-19 20:30:16,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:30:16,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.66 | bwd_microstep: 3320.16 | bwd_inner_microstep: 3319.19 | bwd_allreduce_microstep: 0.92 | step_microstep: 6.82 [2025-06-19 20:30:16,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.66 | bwd: 3320.18 | bwd_inner: 3319.19 | bwd_allreduce: 0.94 | step: 6.82 44%|████▍ | 4448/10000 [7:00:37<8:30:07, 5.51s/it] {'loss': 0.0439, 'grad_norm': 1.4846361875534058, 'learning_rate': 2.45081371944295e-05, 'epoch': 4.45} 44%|████▍ | 4448/10000 [7:00:37<8:30:07, 5.51s/it][2025-06-19 20:30:22,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:30:22,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.84 | bwd_microstep: 3374.86 | bwd_inner_microstep: 3374.04 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.86 [2025-06-19 20:30:22,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.84 | bwd: 3374.87 | bwd_inner: 3374.04 | bwd_allreduce: 0.78 | step: 6.87 44%|████▍ | 4449/10000 [7:00:43<8:30:57, 5.52s/it] {'loss': 0.053, 'grad_norm': 2.8947854042053223, 'learning_rate': 2.450182614801238e-05, 'epoch': 4.45} 44%|████▍ | 4449/10000 [7:00:43<8:30:57, 5.52s/it][2025-06-19 20:30:27,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:30:27,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.62 | bwd_microstep: 3370.81 | bwd_inner_microstep: 3369.99 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.16 [2025-06-19 20:30:27,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.62 | bwd: 3370.82 | bwd_inner: 3369.99 | bwd_allreduce: 0.79 | step: 7.16 44%|████▍ | 4450/10000 [7:00:48<8:31:24, 5.53s/it] {'loss': 0.0071, 'grad_norm': 0.8999518752098083, 'learning_rate': 2.4495514629374592e-05, 'epoch': 4.45} 44%|████▍ | 4450/10000 [7:00:48<8:31:24, 5.53s/it][2025-06-19 20:30:33,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:30:33,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.93 | bwd_microstep: 3375.63 | bwd_inner_microstep: 3374.80 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.16 [2025-06-19 20:30:33,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.93 | bwd: 3375.64 | bwd_inner: 3374.80 | bwd_allreduce: 0.80 | step: 7.16 45%|████▍ | 4451/10000 [7:00:54<8:31:53, 5.54s/it] {'loss': 0.0728, 'grad_norm': 2.095503330230713, 'learning_rate': 2.448920263917818e-05, 'epoch': 4.45} 45%|████▍ | 4451/10000 [7:00:54<8:31:53, 5.54s/it][2025-06-19 20:30:38,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:30:38,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.62 | bwd_microstep: 3373.27 | bwd_inner_microstep: 3372.38 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.97 [2025-06-19 20:30:38,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.62 | bwd: 3373.28 | bwd_inner: 3372.38 | bwd_allreduce: 0.86 | step: 6.97 45%|████▍ | 4452/10000 [7:00:59<8:32:06, 5.54s/it] {'loss': 0.0129, 'grad_norm': 1.4304205179214478, 'learning_rate': 2.4482890178085247e-05, 'epoch': 4.45} 45%|████▍ | 4452/10000 [7:00:59<8:32:06, 5.54s/it][2025-06-19 20:30:44,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:30:44,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.49 | bwd_microstep: 3321.88 | bwd_inner_microstep: 3320.99 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.95 [2025-06-19 20:30:44,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.49 | bwd: 3321.90 | bwd_inner: 3320.99 | bwd_allreduce: 0.86 | step: 6.95 45%|████▍ | 4453/10000 [7:01:05<8:30:00, 5.52s/it] {'loss': 0.0004, 'grad_norm': 0.015948502346873283, 'learning_rate': 2.4476577246757942e-05, 'epoch': 4.45} 45%|████▍ | 4453/10000 [7:01:05<8:30:00, 5.52s/it][2025-06-19 20:30:49,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:30:49,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.06 | bwd_microstep: 3330.10 | bwd_inner_microstep: 3329.28 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.18 [2025-06-19 20:30:49,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.06 | bwd: 3330.11 | bwd_inner: 3329.28 | bwd_allreduce: 0.78 | step: 7.19 45%|████▍ | 4454/10000 [7:01:10<8:28:54, 5.51s/it] {'loss': 0.1134, 'grad_norm': 3.498741388320923, 'learning_rate': 2.447026384585846e-05, 'epoch': 4.45} 45%|████▍ | 4454/10000 [7:01:10<8:28:54, 5.51s/it][2025-06-19 20:30:55,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:30:55,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.21 | bwd_microstep: 3369.19 | bwd_inner_microstep: 3368.35 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.00 [2025-06-19 20:30:55,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.21 | bwd: 3369.21 | bwd_inner: 3368.35 | bwd_allreduce: 0.80 | step: 7.00 45%|████▍ | 4455/10000 [7:01:16<8:30:03, 5.52s/it] {'loss': 0.0663, 'grad_norm': 4.5904541015625, 'learning_rate': 2.4463949976049044e-05, 'epoch': 4.46} 45%|████▍ | 4455/10000 [7:01:16<8:30:03, 5.52s/it][2025-06-19 20:31:00,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:31:00,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.83 | bwd_microstep: 3332.05 | bwd_inner_microstep: 3331.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 20:31:00,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.83 | bwd: 3332.07 | bwd_inner: 3331.27 | bwd_allreduce: 0.76 | step: 6.72 45%|████▍ | 4456/10000 [7:01:21<8:28:51, 5.51s/it] {'loss': 0.067, 'grad_norm': 1.7670328617095947, 'learning_rate': 2.4457635637991996e-05, 'epoch': 4.46} 45%|████▍ | 4456/10000 [7:01:21<8:28:51, 5.51s/it][2025-06-19 20:31:06,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:31:06,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.82 | bwd_microstep: 3320.89 | bwd_inner_microstep: 3320.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-19 20:31:06,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.82 | bwd: 3320.91 | bwd_inner: 3320.08 | bwd_allreduce: 0.78 | step: 6.98 45%|████▍ | 4457/10000 [7:01:27<8:27:46, 5.50s/it] {'loss': 0.0136, 'grad_norm': 1.1295289993286133, 'learning_rate': 2.4451320832349654e-05, 'epoch': 4.46} 45%|████▍ | 4457/10000 [7:01:27<8:27:46, 5.50s/it][2025-06-19 20:31:11,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:31:11,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.53 | bwd_microstep: 3365.31 | bwd_inner_microstep: 3364.48 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.05 [2025-06-19 20:31:11,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.53 | bwd: 3365.32 | bwd_inner: 3364.48 | bwd_allreduce: 0.80 | step: 7.05 45%|████▍ | 4458/10000 [7:01:32<8:28:44, 5.51s/it] {'loss': 0.0107, 'grad_norm': 0.9655203819274902, 'learning_rate': 2.444500555978442e-05, 'epoch': 4.46} 45%|████▍ | 4458/10000 [7:01:32<8:28:44, 5.51s/it][2025-06-19 20:31:17,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:31:17,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.71 | bwd_microstep: 3377.11 | bwd_inner_microstep: 3376.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-19 20:31:17,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.71 | bwd: 3377.13 | bwd_inner: 3376.31 | bwd_allreduce: 0.77 | step: 6.90 45%|████▍ | 4459/10000 [7:01:38<8:30:03, 5.52s/it] {'loss': 0.019, 'grad_norm': 1.4767448902130127, 'learning_rate': 2.4438689820958736e-05, 'epoch': 4.46} 45%|████▍ | 4459/10000 [7:01:38<8:30:03, 5.52s/it][2025-06-19 20:31:22,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:31:22,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.05 | bwd_microstep: 3364.19 | bwd_inner_microstep: 3363.37 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.87 [2025-06-19 20:31:22,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.05 | bwd: 3364.20 | bwd_inner: 3363.37 | bwd_allreduce: 0.79 | step: 6.87 45%|████▍ | 4460/10000 [7:01:43<8:30:11, 5.53s/it] {'loss': 0.1316, 'grad_norm': 3.490330457687378, 'learning_rate': 2.443237361653509e-05, 'epoch': 4.46} 45%|████▍ | 4460/10000 [7:01:43<8:30:11, 5.53s/it][2025-06-19 20:31:28,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:31:28,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.45 | bwd_microstep: 3380.51 | bwd_inner_microstep: 3379.49 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.79 [2025-06-19 20:31:28,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.45 | bwd: 3380.52 | bwd_inner: 3379.49 | bwd_allreduce: 0.98 | step: 7.79 45%|████▍ | 4461/10000 [7:01:49<8:30:57, 5.53s/it] {'loss': 0.0411, 'grad_norm': 1.6714805364608765, 'learning_rate': 2.442605694717602e-05, 'epoch': 4.46} 45%|████▍ | 4461/10000 [7:01:49<8:30:57, 5.53s/it][2025-06-19 20:31:34,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:31:34,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.89 | bwd_microstep: 3381.50 | bwd_inner_microstep: 3380.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 20:31:34,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.89 | bwd: 3381.51 | bwd_inner: 3380.72 | bwd_allreduce: 0.75 | step: 6.64 45%|████▍ | 4462/10000 [7:01:54<8:31:20, 5.54s/it] {'loss': 0.0704, 'grad_norm': 1.7011743783950806, 'learning_rate': 2.4419739813544117e-05, 'epoch': 4.46} 45%|████▍ | 4462/10000 [7:01:54<8:31:20, 5.54s/it][2025-06-19 20:31:39,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 20:31:39,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.56 | bwd_microstep: 3373.98 | bwd_inner_microstep: 3372.84 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.94 [2025-06-19 20:31:39,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.56 | bwd: 3373.99 | bwd_inner: 3372.84 | bwd_allreduce: 1.11 | step: 7.95 45%|████▍ | 4463/10000 [7:02:00<8:31:23, 5.54s/it] {'loss': 0.0084, 'grad_norm': 0.6032674312591553, 'learning_rate': 2.4413422216302022e-05, 'epoch': 4.46} 45%|████▍ | 4463/10000 [7:02:00<8:31:23, 5.54s/it][2025-06-19 20:31:45,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:31:45,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.41 | bwd_microstep: 3315.77 | bwd_inner_microstep: 3314.78 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.50 [2025-06-19 20:31:45,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.41 | bwd: 3315.79 | bwd_inner: 3314.78 | bwd_allreduce: 0.97 | step: 7.50 45%|████▍ | 4464/10000 [7:02:05<8:29:07, 5.52s/it] {'loss': 0.1806, 'grad_norm': 3.4547572135925293, 'learning_rate': 2.4407104156112417e-05, 'epoch': 4.46} 45%|████▍ | 4464/10000 [7:02:05<8:29:07, 5.52s/it][2025-06-19 20:31:50,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:31:50,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.41 | bwd_microstep: 3395.56 | bwd_inner_microstep: 3394.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 20:31:50,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.41 | bwd: 3395.57 | bwd_inner: 3394.76 | bwd_allreduce: 0.77 | step: 6.74 45%|████▍ | 4465/10000 [7:02:11<8:30:38, 5.54s/it] {'loss': 0.0782, 'grad_norm': 3.413529396057129, 'learning_rate': 2.4400785633638044e-05, 'epoch': 4.46} 45%|████▍ | 4465/10000 [7:02:11<8:30:38, 5.54s/it][2025-06-19 20:31:56,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:31:56,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.47 | bwd_microstep: 3400.06 | bwd_inner_microstep: 3399.23 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.03 [2025-06-19 20:31:56,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.47 | bwd: 3400.07 | bwd_inner: 3399.23 | bwd_allreduce: 0.80 | step: 7.03 45%|████▍ | 4466/10000 [7:02:16<8:31:36, 5.55s/it] {'loss': 0.038, 'grad_norm': 1.9834798574447632, 'learning_rate': 2.4394466649541678e-05, 'epoch': 4.47} 45%|████▍ | 4466/10000 [7:02:16<8:31:36, 5.55s/it][2025-06-19 20:32:01,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:32:01,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.12 | bwd_microstep: 3321.06 | bwd_inner_microstep: 3320.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 20:32:01,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.12 | bwd: 3321.08 | bwd_inner: 3320.27 | bwd_allreduce: 0.76 | step: 6.95 45%|████▍ | 4467/10000 [7:02:22<8:29:27, 5.52s/it] {'loss': 0.0041, 'grad_norm': 0.18176288902759552, 'learning_rate': 2.438814720448616e-05, 'epoch': 4.47} 45%|████▍ | 4467/10000 [7:02:22<8:29:27, 5.52s/it][2025-06-19 20:32:07,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:32:07,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.76 | bwd_microstep: 3321.74 | bwd_inner_microstep: 3320.55 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.99 [2025-06-19 20:32:07,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.76 | bwd: 3321.77 | bwd_inner: 3320.55 | bwd_allreduce: 1.16 | step: 7.99 45%|████▍ | 4468/10000 [7:02:27<8:27:56, 5.51s/it] {'loss': 0.016, 'grad_norm': 1.1654492616653442, 'learning_rate': 2.438182729913437e-05, 'epoch': 4.47} 45%|████▍ | 4468/10000 [7:02:27<8:27:56, 5.51s/it][2025-06-19 20:32:12,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:32:12,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.61 | bwd_microstep: 3373.11 | bwd_inner_microstep: 3372.14 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.62 [2025-06-19 20:32:12,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.61 | bwd: 3373.13 | bwd_inner: 3372.14 | bwd_allreduce: 0.94 | step: 7.63 45%|████▍ | 4469/10000 [7:02:33<8:28:56, 5.52s/it] {'loss': 0.0103, 'grad_norm': 0.5758776068687439, 'learning_rate': 2.4375506934149226e-05, 'epoch': 4.47} 45%|████▍ | 4469/10000 [7:02:33<8:28:56, 5.52s/it][2025-06-19 20:32:18,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:32:18,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.49 | bwd_microstep: 3331.79 | bwd_inner_microstep: 3331.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 20:32:18,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.49 | bwd: 3331.81 | bwd_inner: 3331.00 | bwd_allreduce: 0.77 | step: 6.68 45%|████▍ | 4470/10000 [7:02:38<8:27:42, 5.51s/it] {'loss': 0.0107, 'grad_norm': 0.8880674242973328, 'learning_rate': 2.4369186110193706e-05, 'epoch': 4.47} 45%|████▍ | 4470/10000 [7:02:38<8:27:42, 5.51s/it][2025-06-19 20:32:23,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:32:23,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.73 | bwd_microstep: 3375.25 | bwd_inner_microstep: 3374.27 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.44 [2025-06-19 20:32:23,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.73 | bwd: 3375.26 | bwd_inner: 3374.27 | bwd_allreduce: 0.95 | step: 7.45 45%|████▍ | 4471/10000 [7:02:44<8:28:38, 5.52s/it] {'loss': 0.0375, 'grad_norm': 2.8906784057617188, 'learning_rate': 2.4362864827930855e-05, 'epoch': 4.47} 45%|████▍ | 4471/10000 [7:02:44<8:28:38, 5.52s/it][2025-06-19 20:32:29,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:32:29,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.52 | bwd_microstep: 3312.97 | bwd_inner_microstep: 3312.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 20:32:29,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.52 | bwd: 3312.98 | bwd_inner: 3312.19 | bwd_allreduce: 0.75 | step: 6.76 45%|████▍ | 4472/10000 [7:02:49<8:27:23, 5.51s/it] {'loss': 0.0044, 'grad_norm': 0.2391679286956787, 'learning_rate': 2.435654308802373e-05, 'epoch': 4.47} 45%|████▍ | 4472/10000 [7:02:49<8:27:23, 5.51s/it][2025-06-19 20:32:34,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:32:34,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.10 | bwd_microstep: 3320.55 | bwd_inner_microstep: 3319.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-19 20:32:34,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.10 | bwd: 3320.56 | bwd_inner: 3319.76 | bwd_allreduce: 0.76 | step: 6.75 45%|████▍ | 4473/10000 [7:02:55<8:26:10, 5.49s/it] {'loss': 0.0049, 'grad_norm': 0.17800822854042053, 'learning_rate': 2.4350220891135443e-05, 'epoch': 4.47} 45%|████▍ | 4473/10000 [7:02:55<8:26:10, 5.49s/it][2025-06-19 20:32:40,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:32:40,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.82 | bwd_microstep: 3327.31 | bwd_inner_microstep: 3326.50 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 20:32:40,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.82 | bwd: 3327.32 | bwd_inner: 3326.50 | bwd_allreduce: 0.78 | step: 6.84 45%|████▍ | 4474/10000 [7:03:00<8:25:15, 5.49s/it] {'loss': 0.0085, 'grad_norm': 0.9376851916313171, 'learning_rate': 2.4343898237929184e-05, 'epoch': 4.47} 45%|████▍ | 4474/10000 [7:03:00<8:25:15, 5.49s/it][2025-06-19 20:32:45,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:32:45,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.70 | bwd_microstep: 3381.35 | bwd_inner_microstep: 3380.55 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 20:32:45,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.70 | bwd: 3381.36 | bwd_inner: 3380.55 | bwd_allreduce: 0.77 | step: 6.77 45%|████▍ | 4475/10000 [7:03:06<8:26:52, 5.50s/it] {'loss': 0.0141, 'grad_norm': 0.6059772968292236, 'learning_rate': 2.4337575129068157e-05, 'epoch': 4.47} 45%|████▍ | 4475/10000 [7:03:06<8:26:52, 5.50s/it][2025-06-19 20:32:51,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:32:51,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.52 | bwd_microstep: 3321.88 | bwd_inner_microstep: 3321.06 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.04 [2025-06-19 20:32:51,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.53 | bwd: 3321.90 | bwd_inner: 3321.06 | bwd_allreduce: 0.79 | step: 7.05 45%|████▍ | 4476/10000 [7:03:11<8:25:51, 5.49s/it] {'loss': 0.0529, 'grad_norm': 2.072014570236206, 'learning_rate': 2.4331251565215634e-05, 'epoch': 4.48} 45%|████▍ | 4476/10000 [7:03:11<8:25:51, 5.49s/it][2025-06-19 20:32:56,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:32:56,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.04 | bwd_microstep: 3369.31 | bwd_inner_microstep: 3368.43 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.34 [2025-06-19 20:32:56,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.04 | bwd: 3369.32 | bwd_inner: 3368.43 | bwd_allreduce: 0.86 | step: 7.34 45%|████▍ | 4477/10000 [7:03:17<8:26:56, 5.51s/it] {'loss': 0.0051, 'grad_norm': 0.2765474319458008, 'learning_rate': 2.4324927547034923e-05, 'epoch': 4.48} 45%|████▍ | 4477/10000 [7:03:17<8:26:56, 5.51s/it][2025-06-19 20:33:02,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:33:02,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.14 | bwd_microstep: 3323.19 | bwd_inner_microstep: 3322.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.15 [2025-06-19 20:33:02,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.14 | bwd: 3323.21 | bwd_inner: 3322.39 | bwd_allreduce: 0.77 | step: 7.15 45%|████▍ | 4478/10000 [7:03:22<8:25:44, 5.50s/it] {'loss': 0.055, 'grad_norm': 1.9867252111434937, 'learning_rate': 2.4318603075189383e-05, 'epoch': 4.48} 45%|████▍ | 4478/10000 [7:03:22<8:25:44, 5.50s/it][2025-06-19 20:33:07,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:33:07,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.49 | bwd_microstep: 3363.06 | bwd_inner_microstep: 3362.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-19 20:33:07,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.49 | bwd: 3363.07 | bwd_inner: 3362.24 | bwd_allreduce: 0.79 | step: 7.04 45%|████▍ | 4479/10000 [7:03:28<8:26:41, 5.51s/it] {'loss': 0.0728, 'grad_norm': 2.1311542987823486, 'learning_rate': 2.4312278150342423e-05, 'epoch': 4.48} 45%|████▍ | 4479/10000 [7:03:28<8:26:41, 5.51s/it][2025-06-19 20:33:13,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:33:13,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.44 | bwd_microstep: 3313.12 | bwd_inner_microstep: 3312.17 | bwd_allreduce_microstep: 0.91 | step_microstep: 6.96 [2025-06-19 20:33:13,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.44 | bwd: 3313.14 | bwd_inner: 3312.17 | bwd_allreduce: 0.92 | step: 6.97 45%|████▍ | 4480/10000 [7:03:33<8:25:16, 5.49s/it] {'loss': 0.008, 'grad_norm': 0.27814677357673645, 'learning_rate': 2.43059527731575e-05, 'epoch': 4.48} 45%|████▍ | 4480/10000 [7:03:33<8:25:16, 5.49s/it][2025-06-19 20:33:18,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 20:33:18,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.56 | bwd_microstep: 3312.18 | bwd_inner_microstep: 3311.17 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.33 [2025-06-19 20:33:18,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.57 | bwd: 3312.19 | bwd_inner: 3311.17 | bwd_allreduce: 0.97 | step: 7.33 45%|████▍ | 4481/10000 [7:03:39<8:24:07, 5.48s/it] {'loss': 0.0084, 'grad_norm': 0.4129580557346344, 'learning_rate': 2.4299626944298118e-05, 'epoch': 4.48} 45%|████▍ | 4481/10000 [7:03:39<8:24:07, 5.48s/it][2025-06-19 20:33:24,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:33:24,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.05 | bwd_microstep: 3363.10 | bwd_inner_microstep: 3362.30 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-19 20:33:24,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.05 | bwd: 3363.12 | bwd_inner: 3362.30 | bwd_allreduce: 0.78 | step: 7.15 45%|████▍ | 4482/10000 [7:03:44<8:25:24, 5.50s/it] {'loss': 0.007, 'grad_norm': 0.33295246958732605, 'learning_rate': 2.4293300664427823e-05, 'epoch': 4.48} 45%|████▍ | 4482/10000 [7:03:44<8:25:24, 5.50s/it][2025-06-19 20:33:29,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:33:29,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.81 | bwd_microstep: 3314.12 | bwd_inner_microstep: 3313.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 20:33:29,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.81 | bwd: 3314.14 | bwd_inner: 3313.33 | bwd_allreduce: 0.76 | step: 6.64 45%|████▍ | 4483/10000 [7:03:50<8:24:07, 5.48s/it] {'loss': 0.0356, 'grad_norm': 2.0733988285064697, 'learning_rate': 2.4286973934210214e-05, 'epoch': 4.48} 45%|████▍ | 4483/10000 [7:03:50<8:24:07, 5.48s/it][2025-06-19 20:33:35,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:33:35,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.10 | bwd_microstep: 3375.12 | bwd_inner_microstep: 3374.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 20:33:35,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.10 | bwd: 3375.14 | bwd_inner: 3374.34 | bwd_allreduce: 0.75 | step: 6.60 45%|████▍ | 4484/10000 [7:03:55<8:25:31, 5.50s/it] {'loss': 0.0472, 'grad_norm': 3.0330896377563477, 'learning_rate': 2.4280646754308933e-05, 'epoch': 4.48} 45%|████▍ | 4484/10000 [7:03:55<8:25:31, 5.50s/it][2025-06-19 20:33:40,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 20:33:40,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.97 | bwd_microstep: 3322.14 | bwd_inner_microstep: 3321.29 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.43 [2025-06-19 20:33:40,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.97 | bwd: 3322.15 | bwd_inner: 3321.29 | bwd_allreduce: 0.81 | step: 7.43 45%|████▍ | 4485/10000 [7:04:01<8:24:30, 5.49s/it] {'loss': 0.0557, 'grad_norm': 3.3777973651885986, 'learning_rate': 2.4274319125387677e-05, 'epoch': 4.49} 45%|████▍ | 4485/10000 [7:04:01<8:24:30, 5.49s/it][2025-06-19 20:33:46,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:33:46,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.12 | bwd_microstep: 3320.39 | bwd_inner_microstep: 3319.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 20:33:46,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.12 | bwd: 3320.40 | bwd_inner: 3319.60 | bwd_allreduce: 0.76 | step: 6.64 45%|████▍ | 4486/10000 [7:04:06<8:24:01, 5.48s/it] {'loss': 0.0083, 'grad_norm': 0.842046320438385, 'learning_rate': 2.4267991048110188e-05, 'epoch': 4.49} 45%|████▍ | 4486/10000 [7:04:06<8:24:01, 5.48s/it][2025-06-19 20:33:51,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:33:51,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.42 | bwd_microstep: 3359.83 | bwd_inner_microstep: 3359.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 20:33:51,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.42 | bwd: 3359.84 | bwd_inner: 3359.04 | bwd_allreduce: 0.76 | step: 6.66 45%|████▍ | 4487/10000 [7:04:12<8:25:06, 5.50s/it] {'loss': 0.0337, 'grad_norm': 2.29325270652771, 'learning_rate': 2.4261662523140235e-05, 'epoch': 4.49} 45%|████▍ | 4487/10000 [7:04:12<8:25:06, 5.50s/it][2025-06-19 20:33:57,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:33:57,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.77 | bwd_microstep: 3318.42 | bwd_inner_microstep: 3317.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 20:33:57,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.77 | bwd: 3318.43 | bwd_inner: 3317.63 | bwd_allreduce: 0.76 | step: 6.67 45%|████▍ | 4488/10000 [7:04:17<8:23:53, 5.49s/it] {'loss': 0.1332, 'grad_norm': 2.2724602222442627, 'learning_rate': 2.4255333551141674e-05, 'epoch': 4.49} 45%|████▍ | 4488/10000 [7:04:17<8:23:53, 5.49s/it][2025-06-19 20:34:02,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:34:02,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.76 | bwd_microstep: 3380.63 | bwd_inner_microstep: 3379.57 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.44 [2025-06-19 20:34:02,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.76 | bwd: 3380.65 | bwd_inner: 3379.57 | bwd_allreduce: 1.04 | step: 7.45 45%|████▍ | 4489/10000 [7:04:23<8:25:37, 5.50s/it] {'loss': 0.0876, 'grad_norm': 3.3909051418304443, 'learning_rate': 2.4249004132778363e-05, 'epoch': 4.49} 45%|████▍ | 4489/10000 [7:04:23<8:25:37, 5.50s/it][2025-06-19 20:34:08,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 20:34:08,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.27 | bwd_microstep: 3401.32 | bwd_inner_microstep: 3400.32 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.02 [2025-06-19 20:34:08,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.27 | bwd: 3401.33 | bwd_inner: 3400.32 | bwd_allreduce: 0.97 | step: 7.03 45%|████▍ | 4490/10000 [7:04:28<8:27:30, 5.53s/it] {'loss': 0.0295, 'grad_norm': 1.3420777320861816, 'learning_rate': 2.4242674268714243e-05, 'epoch': 4.49} 45%|████▍ | 4490/10000 [7:04:28<8:27:30, 5.53s/it][2025-06-19 20:34:13,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:34:13,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.16 | bwd_microstep: 3317.84 | bwd_inner_microstep: 3317.06 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 20:34:13,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.16 | bwd: 3317.85 | bwd_inner: 3317.06 | bwd_allreduce: 0.75 | step: 6.61 45%|████▍ | 4491/10000 [7:04:34<8:25:45, 5.51s/it] {'loss': 0.0299, 'grad_norm': 2.6243293285369873, 'learning_rate': 2.4236343959613283e-05, 'epoch': 4.49} 45%|████▍ | 4491/10000 [7:04:34<8:25:45, 5.51s/it][2025-06-19 20:34:19,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:34:19,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.77 | bwd_microstep: 3309.64 | bwd_inner_microstep: 3308.82 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-19 20:34:19,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.77 | bwd: 3309.65 | bwd_inner: 3308.82 | bwd_allreduce: 0.78 | step: 7.22 45%|████▍ | 4492/10000 [7:04:39<8:23:59, 5.49s/it] {'loss': 0.0624, 'grad_norm': 2.0427968502044678, 'learning_rate': 2.42300132061395e-05, 'epoch': 4.49} 45%|████▍ | 4492/10000 [7:04:39<8:23:59, 5.49s/it][2025-06-19 20:34:24,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:34:24,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.25 | bwd_microstep: 3320.17 | bwd_inner_microstep: 3319.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 20:34:24,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.25 | bwd: 3320.18 | bwd_inner: 3319.38 | bwd_allreduce: 0.76 | step: 6.72 45%|████▍ | 4493/10000 [7:04:45<8:23:03, 5.48s/it] {'loss': 0.0264, 'grad_norm': 1.1671721935272217, 'learning_rate': 2.422368200895697e-05, 'epoch': 4.49} 45%|████▍ | 4493/10000 [7:04:45<8:23:03, 5.48s/it][2025-06-19 20:34:29,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:34:29,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.55 | bwd_microstep: 3320.86 | bwd_inner_microstep: 3320.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 20:34:29,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.55 | bwd: 3320.87 | bwd_inner: 3320.07 | bwd_allreduce: 0.76 | step: 6.67 45%|████▍ | 4494/10000 [7:04:50<8:22:22, 5.47s/it] {'loss': 0.0463, 'grad_norm': 2.677708148956299, 'learning_rate': 2.4217350368729796e-05, 'epoch': 4.49} 45%|████▍ | 4494/10000 [7:04:50<8:22:22, 5.47s/it][2025-06-19 20:34:35,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:34:35,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.55 | bwd_microstep: 3369.42 | bwd_inner_microstep: 3368.39 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.47 [2025-06-19 20:34:35,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.55 | bwd: 3369.44 | bwd_inner: 3368.39 | bwd_allreduce: 0.99 | step: 7.47 45%|████▍ | 4495/10000 [7:04:56<8:24:05, 5.49s/it] {'loss': 0.0422, 'grad_norm': 2.02404522895813, 'learning_rate': 2.4211018286122144e-05, 'epoch': 4.5} 45%|████▍ | 4495/10000 [7:04:56<8:24:05, 5.49s/it][2025-06-19 20:34:40,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:34:40,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.66 | bwd_microstep: 3309.45 | bwd_inner_microstep: 3308.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 20:34:40,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.66 | bwd: 3309.46 | bwd_inner: 3308.66 | bwd_allreduce: 0.76 | step: 6.64 45%|████▍ | 4496/10000 [7:05:01<8:22:50, 5.48s/it] {'loss': 0.0371, 'grad_norm': 1.6202329397201538, 'learning_rate': 2.420468576179822e-05, 'epoch': 4.5} 45%|████▍ | 4496/10000 [7:05:01<8:22:50, 5.48s/it][2025-06-19 20:34:46,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:34:46,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.67 | bwd_microstep: 3319.78 | bwd_inner_microstep: 3318.80 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.42 [2025-06-19 20:34:46,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.67 | bwd: 3319.79 | bwd_inner: 3318.80 | bwd_allreduce: 0.95 | step: 7.43 45%|████▍ | 4497/10000 [7:05:07<8:22:14, 5.48s/it] {'loss': 0.0164, 'grad_norm': 2.079805612564087, 'learning_rate': 2.419835279642227e-05, 'epoch': 4.5} 45%|████▍ | 4497/10000 [7:05:07<8:22:14, 5.48s/it][2025-06-19 20:34:51,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:34:51,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.82 | bwd_microstep: 3361.54 | bwd_inner_microstep: 3360.67 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.88 [2025-06-19 20:34:51,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.82 | bwd: 3361.55 | bwd_inner: 3360.67 | bwd_allreduce: 0.83 | step: 6.88 45%|████▍ | 4498/10000 [7:05:12<8:23:40, 5.49s/it] {'loss': 0.0123, 'grad_norm': 0.46315494179725647, 'learning_rate': 2.4192019390658595e-05, 'epoch': 4.5} 45%|████▍ | 4498/10000 [7:05:12<8:23:40, 5.49s/it][2025-06-19 20:34:57,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:34:57,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.10 | bwd_microstep: 3365.39 | bwd_inner_microstep: 3364.25 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.73 [2025-06-19 20:34:57,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.10 | bwd: 3365.41 | bwd_inner: 3364.25 | bwd_allreduce: 1.11 | step: 7.75 45%|████▍ | 4499/10000 [7:05:18<8:27:56, 5.54s/it] {'loss': 0.0425, 'grad_norm': 2.667973041534424, 'learning_rate': 2.4185685545171547e-05, 'epoch': 4.5} 45%|████▍ | 4499/10000 [7:05:18<8:27:56, 5.54s/it][2025-06-19 20:35:03,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.79 [2025-06-19 20:35:03,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.08 | bwd_microstep: 3356.30 | bwd_inner_microstep: 3355.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 20:35:03,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.09 | bwd: 3356.31 | bwd_inner: 3355.51 | bwd_allreduce: 0.76 | step: 6.67 45%|████▌ | 4500/10000 [7:05:23<8:27:48, 5.54s/it] {'loss': 0.0195, 'grad_norm': 1.5506211519241333, 'learning_rate': 2.417935126062551e-05, 'epoch': 4.5} 45%|████▌ | 4500/10000 [7:05:23<8:27:48, 5.54s/it][2025-06-19 20:35:08,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:35:08,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.20 | bwd_microstep: 3370.16 | bwd_inner_microstep: 3369.06 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.60 [2025-06-19 20:35:08,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.20 | bwd: 3370.18 | bwd_inner: 3369.06 | bwd_allreduce: 1.07 | step: 7.60 45%|████▌ | 4501/10000 [7:05:29<8:27:33, 5.54s/it] {'loss': 0.1672, 'grad_norm': 3.702868700027466, 'learning_rate': 2.4173016537684918e-05, 'epoch': 4.5} 45%|████▌ | 4501/10000 [7:05:29<8:27:33, 5.54s/it][2025-06-19 20:35:14,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:35:14,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.07 | bwd_microstep: 3313.55 | bwd_inner_microstep: 3312.77 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.55 [2025-06-19 20:35:14,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.07 | bwd: 3313.56 | bwd_inner: 3312.77 | bwd_allreduce: 0.75 | step: 6.55 45%|████▌ | 4502/10000 [7:05:34<8:25:22, 5.52s/it] {'loss': 0.0853, 'grad_norm': 3.1731159687042236, 'learning_rate': 2.416668137701426e-05, 'epoch': 4.5} 45%|████▌ | 4502/10000 [7:05:34<8:25:22, 5.52s/it][2025-06-19 20:35:19,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:35:19,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.79 | bwd_microstep: 3377.37 | bwd_inner_microstep: 3376.25 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.81 [2025-06-19 20:35:19,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.79 | bwd: 3377.39 | bwd_inner: 3376.25 | bwd_allreduce: 1.08 | step: 7.82 45%|████▌ | 4503/10000 [7:05:40<8:26:18, 5.53s/it] {'loss': 0.0331, 'grad_norm': 3.7508010864257812, 'learning_rate': 2.4160345779278066e-05, 'epoch': 4.5} 45%|████▌ | 4503/10000 [7:05:40<8:26:18, 5.53s/it][2025-06-19 20:35:25,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:35:25,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.99 | bwd_microstep: 3325.66 | bwd_inner_microstep: 3324.76 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.83 [2025-06-19 20:35:25,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.00 | bwd: 3325.68 | bwd_inner: 3324.76 | bwd_allreduce: 0.87 | step: 6.83 45%|████▌ | 4504/10000 [7:05:45<8:25:15, 5.52s/it] {'loss': 0.0185, 'grad_norm': 1.3392797708511353, 'learning_rate': 2.415400974514091e-05, 'epoch': 4.5} 45%|████▌ | 4504/10000 [7:05:45<8:25:15, 5.52s/it][2025-06-19 20:35:30,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:35:30,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.04 | bwd_microstep: 3371.25 | bwd_inner_microstep: 3370.43 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.83 [2025-06-19 20:35:30,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.04 | bwd: 3371.27 | bwd_inner: 3370.43 | bwd_allreduce: 0.80 | step: 6.83 45%|████▌ | 4505/10000 [7:05:51<8:26:12, 5.53s/it] {'loss': 0.1537, 'grad_norm': 6.94144344329834, 'learning_rate': 2.414767327526741e-05, 'epoch': 4.5} 45%|████▌ | 4505/10000 [7:05:51<8:26:12, 5.53s/it][2025-06-19 20:35:36,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:35:36,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.72 | bwd_microstep: 3310.80 | bwd_inner_microstep: 3309.92 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.11 [2025-06-19 20:35:36,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.72 | bwd: 3310.82 | bwd_inner: 3309.92 | bwd_allreduce: 0.86 | step: 7.11 45%|████▌ | 4506/10000 [7:05:57<8:24:02, 5.50s/it] {'loss': 0.0051, 'grad_norm': 0.37360435724258423, 'learning_rate': 2.4141336370322233e-05, 'epoch': 4.51} 45%|████▌ | 4506/10000 [7:05:57<8:24:02, 5.50s/it][2025-06-19 20:35:41,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:35:41,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.03 | bwd_microstep: 3317.76 | bwd_inner_microstep: 3316.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 20:35:41,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.03 | bwd: 3317.77 | bwd_inner: 3316.97 | bwd_allreduce: 0.76 | step: 6.82 45%|████▌ | 4507/10000 [7:06:02<8:22:56, 5.49s/it] {'loss': 0.0054, 'grad_norm': 0.5102024078369141, 'learning_rate': 2.4134999030970092e-05, 'epoch': 4.51} 45%|████▌ | 4507/10000 [7:06:02<8:22:56, 5.49s/it][2025-06-19 20:35:47,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:35:47,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.80 | bwd_microstep: 3312.01 | bwd_inner_microstep: 3311.02 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.11 [2025-06-19 20:35:47,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.81 | bwd: 3312.04 | bwd_inner: 3311.02 | bwd_allreduce: 0.96 | step: 7.11 45%|████▌ | 4508/10000 [7:06:07<8:22:08, 5.49s/it] {'loss': 0.0351, 'grad_norm': 4.078675746917725, 'learning_rate': 2.412866125787574e-05, 'epoch': 4.51} 45%|████▌ | 4508/10000 [7:06:07<8:22:08, 5.49s/it][2025-06-19 20:35:52,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:35:52,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.24 | bwd_microstep: 3312.98 | bwd_inner_microstep: 3312.06 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.90 [2025-06-19 20:35:52,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.24 | bwd: 3313.00 | bwd_inner: 3312.06 | bwd_allreduce: 0.89 | step: 6.90 45%|████▌ | 4509/10000 [7:06:13<8:21:30, 5.48s/it] {'loss': 0.0501, 'grad_norm': 1.5916324853897095, 'learning_rate': 2.4122323051703983e-05, 'epoch': 4.51} 45%|████▌ | 4509/10000 [7:06:13<8:21:30, 5.48s/it][2025-06-19 20:35:58,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:35:58,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.21 | bwd_microstep: 3361.44 | bwd_inner_microstep: 3360.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 20:35:58,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.21 | bwd: 3361.45 | bwd_inner: 3360.64 | bwd_allreduce: 0.77 | step: 6.86 45%|████▌ | 4510/10000 [7:06:18<8:22:47, 5.50s/it] {'loss': 0.0222, 'grad_norm': 1.1542668342590332, 'learning_rate': 2.4115984413119673e-05, 'epoch': 4.51} 45%|████▌ | 4510/10000 [7:06:18<8:22:47, 5.50s/it][2025-06-19 20:36:03,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:36:03,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.64 | bwd_microstep: 3309.13 | bwd_inner_microstep: 3308.17 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.41 [2025-06-19 20:36:03,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.64 | bwd: 3309.15 | bwd_inner: 3308.17 | bwd_allreduce: 0.93 | step: 7.42 45%|████▌ | 4511/10000 [7:06:24<8:21:27, 5.48s/it] {'loss': 0.0189, 'grad_norm': 0.9623106122016907, 'learning_rate': 2.4109645342787705e-05, 'epoch': 4.51} 45%|████▌ | 4511/10000 [7:06:24<8:21:27, 5.48s/it][2025-06-19 20:36:09,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:36:09,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.41 | bwd_microstep: 3364.47 | bwd_inner_microstep: 3363.35 | bwd_allreduce_microstep: 1.04 | step_microstep: 8.02 [2025-06-19 20:36:09,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.41 | bwd: 3364.49 | bwd_inner: 3363.35 | bwd_allreduce: 1.08 | step: 8.03 45%|████▌ | 4512/10000 [7:06:29<8:22:59, 5.50s/it] {'loss': 0.0028, 'grad_norm': 0.2186700999736786, 'learning_rate': 2.410330584137301e-05, 'epoch': 4.51} 45%|████▌ | 4512/10000 [7:06:29<8:22:59, 5.50s/it][2025-06-19 20:36:14,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:36:14,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.22 | bwd_microstep: 3359.39 | bwd_inner_microstep: 3358.54 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.87 [2025-06-19 20:36:14,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.22 | bwd: 3359.41 | bwd_inner: 3358.54 | bwd_allreduce: 0.81 | step: 6.87 45%|████▌ | 4513/10000 [7:06:35<8:23:48, 5.51s/it] {'loss': 0.0147, 'grad_norm': 0.9020642638206482, 'learning_rate': 2.4096965909540578e-05, 'epoch': 4.51} 45%|████▌ | 4513/10000 [7:06:35<8:23:48, 5.51s/it][2025-06-19 20:36:20,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:36:20,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.47 | bwd_microstep: 3363.87 | bwd_inner_microstep: 3363.07 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 20:36:20,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.47 | bwd: 3363.88 | bwd_inner: 3363.07 | bwd_allreduce: 0.76 | step: 6.88 45%|████▌ | 4514/10000 [7:06:40<8:24:17, 5.52s/it] {'loss': 0.009, 'grad_norm': 0.9971122741699219, 'learning_rate': 2.409062554795544e-05, 'epoch': 4.51} 45%|████▌ | 4514/10000 [7:06:40<8:24:17, 5.52s/it][2025-06-19 20:36:25,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:36:25,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.34 | bwd_microstep: 3309.06 | bwd_inner_microstep: 3308.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 20:36:25,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.34 | bwd: 3309.08 | bwd_inner: 3308.27 | bwd_allreduce: 0.76 | step: 6.81 45%|████▌ | 4515/10000 [7:06:46<8:22:23, 5.50s/it] {'loss': 0.0043, 'grad_norm': 0.3340526521205902, 'learning_rate': 2.408428475728266e-05, 'epoch': 4.51} 45%|████▌ | 4515/10000 [7:06:46<8:22:23, 5.50s/it][2025-06-19 20:36:31,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:36:31,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.80 | bwd_microstep: 3311.53 | bwd_inner_microstep: 3310.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 20:36:31,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.80 | bwd: 3311.54 | bwd_inner: 3310.75 | bwd_allreduce: 0.75 | step: 6.55 45%|████▌ | 4516/10000 [7:06:51<8:21:02, 5.48s/it] {'loss': 0.0237, 'grad_norm': 1.2490533590316772, 'learning_rate': 2.407794353818737e-05, 'epoch': 4.52} 45%|████▌ | 4516/10000 [7:06:51<8:21:02, 5.48s/it][2025-06-19 20:36:36,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:36:36,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.37 | bwd_microstep: 3316.35 | bwd_inner_microstep: 3315.43 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.39 [2025-06-19 20:36:36,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.37 | bwd: 3316.36 | bwd_inner: 3315.43 | bwd_allreduce: 0.89 | step: 7.39 45%|████▌ | 4517/10000 [7:06:57<8:20:25, 5.48s/it] {'loss': 0.0129, 'grad_norm': 1.286571741104126, 'learning_rate': 2.407160189133473e-05, 'epoch': 4.52} 45%|████▌ | 4517/10000 [7:06:57<8:20:25, 5.48s/it][2025-06-19 20:36:42,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:36:42,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.10 | bwd_microstep: 3365.92 | bwd_inner_microstep: 3365.10 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.93 [2025-06-19 20:36:42,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.10 | bwd: 3365.93 | bwd_inner: 3365.10 | bwd_allreduce: 0.79 | step: 6.93 45%|████▌ | 4518/10000 [7:07:02<8:22:00, 5.49s/it] {'loss': 0.0018, 'grad_norm': 0.14581233263015747, 'learning_rate': 2.4065259817389938e-05, 'epoch': 4.52} 45%|████▌ | 4518/10000 [7:07:02<8:22:00, 5.49s/it][2025-06-19 20:36:47,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:36:47,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.21 | bwd_microstep: 3350.43 | bwd_inner_microstep: 3349.60 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.31 [2025-06-19 20:36:47,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.21 | bwd: 3350.44 | bwd_inner: 3349.60 | bwd_allreduce: 0.80 | step: 7.31 45%|████▌ | 4519/10000 [7:07:08<8:22:34, 5.50s/it] {'loss': 0.0584, 'grad_norm': 1.8010085821151733, 'learning_rate': 2.405891731701827e-05, 'epoch': 4.52} 45%|████▌ | 4519/10000 [7:07:08<8:22:34, 5.50s/it][2025-06-19 20:36:53,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:36:53,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.40 | bwd_microstep: 3320.98 | bwd_inner_microstep: 3320.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 20:36:53,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.40 | bwd: 3321.00 | bwd_inner: 3320.19 | bwd_allreduce: 0.77 | step: 6.94 45%|████▌ | 4520/10000 [7:07:13<8:21:25, 5.49s/it] {'loss': 0.0043, 'grad_norm': 0.38676512241363525, 'learning_rate': 2.4052574390885007e-05, 'epoch': 4.52} 45%|████▌ | 4520/10000 [7:07:13<8:21:25, 5.49s/it][2025-06-19 20:36:58,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:36:58,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.18 | bwd_microstep: 3321.75 | bwd_inner_microstep: 3320.83 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.03 [2025-06-19 20:36:58,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.18 | bwd: 3321.77 | bwd_inner: 3320.83 | bwd_allreduce: 0.89 | step: 7.03 45%|████▌ | 4521/10000 [7:07:19<8:20:31, 5.48s/it] {'loss': 0.0013, 'grad_norm': 0.18132613599300385, 'learning_rate': 2.4046231039655503e-05, 'epoch': 4.52} 45%|████▌ | 4521/10000 [7:07:19<8:20:31, 5.48s/it][2025-06-19 20:37:03,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:37:03,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.26 | bwd_microstep: 3317.06 | bwd_inner_microstep: 3316.23 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.07 [2025-06-19 20:37:03,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.26 | bwd: 3317.07 | bwd_inner: 3316.23 | bwd_allreduce: 0.79 | step: 7.08 45%|████▌ | 4522/10000 [7:07:24<8:19:44, 5.47s/it] {'loss': 0.0528, 'grad_norm': 2.1138854026794434, 'learning_rate': 2.403988726399514e-05, 'epoch': 4.52} 45%|████▌ | 4522/10000 [7:07:24<8:19:44, 5.47s/it][2025-06-19 20:37:09,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:37:09,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.27 | bwd_microstep: 3310.36 | bwd_inner_microstep: 3309.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-19 20:37:09,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.27 | bwd: 3310.38 | bwd_inner: 3309.56 | bwd_allreduce: 0.77 | step: 7.05 45%|████▌ | 4523/10000 [7:07:30<8:19:11, 5.47s/it] {'loss': 0.0131, 'grad_norm': 0.768855631351471, 'learning_rate': 2.403354306456935e-05, 'epoch': 4.52} 45%|████▌ | 4523/10000 [7:07:30<8:19:11, 5.47s/it][2025-06-19 20:37:14,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:37:14,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.83 | bwd_microstep: 3313.79 | bwd_inner_microstep: 3313.00 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 20:37:14,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.83 | bwd: 3313.80 | bwd_inner: 3313.01 | bwd_allreduce: 0.76 | step: 6.61 45%|████▌ | 4524/10000 [7:07:35<8:18:44, 5.46s/it] {'loss': 0.0158, 'grad_norm': 0.7530921101570129, 'learning_rate': 2.402719844204362e-05, 'epoch': 4.52} 45%|████▌ | 4524/10000 [7:07:35<8:18:44, 5.46s/it][2025-06-19 20:37:20,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:37:20,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.80 | bwd_microstep: 3325.19 | bwd_inner_microstep: 3324.37 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.76 [2025-06-19 20:37:20,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.80 | bwd: 3325.20 | bwd_inner: 3324.37 | bwd_allreduce: 0.78 | step: 6.76 45%|████▌ | 4525/10000 [7:07:41<8:18:39, 5.46s/it] {'loss': 0.0128, 'grad_norm': 1.833683729171753, 'learning_rate': 2.4020853397083456e-05, 'epoch': 4.53} 45%|████▌ | 4525/10000 [7:07:41<8:18:39, 5.46s/it][2025-06-19 20:37:25,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:37:25,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.23 | bwd_microstep: 3317.63 | bwd_inner_microstep: 3316.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 20:37:25,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.23 | bwd: 3317.64 | bwd_inner: 3316.83 | bwd_allreduce: 0.77 | step: 6.72 45%|████▌ | 4526/10000 [7:07:46<8:18:40, 5.47s/it] {'loss': 0.0148, 'grad_norm': 1.7751466035842896, 'learning_rate': 2.4014507930354438e-05, 'epoch': 4.53} 45%|████▌ | 4526/10000 [7:07:46<8:18:40, 5.47s/it][2025-06-19 20:37:31,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:37:31,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.31 | bwd_microstep: 3306.21 | bwd_inner_microstep: 3305.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 20:37:31,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.31 | bwd: 3306.22 | bwd_inner: 3305.40 | bwd_allreduce: 0.77 | step: 6.73 45%|████▌ | 4527/10000 [7:07:52<8:18:06, 5.46s/it] {'loss': 0.0042, 'grad_norm': 0.296958863735199, 'learning_rate': 2.4008162042522166e-05, 'epoch': 4.53} 45%|████▌ | 4527/10000 [7:07:52<8:18:06, 5.46s/it][2025-06-19 20:37:36,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:37:36,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.24 | bwd_microstep: 3316.79 | bwd_inner_microstep: 3315.95 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.96 [2025-06-19 20:37:36,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.24 | bwd: 3316.80 | bwd_inner: 3315.95 | bwd_allreduce: 0.80 | step: 6.96 45%|████▌ | 4528/10000 [7:07:57<8:18:00, 5.46s/it] {'loss': 0.0376, 'grad_norm': 2.439784049987793, 'learning_rate': 2.40018157342523e-05, 'epoch': 4.53} 45%|████▌ | 4528/10000 [7:07:57<8:18:00, 5.46s/it][2025-06-19 20:37:42,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:37:42,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.36 | bwd_microstep: 3364.61 | bwd_inner_microstep: 3363.47 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.54 [2025-06-19 20:37:42,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.36 | bwd: 3364.63 | bwd_inner: 3363.47 | bwd_allreduce: 1.11 | step: 7.55 45%|████▌ | 4529/10000 [7:08:03<8:19:52, 5.48s/it] {'loss': 0.0045, 'grad_norm': 0.29302600026130676, 'learning_rate': 2.399546900621054e-05, 'epoch': 4.53} 45%|████▌ | 4529/10000 [7:08:03<8:19:52, 5.48s/it][2025-06-19 20:37:47,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:37:47,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.79 | bwd_microstep: 3372.46 | bwd_inner_microstep: 3371.53 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.08 [2025-06-19 20:37:47,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.80 | bwd: 3372.48 | bwd_inner: 3371.53 | bwd_allreduce: 0.91 | step: 7.08 45%|████▌ | 4530/10000 [7:08:08<8:21:36, 5.50s/it] {'loss': 0.112, 'grad_norm': 3.969141960144043, 'learning_rate': 2.398912185906262e-05, 'epoch': 4.53} 45%|████▌ | 4530/10000 [7:08:08<8:21:36, 5.50s/it][2025-06-19 20:37:53,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:37:53,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.00 | bwd_microstep: 3366.88 | bwd_inner_microstep: 3366.06 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.09 [2025-06-19 20:37:53,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.00 | bwd: 3366.89 | bwd_inner: 3366.06 | bwd_allreduce: 0.79 | step: 7.09 45%|████▌ | 4531/10000 [7:08:14<8:22:27, 5.51s/it] {'loss': 0.0136, 'grad_norm': 1.5838398933410645, 'learning_rate': 2.398277429347433e-05, 'epoch': 4.53} 45%|████▌ | 4531/10000 [7:08:14<8:22:27, 5.51s/it][2025-06-19 20:37:58,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.82 [2025-06-19 20:37:58,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.52 | bwd_microstep: 3367.71 | bwd_inner_microstep: 3366.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-19 20:37:58,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.52 | bwd: 3367.73 | bwd_inner: 3366.90 | bwd_allreduce: 0.78 | step: 6.73 45%|████▌ | 4532/10000 [7:08:19<8:23:07, 5.52s/it] {'loss': 0.0357, 'grad_norm': 1.8589463233947754, 'learning_rate': 2.397642631011151e-05, 'epoch': 4.53} 45%|████▌ | 4532/10000 [7:08:19<8:23:07, 5.52s/it][2025-06-19 20:38:04,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 20:38:04,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.87 | bwd_microstep: 3363.69 | bwd_inner_microstep: 3362.91 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 20:38:04,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.87 | bwd: 3363.70 | bwd_inner: 3362.91 | bwd_allreduce: 0.75 | step: 6.68 45%|████▌ | 4533/10000 [7:08:25<8:23:20, 5.52s/it] {'loss': 0.0357, 'grad_norm': 1.8524894714355469, 'learning_rate': 2.3970077909640015e-05, 'epoch': 4.53} 45%|████▌ | 4533/10000 [7:08:25<8:23:20, 5.52s/it][2025-06-19 20:38:09,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:38:09,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.09 | bwd_microstep: 3368.35 | bwd_inner_microstep: 3367.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 20:38:09,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.09 | bwd: 3368.36 | bwd_inner: 3367.56 | bwd_allreduce: 0.76 | step: 6.67 45%|████▌ | 4534/10000 [7:08:30<8:23:43, 5.53s/it] {'loss': 0.0211, 'grad_norm': 1.1956512928009033, 'learning_rate': 2.3963729092725783e-05, 'epoch': 4.53} 45%|████▌ | 4534/10000 [7:08:30<8:23:43, 5.53s/it][2025-06-19 20:38:15,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 20:38:15,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.86 | bwd_microstep: 3405.33 | bwd_inner_microstep: 3404.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.80 [2025-06-19 20:38:15,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.86 | bwd: 3405.35 | bwd_inner: 3404.55 | bwd_allreduce: 0.76 | step: 6.79 45%|████▌ | 4535/10000 [7:08:36<8:25:22, 5.55s/it] {'loss': 0.0398, 'grad_norm': 1.1846919059753418, 'learning_rate': 2.395737986003476e-05, 'epoch': 4.54} 45%|████▌ | 4535/10000 [7:08:36<8:25:22, 5.55s/it][2025-06-19 20:38:21,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:38:21,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.09 | bwd_microstep: 3377.51 | bwd_inner_microstep: 3376.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 20:38:21,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.09 | bwd: 3377.53 | bwd_inner: 3376.72 | bwd_allreduce: 0.77 | step: 6.99 45%|████▌ | 4536/10000 [7:08:41<8:25:24, 5.55s/it] {'loss': 0.0049, 'grad_norm': 0.7432671189308167, 'learning_rate': 2.3951030212232945e-05, 'epoch': 4.54} 45%|████▌ | 4536/10000 [7:08:41<8:25:24, 5.55s/it][2025-06-19 20:38:26,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 20:38:26,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.20 | bwd_microstep: 3380.43 | bwd_inner_microstep: 3379.65 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 20:38:26,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.20 | bwd: 3380.44 | bwd_inner: 3379.65 | bwd_allreduce: 0.75 | step: 6.71 45%|████▌ | 4537/10000 [7:08:47<8:25:17, 5.55s/it] {'loss': 0.0471, 'grad_norm': 1.2831673622131348, 'learning_rate': 2.3944680149986414e-05, 'epoch': 4.54} 45%|████▌ | 4537/10000 [7:08:47<8:25:17, 5.55s/it][2025-06-19 20:38:32,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:38:32,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.47 | bwd_microstep: 3313.51 | bwd_inner_microstep: 3312.53 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.35 [2025-06-19 20:38:32,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.47 | bwd: 3313.53 | bwd_inner: 3312.53 | bwd_allreduce: 0.95 | step: 7.36 45%|████▌ | 4538/10000 [7:08:52<8:22:45, 5.52s/it] {'loss': 0.0059, 'grad_norm': 0.4774174392223358, 'learning_rate': 2.3938329673961236e-05, 'epoch': 4.54} 45%|████▌ | 4538/10000 [7:08:52<8:22:45, 5.52s/it][2025-06-19 20:38:37,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:38:37,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.69 | bwd_microstep: 3334.89 | bwd_inner_microstep: 3334.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 20:38:37,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.69 | bwd: 3334.90 | bwd_inner: 3334.10 | bwd_allreduce: 0.76 | step: 6.64 45%|████▌ | 4539/10000 [7:08:58<8:21:34, 5.51s/it] {'loss': 0.0239, 'grad_norm': 1.9856479167938232, 'learning_rate': 2.393197878482356e-05, 'epoch': 4.54} 45%|████▌ | 4539/10000 [7:08:58<8:21:34, 5.51s/it][2025-06-19 20:38:43,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:38:43,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.32 | bwd_microstep: 3376.56 | bwd_inner_microstep: 3375.76 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 20:38:43,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.32 | bwd: 3376.57 | bwd_inner: 3375.76 | bwd_allreduce: 0.77 | step: 7.13 45%|████▌ | 4540/10000 [7:09:03<8:22:46, 5.52s/it] {'loss': 0.0006, 'grad_norm': 0.03152293711900711, 'learning_rate': 2.3925627483239554e-05, 'epoch': 4.54} 45%|████▌ | 4540/10000 [7:09:03<8:22:46, 5.52s/it][2025-06-19 20:38:48,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:38:48,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.75 | bwd_microstep: 3382.67 | bwd_inner_microstep: 3381.82 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.85 [2025-06-19 20:38:48,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.75 | bwd: 3382.68 | bwd_inner: 3381.82 | bwd_allreduce: 0.82 | step: 6.86 45%|████▌ | 4541/10000 [7:09:09<8:23:28, 5.53s/it] {'loss': 0.0048, 'grad_norm': 0.48894038796424866, 'learning_rate': 2.391927576987544e-05, 'epoch': 4.54} 45%|████▌ | 4541/10000 [7:09:09<8:23:28, 5.53s/it][2025-06-19 20:38:54,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:38:54,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.97 | bwd_microstep: 3379.18 | bwd_inner_microstep: 3378.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-19 20:38:54,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.97 | bwd: 3379.19 | bwd_inner: 3378.38 | bwd_allreduce: 0.77 | step: 6.75 45%|████▌ | 4542/10000 [7:09:15<8:24:03, 5.54s/it] {'loss': 0.0178, 'grad_norm': 1.5179392099380493, 'learning_rate': 2.3912923645397495e-05, 'epoch': 4.54} 45%|████▌ | 4542/10000 [7:09:15<8:24:03, 5.54s/it][2025-06-19 20:38:59,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:38:59,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.81 | bwd_microstep: 3341.22 | bwd_inner_microstep: 3340.32 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.04 [2025-06-19 20:38:59,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.81 | bwd: 3341.24 | bwd_inner: 3340.32 | bwd_allreduce: 0.88 | step: 7.04 45%|████▌ | 4543/10000 [7:09:20<8:22:30, 5.53s/it] {'loss': 0.0207, 'grad_norm': 2.385341167449951, 'learning_rate': 2.390657111047202e-05, 'epoch': 4.54} 45%|████▌ | 4543/10000 [7:09:20<8:22:30, 5.53s/it][2025-06-19 20:39:05,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:39:05,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.37 | bwd_microstep: 3319.17 | bwd_inner_microstep: 3318.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 20:39:05,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.37 | bwd: 3319.18 | bwd_inner: 3318.38 | bwd_allreduce: 0.75 | step: 6.62 45%|████▌ | 4544/10000 [7:09:26<8:20:53, 5.51s/it] {'loss': 0.0222, 'grad_norm': 1.7947213649749756, 'learning_rate': 2.3900218165765362e-05, 'epoch': 4.54} 45%|████▌ | 4544/10000 [7:09:26<8:20:53, 5.51s/it][2025-06-19 20:39:10,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 20:39:10,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.61 | bwd_microstep: 3381.20 | bwd_inner_microstep: 3380.08 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.29 [2025-06-19 20:39:10,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.61 | bwd: 3381.22 | bwd_inner: 3380.08 | bwd_allreduce: 1.09 | step: 8.29 45%|████▌ | 4545/10000 [7:09:31<8:22:06, 5.52s/it] {'loss': 0.1413, 'grad_norm': 4.775030136108398, 'learning_rate': 2.389386481194392e-05, 'epoch': 4.54} 45%|████▌ | 4545/10000 [7:09:31<8:22:06, 5.52s/it][2025-06-19 20:39:16,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.74 | optimizer_step: 2.72 [2025-06-19 20:39:16,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.60 | bwd_microstep: 3328.45 | bwd_inner_microstep: 3327.43 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.56 [2025-06-19 20:39:16,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.60 | bwd: 3328.47 | bwd_inner: 3327.43 | bwd_allreduce: 0.99 | step: 7.56 45%|████▌ | 4546/10000 [7:09:37<8:21:06, 5.51s/it] {'loss': 0.0013, 'grad_norm': 0.12936142086982727, 'learning_rate': 2.3887511049674133e-05, 'epoch': 4.55} 45%|████▌ | 4546/10000 [7:09:37<8:21:06, 5.51s/it][2025-06-19 20:39:21,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.81 [2025-06-19 20:39:21,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.95 | bwd_microstep: 3330.71 | bwd_inner_microstep: 3329.46 | bwd_allreduce_microstep: 1.18 | step_microstep: 7.45 [2025-06-19 20:39:21,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.95 | bwd: 3330.73 | bwd_inner: 3329.46 | bwd_allreduce: 1.21 | step: 7.46 45%|████▌ | 4547/10000 [7:09:42<8:20:18, 5.50s/it] {'loss': 0.0522, 'grad_norm': 5.567262172698975, 'learning_rate': 2.3881156879622483e-05, 'epoch': 4.55} 45%|████▌ | 4547/10000 [7:09:42<8:20:18, 5.50s/it][2025-06-19 20:39:27,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:39:27,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.25 | bwd_microstep: 3377.97 | bwd_inner_microstep: 3377.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 20:39:27,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.24 | bwd: 3377.98 | bwd_inner: 3377.18 | bwd_allreduce: 0.76 | step: 6.65 45%|████▌ | 4548/10000 [7:09:48<8:21:36, 5.52s/it] {'loss': 0.0048, 'grad_norm': 0.459017276763916, 'learning_rate': 2.387480230245548e-05, 'epoch': 4.55} 45%|████▌ | 4548/10000 [7:09:48<8:21:36, 5.52s/it][2025-06-19 20:39:32,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:39:32,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.97 | bwd_microstep: 3374.44 | bwd_inner_microstep: 3373.47 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.39 [2025-06-19 20:39:32,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.97 | bwd: 3374.46 | bwd_inner: 3373.47 | bwd_allreduce: 0.93 | step: 7.39 45%|████▌ | 4549/10000 [7:09:53<8:22:18, 5.53s/it] {'loss': 0.0071, 'grad_norm': 0.4089880585670471, 'learning_rate': 2.386844731883971e-05, 'epoch': 4.55} 45%|████▌ | 4549/10000 [7:09:53<8:22:18, 5.53s/it][2025-06-19 20:39:38,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:39:38,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.30 | bwd_microstep: 3393.31 | bwd_inner_microstep: 3392.36 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.20 [2025-06-19 20:39:38,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.30 | bwd: 3393.32 | bwd_inner: 3392.36 | bwd_allreduce: 0.92 | step: 7.20 46%|████▌ | 4550/10000 [7:09:59<8:23:26, 5.54s/it] {'loss': 0.0066, 'grad_norm': 0.3822520673274994, 'learning_rate': 2.3862091929441764e-05, 'epoch': 4.55} 46%|████▌ | 4550/10000 [7:09:59<8:23:26, 5.54s/it][2025-06-19 20:39:43,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:39:43,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.95 | bwd_microstep: 3334.37 | bwd_inner_microstep: 3333.59 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 20:39:43,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.95 | bwd: 3334.38 | bwd_inner: 3333.59 | bwd_allreduce: 0.75 | step: 6.59 46%|████▌ | 4551/10000 [7:10:04<8:21:51, 5.53s/it] {'loss': 0.0603, 'grad_norm': 4.695008754730225, 'learning_rate': 2.3855736134928294e-05, 'epoch': 4.55} 46%|████▌ | 4551/10000 [7:10:04<8:21:51, 5.53s/it][2025-06-19 20:39:49,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:39:49,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.60 | bwd_microstep: 3376.18 | bwd_inner_microstep: 3375.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 20:39:49,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.61 | bwd: 3376.19 | bwd_inner: 3375.38 | bwd_allreduce: 0.76 | step: 6.66 46%|████▌ | 4552/10000 [7:10:10<8:22:16, 5.53s/it] {'loss': 0.1013, 'grad_norm': 3.21096134185791, 'learning_rate': 2.384937993596601e-05, 'epoch': 4.55} 46%|████▌ | 4552/10000 [7:10:10<8:22:16, 5.53s/it][2025-06-19 20:39:55,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:39:55,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.44 | bwd_microstep: 3375.61 | bwd_inner_microstep: 3374.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 20:39:55,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.44 | bwd: 3375.63 | bwd_inner: 3374.82 | bwd_allreduce: 0.76 | step: 6.72 46%|████▌ | 4553/10000 [7:10:15<8:22:38, 5.54s/it] {'loss': 0.0058, 'grad_norm': 0.6978119611740112, 'learning_rate': 2.3843023333221625e-05, 'epoch': 4.55} 46%|████▌ | 4553/10000 [7:10:15<8:22:38, 5.54s/it][2025-06-19 20:40:00,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:40:00,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.04 | bwd_microstep: 3328.17 | bwd_inner_microstep: 3327.39 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.72 [2025-06-19 20:40:00,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.04 | bwd: 3328.19 | bwd_inner: 3327.39 | bwd_allreduce: 0.75 | step: 6.72 46%|████▌ | 4554/10000 [7:10:21<8:21:00, 5.52s/it] {'loss': 0.0046, 'grad_norm': 0.6202977895736694, 'learning_rate': 2.3836666327361936e-05, 'epoch': 4.55} 46%|████▌ | 4554/10000 [7:10:21<8:21:00, 5.52s/it][2025-06-19 20:40:05,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:40:05,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.76 | bwd_microstep: 3326.17 | bwd_inner_microstep: 3325.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.18 [2025-06-19 20:40:05,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.76 | bwd: 3326.19 | bwd_inner: 3325.38 | bwd_allreduce: 0.77 | step: 7.18 46%|████▌ | 4555/10000 [7:10:26<8:19:43, 5.51s/it] {'loss': 0.0092, 'grad_norm': 1.1103471517562866, 'learning_rate': 2.3830308919053753e-05, 'epoch': 4.55} 46%|████▌ | 4555/10000 [7:10:26<8:19:43, 5.51s/it][2025-06-19 20:40:11,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:40:11,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.12 | bwd_microstep: 3328.04 | bwd_inner_microstep: 3327.20 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.83 [2025-06-19 20:40:11,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.12 | bwd: 3328.06 | bwd_inner: 3327.20 | bwd_allreduce: 0.81 | step: 6.83 46%|████▌ | 4556/10000 [7:10:32<8:18:53, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.0886402279138565, 'learning_rate': 2.382395110896394e-05, 'epoch': 4.56} 46%|████▌ | 4556/10000 [7:10:32<8:18:53, 5.50s/it][2025-06-19 20:40:16,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:40:16,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.57 | bwd_microstep: 3334.51 | bwd_inner_microstep: 3333.58 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.94 [2025-06-19 20:40:16,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.57 | bwd: 3334.52 | bwd_inner: 3333.58 | bwd_allreduce: 0.89 | step: 6.94 46%|████▌ | 4557/10000 [7:10:37<8:18:29, 5.50s/it] {'loss': 0.01, 'grad_norm': 1.3563103675842285, 'learning_rate': 2.381759289775941e-05, 'epoch': 4.56} 46%|████▌ | 4557/10000 [7:10:37<8:18:29, 5.50s/it][2025-06-19 20:40:22,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:40:22,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.75 | bwd_microstep: 3323.87 | bwd_inner_microstep: 3323.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 20:40:22,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.75 | bwd: 3323.89 | bwd_inner: 3323.08 | bwd_allreduce: 0.76 | step: 6.72 46%|████▌ | 4558/10000 [7:10:43<8:17:49, 5.49s/it] {'loss': 0.0009, 'grad_norm': 0.06312508136034012, 'learning_rate': 2.38112342861071e-05, 'epoch': 4.56} 46%|████▌ | 4558/10000 [7:10:43<8:17:49, 5.49s/it][2025-06-19 20:40:27,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.74 [2025-06-19 20:40:27,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.77 | bwd_microstep: 3325.14 | bwd_inner_microstep: 3324.26 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.98 [2025-06-19 20:40:27,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.77 | bwd: 3325.16 | bwd_inner: 3324.26 | bwd_allreduce: 0.85 | step: 6.98 46%|████▌ | 4559/10000 [7:10:48<8:17:31, 5.49s/it] {'loss': 0.0956, 'grad_norm': 4.212218284606934, 'learning_rate': 2.3804875274674e-05, 'epoch': 4.56} 46%|████▌ | 4559/10000 [7:10:48<8:17:31, 5.49s/it][2025-06-19 20:40:33,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:40:33,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.49 | bwd_microstep: 3325.29 | bwd_inner_microstep: 3324.37 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.22 [2025-06-19 20:40:33,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.49 | bwd: 3325.30 | bwd_inner: 3324.37 | bwd_allreduce: 0.89 | step: 7.23 46%|████▌ | 4560/10000 [7:10:54<8:17:32, 5.49s/it] {'loss': 0.1068, 'grad_norm': 5.183826923370361, 'learning_rate': 2.379851586412714e-05, 'epoch': 4.56} 46%|████▌ | 4560/10000 [7:10:54<8:17:32, 5.49s/it][2025-06-19 20:40:38,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:40:38,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.07 | bwd_microstep: 3330.60 | bwd_inner_microstep: 3329.72 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.92 [2025-06-19 20:40:38,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.07 | bwd: 3330.61 | bwd_inner: 3329.72 | bwd_allreduce: 0.85 | step: 6.92 46%|████▌ | 4561/10000 [7:10:59<8:17:24, 5.49s/it] {'loss': 0.0055, 'grad_norm': 0.33260825276374817, 'learning_rate': 2.37921560551336e-05, 'epoch': 4.56} 46%|████▌ | 4561/10000 [7:10:59<8:17:24, 5.49s/it][2025-06-19 20:40:44,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:40:44,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.03 | bwd_microstep: 3327.41 | bwd_inner_microstep: 3326.48 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.08 [2025-06-19 20:40:44,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.03 | bwd: 3327.43 | bwd_inner: 3326.48 | bwd_allreduce: 0.89 | step: 7.08 46%|████▌ | 4562/10000 [7:11:05<8:17:06, 5.48s/it] {'loss': 0.0331, 'grad_norm': 1.8290191888809204, 'learning_rate': 2.3785795848360483e-05, 'epoch': 4.56} 46%|████▌ | 4562/10000 [7:11:05<8:17:06, 5.48s/it][2025-06-19 20:40:49,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:40:49,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.06 | bwd_microstep: 3369.01 | bwd_inner_microstep: 3368.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-19 20:40:49,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.06 | bwd: 3369.02 | bwd_inner: 3368.21 | bwd_allreduce: 0.77 | step: 6.60 46%|████▌ | 4563/10000 [7:11:10<8:18:40, 5.50s/it] {'loss': 0.028, 'grad_norm': 3.0490639209747314, 'learning_rate': 2.377943524447496e-05, 'epoch': 4.56} 46%|████▌ | 4563/10000 [7:11:10<8:18:40, 5.50s/it][2025-06-19 20:40:55,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:40:55,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.62 | bwd_microstep: 3338.19 | bwd_inner_microstep: 3337.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-19 20:40:55,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.62 | bwd: 3338.21 | bwd_inner: 3337.37 | bwd_allreduce: 0.79 | step: 6.78 46%|████▌ | 4564/10000 [7:11:16<8:18:10, 5.50s/it] {'loss': 0.0033, 'grad_norm': 0.6012350916862488, 'learning_rate': 2.3773074244144213e-05, 'epoch': 4.56} 46%|████▌ | 4564/10000 [7:11:16<8:18:10, 5.50s/it][2025-06-19 20:41:00,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:41:00,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.10 | bwd_microstep: 3379.09 | bwd_inner_microstep: 3378.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 20:41:00,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.10 | bwd: 3379.10 | bwd_inner: 3378.29 | bwd_allreduce: 0.77 | step: 6.96 46%|████▌ | 4565/10000 [7:11:21<8:19:17, 5.51s/it] {'loss': 0.0565, 'grad_norm': 6.6907453536987305, 'learning_rate': 2.37667128480355e-05, 'epoch': 4.56} 46%|████▌ | 4565/10000 [7:11:21<8:19:17, 5.51s/it][2025-06-19 20:41:06,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:41:06,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.10 | bwd_microstep: 3334.85 | bwd_inner_microstep: 3334.04 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 20:41:06,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.10 | bwd: 3334.86 | bwd_inner: 3334.04 | bwd_allreduce: 0.78 | step: 7.11 46%|████▌ | 4566/10000 [7:11:27<8:18:32, 5.50s/it] {'loss': 0.1635, 'grad_norm': 5.30240535736084, 'learning_rate': 2.376035105681608e-05, 'epoch': 4.57} 46%|████▌ | 4566/10000 [7:11:27<8:18:32, 5.50s/it][2025-06-19 20:41:11,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:41:11,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.24 | bwd_microstep: 3376.57 | bwd_inner_microstep: 3375.43 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.36 [2025-06-19 20:41:11,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.24 | bwd: 3376.59 | bwd_inner: 3375.43 | bwd_allreduce: 1.10 | step: 7.36 46%|████▌ | 4567/10000 [7:11:32<8:19:46, 5.52s/it] {'loss': 0.0072, 'grad_norm': 1.2119566202163696, 'learning_rate': 2.3753988871153295e-05, 'epoch': 4.57} 46%|████▌ | 4567/10000 [7:11:32<8:19:46, 5.52s/it][2025-06-19 20:41:17,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:41:17,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.35 | bwd_microstep: 3328.58 | bwd_inner_microstep: 3327.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 20:41:17,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.35 | bwd: 3328.60 | bwd_inner: 3327.78 | bwd_allreduce: 0.77 | step: 7.15 46%|████▌ | 4568/10000 [7:11:38<8:18:44, 5.51s/it] {'loss': 0.0334, 'grad_norm': 1.5533326864242554, 'learning_rate': 2.37476262917145e-05, 'epoch': 4.57} 46%|████▌ | 4568/10000 [7:11:38<8:18:44, 5.51s/it][2025-06-19 20:41:22,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:41:22,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.95 | bwd_microstep: 3318.84 | bwd_inner_microstep: 3317.99 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.36 [2025-06-19 20:41:22,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.95 | bwd: 3318.86 | bwd_inner: 3317.99 | bwd_allreduce: 0.82 | step: 7.36 46%|████▌ | 4569/10000 [7:11:43<8:17:39, 5.50s/it] {'loss': 0.1393, 'grad_norm': 3.7011098861694336, 'learning_rate': 2.3741263319167095e-05, 'epoch': 4.57} 46%|████▌ | 4569/10000 [7:11:43<8:17:39, 5.50s/it][2025-06-19 20:41:28,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:41:28,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.38 | bwd_microstep: 3323.61 | bwd_inner_microstep: 3322.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 20:41:28,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.38 | bwd: 3323.62 | bwd_inner: 3322.82 | bwd_allreduce: 0.76 | step: 6.68 46%|████▌ | 4570/10000 [7:11:49<8:16:47, 5.49s/it] {'loss': 0.1395, 'grad_norm': 3.7924163341522217, 'learning_rate': 2.3734899954178528e-05, 'epoch': 4.57} 46%|████▌ | 4570/10000 [7:11:49<8:16:47, 5.49s/it][2025-06-19 20:41:33,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:41:33,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.00 | bwd_microstep: 3327.28 | bwd_inner_microstep: 3326.34 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.94 [2025-06-19 20:41:33,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.00 | bwd: 3327.30 | bwd_inner: 3326.34 | bwd_allreduce: 0.91 | step: 6.95 46%|████▌ | 4571/10000 [7:11:54<8:16:27, 5.49s/it] {'loss': 0.0263, 'grad_norm': 1.9937829971313477, 'learning_rate': 2.3728536197416295e-05, 'epoch': 4.57} 46%|████▌ | 4571/10000 [7:11:54<8:16:27, 5.49s/it][2025-06-19 20:41:39,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:41:39,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.14 | bwd_microstep: 3326.95 | bwd_inner_microstep: 3326.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 20:41:39,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.14 | bwd: 3326.96 | bwd_inner: 3326.17 | bwd_allreduce: 0.76 | step: 6.55 46%|████▌ | 4572/10000 [7:12:00<8:16:27, 5.49s/it] {'loss': 0.0187, 'grad_norm': 1.3421804904937744, 'learning_rate': 2.3722172049547925e-05, 'epoch': 4.57} 46%|████▌ | 4572/10000 [7:12:00<8:16:27, 5.49s/it][2025-06-19 20:41:44,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:41:44,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.44 | bwd_microstep: 3329.55 | bwd_inner_microstep: 3328.73 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.06 [2025-06-19 20:41:44,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.44 | bwd: 3329.56 | bwd_inner: 3328.73 | bwd_allreduce: 0.78 | step: 7.07 46%|████▌ | 4573/10000 [7:12:05<8:16:09, 5.49s/it] {'loss': 0.0084, 'grad_norm': 0.7123849987983704, 'learning_rate': 2.3715807511240976e-05, 'epoch': 4.57} 46%|████▌ | 4573/10000 [7:12:05<8:16:09, 5.49s/it][2025-06-19 20:41:50,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:41:50,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.69 | bwd_microstep: 3318.95 | bwd_inner_microstep: 3318.11 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.01 [2025-06-19 20:41:50,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.69 | bwd: 3318.97 | bwd_inner: 3318.11 | bwd_allreduce: 0.80 | step: 7.01 46%|████▌ | 4574/10000 [7:12:11<8:15:32, 5.48s/it] {'loss': 0.0687, 'grad_norm': 1.9031726121902466, 'learning_rate': 2.370944258316306e-05, 'epoch': 4.57} 46%|████▌ | 4574/10000 [7:12:11<8:15:32, 5.48s/it][2025-06-19 20:41:55,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:41:55,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.17 | bwd_microstep: 3313.58 | bwd_inner_microstep: 3312.75 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.00 [2025-06-19 20:41:55,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.17 | bwd: 3313.59 | bwd_inner: 3312.75 | bwd_allreduce: 0.80 | step: 7.00 46%|████▌ | 4575/10000 [7:12:16<8:14:50, 5.47s/it] {'loss': 0.0206, 'grad_norm': 1.0330617427825928, 'learning_rate': 2.3703077265981844e-05, 'epoch': 4.58} 46%|████▌ | 4575/10000 [7:12:16<8:14:50, 5.47s/it][2025-06-19 20:42:01,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:42:01,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.93 | bwd_microstep: 3326.70 | bwd_inner_microstep: 3325.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.68 [2025-06-19 20:42:01,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.93 | bwd: 3326.71 | bwd_inner: 3325.90 | bwd_allreduce: 0.77 | step: 6.69 46%|████▌ | 4576/10000 [7:12:22<8:14:54, 5.47s/it] {'loss': 0.0018, 'grad_norm': 0.04656433314085007, 'learning_rate': 2.3696711560365003e-05, 'epoch': 4.58} 46%|████▌ | 4576/10000 [7:12:22<8:14:54, 5.47s/it][2025-06-19 20:42:06,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:42:06,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.88 | bwd_microstep: 3321.73 | bwd_inner_microstep: 3320.68 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.35 [2025-06-19 20:42:06,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.88 | bwd: 3321.75 | bwd_inner: 3320.68 | bwd_allreduce: 1.02 | step: 7.36 46%|████▌ | 4577/10000 [7:12:27<8:14:55, 5.48s/it] {'loss': 0.0034, 'grad_norm': 0.2902161777019501, 'learning_rate': 2.369034546698028e-05, 'epoch': 4.58} 46%|████▌ | 4577/10000 [7:12:27<8:14:55, 5.48s/it][2025-06-19 20:42:12,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:42:12,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.63 | bwd_microstep: 3365.70 | bwd_inner_microstep: 3364.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 20:42:12,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.63 | bwd: 3365.72 | bwd_inner: 3364.91 | bwd_allreduce: 0.77 | step: 6.71 46%|████▌ | 4578/10000 [7:12:33<8:16:36, 5.50s/it] {'loss': 0.0124, 'grad_norm': 0.7251436710357666, 'learning_rate': 2.368397898649544e-05, 'epoch': 4.58} 46%|████▌ | 4578/10000 [7:12:33<8:16:36, 5.50s/it][2025-06-19 20:42:17,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:42:17,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.73 | bwd_microstep: 3373.00 | bwd_inner_microstep: 3372.19 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.79 [2025-06-19 20:42:17,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.73 | bwd: 3373.01 | bwd_inner: 3372.19 | bwd_allreduce: 0.77 | step: 6.79 46%|████▌ | 4579/10000 [7:12:38<8:18:00, 5.51s/it] {'loss': 0.0256, 'grad_norm': 2.2775490283966064, 'learning_rate': 2.3677612119578302e-05, 'epoch': 4.58} 46%|████▌ | 4579/10000 [7:12:38<8:18:00, 5.51s/it][2025-06-19 20:42:23,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:42:23,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.87 | bwd_microstep: 3315.42 | bwd_inner_microstep: 3314.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.13 [2025-06-19 20:42:23,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.87 | bwd: 3315.43 | bwd_inner: 3314.62 | bwd_allreduce: 0.77 | step: 7.13 46%|████▌ | 4580/10000 [7:12:44<8:16:30, 5.50s/it] {'loss': 0.0025, 'grad_norm': 0.17378702759742737, 'learning_rate': 2.3671244866896722e-05, 'epoch': 4.58} 46%|████▌ | 4580/10000 [7:12:44<8:16:30, 5.50s/it][2025-06-19 20:42:28,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:42:28,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.57 | bwd_microstep: 3370.19 | bwd_inner_microstep: 3369.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 20:42:28,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.58 | bwd: 3370.20 | bwd_inner: 3369.41 | bwd_allreduce: 0.75 | step: 6.61 46%|████▌ | 4581/10000 [7:12:49<8:17:27, 5.51s/it] {'loss': 0.0087, 'grad_norm': 0.6710587739944458, 'learning_rate': 2.3664877229118595e-05, 'epoch': 4.58} 46%|████▌ | 4581/10000 [7:12:49<8:17:27, 5.51s/it][2025-06-19 20:42:34,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 20:42:34,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.53 | bwd_microstep: 3320.57 | bwd_inner_microstep: 3319.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-19 20:42:34,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.53 | bwd: 3320.58 | bwd_inner: 3319.78 | bwd_allreduce: 0.76 | step: 6.92 46%|████▌ | 4582/10000 [7:12:55<8:16:22, 5.50s/it] {'loss': 0.1202, 'grad_norm': 7.233997821807861, 'learning_rate': 2.3658509206911862e-05, 'epoch': 4.58} 46%|████▌ | 4582/10000 [7:12:55<8:16:22, 5.50s/it][2025-06-19 20:42:39,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:42:39,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.90 | bwd_microstep: 3316.71 | bwd_inner_microstep: 3315.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 20:42:39,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.90 | bwd: 3316.72 | bwd_inner: 3315.93 | bwd_allreduce: 0.75 | step: 6.55 46%|████▌ | 4583/10000 [7:13:00<8:15:12, 5.49s/it] {'loss': 0.0113, 'grad_norm': 1.10073721408844, 'learning_rate': 2.3652140800944485e-05, 'epoch': 4.58} 46%|████▌ | 4583/10000 [7:13:00<8:15:12, 5.49s/it][2025-06-19 20:42:45,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:42:45,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.29 | bwd_microstep: 3366.14 | bwd_inner_microstep: 3365.36 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 20:42:45,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.29 | bwd: 3366.15 | bwd_inner: 3365.36 | bwd_allreduce: 0.75 | step: 6.59 46%|████▌ | 4584/10000 [7:13:06<8:16:20, 5.50s/it] {'loss': 0.0249, 'grad_norm': 4.844740867614746, 'learning_rate': 2.364577201188449e-05, 'epoch': 4.58} 46%|████▌ | 4584/10000 [7:13:06<8:16:20, 5.50s/it][2025-06-19 20:42:50,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 20:42:50,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.45 | bwd_microstep: 3388.44 | bwd_inner_microstep: 3387.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 20:42:50,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.45 | bwd: 3388.45 | bwd_inner: 3387.66 | bwd_allreduce: 0.75 | step: 6.72 46%|████▌ | 4585/10000 [7:13:11<8:17:53, 5.52s/it] {'loss': 0.0283, 'grad_norm': 2.265205144882202, 'learning_rate': 2.363940284039994e-05, 'epoch': 4.58} 46%|████▌ | 4585/10000 [7:13:11<8:17:53, 5.52s/it][2025-06-19 20:42:56,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:42:56,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.35 | bwd_microstep: 3320.83 | bwd_inner_microstep: 3320.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.14 [2025-06-19 20:42:56,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.35 | bwd: 3320.85 | bwd_inner: 3320.04 | bwd_allreduce: 0.77 | step: 7.15 46%|████▌ | 4586/10000 [7:13:17<8:16:24, 5.50s/it] {'loss': 0.1182, 'grad_norm': 5.212631702423096, 'learning_rate': 2.3633033287158913e-05, 'epoch': 4.59} 46%|████▌ | 4586/10000 [7:13:17<8:16:24, 5.50s/it][2025-06-19 20:43:01,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:43:01,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.84 | bwd_microstep: 3311.46 | bwd_inner_microstep: 3310.68 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 20:43:01,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.84 | bwd: 3311.48 | bwd_inner: 3310.68 | bwd_allreduce: 0.76 | step: 6.56 46%|████▌ | 4587/10000 [7:13:22<8:15:08, 5.49s/it] {'loss': 0.0026, 'grad_norm': 0.12416455149650574, 'learning_rate': 2.362666335282956e-05, 'epoch': 4.59} 46%|████▌ | 4587/10000 [7:13:22<8:15:08, 5.49s/it][2025-06-19 20:43:07,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:43:07,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.90 | bwd_microstep: 3308.67 | bwd_inner_microstep: 3307.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 20:43:07,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.90 | bwd: 3308.68 | bwd_inner: 3307.88 | bwd_allreduce: 0.75 | step: 6.55 46%|████▌ | 4588/10000 [7:13:28<8:13:57, 5.48s/it] {'loss': 0.0728, 'grad_norm': 2.157722234725952, 'learning_rate': 2.3620293038080046e-05, 'epoch': 4.59} 46%|████▌ | 4588/10000 [7:13:28<8:13:57, 5.48s/it][2025-06-19 20:43:12,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:43:12,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.47 | bwd_microstep: 3362.14 | bwd_inner_microstep: 3361.36 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 20:43:12,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.47 | bwd: 3362.15 | bwd_inner: 3361.36 | bwd_allreduce: 0.75 | step: 6.63 46%|████▌ | 4589/10000 [7:13:33<8:15:15, 5.49s/it] {'loss': 0.0041, 'grad_norm': 0.585604190826416, 'learning_rate': 2.3613922343578594e-05, 'epoch': 4.59} 46%|████▌ | 4589/10000 [7:13:33<8:15:15, 5.49s/it][2025-06-19 20:43:18,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:43:18,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.51 | bwd_microstep: 3324.65 | bwd_inner_microstep: 3323.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 20:43:18,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.51 | bwd: 3324.66 | bwd_inner: 3323.86 | bwd_allreduce: 0.76 | step: 6.65 46%|████▌ | 4590/10000 [7:13:39<8:14:33, 5.48s/it] {'loss': 0.0023, 'grad_norm': 0.3436421751976013, 'learning_rate': 2.360755126999347e-05, 'epoch': 4.59} 46%|████▌ | 4590/10000 [7:13:39<8:14:33, 5.48s/it][2025-06-19 20:43:23,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:43:23,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.63 | bwd_microstep: 3316.31 | bwd_inner_microstep: 3315.50 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 20:43:23,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.63 | bwd: 3316.32 | bwd_inner: 3315.50 | bwd_allreduce: 0.78 | step: 7.13 46%|████▌ | 4591/10000 [7:13:44<8:13:52, 5.48s/it] {'loss': 0.113, 'grad_norm': 3.6316640377044678, 'learning_rate': 2.3601179817992957e-05, 'epoch': 4.59} 46%|████▌ | 4591/10000 [7:13:44<8:13:52, 5.48s/it][2025-06-19 20:43:29,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:43:29,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.61 | bwd_microstep: 3317.53 | bwd_inner_microstep: 3316.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 20:43:29,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.61 | bwd: 3317.55 | bwd_inner: 3316.74 | bwd_allreduce: 0.77 | step: 6.62 46%|████▌ | 4592/10000 [7:13:49<8:13:12, 5.47s/it] {'loss': 0.0267, 'grad_norm': 2.879835367202759, 'learning_rate': 2.3594807988245397e-05, 'epoch': 4.59} 46%|████▌ | 4592/10000 [7:13:49<8:13:12, 5.47s/it][2025-06-19 20:43:34,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.86 [2025-06-19 20:43:34,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.98 | bwd_microstep: 3314.63 | bwd_inner_microstep: 3313.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 20:43:34,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.98 | bwd: 3314.64 | bwd_inner: 3313.83 | bwd_allreduce: 0.76 | step: 6.83 46%|████▌ | 4593/10000 [7:13:55<8:12:36, 5.47s/it] {'loss': 0.0258, 'grad_norm': 1.2876962423324585, 'learning_rate': 2.358843578141916e-05, 'epoch': 4.59} 46%|████▌ | 4593/10000 [7:13:55<8:12:36, 5.47s/it][2025-06-19 20:43:40,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:43:40,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.79 | bwd_microstep: 3359.93 | bwd_inner_microstep: 3359.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 20:43:40,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.79 | bwd: 3359.94 | bwd_inner: 3359.15 | bwd_allreduce: 0.75 | step: 6.55 46%|████▌ | 4594/10000 [7:14:00<8:14:03, 5.48s/it] {'loss': 0.0026, 'grad_norm': 0.23889504373073578, 'learning_rate': 2.358206319818266e-05, 'epoch': 4.59} 46%|████▌ | 4594/10000 [7:14:00<8:14:03, 5.48s/it][2025-06-19 20:43:45,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:43:45,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.19 | bwd_microstep: 3368.08 | bwd_inner_microstep: 3367.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 20:43:45,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.19 | bwd: 3368.09 | bwd_inner: 3367.30 | bwd_allreduce: 0.76 | step: 6.59 46%|████▌ | 4595/10000 [7:14:06<8:15:13, 5.50s/it] {'loss': 0.0069, 'grad_norm': 0.36098167300224304, 'learning_rate': 2.357569023920437e-05, 'epoch': 4.59} 46%|████▌ | 4595/10000 [7:14:06<8:15:13, 5.50s/it][2025-06-19 20:43:51,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:43:51,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.33 | bwd_microstep: 3361.73 | bwd_inner_microstep: 3360.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 20:43:51,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.33 | bwd: 3361.74 | bwd_inner: 3360.95 | bwd_allreduce: 0.75 | step: 6.58 46%|████▌ | 4596/10000 [7:14:11<8:16:13, 5.51s/it] {'loss': 0.058, 'grad_norm': 2.7065768241882324, 'learning_rate': 2.356931690515276e-05, 'epoch': 4.6} 46%|████▌ | 4596/10000 [7:14:11<8:16:13, 5.51s/it][2025-06-19 20:43:56,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.91 [2025-06-19 20:43:56,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.72 | bwd_microstep: 3365.37 | bwd_inner_microstep: 3364.45 | bwd_allreduce_microstep: 0.86 | step_microstep: 8.15 [2025-06-19 20:43:56,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.72 | bwd: 3365.38 | bwd_inner: 3364.45 | bwd_allreduce: 0.88 | step: 8.16 46%|████▌ | 4597/10000 [7:14:17<8:16:44, 5.52s/it] {'loss': 0.0027, 'grad_norm': 0.15904779732227325, 'learning_rate': 2.356294319669637e-05, 'epoch': 4.6} 46%|████▌ | 4597/10000 [7:14:17<8:16:44, 5.52s/it][2025-06-19 20:44:02,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:44:02,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.79 | bwd_microstep: 3382.81 | bwd_inner_microstep: 3382.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 20:44:02,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.80 | bwd: 3382.82 | bwd_inner: 3382.02 | bwd_allreduce: 0.76 | step: 6.64 46%|████▌ | 4598/10000 [7:14:23<8:17:35, 5.53s/it] {'loss': 0.0481, 'grad_norm': 1.5044201612472534, 'learning_rate': 2.3556569114503773e-05, 'epoch': 4.6} 46%|████▌ | 4598/10000 [7:14:23<8:17:35, 5.53s/it][2025-06-19 20:44:07,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:44:07,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.97 | bwd_microstep: 3315.59 | bwd_inner_microstep: 3314.80 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 20:44:07,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.97 | bwd: 3315.61 | bwd_inner: 3314.80 | bwd_allreduce: 0.77 | step: 6.68 46%|████▌ | 4599/10000 [7:14:28<8:15:30, 5.50s/it] {'loss': 0.0443, 'grad_norm': 1.6022703647613525, 'learning_rate': 2.3550194659243585e-05, 'epoch': 4.6} 46%|████▌ | 4599/10000 [7:14:28<8:15:30, 5.50s/it][2025-06-19 20:44:13,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:44:13,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.33 | bwd_microstep: 3363.15 | bwd_inner_microstep: 3362.36 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.84 [2025-06-19 20:44:13,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.33 | bwd: 3363.16 | bwd_inner: 3362.36 | bwd_allreduce: 0.76 | step: 6.84 46%|████▌ | 4600/10000 [7:14:34<8:16:01, 5.51s/it] {'loss': 0.0241, 'grad_norm': 1.6322216987609863, 'learning_rate': 2.354381983158446e-05, 'epoch': 4.6} 46%|████▌ | 4600/10000 [7:14:34<8:16:01, 5.51s/it][2025-06-19 20:44:18,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:44:18,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.36 | bwd_microstep: 3315.52 | bwd_inner_microstep: 3314.57 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.08 [2025-06-19 20:44:18,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.36 | bwd: 3315.53 | bwd_inner: 3314.57 | bwd_allreduce: 0.92 | step: 7.08 46%|████▌ | 4601/10000 [7:14:39<8:14:24, 5.49s/it] {'loss': 0.0921, 'grad_norm': 2.1458845138549805, 'learning_rate': 2.3537444632195078e-05, 'epoch': 4.6} 46%|████▌ | 4601/10000 [7:14:39<8:14:24, 5.49s/it][2025-06-19 20:44:24,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:44:24,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.58 | bwd_microstep: 3359.96 | bwd_inner_microstep: 3359.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 20:44:24,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.58 | bwd: 3359.97 | bwd_inner: 3359.15 | bwd_allreduce: 0.78 | step: 7.14 46%|████▌ | 4602/10000 [7:14:45<8:15:06, 5.50s/it] {'loss': 0.073, 'grad_norm': 4.0266337394714355, 'learning_rate': 2.353106906174417e-05, 'epoch': 4.6} 46%|████▌ | 4602/10000 [7:14:45<8:15:06, 5.50s/it][2025-06-19 20:44:29,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 20:44:29,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.48 | bwd_microstep: 3369.97 | bwd_inner_microstep: 3369.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 20:44:29,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.48 | bwd: 3369.99 | bwd_inner: 3369.19 | bwd_allreduce: 0.76 | step: 6.76 46%|████▌ | 4603/10000 [7:14:50<8:15:55, 5.51s/it] {'loss': 0.0019, 'grad_norm': 0.13732852041721344, 'learning_rate': 2.3524693120900505e-05, 'epoch': 4.6} 46%|████▌ | 4603/10000 [7:14:50<8:15:55, 5.51s/it][2025-06-19 20:44:35,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:44:35,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.05 | bwd_microstep: 3310.33 | bwd_inner_microstep: 3309.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 20:44:35,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.05 | bwd: 3310.34 | bwd_inner: 3309.54 | bwd_allreduce: 0.76 | step: 6.58 46%|████▌ | 4604/10000 [7:14:56<8:14:10, 5.49s/it] {'loss': 0.0623, 'grad_norm': 3.150158643722534, 'learning_rate': 2.3518316810332892e-05, 'epoch': 4.6} 46%|████▌ | 4604/10000 [7:14:56<8:14:10, 5.49s/it][2025-06-19 20:44:40,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:44:40,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.60 | bwd_microstep: 3311.71 | bwd_inner_microstep: 3310.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-19 20:44:40,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.60 | bwd: 3311.72 | bwd_inner: 3310.91 | bwd_allreduce: 0.76 | step: 6.92 46%|████▌ | 4605/10000 [7:15:01<8:12:51, 5.48s/it] {'loss': 0.0057, 'grad_norm': 0.5555583834648132, 'learning_rate': 2.3511940130710174e-05, 'epoch': 4.61} 46%|████▌ | 4605/10000 [7:15:01<8:12:51, 5.48s/it][2025-06-19 20:44:46,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:44:46,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.14 | bwd_microstep: 3366.01 | bwd_inner_microstep: 3365.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 20:44:46,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.14 | bwd: 3366.02 | bwd_inner: 3365.22 | bwd_allreduce: 0.76 | step: 6.63 46%|████▌ | 4606/10000 [7:15:06<8:13:57, 5.49s/it] {'loss': 0.0369, 'grad_norm': 2.178889513015747, 'learning_rate': 2.3505563082701234e-05, 'epoch': 4.61} 46%|████▌ | 4606/10000 [7:15:06<8:13:57, 5.49s/it][2025-06-19 20:44:51,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:44:51,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.29 | bwd_microstep: 3361.34 | bwd_inner_microstep: 3360.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 20:44:51,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.29 | bwd: 3361.35 | bwd_inner: 3360.55 | bwd_allreduce: 0.76 | step: 6.98 46%|████▌ | 4607/10000 [7:15:12<8:14:45, 5.50s/it] {'loss': 0.0356, 'grad_norm': 1.4952547550201416, 'learning_rate': 2.3499185666975e-05, 'epoch': 4.61} 46%|████▌ | 4607/10000 [7:15:12<8:14:45, 5.50s/it][2025-06-19 20:44:57,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:44:57,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.16 | bwd_microstep: 3320.18 | bwd_inner_microstep: 3319.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 20:44:57,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.16 | bwd: 3320.19 | bwd_inner: 3319.40 | bwd_allreduce: 0.75 | step: 6.62 46%|████▌ | 4608/10000 [7:15:17<8:13:26, 5.49s/it] {'loss': 0.0208, 'grad_norm': 1.8492027521133423, 'learning_rate': 2.349280788420043e-05, 'epoch': 4.61} 46%|████▌ | 4608/10000 [7:15:17<8:13:26, 5.49s/it][2025-06-19 20:45:02,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:45:02,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.14 | bwd_microstep: 3306.97 | bwd_inner_microstep: 3306.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 20:45:02,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.14 | bwd: 3306.99 | bwd_inner: 3306.19 | bwd_allreduce: 0.76 | step: 6.59 46%|████▌ | 4609/10000 [7:15:23<8:12:03, 5.48s/it] {'loss': 0.1232, 'grad_norm': 3.5592494010925293, 'learning_rate': 2.348642973504652e-05, 'epoch': 4.61} 46%|████▌ | 4609/10000 [7:15:23<8:12:03, 5.48s/it][2025-06-19 20:45:08,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:45:08,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.39 | bwd_microstep: 3315.94 | bwd_inner_microstep: 3315.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 20:45:08,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.39 | bwd: 3315.95 | bwd_inner: 3315.14 | bwd_allreduce: 0.77 | step: 6.66 46%|████▌ | 4610/10000 [7:15:28<8:11:26, 5.47s/it] {'loss': 0.0027, 'grad_norm': 0.33047229051589966, 'learning_rate': 2.348005122018232e-05, 'epoch': 4.61} 46%|████▌ | 4610/10000 [7:15:28<8:11:26, 5.47s/it][2025-06-19 20:45:13,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:45:13,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.52 | bwd_microstep: 3315.85 | bwd_inner_microstep: 3315.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 20:45:13,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.52 | bwd: 3315.87 | bwd_inner: 3315.07 | bwd_allreduce: 0.75 | step: 6.61 46%|████▌ | 4611/10000 [7:15:34<8:11:12, 5.47s/it] {'loss': 0.0185, 'grad_norm': 1.2355057001113892, 'learning_rate': 2.347367234027689e-05, 'epoch': 4.61} 46%|████▌ | 4611/10000 [7:15:34<8:11:12, 5.47s/it][2025-06-19 20:45:19,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:45:19,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.20 | bwd_microstep: 3326.87 | bwd_inner_microstep: 3326.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 20:45:19,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.19 | bwd: 3326.89 | bwd_inner: 3326.08 | bwd_allreduce: 0.76 | step: 6.74 46%|████▌ | 4612/10000 [7:15:39<8:11:08, 5.47s/it] {'loss': 0.0068, 'grad_norm': 0.5993508100509644, 'learning_rate': 2.3467293095999356e-05, 'epoch': 4.61} 46%|████▌ | 4612/10000 [7:15:39<8:11:08, 5.47s/it][2025-06-19 20:45:24,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:45:24,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.67 | bwd_microstep: 3320.27 | bwd_inner_microstep: 3319.32 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.27 [2025-06-19 20:45:24,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.67 | bwd: 3320.28 | bwd_inner: 3319.33 | bwd_allreduce: 0.92 | step: 7.28 46%|████▌ | 4613/10000 [7:15:45<8:10:52, 5.47s/it] {'loss': 0.0041, 'grad_norm': 0.2788010835647583, 'learning_rate': 2.346091348801887e-05, 'epoch': 4.61} 46%|████▌ | 4613/10000 [7:15:45<8:10:52, 5.47s/it][2025-06-19 20:45:29,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:45:29,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.45 | bwd_microstep: 3312.92 | bwd_inner_microstep: 3312.10 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 20:45:29,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.45 | bwd: 3312.94 | bwd_inner: 3312.10 | bwd_allreduce: 0.79 | step: 7.07 46%|████▌ | 4614/10000 [7:15:50<8:10:48, 5.47s/it] {'loss': 0.0024, 'grad_norm': 0.19104738533496857, 'learning_rate': 2.345453351700462e-05, 'epoch': 4.61} 46%|████▌ | 4614/10000 [7:15:50<8:10:48, 5.47s/it][2025-06-19 20:45:35,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:45:35,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.02 | bwd_microstep: 3305.58 | bwd_inner_microstep: 3304.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.70 [2025-06-19 20:45:35,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.02 | bwd: 3305.59 | bwd_inner: 3304.78 | bwd_allreduce: 0.77 | step: 6.70 46%|████▌ | 4615/10000 [7:15:56<8:10:06, 5.46s/it] {'loss': 0.0407, 'grad_norm': 2.0632925033569336, 'learning_rate': 2.3448153183625838e-05, 'epoch': 4.62} 46%|████▌ | 4615/10000 [7:15:56<8:10:06, 5.46s/it][2025-06-19 20:45:40,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 20:45:40,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.77 | bwd_microstep: 3365.71 | bwd_inner_microstep: 3364.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-19 20:45:40,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.77 | bwd: 3365.72 | bwd_inner: 3364.92 | bwd_allreduce: 0.76 | step: 6.52 46%|████▌ | 4616/10000 [7:16:01<8:11:42, 5.48s/it] {'loss': 0.0566, 'grad_norm': 2.4795405864715576, 'learning_rate': 2.3441772488551785e-05, 'epoch': 4.62} 46%|████▌ | 4616/10000 [7:16:01<8:11:42, 5.48s/it][h264 @ 0xb440840] Reference 5 >= 5 [h264 @ 0xb440840] error while decoding MB 15 42, bytestream 9292 [h264 @ 0xb44bd00] left block unavailable for requested intra mode [h264 @ 0xb44bd00] error while decoding MB 0 25, bytestream 45493 [h264 @ 0xd7cdf40] Reference 5 >= 5 [h264 @ 0xd7cdf40] error while decoding MB 15 42, bytestream 9292 [h264 @ 0xd7cdf40] left block unavailable for requested intra mode [h264 @ 0xd7cdf40] error while decoding MB 0 25, bytestream 45493 [2025-06-19 20:45:46,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:45:46,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.54 | bwd_microstep: 3322.72 | bwd_inner_microstep: 3321.65 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.16 [2025-06-19 20:45:46,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.54 | bwd: 3322.74 | bwd_inner: 3321.65 | bwd_allreduce: 1.04 | step: 7.16 46%|████▌ | 4617/10000 [7:16:07<8:11:18, 5.48s/it] {'loss': 0.0366, 'grad_norm': 1.6628133058547974, 'learning_rate': 2.3435391432451775e-05, 'epoch': 4.62} 46%|████▌ | 4617/10000 [7:16:07<8:11:18, 5.48s/it][2025-06-19 20:45:51,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:45:51,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.54 | bwd_microstep: 3321.30 | bwd_inner_microstep: 3320.26 | bwd_allreduce_microstep: 0.98 | step_microstep: 6.96 [2025-06-19 20:45:51,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.54 | bwd: 3321.31 | bwd_inner: 3320.26 | bwd_allreduce: 1.01 | step: 6.96 46%|████▌ | 4618/10000 [7:16:12<8:11:04, 5.47s/it] {'loss': 0.0023, 'grad_norm': 0.1247340738773346, 'learning_rate': 2.3429010015995153e-05, 'epoch': 4.62} 46%|████▌ | 4618/10000 [7:16:12<8:11:04, 5.47s/it][2025-06-19 20:45:57,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 20:45:57,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.09 | bwd_microstep: 3313.87 | bwd_inner_microstep: 3312.78 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.67 [2025-06-19 20:45:57,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.09 | bwd: 3313.90 | bwd_inner: 3312.78 | bwd_allreduce: 1.05 | step: 7.67 46%|████▌ | 4619/10000 [7:16:18<8:10:44, 5.47s/it] {'loss': 0.0904, 'grad_norm': 3.7738869190216064, 'learning_rate': 2.342262823985129e-05, 'epoch': 4.62} 46%|████▌ | 4619/10000 [7:16:18<8:10:44, 5.47s/it][2025-06-19 20:46:02,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:46:02,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.50 | bwd_microstep: 3314.01 | bwd_inner_microstep: 3312.95 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.78 [2025-06-19 20:46:02,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.50 | bwd: 3314.03 | bwd_inner: 3312.95 | bwd_allreduce: 1.03 | step: 7.79 46%|████▌ | 4620/10000 [7:16:23<8:10:35, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.026565086096525192, 'learning_rate': 2.3416246104689607e-05, 'epoch': 4.62} 46%|████▌ | 4620/10000 [7:16:23<8:10:35, 5.47s/it][2025-06-19 20:46:08,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 20:46:08,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.65 | bwd_microstep: 3307.06 | bwd_inner_microstep: 3305.88 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.65 [2025-06-19 20:46:08,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.65 | bwd: 3307.08 | bwd_inner: 3305.88 | bwd_allreduce: 1.14 | step: 7.65 46%|████▌ | 4621/10000 [7:16:29<8:10:22, 5.47s/it] {'loss': 0.019, 'grad_norm': 2.6219940185546875, 'learning_rate': 2.340986361117957e-05, 'epoch': 4.62} 46%|████▌ | 4621/10000 [7:16:29<8:10:22, 5.47s/it][2025-06-19 20:46:13,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:46:13,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.93 | bwd_microstep: 3312.96 | bwd_inner_microstep: 3312.15 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.75 [2025-06-19 20:46:13,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.93 | bwd: 3312.97 | bwd_inner: 3312.15 | bwd_allreduce: 0.78 | step: 6.75 46%|████▌ | 4622/10000 [7:16:34<8:10:18, 5.47s/it] {'loss': 0.0564, 'grad_norm': 4.48884916305542, 'learning_rate': 2.340348075999066e-05, 'epoch': 4.62} 46%|████▌ | 4622/10000 [7:16:34<8:10:18, 5.47s/it][2025-06-19 20:46:19,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:46:19,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.62 | bwd_microstep: 3316.17 | bwd_inner_microstep: 3315.36 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.35 [2025-06-19 20:46:19,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.62 | bwd: 3316.19 | bwd_inner: 3315.36 | bwd_allreduce: 0.79 | step: 7.35 46%|████▌ | 4623/10000 [7:16:39<8:10:02, 5.47s/it] {'loss': 0.0031, 'grad_norm': 0.20026247203350067, 'learning_rate': 2.3397097551792416e-05, 'epoch': 4.62} 46%|████▌ | 4623/10000 [7:16:39<8:10:02, 5.47s/it][2025-06-19 20:46:24,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:46:24,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2154.57 | bwd_microstep: 3313.67 | bwd_inner_microstep: 3312.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 20:46:24,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2154.57 | bwd: 3313.68 | bwd_inner: 3312.88 | bwd_allreduce: 0.76 | step: 6.63 46%|████▌ | 4624/10000 [7:16:45<8:11:20, 5.48s/it] {'loss': 0.0616, 'grad_norm': 4.110487937927246, 'learning_rate': 2.3390713987254405e-05, 'epoch': 4.62} 46%|████▌ | 4624/10000 [7:16:45<8:11:20, 5.48s/it][2025-06-19 20:46:30,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:46:30,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.34 | bwd_microstep: 3323.80 | bwd_inner_microstep: 3322.62 | bwd_allreduce_microstep: 1.11 | step_microstep: 7.56 [2025-06-19 20:46:30,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.34 | bwd: 3323.82 | bwd_inner: 3322.62 | bwd_allreduce: 1.14 | step: 7.55 46%|████▋ | 4625/10000 [7:16:50<8:10:52, 5.48s/it] {'loss': 0.0101, 'grad_norm': 0.8828909397125244, 'learning_rate': 2.3384330067046233e-05, 'epoch': 4.62} 46%|████▋ | 4625/10000 [7:16:50<8:10:52, 5.48s/it][2025-06-19 20:46:35,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:46:35,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.52 | bwd_microstep: 3315.11 | bwd_inner_microstep: 3314.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 20:46:35,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.52 | bwd: 3315.12 | bwd_inner: 3314.32 | bwd_allreduce: 0.76 | step: 6.60 46%|████▋ | 4626/10000 [7:16:56<8:10:07, 5.47s/it] {'loss': 0.0041, 'grad_norm': 0.20347236096858978, 'learning_rate': 2.337794579183755e-05, 'epoch': 4.63} 46%|████▋ | 4626/10000 [7:16:56<8:10:07, 5.47s/it][2025-06-19 20:46:41,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:46:41,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.36 | bwd_microstep: 3310.03 | bwd_inner_microstep: 3309.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 20:46:41,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.36 | bwd: 3310.04 | bwd_inner: 3309.24 | bwd_allreduce: 0.76 | step: 6.91 46%|████▋ | 4627/10000 [7:17:01<8:09:22, 5.46s/it] {'loss': 0.0335, 'grad_norm': 1.2751675844192505, 'learning_rate': 2.3371561162298023e-05, 'epoch': 4.63} 46%|████▋ | 4627/10000 [7:17:01<8:09:22, 5.46s/it][2025-06-19 20:46:46,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:46:46,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.63 | bwd_microstep: 3308.23 | bwd_inner_microstep: 3307.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 20:46:46,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.63 | bwd: 3308.25 | bwd_inner: 3307.45 | bwd_allreduce: 0.76 | step: 6.70 46%|████▋ | 4628/10000 [7:17:07<8:09:00, 5.46s/it] {'loss': 0.0141, 'grad_norm': 0.639579713344574, 'learning_rate': 2.3365176179097384e-05, 'epoch': 4.63} 46%|████▋ | 4628/10000 [7:17:07<8:09:00, 5.46s/it][2025-06-19 20:46:51,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:46:51,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.06 | bwd_microstep: 3319.09 | bwd_inner_microstep: 3318.28 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.25 [2025-06-19 20:46:51,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.06 | bwd: 3319.11 | bwd_inner: 3318.28 | bwd_allreduce: 0.78 | step: 7.25 46%|████▋ | 4629/10000 [7:17:12<8:08:47, 5.46s/it] {'loss': 0.1509, 'grad_norm': 2.7969110012054443, 'learning_rate': 2.3358790842905376e-05, 'epoch': 4.63} 46%|████▋ | 4629/10000 [7:17:12<8:08:47, 5.46s/it][2025-06-19 20:46:57,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:46:57,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.44 | bwd_microstep: 3362.78 | bwd_inner_microstep: 3361.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 20:46:57,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.44 | bwd: 3362.80 | bwd_inner: 3361.97 | bwd_allreduce: 0.78 | step: 7.21 46%|████▋ | 4630/10000 [7:17:18<8:10:37, 5.48s/it] {'loss': 0.0063, 'grad_norm': 0.4647670090198517, 'learning_rate': 2.3352405154391803e-05, 'epoch': 4.63} 46%|████▋ | 4630/10000 [7:17:18<8:10:37, 5.48s/it][2025-06-19 20:47:03,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 20:47:03,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.19 | bwd_microstep: 3366.24 | bwd_inner_microstep: 3365.11 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.52 [2025-06-19 20:47:03,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.19 | bwd: 3366.25 | bwd_inner: 3365.11 | bwd_allreduce: 1.09 | step: 7.53 46%|████▋ | 4631/10000 [7:17:23<8:12:15, 5.50s/it] {'loss': 0.0007, 'grad_norm': 0.02326088398694992, 'learning_rate': 2.3346019114226488e-05, 'epoch': 4.63} 46%|████▋ | 4631/10000 [7:17:23<8:12:15, 5.50s/it][2025-06-19 20:47:08,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:47:08,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.04 | bwd_microstep: 3361.88 | bwd_inner_microstep: 3361.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 20:47:08,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.04 | bwd: 3361.89 | bwd_inner: 3361.08 | bwd_allreduce: 0.76 | step: 6.81 46%|████▋ | 4632/10000 [7:17:29<8:12:49, 5.51s/it] {'loss': 0.0025, 'grad_norm': 0.09696074575185776, 'learning_rate': 2.3339632723079294e-05, 'epoch': 4.63} 46%|████▋ | 4632/10000 [7:17:29<8:12:49, 5.51s/it][2025-06-19 20:47:14,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:47:14,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.48 | bwd_microstep: 3325.04 | bwd_inner_microstep: 3324.07 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.38 [2025-06-19 20:47:14,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.48 | bwd: 3325.06 | bwd_inner: 3324.07 | bwd_allreduce: 0.94 | step: 7.38 46%|████▋ | 4633/10000 [7:17:34<8:11:47, 5.50s/it] {'loss': 0.0348, 'grad_norm': 2.030768632888794, 'learning_rate': 2.333324598162013e-05, 'epoch': 4.63} 46%|████▋ | 4633/10000 [7:17:34<8:11:47, 5.50s/it][2025-06-19 20:47:19,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:47:19,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.08 | bwd_microstep: 3379.23 | bwd_inner_microstep: 3378.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.26 [2025-06-19 20:47:19,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.08 | bwd: 3379.25 | bwd_inner: 3378.43 | bwd_allreduce: 0.78 | step: 7.26 46%|████▋ | 4634/10000 [7:17:40<8:13:13, 5.52s/it] {'loss': 0.003, 'grad_norm': 0.3319012522697449, 'learning_rate': 2.3326858890518928e-05, 'epoch': 4.63} 46%|████▋ | 4634/10000 [7:17:40<8:13:13, 5.52s/it][2025-06-19 20:47:25,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:47:25,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.28 | bwd_microstep: 3383.62 | bwd_inner_microstep: 3382.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 20:47:25,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.28 | bwd: 3383.64 | bwd_inner: 3382.82 | bwd_allreduce: 0.77 | step: 6.69 46%|████▋ | 4635/10000 [7:17:45<8:14:18, 5.53s/it] {'loss': 0.0104, 'grad_norm': 0.5344914793968201, 'learning_rate': 2.332047145044567e-05, 'epoch': 4.63} 46%|████▋ | 4635/10000 [7:17:45<8:14:18, 5.53s/it][2025-06-19 20:47:30,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:47:30,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.43 | bwd_microstep: 3367.25 | bwd_inner_microstep: 3366.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 20:47:30,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.43 | bwd: 3367.27 | bwd_inner: 3366.46 | bwd_allreduce: 0.76 | step: 6.69 46%|████▋ | 4636/10000 [7:17:51<8:14:18, 5.53s/it] {'loss': 0.0626, 'grad_norm': 2.7029082775115967, 'learning_rate': 2.3314083662070372e-05, 'epoch': 4.64} 46%|████▋ | 4636/10000 [7:17:51<8:14:18, 5.53s/it][2025-06-19 20:47:36,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:47:36,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.51 | bwd_microstep: 3317.95 | bwd_inner_microstep: 3316.98 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.52 [2025-06-19 20:47:36,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.51 | bwd: 3317.97 | bwd_inner: 3316.98 | bwd_allreduce: 0.93 | step: 7.52 46%|████▋ | 4637/10000 [7:17:56<8:12:37, 5.51s/it] {'loss': 0.0711, 'grad_norm': 6.30676794052124, 'learning_rate': 2.3307695526063077e-05, 'epoch': 4.64} 46%|████▋ | 4637/10000 [7:17:56<8:12:37, 5.51s/it][2025-06-19 20:47:41,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:47:41,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.35 | bwd_microstep: 3324.81 | bwd_inner_microstep: 3324.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 20:47:41,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.35 | bwd: 3324.82 | bwd_inner: 3324.02 | bwd_allreduce: 0.76 | step: 6.72 46%|████▋ | 4638/10000 [7:18:02<8:11:31, 5.50s/it] {'loss': 0.054, 'grad_norm': 2.8441367149353027, 'learning_rate': 2.3301307043093874e-05, 'epoch': 4.64} 46%|████▋ | 4638/10000 [7:18:02<8:11:31, 5.50s/it][2025-06-19 20:47:47,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:47:47,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.87 | bwd_microstep: 3332.70 | bwd_inner_microstep: 3331.74 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.05 [2025-06-19 20:47:47,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.87 | bwd: 3332.73 | bwd_inner: 3331.74 | bwd_allreduce: 0.92 | step: 7.05 46%|████▋ | 4639/10000 [7:18:07<8:11:05, 5.50s/it] {'loss': 0.013, 'grad_norm': 1.2343257665634155, 'learning_rate': 2.3294918213832877e-05, 'epoch': 4.64} 46%|████▋ | 4639/10000 [7:18:07<8:11:05, 5.50s/it][2025-06-19 20:47:52,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:47:52,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.17 | bwd_microstep: 3315.39 | bwd_inner_microstep: 3314.48 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.71 [2025-06-19 20:47:52,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.17 | bwd: 3315.40 | bwd_inner: 3314.48 | bwd_allreduce: 0.88 | step: 7.72 46%|████▋ | 4640/10000 [7:18:13<8:10:16, 5.49s/it] {'loss': 0.0675, 'grad_norm': 3.1139767169952393, 'learning_rate': 2.328852903895026e-05, 'epoch': 4.64} 46%|████▋ | 4640/10000 [7:18:13<8:10:16, 5.49s/it][2025-06-19 20:47:58,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:47:58,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.62 | bwd_microstep: 3332.97 | bwd_inner_microstep: 3332.13 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.18 [2025-06-19 20:47:58,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.62 | bwd: 3332.98 | bwd_inner: 3332.13 | bwd_allreduce: 0.81 | step: 7.18 46%|████▋ | 4641/10000 [7:18:18<8:10:10, 5.49s/it] {'loss': 0.0468, 'grad_norm': 2.194852828979492, 'learning_rate': 2.3282139519116202e-05, 'epoch': 4.64} 46%|████▋ | 4641/10000 [7:18:18<8:10:10, 5.49s/it][2025-06-19 20:48:03,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:48:03,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.00 | bwd_microstep: 3372.65 | bwd_inner_microstep: 3371.76 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.96 [2025-06-19 20:48:03,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.00 | bwd: 3372.66 | bwd_inner: 3371.76 | bwd_allreduce: 0.86 | step: 6.96 46%|████▋ | 4642/10000 [7:18:24<8:11:33, 5.50s/it] {'loss': 0.0404, 'grad_norm': 1.720653772354126, 'learning_rate': 2.3275749655000944e-05, 'epoch': 4.64} 46%|████▋ | 4642/10000 [7:18:24<8:11:33, 5.50s/it][2025-06-19 20:48:09,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:48:09,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.69 | bwd_microstep: 3376.26 | bwd_inner_microstep: 3375.27 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.13 [2025-06-19 20:48:09,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.69 | bwd: 3376.27 | bwd_inner: 3375.27 | bwd_allreduce: 0.96 | step: 7.14 46%|████▋ | 4643/10000 [7:18:29<8:12:46, 5.52s/it] {'loss': 0.0104, 'grad_norm': 0.6233898997306824, 'learning_rate': 2.3269359447274757e-05, 'epoch': 4.64} 46%|████▋ | 4643/10000 [7:18:29<8:12:46, 5.52s/it][2025-06-19 20:48:14,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:48:14,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.07 | bwd_microstep: 3321.72 | bwd_inner_microstep: 3320.84 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.06 [2025-06-19 20:48:14,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.07 | bwd: 3321.74 | bwd_inner: 3320.84 | bwd_allreduce: 0.85 | step: 7.06 46%|████▋ | 4644/10000 [7:18:35<8:11:26, 5.51s/it] {'loss': 0.0553, 'grad_norm': 2.5069167613983154, 'learning_rate': 2.326296889660793e-05, 'epoch': 4.64} 46%|████▋ | 4644/10000 [7:18:35<8:11:26, 5.51s/it][2025-06-19 20:48:20,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:48:20,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.22 | bwd_microstep: 3334.09 | bwd_inner_microstep: 3333.22 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.27 [2025-06-19 20:48:20,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.22 | bwd: 3334.11 | bwd_inner: 3333.22 | bwd_allreduce: 0.84 | step: 7.27 46%|████▋ | 4645/10000 [7:18:40<8:10:56, 5.50s/it] {'loss': 0.012, 'grad_norm': 0.7410423159599304, 'learning_rate': 2.325657800367081e-05, 'epoch': 4.64} 46%|████▋ | 4645/10000 [7:18:40<8:10:56, 5.50s/it][2025-06-19 20:48:25,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 20:48:25,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.19 | bwd_microstep: 3372.18 | bwd_inner_microstep: 3371.31 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.35 [2025-06-19 20:48:25,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.19 | bwd: 3372.20 | bwd_inner: 3371.31 | bwd_allreduce: 0.84 | step: 7.35 46%|████▋ | 4646/10000 [7:18:46<8:12:13, 5.52s/it] {'loss': 0.0022, 'grad_norm': 0.23952239751815796, 'learning_rate': 2.3250186769133775e-05, 'epoch': 4.65} 46%|████▋ | 4646/10000 [7:18:46<8:12:13, 5.52s/it][2025-06-19 20:48:31,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 20:48:31,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.31 | bwd_microstep: 3334.88 | bwd_inner_microstep: 3333.72 | bwd_allreduce_microstep: 1.09 | step_microstep: 8.31 [2025-06-19 20:48:31,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.31 | bwd: 3334.90 | bwd_inner: 3333.72 | bwd_allreduce: 1.12 | step: 8.32 46%|████▋ | 4647/10000 [7:18:51<8:11:29, 5.51s/it] {'loss': 0.0386, 'grad_norm': 1.7477127313613892, 'learning_rate': 2.324379519366723e-05, 'epoch': 4.65} 46%|████▋ | 4647/10000 [7:18:52<8:11:29, 5.51s/it][2025-06-19 20:48:36,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:48:36,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.61 | bwd_microstep: 3323.87 | bwd_inner_microstep: 3323.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 20:48:36,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.61 | bwd: 3323.88 | bwd_inner: 3323.09 | bwd_allreduce: 0.76 | step: 6.62 46%|████▋ | 4648/10000 [7:18:57<8:10:30, 5.50s/it] {'loss': 0.0144, 'grad_norm': 0.8608933687210083, 'learning_rate': 2.323740327794163e-05, 'epoch': 4.65} 46%|████▋ | 4648/10000 [7:18:57<8:10:30, 5.50s/it][2025-06-19 20:48:42,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 20:48:42,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.84 | bwd_microstep: 3406.41 | bwd_inner_microstep: 3405.20 | bwd_allreduce_microstep: 1.14 | step_microstep: 8.51 [2025-06-19 20:48:42,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.84 | bwd: 3406.43 | bwd_inner: 3405.20 | bwd_allreduce: 1.17 | step: 8.52 46%|████▋ | 4649/10000 [7:19:03<8:13:04, 5.53s/it] {'loss': 0.0281, 'grad_norm': 2.388376474380493, 'learning_rate': 2.3231011022627447e-05, 'epoch': 4.65} 46%|████▋ | 4649/10000 [7:19:03<8:13:04, 5.53s/it][2025-06-19 20:48:47,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:48:47,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.23 | bwd_microstep: 3333.64 | bwd_inner_microstep: 3332.79 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.31 [2025-06-19 20:48:47,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.23 | bwd: 3333.66 | bwd_inner: 3332.79 | bwd_allreduce: 0.81 | step: 7.31 46%|████▋ | 4650/10000 [7:19:08<8:12:06, 5.52s/it] {'loss': 0.0256, 'grad_norm': 1.7752662897109985, 'learning_rate': 2.3224618428395198e-05, 'epoch': 4.65} 46%|████▋ | 4650/10000 [7:19:08<8:12:06, 5.52s/it][2025-06-19 20:48:53,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:48:53,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.12 | bwd_microstep: 3373.75 | bwd_inner_microstep: 3372.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.61 [2025-06-19 20:48:53,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.12 | bwd: 3373.76 | bwd_inner: 3372.96 | bwd_allreduce: 0.76 | step: 6.61 47%|████▋ | 4651/10000 [7:19:14<8:12:37, 5.53s/it] {'loss': 0.0451, 'grad_norm': 2.953640937805176, 'learning_rate': 2.321822549591545e-05, 'epoch': 4.65} 47%|████▋ | 4651/10000 [7:19:14<8:12:37, 5.53s/it][2025-06-19 20:48:58,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:48:58,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.71 | bwd_microstep: 3320.30 | bwd_inner_microstep: 3319.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 20:48:58,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.71 | bwd: 3320.32 | bwd_inner: 3319.51 | bwd_allreduce: 0.76 | step: 6.62 47%|████▋ | 4652/10000 [7:19:19<8:10:56, 5.51s/it] {'loss': 0.0032, 'grad_norm': 0.2446785718202591, 'learning_rate': 2.3211832225858775e-05, 'epoch': 4.65} 47%|████▋ | 4652/10000 [7:19:19<8:10:56, 5.51s/it][2025-06-19 20:49:04,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:49:04,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.13 | bwd_microstep: 3327.54 | bwd_inner_microstep: 3326.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-19 20:49:04,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.13 | bwd: 3327.55 | bwd_inner: 3326.74 | bwd_allreduce: 0.77 | step: 7.02 47%|████▋ | 4653/10000 [7:19:25<8:10:04, 5.50s/it] {'loss': 0.0086, 'grad_norm': 1.0566837787628174, 'learning_rate': 2.320543861889581e-05, 'epoch': 4.65} 47%|████▋ | 4653/10000 [7:19:25<8:10:04, 5.50s/it][2025-06-19 20:49:09,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:49:09,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.48 | bwd_microstep: 3321.44 | bwd_inner_microstep: 3320.58 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.81 [2025-06-19 20:49:09,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.48 | bwd: 3321.45 | bwd_inner: 3320.58 | bwd_allreduce: 0.83 | step: 6.82 47%|████▋ | 4654/10000 [7:19:30<8:09:04, 5.49s/it] {'loss': 0.0074, 'grad_norm': 1.1309984922409058, 'learning_rate': 2.319904467569722e-05, 'epoch': 4.65} 47%|████▋ | 4654/10000 [7:19:30<8:09:04, 5.49s/it][2025-06-19 20:49:15,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:49:15,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.44 | bwd_microstep: 3341.97 | bwd_inner_microstep: 3341.17 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 20:49:15,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.44 | bwd: 3341.99 | bwd_inner: 3341.17 | bwd_allreduce: 0.78 | step: 7.15 47%|████▋ | 4655/10000 [7:19:36<8:09:12, 5.49s/it] {'loss': 0.029, 'grad_norm': 1.2949601411819458, 'learning_rate': 2.3192650396933682e-05, 'epoch': 4.66} 47%|████▋ | 4655/10000 [7:19:36<8:09:12, 5.49s/it][2025-06-19 20:49:20,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:49:20,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.45 | bwd_microstep: 3376.25 | bwd_inner_microstep: 3375.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.75 [2025-06-19 20:49:20,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.45 | bwd: 3376.26 | bwd_inner: 3375.46 | bwd_allreduce: 0.76 | step: 6.76 47%|████▋ | 4656/10000 [7:19:41<8:10:34, 5.51s/it] {'loss': 0.0126, 'grad_norm': 0.6078200340270996, 'learning_rate': 2.3186255783275936e-05, 'epoch': 4.66} 47%|████▋ | 4656/10000 [7:19:41<8:10:34, 5.51s/it][2025-06-19 20:49:26,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:49:26,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.11 | bwd_microstep: 3320.96 | bwd_inner_microstep: 3319.99 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.62 [2025-06-19 20:49:26,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.11 | bwd: 3320.98 | bwd_inner: 3319.98 | bwd_allreduce: 0.95 | step: 7.63 47%|████▋ | 4657/10000 [7:19:47<8:09:37, 5.50s/it] {'loss': 0.0183, 'grad_norm': 1.3819174766540527, 'learning_rate': 2.3179860835394744e-05, 'epoch': 4.66} 47%|████▋ | 4657/10000 [7:19:47<8:09:37, 5.50s/it][2025-06-19 20:49:31,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:49:31,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.80 | bwd_microstep: 3384.43 | bwd_inner_microstep: 3383.47 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.15 [2025-06-19 20:49:31,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.80 | bwd: 3384.45 | bwd_inner: 3383.47 | bwd_allreduce: 0.93 | step: 7.15 47%|████▋ | 4658/10000 [7:19:52<8:11:11, 5.52s/it] {'loss': 0.0045, 'grad_norm': 0.4342837333679199, 'learning_rate': 2.317346555396091e-05, 'epoch': 4.66} 47%|████▋ | 4658/10000 [7:19:52<8:11:11, 5.52s/it][2025-06-19 20:49:37,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:49:37,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.37 | bwd_microstep: 3332.58 | bwd_inner_microstep: 3331.64 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.94 [2025-06-19 20:49:37,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.37 | bwd: 3332.59 | bwd_inner: 3331.64 | bwd_allreduce: 0.90 | step: 6.94 47%|████▋ | 4659/10000 [7:19:58<8:10:18, 5.51s/it] {'loss': 0.0028, 'grad_norm': 0.24542897939682007, 'learning_rate': 2.316706993964527e-05, 'epoch': 4.66} 47%|████▋ | 4659/10000 [7:19:58<8:10:18, 5.51s/it][2025-06-19 20:49:42,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:49:42,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.84 | bwd_microstep: 3322.81 | bwd_inner_microstep: 3322.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 20:49:42,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.84 | bwd: 3322.83 | bwd_inner: 3322.00 | bwd_allreduce: 0.78 | step: 7.08 47%|████▋ | 4660/10000 [7:20:03<8:09:17, 5.50s/it] {'loss': 0.0097, 'grad_norm': 0.5556994080543518, 'learning_rate': 2.3160673993118686e-05, 'epoch': 4.66} 47%|████▋ | 4660/10000 [7:20:03<8:09:17, 5.50s/it][2025-06-19 20:49:48,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:49:48,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.62 | bwd_microstep: 3325.52 | bwd_inner_microstep: 3324.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-19 20:49:48,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.62 | bwd: 3325.54 | bwd_inner: 3324.73 | bwd_allreduce: 0.77 | step: 6.97 47%|████▋ | 4661/10000 [7:20:09<8:08:27, 5.49s/it] {'loss': 0.0024, 'grad_norm': 0.24062290787696838, 'learning_rate': 2.315427771505208e-05, 'epoch': 4.66} 47%|████▋ | 4661/10000 [7:20:09<8:08:27, 5.49s/it][2025-06-19 20:49:53,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:49:53,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.07 | bwd_microstep: 3375.43 | bwd_inner_microstep: 3374.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 20:49:53,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.07 | bwd: 3375.44 | bwd_inner: 3374.64 | bwd_allreduce: 0.76 | step: 6.64 47%|████▋ | 4662/10000 [7:20:14<8:09:49, 5.51s/it] {'loss': 0.0007, 'grad_norm': 0.053769633173942566, 'learning_rate': 2.3147881106116373e-05, 'epoch': 4.66} 47%|████▋ | 4662/10000 [7:20:14<8:09:49, 5.51s/it][2025-06-19 20:49:59,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:49:59,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.84 | bwd_microstep: 3329.16 | bwd_inner_microstep: 3328.28 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.68 [2025-06-19 20:49:59,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.84 | bwd: 3329.18 | bwd_inner: 3328.28 | bwd_allreduce: 0.85 | step: 6.68 47%|████▋ | 4663/10000 [7:20:20<8:09:12, 5.50s/it] {'loss': 0.0016, 'grad_norm': 0.14111678302288055, 'learning_rate': 2.3141484166982545e-05, 'epoch': 4.66} 47%|████▋ | 4663/10000 [7:20:20<8:09:12, 5.50s/it][2025-06-19 20:50:04,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:50:04,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.54 | bwd_microstep: 3379.77 | bwd_inner_microstep: 3378.96 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 20:50:04,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.54 | bwd: 3379.79 | bwd_inner: 3378.96 | bwd_allreduce: 0.78 | step: 7.21 47%|████▋ | 4664/10000 [7:20:25<8:10:29, 5.52s/it] {'loss': 0.0221, 'grad_norm': 2.2564198970794678, 'learning_rate': 2.313508689832162e-05, 'epoch': 4.66} 47%|████▋ | 4664/10000 [7:20:25<8:10:29, 5.52s/it][2025-06-19 20:50:10,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:50:10,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.66 | bwd_microstep: 3373.60 | bwd_inner_microstep: 3372.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 20:50:10,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.66 | bwd: 3373.61 | bwd_inner: 3372.81 | bwd_allreduce: 0.76 | step: 6.73 47%|████▋ | 4665/10000 [7:20:31<8:11:05, 5.52s/it] {'loss': 0.0047, 'grad_norm': 0.24492418766021729, 'learning_rate': 2.312868930080462e-05, 'epoch': 4.67} 47%|████▋ | 4665/10000 [7:20:31<8:11:05, 5.52s/it][2025-06-19 20:50:15,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:50:15,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.63 | bwd_microstep: 3327.31 | bwd_inner_microstep: 3326.50 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 20:50:15,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.64 | bwd: 3327.33 | bwd_inner: 3326.50 | bwd_allreduce: 0.78 | step: 7.24 47%|████▋ | 4666/10000 [7:20:36<8:09:42, 5.51s/it] {'loss': 0.1479, 'grad_norm': 4.714169025421143, 'learning_rate': 2.312229137510264e-05, 'epoch': 4.67} 47%|████▋ | 4666/10000 [7:20:36<8:09:42, 5.51s/it][2025-06-19 20:50:21,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:50:21,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.53 | bwd_microstep: 3328.10 | bwd_inner_microstep: 3327.31 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 20:50:21,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.53 | bwd: 3328.11 | bwd_inner: 3327.31 | bwd_allreduce: 0.76 | step: 6.69 47%|████▋ | 4667/10000 [7:20:42<8:08:53, 5.50s/it] {'loss': 0.0915, 'grad_norm': 3.3617794513702393, 'learning_rate': 2.3115893121886778e-05, 'epoch': 4.67} 47%|████▋ | 4667/10000 [7:20:42<8:08:53, 5.50s/it][2025-06-19 20:50:26,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:50:26,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.93 | bwd_microstep: 3405.02 | bwd_inner_microstep: 3404.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-19 20:50:26,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.93 | bwd: 3405.04 | bwd_inner: 3404.22 | bwd_allreduce: 0.78 | step: 6.86 47%|████▋ | 4668/10000 [7:20:47<8:11:07, 5.53s/it] {'loss': 0.0044, 'grad_norm': 0.3111865520477295, 'learning_rate': 2.3109494541828196e-05, 'epoch': 4.67} 47%|████▋ | 4668/10000 [7:20:47<8:11:07, 5.53s/it][2025-06-19 20:50:32,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:50:32,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.56 | bwd_microstep: 3326.05 | bwd_inner_microstep: 3325.10 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.31 [2025-06-19 20:50:32,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.56 | bwd: 3326.07 | bwd_inner: 3325.10 | bwd_allreduce: 0.93 | step: 7.32 47%|████▋ | 4669/10000 [7:20:53<8:09:43, 5.51s/it] {'loss': 0.0296, 'grad_norm': 1.7431925535202026, 'learning_rate': 2.310309563559806e-05, 'epoch': 4.67} 47%|████▋ | 4669/10000 [7:20:53<8:09:43, 5.51s/it][2025-06-19 20:50:37,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:50:37,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.59 | bwd_microstep: 3371.90 | bwd_inner_microstep: 3371.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 20:50:37,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.59 | bwd: 3371.92 | bwd_inner: 3371.11 | bwd_allreduce: 0.76 | step: 6.82 47%|████▋ | 4670/10000 [7:20:58<8:10:39, 5.52s/it] {'loss': 0.0104, 'grad_norm': 0.9108227491378784, 'learning_rate': 2.3096696403867597e-05, 'epoch': 4.67} 47%|████▋ | 4670/10000 [7:20:58<8:10:39, 5.52s/it][2025-06-19 20:50:43,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:50:43,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.36 | bwd_microstep: 3327.07 | bwd_inner_microstep: 3326.21 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.84 [2025-06-19 20:50:43,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.36 | bwd: 3327.08 | bwd_inner: 3326.21 | bwd_allreduce: 0.83 | step: 6.84 47%|████▋ | 4671/10000 [7:21:04<8:09:18, 5.51s/it] {'loss': 0.0059, 'grad_norm': 0.5850721001625061, 'learning_rate': 2.3090296847308057e-05, 'epoch': 4.67} 47%|████▋ | 4671/10000 [7:21:04<8:09:18, 5.51s/it][2025-06-19 20:50:48,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:50:48,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.13 | bwd_microstep: 3331.49 | bwd_inner_microstep: 3330.67 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-19 20:50:48,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.13 | bwd: 3331.50 | bwd_inner: 3330.67 | bwd_allreduce: 0.79 | step: 7.12 47%|████▋ | 4672/10000 [7:21:09<8:08:37, 5.50s/it] {'loss': 0.1102, 'grad_norm': 3.4270694255828857, 'learning_rate': 2.3083896966590716e-05, 'epoch': 4.67} 47%|████▋ | 4672/10000 [7:21:09<8:08:37, 5.50s/it][2025-06-19 20:50:54,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:50:54,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.91 | bwd_microstep: 3322.32 | bwd_inner_microstep: 3321.35 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.58 [2025-06-19 20:50:54,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.91 | bwd: 3322.33 | bwd_inner: 3321.35 | bwd_allreduce: 0.93 | step: 7.58 47%|████▋ | 4673/10000 [7:21:15<8:08:06, 5.50s/it] {'loss': 0.0037, 'grad_norm': 0.19171638786792755, 'learning_rate': 2.3077496762386895e-05, 'epoch': 4.67} 47%|████▋ | 4673/10000 [7:21:15<8:08:06, 5.50s/it][2025-06-19 20:50:59,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:50:59,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.36 | bwd_microstep: 3318.58 | bwd_inner_microstep: 3317.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.68 [2025-06-19 20:50:59,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.36 | bwd: 3318.59 | bwd_inner: 3317.77 | bwd_allreduce: 0.78 | step: 6.68 47%|████▋ | 4674/10000 [7:21:20<8:07:22, 5.49s/it] {'loss': 0.0261, 'grad_norm': 1.8978440761566162, 'learning_rate': 2.3071096235367955e-05, 'epoch': 4.67} 47%|████▋ | 4674/10000 [7:21:20<8:07:22, 5.49s/it][2025-06-19 20:51:05,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:51:05,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.01 | bwd_microstep: 3323.58 | bwd_inner_microstep: 3322.60 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.21 [2025-06-19 20:51:05,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.01 | bwd: 3323.60 | bwd_inner: 3322.60 | bwd_allreduce: 0.95 | step: 7.22 47%|████▋ | 4675/10000 [7:21:26<8:06:54, 5.49s/it] {'loss': 0.0011, 'grad_norm': 0.08472167700529099, 'learning_rate': 2.3064695386205264e-05, 'epoch': 4.67} 47%|████▋ | 4675/10000 [7:21:26<8:06:54, 5.49s/it][2025-06-19 20:51:10,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:51:10,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.51 | bwd_microstep: 3404.47 | bwd_inner_microstep: 3403.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 20:51:10,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.51 | bwd: 3404.49 | bwd_inner: 3403.68 | bwd_allreduce: 0.76 | step: 6.68 47%|████▋ | 4676/10000 [7:21:31<8:09:33, 5.52s/it] {'loss': 0.0319, 'grad_norm': 1.667412281036377, 'learning_rate': 2.3058294215570257e-05, 'epoch': 4.68} 47%|████▋ | 4676/10000 [7:21:31<8:09:33, 5.52s/it][2025-06-19 20:51:16,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:51:16,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.66 | bwd_microstep: 3320.52 | bwd_inner_microstep: 3319.69 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.72 [2025-06-19 20:51:16,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.66 | bwd: 3320.53 | bwd_inner: 3319.69 | bwd_allreduce: 0.80 | step: 6.72 47%|████▋ | 4677/10000 [7:21:37<8:08:03, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.037549663335084915, 'learning_rate': 2.3051892724134376e-05, 'epoch': 4.68} 47%|████▋ | 4677/10000 [7:21:37<8:08:03, 5.50s/it][2025-06-19 20:51:21,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:51:21,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.07 | bwd_microstep: 3399.87 | bwd_inner_microstep: 3398.97 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.34 [2025-06-19 20:51:21,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.07 | bwd: 3399.89 | bwd_inner: 3398.97 | bwd_allreduce: 0.87 | step: 7.35 47%|████▋ | 4678/10000 [7:21:42<8:09:58, 5.52s/it] {'loss': 0.006, 'grad_norm': 0.47487661242485046, 'learning_rate': 2.3045490912569114e-05, 'epoch': 4.68} 47%|████▋ | 4678/10000 [7:21:42<8:09:58, 5.52s/it][2025-06-19 20:51:27,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:51:27,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.34 | bwd_microstep: 3374.72 | bwd_inner_microstep: 3373.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 20:51:27,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.34 | bwd: 3374.74 | bwd_inner: 3373.94 | bwd_allreduce: 0.76 | step: 6.71 47%|████▋ | 4679/10000 [7:21:48<8:10:25, 5.53s/it] {'loss': 0.0251, 'grad_norm': 1.6155362129211426, 'learning_rate': 2.3039088781545992e-05, 'epoch': 4.68} 47%|████▋ | 4679/10000 [7:21:48<8:10:25, 5.53s/it][2025-06-19 20:51:32,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:51:32,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.93 | bwd_microstep: 3323.55 | bwd_inner_microstep: 3322.56 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.30 [2025-06-19 20:51:32,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.93 | bwd: 3323.57 | bwd_inner: 3322.56 | bwd_allreduce: 0.96 | step: 7.30 47%|████▋ | 4680/10000 [7:21:53<8:09:00, 5.52s/it] {'loss': 0.0125, 'grad_norm': 1.5703198909759521, 'learning_rate': 2.303268633173656e-05, 'epoch': 4.68} 47%|████▋ | 4680/10000 [7:21:53<8:09:00, 5.52s/it][2025-06-19 20:51:38,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 20:51:38,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.76 | bwd_microstep: 3365.37 | bwd_inner_microstep: 3364.59 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 20:51:38,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.76 | bwd: 3365.38 | bwd_inner: 3364.59 | bwd_allreduce: 0.75 | step: 6.57 47%|████▋ | 4681/10000 [7:21:59<8:09:26, 5.52s/it] {'loss': 0.0115, 'grad_norm': 0.8771171569824219, 'learning_rate': 2.3026283563812404e-05, 'epoch': 4.68} 47%|████▋ | 4681/10000 [7:21:59<8:09:26, 5.52s/it][2025-06-19 20:51:44,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.80 [2025-06-19 20:51:44,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.77 | bwd_microstep: 3368.79 | bwd_inner_microstep: 3368.01 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.89 [2025-06-19 20:51:44,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.77 | bwd: 3368.80 | bwd_inner: 3368.01 | bwd_allreduce: 0.75 | step: 6.89 47%|████▋ | 4682/10000 [7:22:04<8:09:37, 5.52s/it] {'loss': 0.1286, 'grad_norm': 3.8212509155273438, 'learning_rate': 2.301988047844516e-05, 'epoch': 4.68} 47%|████▋ | 4682/10000 [7:22:04<8:09:37, 5.52s/it][2025-06-19 20:51:49,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:51:49,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.89 | bwd_microstep: 3374.11 | bwd_inner_microstep: 3373.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.75 [2025-06-19 20:51:49,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.89 | bwd: 3374.12 | bwd_inner: 3373.33 | bwd_allreduce: 0.75 | step: 6.76 47%|████▋ | 4683/10000 [7:22:10<8:10:07, 5.53s/it] {'loss': 0.0084, 'grad_norm': 0.5786840915679932, 'learning_rate': 2.3013477076306457e-05, 'epoch': 4.68} 47%|████▋ | 4683/10000 [7:22:10<8:10:07, 5.53s/it][2025-06-19 20:51:55,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:51:55,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.75 | bwd_microstep: 3396.97 | bwd_inner_microstep: 3395.96 | bwd_allreduce_microstep: 0.94 | step_microstep: 8.08 [2025-06-19 20:51:55,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.75 | bwd: 3396.98 | bwd_inner: 3395.96 | bwd_allreduce: 0.97 | step: 8.10 47%|████▋ | 4684/10000 [7:22:15<8:11:13, 5.54s/it] {'loss': 0.0131, 'grad_norm': 0.6522448658943176, 'learning_rate': 2.3007073358068e-05, 'epoch': 4.68} 47%|████▋ | 4684/10000 [7:22:15<8:11:13, 5.54s/it][2025-06-19 20:52:00,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 20:52:00,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.07 | bwd_microstep: 3321.69 | bwd_inner_microstep: 3320.91 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 20:52:00,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.07 | bwd: 3321.70 | bwd_inner: 3320.91 | bwd_allreduce: 0.75 | step: 6.66 47%|████▋ | 4685/10000 [7:22:21<8:09:01, 5.52s/it] {'loss': 0.0379, 'grad_norm': 4.808115482330322, 'learning_rate': 2.30006693244015e-05, 'epoch': 4.69} 47%|████▋ | 4685/10000 [7:22:21<8:09:01, 5.52s/it][2025-06-19 20:52:06,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.84 [2025-06-19 20:52:06,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.36 | bwd_microstep: 3325.00 | bwd_inner_microstep: 3324.17 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.09 [2025-06-19 20:52:06,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.36 | bwd: 3325.02 | bwd_inner: 3324.17 | bwd_allreduce: 0.80 | step: 7.09 47%|████▋ | 4686/10000 [7:22:26<8:07:40, 5.51s/it] {'loss': 0.0022, 'grad_norm': 0.2754237651824951, 'learning_rate': 2.2994264975978713e-05, 'epoch': 4.69} 47%|████▋ | 4686/10000 [7:22:26<8:07:40, 5.51s/it][2025-06-19 20:52:11,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 20:52:11,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.36 | bwd_microstep: 3374.05 | bwd_inner_microstep: 3373.10 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.09 [2025-06-19 20:52:11,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.36 | bwd: 3374.07 | bwd_inner: 3373.10 | bwd_allreduce: 0.92 | step: 7.09 47%|████▋ | 4687/10000 [7:22:32<8:08:44, 5.52s/it] {'loss': 0.0088, 'grad_norm': 0.44865164160728455, 'learning_rate': 2.298786031347143e-05, 'epoch': 4.69} 47%|████▋ | 4687/10000 [7:22:32<8:08:44, 5.52s/it][2025-06-19 20:52:17,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:52:17,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.51 | bwd_microstep: 3392.17 | bwd_inner_microstep: 3391.37 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 20:52:17,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.51 | bwd: 3392.19 | bwd_inner: 3391.37 | bwd_allreduce: 0.77 | step: 6.77 47%|████▋ | 4688/10000 [7:22:38<8:09:46, 5.53s/it] {'loss': 0.021, 'grad_norm': 2.6449432373046875, 'learning_rate': 2.298145533755147e-05, 'epoch': 4.69} 47%|████▋ | 4688/10000 [7:22:38<8:09:46, 5.53s/it][2025-06-19 20:52:22,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:52:22,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.05 | bwd_microstep: 3332.54 | bwd_inner_microstep: 3331.73 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.03 [2025-06-19 20:52:22,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.05 | bwd: 3332.56 | bwd_inner: 3331.73 | bwd_allreduce: 0.79 | step: 7.03 47%|████▋ | 4689/10000 [7:22:43<8:08:05, 5.51s/it] {'loss': 0.0044, 'grad_norm': 0.5252096652984619, 'learning_rate': 2.297505004889068e-05, 'epoch': 4.69} 47%|████▋ | 4689/10000 [7:22:43<8:08:05, 5.51s/it][2025-06-19 20:52:28,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:52:28,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.60 | bwd_microstep: 3323.31 | bwd_inner_microstep: 3322.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 20:52:28,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.60 | bwd: 3323.32 | bwd_inner: 3322.51 | bwd_allreduce: 0.77 | step: 6.93 47%|████▋ | 4690/10000 [7:22:48<8:06:52, 5.50s/it] {'loss': 0.0042, 'grad_norm': 0.40037810802459717, 'learning_rate': 2.2968644448160947e-05, 'epoch': 4.69} 47%|████▋ | 4690/10000 [7:22:48<8:06:52, 5.50s/it][2025-06-19 20:52:33,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:52:33,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.71 | bwd_microstep: 3314.85 | bwd_inner_microstep: 3313.91 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.01 [2025-06-19 20:52:33,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.71 | bwd: 3314.87 | bwd_inner: 3313.91 | bwd_allreduce: 0.92 | step: 7.01 47%|████▋ | 4691/10000 [7:22:54<8:05:30, 5.49s/it] {'loss': 0.0606, 'grad_norm': 3.671215534210205, 'learning_rate': 2.2962238536034188e-05, 'epoch': 4.69} 47%|████▋ | 4691/10000 [7:22:54<8:05:30, 5.49s/it][2025-06-19 20:52:39,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:52:39,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.86 | bwd_microstep: 3316.19 | bwd_inner_microstep: 3315.31 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.94 [2025-06-19 20:52:39,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.86 | bwd: 3316.20 | bwd_inner: 3315.31 | bwd_allreduce: 0.86 | step: 6.96 47%|████▋ | 4692/10000 [7:22:59<8:04:50, 5.48s/it] {'loss': 0.0061, 'grad_norm': 0.5214993357658386, 'learning_rate': 2.295583231318236e-05, 'epoch': 4.69} 47%|████▋ | 4692/10000 [7:22:59<8:04:50, 5.48s/it][2025-06-19 20:52:44,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:52:44,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.70 | bwd_microstep: 3363.47 | bwd_inner_microstep: 3362.56 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.91 [2025-06-19 20:52:44,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.70 | bwd: 3363.48 | bwd_inner: 3362.56 | bwd_allreduce: 0.87 | step: 6.92 47%|████▋ | 4693/10000 [7:23:05<8:06:04, 5.50s/it] {'loss': 0.0118, 'grad_norm': 1.0614997148513794, 'learning_rate': 2.2949425780277433e-05, 'epoch': 4.69} 47%|████▋ | 4693/10000 [7:23:05<8:06:04, 5.50s/it][2025-06-19 20:52:50,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:52:50,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.99 | bwd_microstep: 3315.69 | bwd_inner_microstep: 3314.86 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.93 [2025-06-19 20:52:50,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.99 | bwd: 3315.71 | bwd_inner: 3314.86 | bwd_allreduce: 0.80 | step: 6.94 47%|████▋ | 4694/10000 [7:23:10<8:05:06, 5.49s/it] {'loss': 0.2282, 'grad_norm': 4.148348331451416, 'learning_rate': 2.2943018937991438e-05, 'epoch': 4.69} 47%|████▋ | 4694/10000 [7:23:10<8:05:06, 5.49s/it][2025-06-19 20:52:55,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:52:55,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.89 | bwd_microstep: 3320.52 | bwd_inner_microstep: 3319.71 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.28 [2025-06-19 20:52:55,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.89 | bwd: 3320.54 | bwd_inner: 3319.71 | bwd_allreduce: 0.78 | step: 7.28 47%|████▋ | 4695/10000 [7:23:16<8:04:42, 5.48s/it] {'loss': 0.0338, 'grad_norm': 4.698261260986328, 'learning_rate': 2.2936611786996407e-05, 'epoch': 4.7} 47%|████▋ | 4695/10000 [7:23:16<8:04:42, 5.48s/it][2025-06-19 20:53:01,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:53:01,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.79 | bwd_microstep: 3316.59 | bwd_inner_microstep: 3315.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 20:53:01,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.79 | bwd: 3316.61 | bwd_inner: 3315.79 | bwd_allreduce: 0.77 | step: 6.72 47%|████▋ | 4696/10000 [7:23:21<8:04:01, 5.48s/it] {'loss': 0.0046, 'grad_norm': 0.27174657583236694, 'learning_rate': 2.2930204327964436e-05, 'epoch': 4.7} 47%|████▋ | 4696/10000 [7:23:21<8:04:01, 5.48s/it][2025-06-19 20:53:06,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.74 | optimizer_step: 2.72 [2025-06-19 20:53:06,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.51 | bwd_microstep: 3366.16 | bwd_inner_microstep: 3365.24 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.81 [2025-06-19 20:53:06,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.51 | bwd: 3366.18 | bwd_inner: 3365.24 | bwd_allreduce: 0.89 | step: 7.82 47%|████▋ | 4697/10000 [7:23:27<8:05:22, 5.49s/it] {'loss': 0.009, 'grad_norm': 1.495690941810608, 'learning_rate': 2.292379656156763e-05, 'epoch': 4.7} 47%|████▋ | 4697/10000 [7:23:27<8:05:22, 5.49s/it][2025-06-19 20:53:12,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 20:53:12,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.80 | bwd_microstep: 3323.32 | bwd_inner_microstep: 3322.45 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.93 [2025-06-19 20:53:12,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.80 | bwd: 3323.33 | bwd_inner: 3322.45 | bwd_allreduce: 0.84 | step: 6.93 47%|████▋ | 4698/10000 [7:23:32<8:04:42, 5.49s/it] {'loss': 0.0815, 'grad_norm': 3.1488773822784424, 'learning_rate': 2.2917388488478133e-05, 'epoch': 4.7} 47%|████▋ | 4698/10000 [7:23:32<8:04:42, 5.49s/it][2025-06-19 20:53:17,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.73 [2025-06-19 20:53:17,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.36 | bwd_microstep: 3371.08 | bwd_inner_microstep: 3370.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 20:53:17,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.36 | bwd: 3371.09 | bwd_inner: 3370.29 | bwd_allreduce: 0.75 | step: 6.69 47%|████▋ | 4699/10000 [7:23:38<8:06:03, 5.50s/it] {'loss': 0.0211, 'grad_norm': 2.1030609607696533, 'learning_rate': 2.2910980109368125e-05, 'epoch': 4.7} 47%|████▋ | 4699/10000 [7:23:38<8:06:03, 5.50s/it][2025-06-19 20:53:23,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:53:23,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.58 | bwd_microstep: 3363.51 | bwd_inner_microstep: 3362.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 20:53:23,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.58 | bwd: 3363.53 | bwd_inner: 3362.69 | bwd_allreduce: 0.79 | step: 6.95 47%|████▋ | 4700/10000 [7:23:43<8:06:33, 5.51s/it] {'loss': 0.0206, 'grad_norm': 2.0229172706604004, 'learning_rate': 2.290457142490981e-05, 'epoch': 4.7} 47%|████▋ | 4700/10000 [7:23:43<8:06:33, 5.51s/it][2025-06-19 20:53:28,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:53:28,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.05 | bwd_microstep: 3324.90 | bwd_inner_microstep: 3324.01 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.38 [2025-06-19 20:53:28,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.05 | bwd: 3324.91 | bwd_inner: 3324.01 | bwd_allreduce: 0.85 | step: 7.38 47%|████▋ | 4701/10000 [7:23:49<8:06:26, 5.51s/it] {'loss': 0.0039, 'grad_norm': 0.31874707341194153, 'learning_rate': 2.2898162435775432e-05, 'epoch': 4.7} 47%|████▋ | 4701/10000 [7:23:49<8:06:26, 5.51s/it][2025-06-19 20:53:34,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:53:34,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.83 | bwd_microstep: 3377.92 | bwd_inner_microstep: 3377.05 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.27 [2025-06-19 20:53:34,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.83 | bwd: 3377.94 | bwd_inner: 3377.05 | bwd_allreduce: 0.83 | step: 7.27 47%|████▋ | 4702/10000 [7:23:54<8:08:00, 5.53s/it] {'loss': 0.057, 'grad_norm': 2.2576310634613037, 'learning_rate': 2.2891753142637273e-05, 'epoch': 4.7} 47%|████▋ | 4702/10000 [7:23:54<8:08:00, 5.53s/it][2025-06-19 20:53:39,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:53:39,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.03 | bwd_microstep: 3329.25 | bwd_inner_microstep: 3328.40 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.28 [2025-06-19 20:53:39,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.03 | bwd: 3329.28 | bwd_inner: 3328.40 | bwd_allreduce: 0.82 | step: 7.28 47%|████▋ | 4703/10000 [7:24:00<8:07:15, 5.52s/it] {'loss': 0.0011, 'grad_norm': 0.07285618782043457, 'learning_rate': 2.288534354616762e-05, 'epoch': 4.7} 47%|████▋ | 4703/10000 [7:24:00<8:07:15, 5.52s/it][2025-06-19 20:53:45,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:53:45,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.75 | bwd_microstep: 3331.22 | bwd_inner_microstep: 3330.19 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.24 [2025-06-19 20:53:45,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.75 | bwd: 3331.24 | bwd_inner: 3330.19 | bwd_allreduce: 1.00 | step: 7.25 47%|████▋ | 4704/10000 [7:24:05<8:07:11, 5.52s/it] {'loss': 0.0105, 'grad_norm': 0.9434320330619812, 'learning_rate': 2.2878933647038828e-05, 'epoch': 4.7} 47%|████▋ | 4704/10000 [7:24:05<8:07:11, 5.52s/it][2025-06-19 20:53:50,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.66 | optimizer_step: 2.73 [2025-06-19 20:53:50,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.85 | bwd_microstep: 3332.91 | bwd_inner_microstep: 3331.55 | bwd_allreduce_microstep: 1.26 | step_microstep: 9.10 [2025-06-19 20:53:50,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.85 | bwd: 3332.94 | bwd_inner: 3331.55 | bwd_allreduce: 1.31 | step: 9.11 47%|████▋ | 4705/10000 [7:24:11<8:06:24, 5.51s/it] {'loss': 0.003, 'grad_norm': 0.3764071762561798, 'learning_rate': 2.2872523445923254e-05, 'epoch': 4.71} 47%|████▋ | 4705/10000 [7:24:11<8:06:24, 5.51s/it][2025-06-19 20:53:56,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:53:56,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.14 | bwd_microstep: 3322.18 | bwd_inner_microstep: 3321.23 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.56 [2025-06-19 20:53:56,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.14 | bwd: 3322.20 | bwd_inner: 3321.23 | bwd_allreduce: 0.93 | step: 7.56 47%|████▋ | 4706/10000 [7:24:16<8:05:43, 5.50s/it] {'loss': 0.1271, 'grad_norm': 5.125049591064453, 'learning_rate': 2.2866112943493307e-05, 'epoch': 4.71} 47%|████▋ | 4706/10000 [7:24:16<8:05:43, 5.50s/it][2025-06-19 20:54:01,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:54:01,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.73 | bwd_microstep: 3378.12 | bwd_inner_microstep: 3377.13 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.10 [2025-06-19 20:54:01,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.73 | bwd: 3378.15 | bwd_inner: 3377.13 | bwd_allreduce: 0.95 | step: 7.11 47%|████▋ | 4707/10000 [7:24:22<8:06:44, 5.52s/it] {'loss': 0.0027, 'grad_norm': 0.23431962728500366, 'learning_rate': 2.2859702140421413e-05, 'epoch': 4.71} 47%|████▋ | 4707/10000 [7:24:22<8:06:44, 5.52s/it][2025-06-19 20:54:07,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:54:07,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.82 | bwd_microstep: 3372.23 | bwd_inner_microstep: 3371.39 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.79 [2025-06-19 20:54:07,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.82 | bwd: 3372.25 | bwd_inner: 3371.39 | bwd_allreduce: 0.81 | step: 6.79 47%|████▋ | 4708/10000 [7:24:28<8:07:25, 5.53s/it] {'loss': 0.0787, 'grad_norm': 1.6604583263397217, 'learning_rate': 2.285329103738003e-05, 'epoch': 4.71} 47%|████▋ | 4708/10000 [7:24:28<8:07:25, 5.53s/it][2025-06-19 20:54:12,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 20:54:12,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.88 | bwd_microstep: 3327.62 | bwd_inner_microstep: 3326.41 | bwd_allreduce_microstep: 1.12 | step_microstep: 8.06 [2025-06-19 20:54:12,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.88 | bwd: 3327.65 | bwd_inner: 3326.41 | bwd_allreduce: 1.16 | step: 8.06 47%|████▋ | 4709/10000 [7:24:33<8:06:13, 5.51s/it] {'loss': 0.0128, 'grad_norm': 1.7572399377822876, 'learning_rate': 2.2846879635041674e-05, 'epoch': 4.71} 47%|████▋ | 4709/10000 [7:24:33<8:06:13, 5.51s/it][2025-06-19 20:54:18,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 20:54:18,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.34 | bwd_microstep: 3333.73 | bwd_inner_microstep: 3332.51 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.95 [2025-06-19 20:54:18,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.34 | bwd: 3333.77 | bwd_inner: 3332.51 | bwd_allreduce: 1.17 | step: 7.94 47%|████▋ | 4710/10000 [7:24:39<8:06:03, 5.51s/it] {'loss': 0.0092, 'grad_norm': 1.1389031410217285, 'learning_rate': 2.2840467934078846e-05, 'epoch': 4.71} 47%|████▋ | 4710/10000 [7:24:39<8:06:03, 5.51s/it][2025-06-19 20:54:23,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:54:23,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2157.51 | bwd_microstep: 3384.54 | bwd_inner_microstep: 3383.67 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.23 [2025-06-19 20:54:23,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2157.51 | bwd: 3384.57 | bwd_inner: 3383.67 | bwd_allreduce: 0.83 | step: 7.23 47%|████▋ | 4711/10000 [7:24:44<8:07:56, 5.54s/it] {'loss': 0.0039, 'grad_norm': 0.3090524673461914, 'learning_rate': 2.2834055935164116e-05, 'epoch': 4.71} 47%|████▋ | 4711/10000 [7:24:44<8:07:56, 5.54s/it][2025-06-19 20:54:29,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 20:54:29,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.87 | bwd_microstep: 3331.02 | bwd_inner_microstep: 3330.01 | bwd_allreduce_microstep: 0.93 | step_microstep: 8.15 [2025-06-19 20:54:29,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.87 | bwd: 3331.04 | bwd_inner: 3330.01 | bwd_allreduce: 0.96 | step: 8.16 47%|████▋ | 4712/10000 [7:24:50<8:07:20, 5.53s/it] {'loss': 0.0017, 'grad_norm': 0.07886789739131927, 'learning_rate': 2.282764363897007e-05, 'epoch': 4.71} 47%|████▋ | 4712/10000 [7:24:50<8:07:20, 5.53s/it][2025-06-19 20:54:34,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:54:34,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2169.42 | bwd_microstep: 3398.56 | bwd_inner_microstep: 3397.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-19 20:54:34,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2169.42 | bwd: 3398.58 | bwd_inner: 3397.76 | bwd_allreduce: 0.77 | step: 6.60 47%|████▋ | 4713/10000 [7:24:55<8:09:27, 5.55s/it] {'loss': 0.0026, 'grad_norm': 0.27840107679367065, 'learning_rate': 2.282123104616933e-05, 'epoch': 4.71} 47%|████▋ | 4713/10000 [7:24:55<8:09:27, 5.55s/it][2025-06-19 20:54:40,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:54:40,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.59 | bwd_microstep: 3400.85 | bwd_inner_microstep: 3400.01 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.86 [2025-06-19 20:54:40,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.59 | bwd: 3400.87 | bwd_inner: 3400.01 | bwd_allreduce: 0.80 | step: 6.86 47%|████▋ | 4714/10000 [7:25:01<8:10:17, 5.57s/it] {'loss': 0.0359, 'grad_norm': 2.1428780555725098, 'learning_rate': 2.281481815743455e-05, 'epoch': 4.71} 47%|████▋ | 4714/10000 [7:25:01<8:10:17, 5.57s/it][2025-06-19 20:54:46,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.72 [2025-06-19 20:54:46,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.64 | bwd_microstep: 3326.11 | bwd_inner_microstep: 3324.58 | bwd_allreduce_microstep: 1.36 | step_microstep: 10.08 [2025-06-19 20:54:46,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.64 | bwd: 3326.13 | bwd_inner: 3324.58 | bwd_allreduce: 1.47 | step: 10.10 47%|████▋ | 4715/10000 [7:25:06<8:08:41, 5.55s/it] {'loss': 0.0014, 'grad_norm': 0.0931818038225174, 'learning_rate': 2.280840497343841e-05, 'epoch': 4.71} 47%|████▋ | 4715/10000 [7:25:06<8:08:41, 5.55s/it][2025-06-19 20:54:51,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:54:51,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2153.09 | bwd_microstep: 3380.96 | bwd_inner_microstep: 3379.98 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.29 [2025-06-19 20:54:51,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2153.09 | bwd: 3380.98 | bwd_inner: 3379.98 | bwd_allreduce: 0.95 | step: 7.30 47%|████▋ | 4716/10000 [7:25:12<8:10:01, 5.56s/it] {'loss': 0.0011, 'grad_norm': 0.0691501721739769, 'learning_rate': 2.280199149485362e-05, 'epoch': 4.72} 47%|████▋ | 4716/10000 [7:25:12<8:10:01, 5.56s/it][2025-06-19 20:54:57,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.66 | optimizer_step: 2.72 [2025-06-19 20:54:57,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.37 | bwd_microstep: 3387.28 | bwd_inner_microstep: 3386.07 | bwd_allreduce_microstep: 1.08 | step_microstep: 9.26 [2025-06-19 20:54:57,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.37 | bwd: 3387.33 | bwd_inner: 3386.07 | bwd_allreduce: 1.15 | step: 9.24 47%|████▋ | 4717/10000 [7:25:18<8:10:30, 5.57s/it] {'loss': 0.0125, 'grad_norm': 1.4233806133270264, 'learning_rate': 2.2795577722352933e-05, 'epoch': 4.72} 47%|████▋ | 4717/10000 [7:25:18<8:10:30, 5.57s/it][2025-06-19 20:55:02,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:55:02,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.71 | bwd_microstep: 3332.98 | bwd_inner_microstep: 3331.85 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.77 [2025-06-19 20:55:02,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.71 | bwd: 3333.00 | bwd_inner: 3331.85 | bwd_allreduce: 1.09 | step: 7.78 47%|████▋ | 4718/10000 [7:25:23<8:08:57, 5.55s/it] {'loss': 0.0197, 'grad_norm': 2.4665427207946777, 'learning_rate': 2.278916365660911e-05, 'epoch': 4.72} 47%|████▋ | 4718/10000 [7:25:23<8:08:57, 5.55s/it][2025-06-19 20:55:08,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:55:08,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.28 | bwd_microstep: 3328.05 | bwd_inner_microstep: 3327.22 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.86 [2025-06-19 20:55:08,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.28 | bwd: 3328.06 | bwd_inner: 3327.22 | bwd_allreduce: 0.79 | step: 6.86 47%|████▋ | 4719/10000 [7:25:29<8:07:40, 5.54s/it] {'loss': 0.016, 'grad_norm': 1.7587498426437378, 'learning_rate': 2.2782749298294963e-05, 'epoch': 4.72} 47%|████▋ | 4719/10000 [7:25:29<8:07:40, 5.54s/it][2025-06-19 20:55:13,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:55:13,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.86 | bwd_microstep: 3378.03 | bwd_inner_microstep: 3377.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 20:55:13,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.86 | bwd: 3378.04 | bwd_inner: 3377.24 | bwd_allreduce: 0.76 | step: 6.68 47%|████▋ | 4720/10000 [7:25:34<8:08:05, 5.55s/it] {'loss': 0.0037, 'grad_norm': 0.5680421590805054, 'learning_rate': 2.277633464808334e-05, 'epoch': 4.72} 47%|████▋ | 4720/10000 [7:25:34<8:08:05, 5.55s/it][2025-06-19 20:55:19,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:55:19,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.60 | bwd_microstep: 3315.33 | bwd_inner_microstep: 3314.50 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.82 [2025-06-19 20:55:19,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.60 | bwd: 3315.35 | bwd_inner: 3314.50 | bwd_allreduce: 0.79 | step: 6.82 47%|████▋ | 4721/10000 [7:25:40<8:05:37, 5.52s/it] {'loss': 0.0131, 'grad_norm': 0.9861254096031189, 'learning_rate': 2.276991970664709e-05, 'epoch': 4.72} 47%|████▋ | 4721/10000 [7:25:40<8:05:37, 5.52s/it][2025-06-19 20:55:24,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:55:24,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.09 | bwd_microstep: 3317.06 | bwd_inner_microstep: 3316.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.95 [2025-06-19 20:55:24,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.09 | bwd: 3317.07 | bwd_inner: 3316.27 | bwd_allreduce: 0.76 | step: 6.95 47%|████▋ | 4722/10000 [7:25:45<8:04:20, 5.51s/it] {'loss': 0.0834, 'grad_norm': 7.651373863220215, 'learning_rate': 2.2763504474659114e-05, 'epoch': 4.72} 47%|████▋ | 4722/10000 [7:25:45<8:04:20, 5.51s/it][2025-06-19 20:55:30,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:55:30,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.78 | bwd_microstep: 3326.63 | bwd_inner_microstep: 3325.84 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-19 20:55:30,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.78 | bwd: 3326.64 | bwd_inner: 3325.84 | bwd_allreduce: 0.76 | step: 6.81 47%|████▋ | 4723/10000 [7:25:51<8:03:31, 5.50s/it] {'loss': 0.0029, 'grad_norm': 0.3330802321434021, 'learning_rate': 2.2757088952792353e-05, 'epoch': 4.72} 47%|████▋ | 4723/10000 [7:25:51<8:03:31, 5.50s/it][2025-06-19 20:55:35,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:55:35,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.71 | bwd_microstep: 3314.05 | bwd_inner_microstep: 3313.13 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.00 [2025-06-19 20:55:35,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.71 | bwd: 3314.07 | bwd_inner: 3313.13 | bwd_allreduce: 0.89 | step: 7.00 47%|████▋ | 4724/10000 [7:25:56<8:02:54, 5.49s/it] {'loss': 0.006, 'grad_norm': 0.7024018168449402, 'learning_rate': 2.2750673141719754e-05, 'epoch': 4.72} 47%|████▋ | 4724/10000 [7:25:56<8:02:54, 5.49s/it][2025-06-19 20:55:41,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:55:41,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.12 | bwd_microstep: 3322.59 | bwd_inner_microstep: 3321.52 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.51 [2025-06-19 20:55:41,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.12 | bwd: 3322.61 | bwd_inner: 3321.52 | bwd_allreduce: 1.03 | step: 7.52 47%|████▋ | 4725/10000 [7:26:01<8:02:14, 5.49s/it] {'loss': 0.0015, 'grad_norm': 0.0730510726571083, 'learning_rate': 2.274425704211431e-05, 'epoch': 4.72} 47%|████▋ | 4725/10000 [7:26:01<8:02:14, 5.49s/it][2025-06-19 20:55:46,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:55:46,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.33 | bwd_microstep: 3313.05 | bwd_inner_microstep: 3312.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 20:55:46,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.33 | bwd: 3313.06 | bwd_inner: 3312.26 | bwd_allreduce: 0.75 | step: 6.55 47%|████▋ | 4726/10000 [7:26:07<8:01:41, 5.48s/it] {'loss': 0.1628, 'grad_norm': 3.7354063987731934, 'learning_rate': 2.2737840654649034e-05, 'epoch': 4.73} 47%|████▋ | 4726/10000 [7:26:07<8:01:41, 5.48s/it][2025-06-19 20:55:52,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 20:55:52,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.58 | bwd_microstep: 3311.98 | bwd_inner_microstep: 3311.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 20:55:52,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.58 | bwd: 3311.99 | bwd_inner: 3311.19 | bwd_allreduce: 0.76 | step: 6.61 47%|████▋ | 4727/10000 [7:26:12<8:01:02, 5.47s/it] {'loss': 0.09, 'grad_norm': 2.804441452026367, 'learning_rate': 2.2731423979996988e-05, 'epoch': 4.73} 47%|████▋ | 4727/10000 [7:26:12<8:01:02, 5.47s/it][2025-06-19 20:55:57,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:55:57,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.56 | bwd_microstep: 3322.22 | bwd_inner_microstep: 3321.23 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.05 [2025-06-19 20:55:57,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.56 | bwd: 3322.23 | bwd_inner: 3321.23 | bwd_allreduce: 0.95 | step: 7.06 47%|████▋ | 4728/10000 [7:26:18<8:01:18, 5.48s/it] {'loss': 0.0658, 'grad_norm': 3.059149742126465, 'learning_rate': 2.272500701883124e-05, 'epoch': 4.73} 47%|████▋ | 4728/10000 [7:26:18<8:01:18, 5.48s/it][2025-06-19 20:56:03,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.82 [2025-06-19 20:56:03,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.53 | bwd_microstep: 3326.95 | bwd_inner_microstep: 3326.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-19 20:56:03,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.53 | bwd: 3326.96 | bwd_inner: 3326.15 | bwd_allreduce: 0.78 | step: 6.72 47%|████▋ | 4729/10000 [7:26:23<8:01:03, 5.48s/it] {'loss': 0.0947, 'grad_norm': 13.407383918762207, 'learning_rate': 2.2718589771824898e-05, 'epoch': 4.73} 47%|████▋ | 4729/10000 [7:26:23<8:01:03, 5.48s/it][2025-06-19 20:56:08,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:56:08,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.98 | bwd_microstep: 3372.22 | bwd_inner_microstep: 3371.29 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.00 [2025-06-19 20:56:08,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.99 | bwd: 3372.24 | bwd_inner: 3371.29 | bwd_allreduce: 0.91 | step: 7.00 47%|████▋ | 4730/10000 [7:26:29<8:02:39, 5.50s/it] {'loss': 0.0192, 'grad_norm': 1.9698171615600586, 'learning_rate': 2.2712172239651106e-05, 'epoch': 4.73} 47%|████▋ | 4730/10000 [7:26:29<8:02:39, 5.50s/it][2025-06-19 20:56:14,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:56:14,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.27 | bwd_microstep: 3319.36 | bwd_inner_microstep: 3318.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 20:56:14,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.27 | bwd: 3319.37 | bwd_inner: 3318.58 | bwd_allreduce: 0.75 | step: 6.59 47%|████▋ | 4731/10000 [7:26:34<8:01:52, 5.49s/it] {'loss': 0.0045, 'grad_norm': 0.2514333426952362, 'learning_rate': 2.270575442298304e-05, 'epoch': 4.73} 47%|████▋ | 4731/10000 [7:26:34<8:01:52, 5.49s/it][2025-06-19 20:56:19,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:56:19,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.91 | bwd_microstep: 3317.21 | bwd_inner_microstep: 3316.39 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.76 [2025-06-19 20:56:19,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.91 | bwd: 3317.23 | bwd_inner: 3316.39 | bwd_allreduce: 0.79 | step: 6.76 47%|████▋ | 4732/10000 [7:26:40<8:01:05, 5.48s/it] {'loss': 0.0011, 'grad_norm': 0.12985992431640625, 'learning_rate': 2.269933632249389e-05, 'epoch': 4.73} 47%|████▋ | 4732/10000 [7:26:40<8:01:05, 5.48s/it][2025-06-19 20:56:25,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:56:25,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.55 | bwd_microstep: 3309.25 | bwd_inner_microstep: 3308.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.03 [2025-06-19 20:56:25,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.55 | bwd: 3309.26 | bwd_inner: 3308.45 | bwd_allreduce: 0.77 | step: 7.03 47%|████▋ | 4733/10000 [7:26:45<8:00:32, 5.47s/it] {'loss': 0.0041, 'grad_norm': 0.2863548994064331, 'learning_rate': 2.269291793885689e-05, 'epoch': 4.73} 47%|████▋ | 4733/10000 [7:26:45<8:00:32, 5.47s/it][2025-06-19 20:56:30,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 20:56:30,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.15 | bwd_microstep: 3373.22 | bwd_inner_microstep: 3372.07 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.97 [2025-06-19 20:56:30,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.15 | bwd: 3373.24 | bwd_inner: 3372.07 | bwd_allreduce: 1.11 | step: 7.97 47%|████▋ | 4734/10000 [7:26:51<8:02:37, 5.50s/it] {'loss': 0.0386, 'grad_norm': 3.328376531600952, 'learning_rate': 2.268649927274529e-05, 'epoch': 4.73} 47%|████▋ | 4734/10000 [7:26:51<8:02:37, 5.50s/it][2025-06-19 20:56:36,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:56:36,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.71 | bwd_microstep: 3373.37 | bwd_inner_microstep: 3372.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 20:56:36,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.71 | bwd: 3373.39 | bwd_inner: 3372.57 | bwd_allreduce: 0.77 | step: 6.76 47%|████▋ | 4735/10000 [7:26:56<8:04:07, 5.52s/it] {'loss': 0.1393, 'grad_norm': 5.591927528381348, 'learning_rate': 2.2680080324832394e-05, 'epoch': 4.74} 47%|████▋ | 4735/10000 [7:26:56<8:04:07, 5.52s/it][2025-06-19 20:56:41,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 20:56:41,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.80 | bwd_microstep: 3316.21 | bwd_inner_microstep: 3315.35 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.77 [2025-06-19 20:56:41,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.80 | bwd: 3316.22 | bwd_inner: 3315.35 | bwd_allreduce: 0.83 | step: 6.77 47%|████▋ | 4736/10000 [7:27:02<8:02:30, 5.50s/it] {'loss': 0.0061, 'grad_norm': 0.4985906183719635, 'learning_rate': 2.2673661095791504e-05, 'epoch': 4.74} 47%|████▋ | 4736/10000 [7:27:02<8:02:30, 5.50s/it][2025-06-19 20:56:47,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:56:47,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.07 | bwd_microstep: 3394.94 | bwd_inner_microstep: 3393.99 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.16 [2025-06-19 20:56:47,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.07 | bwd: 3394.96 | bwd_inner: 3394.00 | bwd_allreduce: 0.92 | step: 7.17 47%|████▋ | 4737/10000 [7:27:07<8:04:07, 5.52s/it] {'loss': 0.0056, 'grad_norm': 0.2971128523349762, 'learning_rate': 2.266724158629598e-05, 'epoch': 4.74} 47%|████▋ | 4737/10000 [7:27:07<8:04:07, 5.52s/it][2025-06-19 20:56:52,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:56:52,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.71 | bwd_microstep: 3317.03 | bwd_inner_microstep: 3316.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 20:56:52,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.71 | bwd: 3317.05 | bwd_inner: 3316.24 | bwd_allreduce: 0.76 | step: 6.77 47%|████▋ | 4738/10000 [7:27:13<8:02:31, 5.50s/it] {'loss': 0.018, 'grad_norm': 1.0635440349578857, 'learning_rate': 2.2660821797019188e-05, 'epoch': 4.74} 47%|████▋ | 4738/10000 [7:27:13<8:02:31, 5.50s/it][2025-06-19 20:56:58,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 20:56:58,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.20 | bwd_microstep: 3323.07 | bwd_inner_microstep: 3322.16 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.44 [2025-06-19 20:56:58,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.20 | bwd: 3323.08 | bwd_inner: 3322.16 | bwd_allreduce: 0.87 | step: 7.44 47%|████▋ | 4739/10000 [7:27:18<8:01:49, 5.50s/it] {'loss': 0.0151, 'grad_norm': 0.8781974911689758, 'learning_rate': 2.265440172863454e-05, 'epoch': 4.74} 47%|████▋ | 4739/10000 [7:27:18<8:01:49, 5.50s/it][2025-06-19 20:57:03,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 20:57:03,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.17 | bwd_microstep: 3315.85 | bwd_inner_microstep: 3314.93 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.15 [2025-06-19 20:57:03,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.17 | bwd: 3315.87 | bwd_inner: 3314.93 | bwd_allreduce: 0.90 | step: 7.16 47%|████▋ | 4740/10000 [7:27:24<8:00:50, 5.48s/it] {'loss': 0.0657, 'grad_norm': 2.2308237552642822, 'learning_rate': 2.2647981381815468e-05, 'epoch': 4.74} 47%|████▋ | 4740/10000 [7:27:24<8:00:50, 5.48s/it][2025-06-19 20:57:09,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:57:09,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.14 | bwd_microstep: 3312.07 | bwd_inner_microstep: 3311.28 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 20:57:09,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.14 | bwd: 3312.08 | bwd_inner: 3311.28 | bwd_allreduce: 0.76 | step: 6.66 47%|████▋ | 4741/10000 [7:27:29<8:00:05, 5.48s/it] {'loss': 0.0237, 'grad_norm': 1.9486303329467773, 'learning_rate': 2.2641560757235435e-05, 'epoch': 4.74} 47%|████▋ | 4741/10000 [7:27:29<8:00:05, 5.48s/it][2025-06-19 20:57:14,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:57:14,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.41 | bwd_microstep: 3365.98 | bwd_inner_microstep: 3365.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 20:57:14,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.41 | bwd: 3366.00 | bwd_inner: 3365.20 | bwd_allreduce: 0.76 | step: 6.62 47%|████▋ | 4742/10000 [7:27:35<8:01:19, 5.49s/it] {'loss': 0.1437, 'grad_norm': 5.118139266967773, 'learning_rate': 2.2635139855567942e-05, 'epoch': 4.74} 47%|████▋ | 4742/10000 [7:27:35<8:01:19, 5.49s/it][2025-06-19 20:57:20,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:57:20,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.76 | bwd_microstep: 3321.04 | bwd_inner_microstep: 3320.14 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.89 [2025-06-19 20:57:20,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.76 | bwd: 3321.06 | bwd_inner: 3320.14 | bwd_allreduce: 0.88 | step: 6.90 47%|████▋ | 4743/10000 [7:27:40<8:00:33, 5.48s/it] {'loss': 0.0109, 'grad_norm': 1.2359623908996582, 'learning_rate': 2.2628718677486514e-05, 'epoch': 4.74} 47%|████▋ | 4743/10000 [7:27:40<8:00:33, 5.48s/it][2025-06-19 20:57:25,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:57:25,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.41 | bwd_microstep: 3322.71 | bwd_inner_microstep: 3321.68 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.94 [2025-06-19 20:57:25,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.41 | bwd: 3322.73 | bwd_inner: 3321.68 | bwd_allreduce: 1.00 | step: 7.95 47%|████▋ | 4744/10000 [7:27:46<8:00:09, 5.48s/it] {'loss': 0.0057, 'grad_norm': 0.9808027744293213, 'learning_rate': 2.262229722366469e-05, 'epoch': 4.74} 47%|████▋ | 4744/10000 [7:27:46<8:00:09, 5.48s/it][2025-06-19 20:57:31,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:57:31,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.38 | bwd_microstep: 3378.70 | bwd_inner_microstep: 3377.78 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.98 [2025-06-19 20:57:31,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.38 | bwd: 3378.72 | bwd_inner: 3377.78 | bwd_allreduce: 0.89 | step: 6.98 47%|████▋ | 4745/10000 [7:27:51<8:01:57, 5.50s/it] {'loss': 0.0699, 'grad_norm': 2.2578163146972656, 'learning_rate': 2.2615875494776068e-05, 'epoch': 4.75} 47%|████▋ | 4745/10000 [7:27:51<8:01:57, 5.50s/it][2025-06-19 20:57:36,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 20:57:36,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.70 | bwd_microstep: 3320.68 | bwd_inner_microstep: 3319.73 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.43 [2025-06-19 20:57:36,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.70 | bwd: 3320.69 | bwd_inner: 3319.73 | bwd_allreduce: 0.91 | step: 7.43 47%|████▋ | 4746/10000 [7:27:57<8:01:04, 5.49s/it] {'loss': 0.0021, 'grad_norm': 0.24201491475105286, 'learning_rate': 2.2609453491494238e-05, 'epoch': 4.75} 47%|████▋ | 4746/10000 [7:27:57<8:01:04, 5.49s/it][2025-06-19 20:57:41,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:57:41,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.86 | bwd_microstep: 3330.85 | bwd_inner_microstep: 3330.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 20:57:41,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.86 | bwd: 3330.87 | bwd_inner: 3330.06 | bwd_allreduce: 0.76 | step: 6.88 47%|████▋ | 4747/10000 [7:28:02<8:00:43, 5.49s/it] {'loss': 0.1007, 'grad_norm': 7.416281223297119, 'learning_rate': 2.2603031214492844e-05, 'epoch': 4.75} 47%|████▋ | 4747/10000 [7:28:02<8:00:43, 5.49s/it][2025-06-19 20:57:47,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:57:47,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.88 | bwd_microstep: 3322.40 | bwd_inner_microstep: 3321.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 20:57:47,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.88 | bwd: 3322.41 | bwd_inner: 3321.62 | bwd_allreduce: 0.75 | step: 6.64 47%|████▋ | 4748/10000 [7:28:08<7:59:59, 5.48s/it] {'loss': 0.0353, 'grad_norm': 3.4331769943237305, 'learning_rate': 2.259660866444556e-05, 'epoch': 4.75} 47%|████▋ | 4748/10000 [7:28:08<7:59:59, 5.48s/it][2025-06-19 20:57:52,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 20:57:52,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.86 | bwd_microstep: 3332.35 | bwd_inner_microstep: 3331.57 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 20:57:52,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.86 | bwd: 3332.37 | bwd_inner: 3331.57 | bwd_allreduce: 0.76 | step: 6.65 47%|████▋ | 4749/10000 [7:28:13<7:59:48, 5.48s/it] {'loss': 0.0085, 'grad_norm': 1.2120493650436401, 'learning_rate': 2.259018584202608e-05, 'epoch': 4.75} 47%|████▋ | 4749/10000 [7:28:13<7:59:48, 5.48s/it][2025-06-19 20:57:58,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:57:58,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.52 | bwd_microstep: 3326.28 | bwd_inner_microstep: 3325.44 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.90 [2025-06-19 20:57:58,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.52 | bwd: 3326.29 | bwd_inner: 3325.44 | bwd_allreduce: 0.81 | step: 6.91 48%|████▊ | 4750/10000 [7:28:19<7:59:30, 5.48s/it] {'loss': 0.0729, 'grad_norm': 3.467040538787842, 'learning_rate': 2.2583762747908132e-05, 'epoch': 4.75} 48%|████▊ | 4750/10000 [7:28:19<7:59:30, 5.48s/it][2025-06-19 20:58:03,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:58:03,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.38 | bwd_microstep: 3318.63 | bwd_inner_microstep: 3317.80 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.77 [2025-06-19 20:58:03,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.38 | bwd: 3318.64 | bwd_inner: 3317.80 | bwd_allreduce: 0.79 | step: 6.79 48%|████▊ | 4751/10000 [7:28:24<7:58:57, 5.47s/it] {'loss': 0.0026, 'grad_norm': 0.2085937112569809, 'learning_rate': 2.2577339382765456e-05, 'epoch': 4.75} 48%|████▊ | 4751/10000 [7:28:24<7:58:57, 5.47s/it][2025-06-19 20:58:09,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:58:09,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.41 | bwd_microstep: 3385.42 | bwd_inner_microstep: 3384.57 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.90 [2025-06-19 20:58:09,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.41 | bwd: 3385.44 | bwd_inner: 3384.57 | bwd_allreduce: 0.81 | step: 6.90 48%|████▊ | 4752/10000 [7:28:30<8:01:12, 5.50s/it] {'loss': 0.0278, 'grad_norm': 1.5577867031097412, 'learning_rate': 2.257091574727184e-05, 'epoch': 4.75} 48%|████▊ | 4752/10000 [7:28:30<8:01:12, 5.50s/it][2025-06-19 20:58:14,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:58:14,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.19 | bwd_microstep: 3326.69 | bwd_inner_microstep: 3325.70 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.37 [2025-06-19 20:58:14,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.19 | bwd: 3326.71 | bwd_inner: 3325.70 | bwd_allreduce: 0.96 | step: 7.37 48%|████▊ | 4753/10000 [7:28:35<8:00:29, 5.49s/it] {'loss': 0.018, 'grad_norm': 1.5206879377365112, 'learning_rate': 2.2564491842101104e-05, 'epoch': 4.75} 48%|████▊ | 4753/10000 [7:28:35<8:00:29, 5.49s/it][2025-06-19 20:58:20,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:58:20,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.18 | bwd_microstep: 3383.30 | bwd_inner_microstep: 3382.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 20:58:20,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.18 | bwd: 3383.32 | bwd_inner: 3382.51 | bwd_allreduce: 0.76 | step: 6.77 48%|████▊ | 4754/10000 [7:28:41<8:02:04, 5.51s/it] {'loss': 0.1015, 'grad_norm': 4.149294853210449, 'learning_rate': 2.2558067667927065e-05, 'epoch': 4.75} 48%|████▊ | 4754/10000 [7:28:41<8:02:04, 5.51s/it][2025-06-19 20:58:26,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:58:26,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.64 | bwd_microstep: 3380.44 | bwd_inner_microstep: 3379.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 20:58:26,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.64 | bwd: 3380.46 | bwd_inner: 3379.65 | bwd_allreduce: 0.76 | step: 6.65 48%|████▊ | 4755/10000 [7:28:46<8:03:04, 5.53s/it] {'loss': 0.0947, 'grad_norm': 4.317079067230225, 'learning_rate': 2.2551643225423606e-05, 'epoch': 4.75} 48%|████▊ | 4755/10000 [7:28:46<8:03:04, 5.53s/it][2025-06-19 20:58:31,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:58:31,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.12 | bwd_microstep: 3386.10 | bwd_inner_microstep: 3385.31 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 20:58:31,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.12 | bwd: 3386.12 | bwd_inner: 3385.31 | bwd_allreduce: 0.76 | step: 6.65 48%|████▊ | 4756/10000 [7:28:52<8:03:51, 5.54s/it] {'loss': 0.044, 'grad_norm': 3.2484898567199707, 'learning_rate': 2.254521851526461e-05, 'epoch': 4.76} 48%|████▊ | 4756/10000 [7:28:52<8:03:51, 5.54s/it][2025-06-19 20:58:37,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:58:37,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.60 | bwd_microstep: 3341.65 | bwd_inner_microstep: 3340.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 20:58:37,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.60 | bwd: 3341.67 | bwd_inner: 3340.86 | bwd_allreduce: 0.77 | step: 6.74 48%|████▊ | 4757/10000 [7:28:57<8:02:32, 5.52s/it] {'loss': 0.0461, 'grad_norm': 2.9908761978149414, 'learning_rate': 2.2538793538124005e-05, 'epoch': 4.76} 48%|████▊ | 4757/10000 [7:28:57<8:02:32, 5.52s/it][2025-06-19 20:58:42,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 20:58:42,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.26 | bwd_microstep: 3325.20 | bwd_inner_microstep: 3324.27 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.14 [2025-06-19 20:58:42,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.26 | bwd: 3325.22 | bwd_inner: 3324.27 | bwd_allreduce: 0.91 | step: 7.14 48%|████▊ | 4758/10000 [7:29:03<8:01:04, 5.51s/it] {'loss': 0.0011, 'grad_norm': 0.04648406058549881, 'learning_rate': 2.2532368294675746e-05, 'epoch': 4.76} 48%|████▊ | 4758/10000 [7:29:03<8:01:04, 5.51s/it][2025-06-19 20:58:48,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:58:48,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.04 | bwd_microstep: 3371.08 | bwd_inner_microstep: 3370.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 20:58:48,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.04 | bwd: 3371.10 | bwd_inner: 3370.28 | bwd_allreduce: 0.77 | step: 6.69 48%|████▊ | 4759/10000 [7:29:08<8:02:03, 5.52s/it] {'loss': 0.0025, 'grad_norm': 0.13003933429718018, 'learning_rate': 2.25259427855938e-05, 'epoch': 4.76} 48%|████▊ | 4759/10000 [7:29:08<8:02:03, 5.52s/it][2025-06-19 20:58:53,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:58:53,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.09 | bwd_microstep: 3375.71 | bwd_inner_microstep: 3374.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 20:58:53,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.09 | bwd: 3375.73 | bwd_inner: 3374.91 | bwd_allreduce: 0.77 | step: 6.86 48%|████▊ | 4760/10000 [7:29:14<8:02:37, 5.53s/it] {'loss': 0.0598, 'grad_norm': 2.3723065853118896, 'learning_rate': 2.2519517011552184e-05, 'epoch': 4.76} 48%|████▊ | 4760/10000 [7:29:14<8:02:37, 5.53s/it][2025-06-19 20:58:59,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:58:59,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.83 | bwd_microstep: 3330.14 | bwd_inner_microstep: 3329.32 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.02 [2025-06-19 20:58:59,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.83 | bwd: 3330.15 | bwd_inner: 3329.32 | bwd_allreduce: 0.78 | step: 7.03 48%|████▊ | 4761/10000 [7:29:19<8:01:26, 5.51s/it] {'loss': 0.0014, 'grad_norm': 0.1541648805141449, 'learning_rate': 2.2513090973224924e-05, 'epoch': 4.76} 48%|████▊ | 4761/10000 [7:29:19<8:01:26, 5.51s/it][2025-06-19 20:59:04,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:59:04,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.82 | bwd_microstep: 3337.79 | bwd_inner_microstep: 3337.00 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 20:59:04,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.83 | bwd: 3337.80 | bwd_inner: 3337.00 | bwd_allreduce: 0.76 | step: 6.62 48%|████▊ | 4762/10000 [7:29:25<8:00:44, 5.51s/it] {'loss': 0.0443, 'grad_norm': 3.049783706665039, 'learning_rate': 2.2506664671286087e-05, 'epoch': 4.76} 48%|████▊ | 4762/10000 [7:29:25<8:00:44, 5.51s/it][2025-06-19 20:59:10,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 20:59:10,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.07 | bwd_microstep: 3412.57 | bwd_inner_microstep: 3411.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 20:59:10,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.07 | bwd: 3412.58 | bwd_inner: 3411.78 | bwd_allreduce: 0.76 | step: 6.68 48%|████▊ | 4763/10000 [7:29:31<8:02:55, 5.53s/it] {'loss': 0.0144, 'grad_norm': 0.7169895172119141, 'learning_rate': 2.250023810640976e-05, 'epoch': 4.76} 48%|████▊ | 4763/10000 [7:29:31<8:02:55, 5.53s/it][2025-06-19 20:59:15,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 20:59:15,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.07 | bwd_microstep: 3328.84 | bwd_inner_microstep: 3328.04 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 20:59:15,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.07 | bwd: 3328.86 | bwd_inner: 3328.04 | bwd_allreduce: 0.77 | step: 7.16 48%|████▊ | 4764/10000 [7:29:36<8:01:30, 5.52s/it] {'loss': 0.0249, 'grad_norm': 3.7146615982055664, 'learning_rate': 2.2493811279270057e-05, 'epoch': 4.76} 48%|████▊ | 4764/10000 [7:29:36<8:01:30, 5.52s/it][2025-06-19 20:59:21,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 20:59:21,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.89 | bwd_microstep: 3331.92 | bwd_inner_microstep: 3330.90 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.34 [2025-06-19 20:59:21,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.89 | bwd: 3331.94 | bwd_inner: 3330.90 | bwd_allreduce: 0.99 | step: 7.34 48%|████▊ | 4765/10000 [7:29:41<8:00:26, 5.51s/it] {'loss': 0.0098, 'grad_norm': 0.517828106880188, 'learning_rate': 2.248738419054113e-05, 'epoch': 4.76} 48%|████▊ | 4765/10000 [7:29:41<8:00:26, 5.51s/it][2025-06-19 20:59:26,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 20:59:26,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.59 | bwd_microstep: 3372.93 | bwd_inner_microstep: 3372.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 20:59:26,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.59 | bwd: 3372.94 | bwd_inner: 3372.13 | bwd_allreduce: 0.76 | step: 6.69 48%|████▊ | 4766/10000 [7:29:47<8:01:19, 5.52s/it] {'loss': 0.0232, 'grad_norm': 2.220825672149658, 'learning_rate': 2.2480956840897135e-05, 'epoch': 4.77} 48%|████▊ | 4766/10000 [7:29:47<8:01:19, 5.52s/it][2025-06-19 20:59:32,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 20:59:32,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.14 | bwd_microstep: 3378.73 | bwd_inner_microstep: 3377.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 20:59:32,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.14 | bwd: 3378.75 | bwd_inner: 3377.92 | bwd_allreduce: 0.78 | step: 7.24 48%|████▊ | 4767/10000 [7:29:53<8:02:18, 5.53s/it] {'loss': 0.0054, 'grad_norm': 0.5627459287643433, 'learning_rate': 2.247452923101229e-05, 'epoch': 4.77} 48%|████▊ | 4767/10000 [7:29:53<8:02:18, 5.53s/it][2025-06-19 20:59:37,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:59:37,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.86 | bwd_microstep: 3328.58 | bwd_inner_microstep: 3327.75 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.91 [2025-06-19 20:59:37,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.86 | bwd: 3328.60 | bwd_inner: 3327.75 | bwd_allreduce: 0.80 | step: 6.92 48%|████▊ | 4768/10000 [7:29:58<8:00:56, 5.52s/it] {'loss': 0.0151, 'grad_norm': 1.4290602207183838, 'learning_rate': 2.2468101361560813e-05, 'epoch': 4.77} 48%|████▊ | 4768/10000 [7:29:58<8:00:56, 5.52s/it][2025-06-19 20:59:43,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:59:43,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.62 | bwd_microstep: 3372.47 | bwd_inner_microstep: 3371.63 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.41 [2025-06-19 20:59:43,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.62 | bwd: 3372.48 | bwd_inner: 3371.63 | bwd_allreduce: 0.81 | step: 7.42 48%|████▊ | 4769/10000 [7:30:04<8:01:33, 5.52s/it] {'loss': 0.0261, 'grad_norm': 2.1126904487609863, 'learning_rate': 2.2461673233216958e-05, 'epoch': 4.77} 48%|████▊ | 4769/10000 [7:30:04<8:01:33, 5.52s/it][2025-06-19 20:59:48,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 20:59:48,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.79 | bwd_microstep: 3330.12 | bwd_inner_microstep: 3329.17 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.14 [2025-06-19 20:59:48,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.79 | bwd: 3330.14 | bwd_inner: 3329.17 | bwd_allreduce: 0.92 | step: 7.15 48%|████▊ | 4770/10000 [7:30:09<8:00:29, 5.51s/it] {'loss': 0.0039, 'grad_norm': 0.4259125888347626, 'learning_rate': 2.2455244846655003e-05, 'epoch': 4.77} 48%|████▊ | 4770/10000 [7:30:09<8:00:29, 5.51s/it][2025-06-19 20:59:54,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 20:59:54,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.16 | bwd_microstep: 3329.74 | bwd_inner_microstep: 3328.75 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.56 [2025-06-19 20:59:54,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.16 | bwd: 3329.76 | bwd_inner: 3328.75 | bwd_allreduce: 0.96 | step: 7.57 48%|████▊ | 4771/10000 [7:30:15<7:59:37, 5.50s/it] {'loss': 0.0421, 'grad_norm': 1.7485255002975464, 'learning_rate': 2.2448816202549262e-05, 'epoch': 4.77} 48%|████▊ | 4771/10000 [7:30:15<7:59:37, 5.50s/it][2025-06-19 20:59:59,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 20:59:59,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.56 | bwd_microstep: 3330.82 | bwd_inner_microstep: 3329.94 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.14 [2025-06-19 20:59:59,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.56 | bwd: 3330.83 | bwd_inner: 3329.94 | bwd_allreduce: 0.85 | step: 7.14 48%|████▊ | 4772/10000 [7:30:20<7:59:17, 5.50s/it] {'loss': 0.0918, 'grad_norm': 3.3337814807891846, 'learning_rate': 2.2442387301574065e-05, 'epoch': 4.77} 48%|████▊ | 4772/10000 [7:30:20<7:59:17, 5.50s/it][2025-06-19 21:00:05,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:00:05,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.61 | bwd_microstep: 3330.77 | bwd_inner_microstep: 3329.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 21:00:05,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.61 | bwd: 3330.78 | bwd_inner: 3329.97 | bwd_allreduce: 0.77 | step: 6.68 48%|████▊ | 4773/10000 [7:30:26<7:58:33, 5.49s/it] {'loss': 0.1306, 'grad_norm': 3.8825509548187256, 'learning_rate': 2.2435958144403774e-05, 'epoch': 4.77} 48%|████▊ | 4773/10000 [7:30:26<7:58:33, 5.49s/it][2025-06-19 21:00:10,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 21:00:10,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.42 | bwd_microstep: 3327.47 | bwd_inner_microstep: 3326.44 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.75 [2025-06-19 21:00:10,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.42 | bwd: 3327.49 | bwd_inner: 3326.44 | bwd_allreduce: 1.00 | step: 7.75 48%|████▊ | 4774/10000 [7:30:31<7:57:59, 5.49s/it] {'loss': 0.0089, 'grad_norm': 0.5579816699028015, 'learning_rate': 2.2429528731712785e-05, 'epoch': 4.77} 48%|████▊ | 4774/10000 [7:30:31<7:57:59, 5.49s/it][2025-06-19 21:00:16,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:00:16,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.40 | bwd_microstep: 3379.86 | bwd_inner_microstep: 3378.88 | bwd_allreduce_microstep: 0.93 | step_microstep: 6.95 [2025-06-19 21:00:16,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.40 | bwd: 3379.87 | bwd_inner: 3378.88 | bwd_allreduce: 0.95 | step: 6.95 48%|████▊ | 4775/10000 [7:30:37<7:59:33, 5.51s/it] {'loss': 0.0061, 'grad_norm': 0.4716922342777252, 'learning_rate': 2.2423099064175498e-05, 'epoch': 4.78} 48%|████▊ | 4775/10000 [7:30:37<7:59:33, 5.51s/it][2025-06-19 21:00:21,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:00:21,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.07 | bwd_microstep: 3370.47 | bwd_inner_microstep: 3369.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 21:00:21,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.07 | bwd: 3370.49 | bwd_inner: 3369.67 | bwd_allreduce: 0.77 | step: 6.97 48%|████▊ | 4776/10000 [7:30:42<8:00:16, 5.52s/it] {'loss': 0.0063, 'grad_norm': 0.3796176612377167, 'learning_rate': 2.241666914246637e-05, 'epoch': 4.78} 48%|████▊ | 4776/10000 [7:30:42<8:00:16, 5.52s/it][2025-06-19 21:00:27,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:00:27,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.83 | bwd_microstep: 3383.85 | bwd_inner_microstep: 3383.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 21:00:27,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.83 | bwd: 3383.87 | bwd_inner: 3383.06 | bwd_allreduce: 0.77 | step: 6.99 48%|████▊ | 4777/10000 [7:30:48<8:01:03, 5.53s/it] {'loss': 0.0611, 'grad_norm': 2.109771728515625, 'learning_rate': 2.2410238967259867e-05, 'epoch': 4.78} 48%|████▊ | 4777/10000 [7:30:48<8:01:03, 5.53s/it][2025-06-19 21:00:32,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:00:32,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2161.03 | bwd_microstep: 3371.64 | bwd_inner_microstep: 3370.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 21:00:32,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2161.03 | bwd: 3371.65 | bwd_inner: 3370.86 | bwd_allreduce: 0.75 | step: 6.59 48%|████▊ | 4778/10000 [7:30:53<8:02:02, 5.54s/it] {'loss': 0.0089, 'grad_norm': 1.4406790733337402, 'learning_rate': 2.2403808539230484e-05, 'epoch': 4.78} 48%|████▊ | 4778/10000 [7:30:53<8:02:02, 5.54s/it][2025-06-19 21:00:38,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:00:38,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.53 | bwd_microstep: 3335.57 | bwd_inner_microstep: 3334.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 21:00:38,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.53 | bwd: 3335.59 | bwd_inner: 3334.78 | bwd_allreduce: 0.76 | step: 6.70 48%|████▊ | 4779/10000 [7:30:59<8:00:26, 5.52s/it] {'loss': 0.0955, 'grad_norm': 2.8542251586914062, 'learning_rate': 2.2397377859052738e-05, 'epoch': 4.78} 48%|████▊ | 4779/10000 [7:30:59<8:00:26, 5.52s/it][2025-06-19 21:00:43,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:00:43,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2163.32 | bwd_microstep: 3374.23 | bwd_inner_microstep: 3373.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 21:00:43,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2163.32 | bwd: 3374.25 | bwd_inner: 3373.44 | bwd_allreduce: 0.76 | step: 6.73 48%|████▊ | 4780/10000 [7:31:04<8:01:43, 5.54s/it] {'loss': 0.0122, 'grad_norm': 1.234274983406067, 'learning_rate': 2.239094692740118e-05, 'epoch': 4.78} 48%|████▊ | 4780/10000 [7:31:04<8:01:43, 5.54s/it][2025-06-19 21:00:49,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:00:49,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.36 | bwd_microstep: 3326.48 | bwd_inner_microstep: 3325.59 | bwd_allreduce_microstep: 0.82 | step_microstep: 8.14 [2025-06-19 21:00:49,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.36 | bwd: 3326.50 | bwd_inner: 3325.59 | bwd_allreduce: 0.85 | step: 8.14 48%|████▊ | 4781/10000 [7:31:10<8:00:00, 5.52s/it] {'loss': 0.0021, 'grad_norm': 0.08563416451215744, 'learning_rate': 2.2384515744950393e-05, 'epoch': 4.78} 48%|████▊ | 4781/10000 [7:31:10<8:00:00, 5.52s/it][2025-06-19 21:00:54,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:00:54,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.20 | bwd_microstep: 3323.01 | bwd_inner_microstep: 3322.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 21:00:54,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.20 | bwd: 3323.03 | bwd_inner: 3322.23 | bwd_allreduce: 0.76 | step: 6.64 48%|████▊ | 4782/10000 [7:31:15<7:59:17, 5.51s/it] {'loss': 0.0141, 'grad_norm': 0.8017041087150574, 'learning_rate': 2.2378084312374964e-05, 'epoch': 4.78} 48%|████▊ | 4782/10000 [7:31:15<7:59:17, 5.51s/it][2025-06-19 21:01:00,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:01:00,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.23 | bwd_microstep: 3327.30 | bwd_inner_microstep: 3326.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-19 21:01:00,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.23 | bwd: 3327.31 | bwd_inner: 3326.51 | bwd_allreduce: 0.76 | step: 7.01 48%|████▊ | 4783/10000 [7:31:21<7:58:09, 5.50s/it] {'loss': 0.0696, 'grad_norm': 3.1244757175445557, 'learning_rate': 2.2371652630349532e-05, 'epoch': 4.78} 48%|████▊ | 4783/10000 [7:31:21<7:58:09, 5.50s/it][2025-06-19 21:01:05,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:01:05,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.38 | bwd_microstep: 3320.28 | bwd_inner_microstep: 3319.41 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.26 [2025-06-19 21:01:05,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.38 | bwd: 3320.31 | bwd_inner: 3319.41 | bwd_allreduce: 0.83 | step: 7.26 48%|████▊ | 4784/10000 [7:31:26<7:58:13, 5.50s/it] {'loss': 0.0055, 'grad_norm': 0.9733108878135681, 'learning_rate': 2.236522069954874e-05, 'epoch': 4.78} 48%|████▊ | 4784/10000 [7:31:26<7:58:13, 5.50s/it][2025-06-19 21:01:11,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:01:11,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.62 | bwd_microstep: 3334.16 | bwd_inner_microstep: 3333.27 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.78 [2025-06-19 21:01:11,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.62 | bwd: 3334.19 | bwd_inner: 3333.27 | bwd_allreduce: 0.85 | step: 7.78 48%|████▊ | 4785/10000 [7:31:32<7:58:53, 5.51s/it] {'loss': 0.0121, 'grad_norm': 0.6982332468032837, 'learning_rate': 2.2358788520647277e-05, 'epoch': 4.79} 48%|████▊ | 4785/10000 [7:31:32<7:58:53, 5.51s/it][2025-06-19 21:01:16,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:01:16,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.02 | bwd_microstep: 3331.48 | bwd_inner_microstep: 3330.63 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.24 [2025-06-19 21:01:16,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.02 | bwd: 3331.51 | bwd_inner: 3330.63 | bwd_allreduce: 0.81 | step: 7.24 48%|████▊ | 4786/10000 [7:31:37<7:59:09, 5.51s/it] {'loss': 0.0023, 'grad_norm': 0.134089395403862, 'learning_rate': 2.2352356094319852e-05, 'epoch': 4.79} 48%|████▊ | 4786/10000 [7:31:37<7:59:09, 5.51s/it][2025-06-19 21:01:22,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:01:22,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.89 | bwd_microstep: 3315.54 | bwd_inner_microstep: 3314.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 21:01:22,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.89 | bwd: 3315.56 | bwd_inner: 3314.73 | bwd_allreduce: 0.77 | step: 6.68 48%|████▊ | 4787/10000 [7:31:43<7:58:13, 5.50s/it] {'loss': 0.0128, 'grad_norm': 1.1950981616973877, 'learning_rate': 2.2345923421241182e-05, 'epoch': 4.79} 48%|████▊ | 4787/10000 [7:31:43<7:58:13, 5.50s/it][2025-06-19 21:01:27,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:01:27,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.21 | bwd_microstep: 3327.92 | bwd_inner_microstep: 3327.06 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.44 [2025-06-19 21:01:27,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.21 | bwd: 3327.94 | bwd_inner: 3327.06 | bwd_allreduce: 0.82 | step: 7.47 48%|████▊ | 4788/10000 [7:31:48<7:57:53, 5.50s/it] {'loss': 0.0339, 'grad_norm': 2.5488014221191406, 'learning_rate': 2.233949050208604e-05, 'epoch': 4.79} 48%|████▊ | 4788/10000 [7:31:48<7:57:53, 5.50s/it][2025-06-19 21:01:33,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:01:33,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2198.13 | bwd_microstep: 3382.08 | bwd_inner_microstep: 3381.27 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-19 21:01:33,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2198.13 | bwd: 3382.09 | bwd_inner: 3381.27 | bwd_allreduce: 0.77 | step: 6.72 48%|████▊ | 4789/10000 [7:31:54<8:01:01, 5.54s/it] {'loss': 0.0055, 'grad_norm': 1.7142796516418457, 'learning_rate': 2.23330573375292e-05, 'epoch': 4.79} 48%|████▊ | 4789/10000 [7:31:54<8:01:01, 5.54s/it][2025-06-19 21:01:39,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:01:39,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.26 | bwd_microstep: 3372.08 | bwd_inner_microstep: 3371.11 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.40 [2025-06-19 21:01:39,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.26 | bwd: 3372.09 | bwd_inner: 3371.11 | bwd_allreduce: 0.93 | step: 7.40 48%|████▊ | 4790/10000 [7:31:59<8:01:24, 5.54s/it] {'loss': 0.0003, 'grad_norm': 0.03902313485741615, 'learning_rate': 2.2326623928245467e-05, 'epoch': 4.79} 48%|████▊ | 4790/10000 [7:31:59<8:01:24, 5.54s/it][2025-06-19 21:01:44,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:01:44,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.32 | bwd_microstep: 3322.63 | bwd_inner_microstep: 3321.84 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 21:01:44,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.32 | bwd: 3322.65 | bwd_inner: 3321.84 | bwd_allreduce: 0.76 | step: 6.85 48%|████▊ | 4791/10000 [7:32:05<8:00:03, 5.53s/it] {'loss': 0.0044, 'grad_norm': 0.2792944610118866, 'learning_rate': 2.232019027490969e-05, 'epoch': 4.79} 48%|████▊ | 4791/10000 [7:32:05<8:00:03, 5.53s/it][2025-06-19 21:01:50,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:01:50,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.39 | bwd_microstep: 3370.19 | bwd_inner_microstep: 3369.21 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.36 [2025-06-19 21:01:50,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.39 | bwd: 3370.20 | bwd_inner: 3369.21 | bwd_allreduce: 0.95 | step: 7.36 48%|████▊ | 4792/10000 [7:32:10<8:00:15, 5.53s/it] {'loss': 0.0159, 'grad_norm': 0.9567880630493164, 'learning_rate': 2.2313756378196714e-05, 'epoch': 4.79} 48%|████▊ | 4792/10000 [7:32:10<8:00:15, 5.53s/it][2025-06-19 21:01:55,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 21:01:55,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.87 | bwd_microstep: 3323.73 | bwd_inner_microstep: 3322.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 21:01:55,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.87 | bwd: 3323.74 | bwd_inner: 3322.95 | bwd_allreduce: 0.75 | step: 6.65 48%|████▊ | 4793/10000 [7:32:16<7:58:41, 5.52s/it] {'loss': 0.0079, 'grad_norm': 0.8339968919754028, 'learning_rate': 2.230732223878144e-05, 'epoch': 4.79} 48%|████▊ | 4793/10000 [7:32:16<7:58:41, 5.52s/it][2025-06-19 21:02:01,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 21:02:01,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.50 | bwd_microstep: 3366.67 | bwd_inner_microstep: 3365.65 | bwd_allreduce_microstep: 0.97 | step_microstep: 8.20 [2025-06-19 21:02:01,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.50 | bwd: 3366.68 | bwd_inner: 3365.65 | bwd_allreduce: 0.99 | step: 8.20 48%|████▊ | 4794/10000 [7:32:21<7:58:55, 5.52s/it] {'loss': 0.0369, 'grad_norm': 1.6176533699035645, 'learning_rate': 2.2300887857338766e-05, 'epoch': 4.79} 48%|████▊ | 4794/10000 [7:32:21<7:58:55, 5.52s/it][2025-06-19 21:02:06,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:02:06,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.47 | bwd_microstep: 3321.77 | bwd_inner_microstep: 3320.87 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.85 [2025-06-19 21:02:06,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.47 | bwd: 3321.78 | bwd_inner: 3320.87 | bwd_allreduce: 0.87 | step: 6.85 48%|████▊ | 4795/10000 [7:32:27<7:57:31, 5.50s/it] {'loss': 0.0513, 'grad_norm': 4.288932800292969, 'learning_rate': 2.229445323454364e-05, 'epoch': 4.79} 48%|████▊ | 4795/10000 [7:32:27<7:57:31, 5.50s/it][2025-06-19 21:02:12,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 21:02:12,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.00 | bwd_microstep: 3333.63 | bwd_inner_microstep: 3332.42 | bwd_allreduce_microstep: 1.14 | step_microstep: 7.54 [2025-06-19 21:02:12,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.00 | bwd: 3333.65 | bwd_inner: 3332.42 | bwd_allreduce: 1.17 | step: 7.54 48%|████▊ | 4796/10000 [7:32:32<7:56:56, 5.50s/it] {'loss': 0.1218, 'grad_norm': 6.3232598304748535, 'learning_rate': 2.228801837107102e-05, 'epoch': 4.8} 48%|████▊ | 4796/10000 [7:32:32<7:56:56, 5.50s/it][2025-06-19 21:02:17,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:02:17,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.49 | bwd_microstep: 3372.52 | bwd_inner_microstep: 3371.66 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.83 [2025-06-19 21:02:17,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.49 | bwd: 3372.54 | bwd_inner: 3371.66 | bwd_allreduce: 0.84 | step: 6.83 48%|████▊ | 4797/10000 [7:32:38<7:58:18, 5.52s/it] {'loss': 0.0199, 'grad_norm': 2.1551167964935303, 'learning_rate': 2.2281583267595886e-05, 'epoch': 4.8} 48%|████▊ | 4797/10000 [7:32:38<7:58:18, 5.52s/it][2025-06-19 21:02:23,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:02:23,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.90 | bwd_microstep: 3369.37 | bwd_inner_microstep: 3368.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 21:02:23,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.90 | bwd: 3369.39 | bwd_inner: 3368.58 | bwd_allreduce: 0.76 | step: 6.73 48%|████▊ | 4798/10000 [7:32:44<7:58:53, 5.52s/it] {'loss': 0.1136, 'grad_norm': 3.987699031829834, 'learning_rate': 2.227514792479326e-05, 'epoch': 4.8} 48%|████▊ | 4798/10000 [7:32:44<7:58:53, 5.52s/it][2025-06-19 21:02:28,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:02:28,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.66 | bwd_microstep: 3319.07 | bwd_inner_microstep: 3318.27 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 21:02:28,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.66 | bwd: 3319.09 | bwd_inner: 3318.27 | bwd_allreduce: 0.77 | step: 7.08 48%|████▊ | 4799/10000 [7:32:49<7:57:21, 5.51s/it] {'loss': 0.0214, 'grad_norm': 2.172675371170044, 'learning_rate': 2.226871234333817e-05, 'epoch': 4.8} 48%|████▊ | 4799/10000 [7:32:49<7:57:21, 5.51s/it][2025-06-19 21:02:34,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 21:02:34,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.95 | bwd_microstep: 3318.45 | bwd_inner_microstep: 3317.67 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 21:02:34,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.95 | bwd: 3318.47 | bwd_inner: 3317.67 | bwd_allreduce: 0.75 | step: 6.66 48%|████▊ | 4800/10000 [7:32:54<7:56:06, 5.49s/it] {'loss': 0.0237, 'grad_norm': 2.3017776012420654, 'learning_rate': 2.226227652390569e-05, 'epoch': 4.8} 48%|████▊ | 4800/10000 [7:32:54<7:56:06, 5.49s/it][2025-06-19 21:02:39,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:02:39,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.90 | bwd_microstep: 3370.68 | bwd_inner_microstep: 3369.79 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.78 [2025-06-19 21:02:39,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.90 | bwd: 3370.69 | bwd_inner: 3369.79 | bwd_allreduce: 0.86 | step: 6.78 48%|████▊ | 4801/10000 [7:33:00<7:57:16, 5.51s/it] {'loss': 0.0028, 'grad_norm': 0.23022767901420593, 'learning_rate': 2.22558404671709e-05, 'epoch': 4.8} 48%|████▊ | 4801/10000 [7:33:00<7:57:16, 5.51s/it][2025-06-19 21:02:45,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 21:02:45,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.50 | bwd_microstep: 3365.87 | bwd_inner_microstep: 3365.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-19 21:02:45,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.50 | bwd: 3365.88 | bwd_inner: 3365.08 | bwd_allreduce: 0.76 | step: 6.52 48%|████▊ | 4802/10000 [7:33:06<7:57:52, 5.52s/it] {'loss': 0.0051, 'grad_norm': 0.40231558680534363, 'learning_rate': 2.224940417380891e-05, 'epoch': 4.8} 48%|████▊ | 4802/10000 [7:33:06<7:57:52, 5.52s/it][2025-06-19 21:02:50,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:02:50,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.86 | bwd_microstep: 3322.18 | bwd_inner_microstep: 3321.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 21:02:50,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.86 | bwd: 3322.20 | bwd_inner: 3321.40 | bwd_allreduce: 0.76 | step: 6.62 48%|████▊ | 4803/10000 [7:33:11<7:56:33, 5.50s/it] {'loss': 0.0095, 'grad_norm': 0.7599790692329407, 'learning_rate': 2.2242967644494867e-05, 'epoch': 4.8} 48%|████▊ | 4803/10000 [7:33:11<7:56:33, 5.50s/it][2025-06-19 21:02:56,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:02:56,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.74 | bwd_microstep: 3329.48 | bwd_inner_microstep: 3328.51 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.20 [2025-06-19 21:02:56,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.74 | bwd: 3329.50 | bwd_inner: 3328.51 | bwd_allreduce: 0.94 | step: 7.21 48%|████▊ | 4804/10000 [7:33:16<7:55:47, 5.49s/it] {'loss': 0.0061, 'grad_norm': 0.8979577422142029, 'learning_rate': 2.2236530879903936e-05, 'epoch': 4.8} 48%|████▊ | 4804/10000 [7:33:16<7:55:47, 5.49s/it][2025-06-19 21:03:01,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:03:01,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.28 | bwd_microstep: 3370.19 | bwd_inner_microstep: 3369.35 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.84 [2025-06-19 21:03:01,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.28 | bwd: 3370.20 | bwd_inner: 3369.35 | bwd_allreduce: 0.81 | step: 6.84 48%|████▊ | 4805/10000 [7:33:22<7:56:52, 5.51s/it] {'loss': 0.0037, 'grad_norm': 0.4054155647754669, 'learning_rate': 2.2230093880711286e-05, 'epoch': 4.8} 48%|████▊ | 4805/10000 [7:33:22<7:56:52, 5.51s/it][2025-06-19 21:03:07,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:03:07,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.73 | bwd_microstep: 3364.38 | bwd_inner_microstep: 3363.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 21:03:07,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.73 | bwd: 3364.39 | bwd_inner: 3363.60 | bwd_allreduce: 0.75 | step: 6.55 48%|████▊ | 4806/10000 [7:33:28<7:57:33, 5.52s/it] {'loss': 0.0136, 'grad_norm': 0.7638832926750183, 'learning_rate': 2.222365664759214e-05, 'epoch': 4.81} 48%|████▊ | 4806/10000 [7:33:28<7:57:33, 5.52s/it][2025-06-19 21:03:12,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:03:12,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.40 | bwd_microstep: 3321.18 | bwd_inner_microstep: 3320.35 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.77 [2025-06-19 21:03:12,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.40 | bwd: 3321.20 | bwd_inner: 3320.35 | bwd_allreduce: 0.79 | step: 6.77 48%|████▊ | 4807/10000 [7:33:33<7:56:18, 5.50s/it] {'loss': 0.001, 'grad_norm': 0.092927947640419, 'learning_rate': 2.2217219181221733e-05, 'epoch': 4.81} 48%|████▊ | 4807/10000 [7:33:33<7:56:18, 5.50s/it][2025-06-19 21:03:18,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:03:18,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.01 | bwd_microstep: 3315.99 | bwd_inner_microstep: 3315.01 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.11 [2025-06-19 21:03:18,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.01 | bwd: 3316.00 | bwd_inner: 3315.01 | bwd_allreduce: 0.94 | step: 7.12 48%|████▊ | 4808/10000 [7:33:38<7:55:00, 5.49s/it] {'loss': 0.0097, 'grad_norm': 1.2429279088974, 'learning_rate': 2.221078148227532e-05, 'epoch': 4.81} 48%|████▊ | 4808/10000 [7:33:38<7:55:00, 5.49s/it][2025-06-19 21:03:23,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:03:23,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.70 | bwd_microstep: 3317.96 | bwd_inner_microstep: 3317.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 21:03:23,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.70 | bwd: 3317.98 | bwd_inner: 3317.16 | bwd_allreduce: 0.77 | step: 6.77 48%|████▊ | 4809/10000 [7:33:44<7:54:22, 5.48s/it] {'loss': 0.0021, 'grad_norm': 0.27744370698928833, 'learning_rate': 2.2204343551428194e-05, 'epoch': 4.81} 48%|████▊ | 4809/10000 [7:33:44<7:54:22, 5.48s/it][2025-06-19 21:03:29,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 21:03:29,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.94 | bwd_microstep: 3368.67 | bwd_inner_microstep: 3367.63 | bwd_allreduce_microstep: 0.96 | step_microstep: 8.04 [2025-06-19 21:03:29,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.94 | bwd: 3368.69 | bwd_inner: 3367.63 | bwd_allreduce: 0.99 | step: 8.05 48%|████▊ | 4810/10000 [7:33:49<7:55:38, 5.50s/it] {'loss': 0.2138, 'grad_norm': 5.284377098083496, 'learning_rate': 2.219790538935566e-05, 'epoch': 4.81} 48%|████▊ | 4810/10000 [7:33:49<7:55:38, 5.50s/it][2025-06-19 21:03:34,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:03:34,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.26 | bwd_microstep: 3312.05 | bwd_inner_microstep: 3311.25 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-19 21:03:34,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.26 | bwd: 3312.07 | bwd_inner: 3311.25 | bwd_allreduce: 0.78 | step: 6.89 48%|████▊ | 4811/10000 [7:33:55<7:54:32, 5.49s/it] {'loss': 0.0021, 'grad_norm': 0.21216250956058502, 'learning_rate': 2.2191466996733054e-05, 'epoch': 4.81} 48%|████▊ | 4811/10000 [7:33:55<7:54:32, 5.49s/it][2025-06-19 21:03:40,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:03:40,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.74 | bwd_microstep: 3378.82 | bwd_inner_microstep: 3377.89 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.37 [2025-06-19 21:03:40,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.74 | bwd: 3378.83 | bwd_inner: 3377.89 | bwd_allreduce: 0.90 | step: 7.38 48%|████▊ | 4812/10000 [7:34:00<7:56:04, 5.51s/it] {'loss': 0.0156, 'grad_norm': 1.0680556297302246, 'learning_rate': 2.218502837423573e-05, 'epoch': 4.81} 48%|████▊ | 4812/10000 [7:34:00<7:56:04, 5.51s/it][2025-06-19 21:03:45,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:03:45,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.92 | bwd_microstep: 3308.46 | bwd_inner_microstep: 3307.52 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.06 [2025-06-19 21:03:45,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.93 | bwd: 3308.48 | bwd_inner: 3307.52 | bwd_allreduce: 0.91 | step: 7.06 48%|████▊ | 4813/10000 [7:34:06<7:54:47, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.023572847247123718, 'learning_rate': 2.217858952253907e-05, 'epoch': 4.81} 48%|████▊ | 4813/10000 [7:34:06<7:54:47, 5.49s/it][2025-06-19 21:03:51,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:03:51,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.41 | bwd_microstep: 3319.00 | bwd_inner_microstep: 3318.10 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.00 [2025-06-19 21:03:51,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.41 | bwd: 3319.02 | bwd_inner: 3318.10 | bwd_allreduce: 0.88 | step: 7.01 48%|████▊ | 4814/10000 [7:34:11<7:54:03, 5.48s/it] {'loss': 0.0593, 'grad_norm': 4.477458953857422, 'learning_rate': 2.2172150442318475e-05, 'epoch': 4.81} 48%|████▊ | 4814/10000 [7:34:11<7:54:03, 5.48s/it][2025-06-19 21:03:56,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:03:56,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.75 | bwd_microstep: 3314.62 | bwd_inner_microstep: 3313.62 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 21:03:56,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.75 | bwd: 3314.64 | bwd_inner: 3313.62 | bwd_allreduce: 0.97 | step: 7.13 48%|████▊ | 4815/10000 [7:34:17<7:53:12, 5.48s/it] {'loss': 0.0128, 'grad_norm': 1.342499852180481, 'learning_rate': 2.2165711134249386e-05, 'epoch': 4.81} 48%|████▊ | 4815/10000 [7:34:17<7:53:12, 5.48s/it][2025-06-19 21:04:02,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:04:02,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.89 | bwd_microstep: 3350.34 | bwd_inner_microstep: 3349.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 21:04:02,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.89 | bwd: 3350.35 | bwd_inner: 3349.53 | bwd_allreduce: 0.78 | step: 6.83 48%|████▊ | 4816/10000 [7:34:22<7:54:05, 5.49s/it] {'loss': 0.0009, 'grad_norm': 0.11031406372785568, 'learning_rate': 2.215927159900725e-05, 'epoch': 4.82} 48%|████▊ | 4816/10000 [7:34:22<7:54:05, 5.49s/it][2025-06-19 21:04:07,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:04:07,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.01 | bwd_microstep: 3319.39 | bwd_inner_microstep: 3318.51 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.95 [2025-06-19 21:04:07,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.01 | bwd: 3319.41 | bwd_inner: 3318.51 | bwd_allreduce: 0.84 | step: 6.94 48%|████▊ | 4817/10000 [7:34:28<7:53:28, 5.48s/it] {'loss': 0.0232, 'grad_norm': 1.7623417377471924, 'learning_rate': 2.2152831837267542e-05, 'epoch': 4.82} 48%|████▊ | 4817/10000 [7:34:28<7:53:28, 5.48s/it][2025-06-19 21:04:13,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:04:13,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.01 | bwd_microstep: 3311.40 | bwd_inner_microstep: 3310.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 21:04:13,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.01 | bwd: 3311.41 | bwd_inner: 3310.61 | bwd_allreduce: 0.75 | step: 6.59 48%|████▊ | 4818/10000 [7:34:33<7:52:59, 5.48s/it] {'loss': 0.0746, 'grad_norm': 3.3869264125823975, 'learning_rate': 2.2146391849705764e-05, 'epoch': 4.82} 48%|████▊ | 4818/10000 [7:34:33<7:52:59, 5.48s/it][2025-06-19 21:04:18,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:04:18,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.43 | bwd_microstep: 3362.51 | bwd_inner_microstep: 3361.65 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.30 [2025-06-19 21:04:18,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.43 | bwd: 3362.54 | bwd_inner: 3361.65 | bwd_allreduce: 0.83 | step: 7.31 48%|████▊ | 4819/10000 [7:34:39<7:54:00, 5.49s/it] {'loss': 0.0032, 'grad_norm': 0.2455255091190338, 'learning_rate': 2.2139951636997447e-05, 'epoch': 4.82} 48%|████▊ | 4819/10000 [7:34:39<7:54:00, 5.49s/it][2025-06-19 21:04:24,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:04:24,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.47 | bwd_microstep: 3371.86 | bwd_inner_microstep: 3371.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 21:04:24,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.47 | bwd: 3371.88 | bwd_inner: 3371.06 | bwd_allreduce: 0.77 | step: 6.91 48%|████▊ | 4820/10000 [7:34:44<7:55:14, 5.50s/it] {'loss': 0.0362, 'grad_norm': 3.1104254722595215, 'learning_rate': 2.213351119981813e-05, 'epoch': 4.82} 48%|████▊ | 4820/10000 [7:34:44<7:55:14, 5.50s/it][2025-06-19 21:04:29,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 21:04:29,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.39 | bwd_microstep: 3322.82 | bwd_inner_microstep: 3321.95 | bwd_allreduce_microstep: 0.82 | step_microstep: 8.57 [2025-06-19 21:04:29,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.39 | bwd: 3322.83 | bwd_inner: 3321.95 | bwd_allreduce: 0.84 | step: 8.58 48%|████▊ | 4821/10000 [7:34:50<7:54:16, 5.49s/it] {'loss': 0.065, 'grad_norm': 4.292520046234131, 'learning_rate': 2.21270705388434e-05, 'epoch': 4.82} 48%|████▊ | 4821/10000 [7:34:50<7:54:16, 5.49s/it][2025-06-19 21:04:35,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:04:35,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.79 | bwd_microstep: 3310.01 | bwd_inner_microstep: 3309.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 21:04:35,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.79 | bwd: 3310.03 | bwd_inner: 3309.21 | bwd_allreduce: 0.77 | step: 6.82 48%|████▊ | 4822/10000 [7:34:55<7:53:12, 5.48s/it] {'loss': 0.0109, 'grad_norm': 0.7134408354759216, 'learning_rate': 2.2120629654748834e-05, 'epoch': 4.82} 48%|████▊ | 4822/10000 [7:34:55<7:53:12, 5.48s/it][2025-06-19 21:04:40,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 21:04:40,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.22 | bwd_microstep: 3323.39 | bwd_inner_microstep: 3322.51 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.36 [2025-06-19 21:04:40,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.22 | bwd: 3323.41 | bwd_inner: 3322.51 | bwd_allreduce: 0.86 | step: 7.37 48%|████▊ | 4823/10000 [7:35:01<7:52:38, 5.48s/it] {'loss': 0.0112, 'grad_norm': 1.0446516275405884, 'learning_rate': 2.2114188548210057e-05, 'epoch': 4.82} 48%|████▊ | 4823/10000 [7:35:01<7:52:38, 5.48s/it][2025-06-19 21:04:45,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:04:45,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.67 | bwd_microstep: 3311.44 | bwd_inner_microstep: 3310.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 21:04:45,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.67 | bwd: 3311.46 | bwd_inner: 3310.65 | bwd_allreduce: 0.77 | step: 6.77 48%|████▊ | 4824/10000 [7:35:06<7:51:55, 5.47s/it] {'loss': 0.0542, 'grad_norm': 3.9425439834594727, 'learning_rate': 2.2107747219902723e-05, 'epoch': 4.82} 48%|████▊ | 4824/10000 [7:35:06<7:51:55, 5.47s/it][2025-06-19 21:04:51,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:04:51,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.89 | bwd_microstep: 3312.45 | bwd_inner_microstep: 3311.56 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.32 [2025-06-19 21:04:51,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.89 | bwd: 3312.48 | bwd_inner: 3311.56 | bwd_allreduce: 0.85 | step: 7.32 48%|████▊ | 4825/10000 [7:35:12<7:51:25, 5.47s/it] {'loss': 0.0286, 'grad_norm': 3.0937387943267822, 'learning_rate': 2.210130567050248e-05, 'epoch': 4.83} 48%|████▊ | 4825/10000 [7:35:12<7:51:25, 5.47s/it][2025-06-19 21:04:56,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:04:56,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2155.05 | bwd_microstep: 3365.51 | bwd_inner_microstep: 3364.60 | bwd_allreduce_microstep: 0.83 | step_microstep: 8.17 [2025-06-19 21:04:56,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2155.05 | bwd: 3365.54 | bwd_inner: 3364.60 | bwd_allreduce: 0.86 | step: 8.18 48%|████▊ | 4826/10000 [7:35:17<7:54:00, 5.50s/it] {'loss': 0.0034, 'grad_norm': 0.5003339648246765, 'learning_rate': 2.2094863900685034e-05, 'epoch': 4.83} 48%|████▊ | 4826/10000 [7:35:17<7:54:00, 5.50s/it][2025-06-19 21:05:02,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:05:02,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2179.78 | bwd_microstep: 3369.62 | bwd_inner_microstep: 3368.84 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 21:05:02,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2179.78 | bwd: 3369.64 | bwd_inner: 3368.84 | bwd_allreduce: 0.76 | step: 6.65 48%|████▊ | 4827/10000 [7:35:23<7:56:23, 5.53s/it] {'loss': 0.0028, 'grad_norm': 0.21312442421913147, 'learning_rate': 2.2088421911126075e-05, 'epoch': 4.83} 48%|████▊ | 4827/10000 [7:35:23<7:56:23, 5.53s/it][2025-06-19 21:05:08,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:05:08,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.74 | bwd_microstep: 3358.10 | bwd_inner_microstep: 3357.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 21:05:08,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.74 | bwd: 3358.11 | bwd_inner: 3357.31 | bwd_allreduce: 0.76 | step: 6.68 48%|████▊ | 4828/10000 [7:35:28<7:56:00, 5.52s/it] {'loss': 0.0013, 'grad_norm': 0.1318712681531906, 'learning_rate': 2.2081979702501356e-05, 'epoch': 4.83} 48%|████▊ | 4828/10000 [7:35:28<7:56:00, 5.52s/it][2025-06-19 21:05:13,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:05:13,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.37 | bwd_microstep: 3314.68 | bwd_inner_microstep: 3313.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 21:05:13,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.37 | bwd: 3314.69 | bwd_inner: 3313.89 | bwd_allreduce: 0.76 | step: 6.68 48%|████▊ | 4829/10000 [7:35:34<7:54:13, 5.50s/it] {'loss': 0.0079, 'grad_norm': 0.3350273370742798, 'learning_rate': 2.207553727548663e-05, 'epoch': 4.83} 48%|████▊ | 4829/10000 [7:35:34<7:54:13, 5.50s/it][2025-06-19 21:05:18,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:05:18,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.54 | bwd_microstep: 3310.29 | bwd_inner_microstep: 3309.48 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 21:05:18,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.54 | bwd: 3310.30 | bwd_inner: 3309.48 | bwd_allreduce: 0.78 | step: 7.22 48%|████▊ | 4830/10000 [7:35:39<7:52:43, 5.49s/it] {'loss': 0.0043, 'grad_norm': 0.27625522017478943, 'learning_rate': 2.2069094630757676e-05, 'epoch': 4.83} 48%|████▊ | 4830/10000 [7:35:39<7:52:43, 5.49s/it][2025-06-19 21:05:24,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:05:24,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.99 | bwd_microstep: 3313.35 | bwd_inner_microstep: 3312.40 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.06 [2025-06-19 21:05:24,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.99 | bwd: 3313.37 | bwd_inner: 3312.40 | bwd_allreduce: 0.92 | step: 7.07 48%|████▊ | 4831/10000 [7:35:45<7:51:56, 5.48s/it] {'loss': 0.0033, 'grad_norm': 0.34914836287498474, 'learning_rate': 2.2062651768990304e-05, 'epoch': 4.83} 48%|████▊ | 4831/10000 [7:35:45<7:51:56, 5.48s/it][2025-06-19 21:05:29,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:05:29,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.12 | bwd_microstep: 3313.87 | bwd_inner_microstep: 3313.05 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.07 [2025-06-19 21:05:29,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.12 | bwd: 3313.88 | bwd_inner: 3313.05 | bwd_allreduce: 0.79 | step: 7.07 48%|████▊ | 4832/10000 [7:35:50<7:51:51, 5.48s/it] {'loss': 0.0905, 'grad_norm': 2.688633680343628, 'learning_rate': 2.2056208690860322e-05, 'epoch': 4.83} 48%|████▊ | 4832/10000 [7:35:50<7:51:51, 5.48s/it][2025-06-19 21:05:35,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:05:35,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.65 | bwd_microstep: 3314.30 | bwd_inner_microstep: 3313.49 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 21:05:35,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.65 | bwd: 3314.32 | bwd_inner: 3313.49 | bwd_allreduce: 0.78 | step: 7.08 48%|████▊ | 4833/10000 [7:35:56<7:51:14, 5.47s/it] {'loss': 0.0425, 'grad_norm': 2.4248201847076416, 'learning_rate': 2.20497653970436e-05, 'epoch': 4.83} 48%|████▊ | 4833/10000 [7:35:56<7:51:14, 5.47s/it][2025-06-19 21:05:40,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:05:40,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.69 | bwd_microstep: 3305.57 | bwd_inner_microstep: 3304.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 21:05:40,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.69 | bwd: 3305.59 | bwd_inner: 3304.78 | bwd_allreduce: 0.76 | step: 6.87 48%|████▊ | 4834/10000 [7:36:01<7:50:26, 5.46s/it] {'loss': 0.0013, 'grad_norm': 0.09899941086769104, 'learning_rate': 2.2043321888216e-05, 'epoch': 4.83} 48%|████▊ | 4834/10000 [7:36:01<7:50:26, 5.46s/it][2025-06-19 21:05:46,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 21:05:46,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.40 | bwd_microstep: 3365.17 | bwd_inner_microstep: 3364.09 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.12 [2025-06-19 21:05:46,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.40 | bwd: 3365.20 | bwd_inner: 3364.09 | bwd_allreduce: 1.03 | step: 8.13 48%|████▊ | 4835/10000 [7:36:07<7:52:05, 5.48s/it] {'loss': 0.0041, 'grad_norm': 0.756037712097168, 'learning_rate': 2.203687816505342e-05, 'epoch': 4.83} 48%|████▊ | 4835/10000 [7:36:07<7:52:05, 5.48s/it][2025-06-19 21:05:51,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:05:51,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.43 | bwd_microstep: 3318.75 | bwd_inner_microstep: 3317.84 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.80 [2025-06-19 21:05:51,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.43 | bwd: 3318.78 | bwd_inner: 3317.84 | bwd_allreduce: 0.86 | step: 7.81 48%|████▊ | 4836/10000 [7:36:12<7:52:58, 5.50s/it] {'loss': 0.0123, 'grad_norm': 1.6962499618530273, 'learning_rate': 2.203043422823177e-05, 'epoch': 4.84} 48%|████▊ | 4836/10000 [7:36:12<7:52:58, 5.50s/it][2025-06-19 21:05:57,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 21:05:57,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.44 | bwd_microstep: 3310.59 | bwd_inner_microstep: 3309.67 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.18 [2025-06-19 21:05:57,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.44 | bwd: 3310.61 | bwd_inner: 3309.67 | bwd_allreduce: 0.87 | step: 8.19 48%|████▊ | 4837/10000 [7:36:18<7:52:54, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.030511129647493362, 'learning_rate': 2.2023990078426988e-05, 'epoch': 4.84} 48%|████▊ | 4837/10000 [7:36:18<7:52:54, 5.50s/it][2025-06-19 21:06:02,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:06:02,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.23 | bwd_microstep: 3322.02 | bwd_inner_microstep: 3321.14 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.36 [2025-06-19 21:06:02,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.23 | bwd: 3322.05 | bwd_inner: 3321.14 | bwd_allreduce: 0.84 | step: 7.36 48%|████▊ | 4838/10000 [7:36:23<7:53:12, 5.50s/it] {'loss': 0.0043, 'grad_norm': 0.2752573490142822, 'learning_rate': 2.201754571631505e-05, 'epoch': 4.84} 48%|████▊ | 4838/10000 [7:36:23<7:53:12, 5.50s/it][2025-06-19 21:06:08,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-19 21:06:08,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.65 | bwd_microstep: 3307.23 | bwd_inner_microstep: 3305.96 | bwd_allreduce_microstep: 1.18 | step_microstep: 7.95 [2025-06-19 21:06:08,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.65 | bwd: 3307.26 | bwd_inner: 3305.96 | bwd_allreduce: 1.22 | step: 7.94 48%|████▊ | 4839/10000 [7:36:29<7:52:56, 5.50s/it] {'loss': 0.0052, 'grad_norm': 0.2983705699443817, 'learning_rate': 2.2011101142571928e-05, 'epoch': 4.84} 48%|████▊ | 4839/10000 [7:36:29<7:52:56, 5.50s/it][2025-06-19 21:06:13,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:06:13,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.08 | bwd_microstep: 3308.77 | bwd_inner_microstep: 3307.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 21:06:13,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.08 | bwd: 3308.79 | bwd_inner: 3307.98 | bwd_allreduce: 0.76 | step: 6.88 48%|████▊ | 4840/10000 [7:36:34<7:52:12, 5.49s/it] {'loss': 0.0085, 'grad_norm': 0.48888054490089417, 'learning_rate': 2.2004656357873625e-05, 'epoch': 4.84} 48%|████▊ | 4840/10000 [7:36:34<7:52:12, 5.49s/it][2025-06-19 21:06:19,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 3.05 [2025-06-19 21:06:19,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.72 | bwd_microstep: 3318.39 | bwd_inner_microstep: 3317.42 | bwd_allreduce_microstep: 0.91 | step_microstep: 8.31 [2025-06-19 21:06:19,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.72 | bwd: 3318.41 | bwd_inner: 3317.42 | bwd_allreduce: 0.94 | step: 8.33 48%|████▊ | 4841/10000 [7:36:40<7:51:53, 5.49s/it] {'loss': 0.0336, 'grad_norm': 5.2579216957092285, 'learning_rate': 2.1998211362896174e-05, 'epoch': 4.84} 48%|████▊ | 4841/10000 [7:36:40<7:51:53, 5.49s/it][2025-06-19 21:06:24,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:06:24,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.13 | bwd_microstep: 3368.79 | bwd_inner_microstep: 3367.93 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.23 [2025-06-19 21:06:24,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.13 | bwd: 3368.82 | bwd_inner: 3367.93 | bwd_allreduce: 0.82 | step: 7.23 48%|████▊ | 4842/10000 [7:36:45<7:53:44, 5.51s/it] {'loss': 0.0033, 'grad_norm': 0.20908544957637787, 'learning_rate': 2.1991766158315634e-05, 'epoch': 4.84} 48%|████▊ | 4842/10000 [7:36:45<7:53:44, 5.51s/it][2025-06-19 21:06:30,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:06:30,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.25 | bwd_microstep: 3367.96 | bwd_inner_microstep: 3367.08 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.82 [2025-06-19 21:06:30,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.25 | bwd: 3367.99 | bwd_inner: 3367.08 | bwd_allreduce: 0.84 | step: 6.81 48%|████▊ | 4843/10000 [7:36:51<7:54:29, 5.52s/it] {'loss': 0.0202, 'grad_norm': 1.630983829498291, 'learning_rate': 2.198532074480806e-05, 'epoch': 4.84} 48%|████▊ | 4843/10000 [7:36:51<7:54:29, 5.52s/it][2025-06-19 21:06:35,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:06:35,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.67 | bwd_microstep: 3322.42 | bwd_inner_microstep: 3321.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 21:06:35,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.67 | bwd: 3322.43 | bwd_inner: 3321.63 | bwd_allreduce: 0.76 | step: 6.65 48%|████▊ | 4844/10000 [7:36:56<7:52:57, 5.50s/it] {'loss': 0.0038, 'grad_norm': 0.3594643175601959, 'learning_rate': 2.1978875123049553e-05, 'epoch': 4.84} 48%|████▊ | 4844/10000 [7:36:56<7:52:57, 5.50s/it][2025-06-19 21:06:41,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:06:41,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.96 | bwd_microstep: 3323.43 | bwd_inner_microstep: 3322.61 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 21:06:41,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.96 | bwd: 3323.44 | bwd_inner: 3322.61 | bwd_allreduce: 0.78 | step: 7.16 48%|████▊ | 4845/10000 [7:37:02<7:51:49, 5.49s/it] {'loss': 0.0345, 'grad_norm': 3.43672776222229, 'learning_rate': 2.197242929371623e-05, 'epoch': 4.84} 48%|████▊ | 4845/10000 [7:37:02<7:51:49, 5.49s/it][2025-06-19 21:06:46,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:06:46,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.97 | bwd_microstep: 3377.72 | bwd_inner_microstep: 3376.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 21:06:46,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.97 | bwd: 3377.73 | bwd_inner: 3376.93 | bwd_allreduce: 0.76 | step: 6.72 48%|████▊ | 4846/10000 [7:37:07<7:53:23, 5.51s/it] {'loss': 0.0034, 'grad_norm': 0.36964187026023865, 'learning_rate': 2.1965983257484233e-05, 'epoch': 4.85} 48%|████▊ | 4846/10000 [7:37:07<7:53:23, 5.51s/it][2025-06-19 21:06:52,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:06:52,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.90 | bwd_microstep: 3321.28 | bwd_inner_microstep: 3320.39 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.72 [2025-06-19 21:06:52,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.90 | bwd: 3321.30 | bwd_inner: 3320.40 | bwd_allreduce: 0.84 | step: 6.72 48%|████▊ | 4847/10000 [7:37:13<7:52:09, 5.50s/it] {'loss': 0.0028, 'grad_norm': 0.23989225924015045, 'learning_rate': 2.195953701502972e-05, 'epoch': 4.85} 48%|████▊ | 4847/10000 [7:37:13<7:52:09, 5.50s/it][2025-06-19 21:06:57,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:06:57,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.63 | bwd_microstep: 3369.81 | bwd_inner_microstep: 3368.99 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.79 [2025-06-19 21:06:57,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.63 | bwd: 3369.82 | bwd_inner: 3368.99 | bwd_allreduce: 0.79 | step: 6.79 48%|████▊ | 4848/10000 [7:37:18<7:53:31, 5.51s/it] {'loss': 0.1133, 'grad_norm': 2.2938973903656006, 'learning_rate': 2.195309056702886e-05, 'epoch': 4.85} 48%|████▊ | 4848/10000 [7:37:18<7:53:31, 5.51s/it][2025-06-19 21:07:03,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:07:03,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.19 | bwd_microstep: 3371.13 | bwd_inner_microstep: 3370.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 21:07:03,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.19 | bwd: 3371.14 | bwd_inner: 3370.34 | bwd_allreduce: 0.76 | step: 6.68 48%|████▊ | 4849/10000 [7:37:24<7:53:57, 5.52s/it] {'loss': 0.0768, 'grad_norm': 4.061734676361084, 'learning_rate': 2.194664391415787e-05, 'epoch': 4.85} 48%|████▊ | 4849/10000 [7:37:24<7:53:57, 5.52s/it][2025-06-19 21:07:08,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:07:08,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.87 | bwd_microstep: 3323.28 | bwd_inner_microstep: 3322.48 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-19 21:07:08,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.88 | bwd: 3323.30 | bwd_inner: 3322.48 | bwd_allreduce: 0.78 | step: 6.75 48%|████▊ | 4850/10000 [7:37:29<7:52:45, 5.51s/it] {'loss': 0.002, 'grad_norm': 0.2037533074617386, 'learning_rate': 2.1940197057092964e-05, 'epoch': 4.85} 48%|████▊ | 4850/10000 [7:37:29<7:52:45, 5.51s/it][2025-06-19 21:07:14,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:07:14,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.86 | bwd_microstep: 3372.49 | bwd_inner_microstep: 3371.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-19 21:07:14,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.87 | bwd: 3372.50 | bwd_inner: 3371.69 | bwd_allreduce: 0.77 | step: 6.94 49%|████▊ | 4851/10000 [7:37:35<7:53:41, 5.52s/it] {'loss': 0.038, 'grad_norm': 1.95197331905365, 'learning_rate': 2.1933749996510394e-05, 'epoch': 4.85} 49%|████▊ | 4851/10000 [7:37:35<7:53:41, 5.52s/it][2025-06-19 21:07:20,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:07:20,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.03 | bwd_microstep: 3376.44 | bwd_inner_microstep: 3375.53 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.84 [2025-06-19 21:07:20,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.03 | bwd: 3376.46 | bwd_inner: 3375.53 | bwd_allreduce: 0.88 | step: 6.84 49%|████▊ | 4852/10000 [7:37:40<7:54:11, 5.53s/it] {'loss': 0.1773, 'grad_norm': 5.406773567199707, 'learning_rate': 2.1927302733086428e-05, 'epoch': 4.85} 49%|████▊ | 4852/10000 [7:37:40<7:54:11, 5.53s/it][2025-06-19 21:07:25,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 21:07:25,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2152.10 | bwd_microstep: 3378.37 | bwd_inner_microstep: 3377.49 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.44 [2025-06-19 21:07:25,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2152.10 | bwd: 3378.40 | bwd_inner: 3377.49 | bwd_allreduce: 0.84 | step: 7.44 49%|████▊ | 4853/10000 [7:37:46<7:55:16, 5.54s/it] {'loss': 0.0079, 'grad_norm': 0.9210001230239868, 'learning_rate': 2.1920855267497344e-05, 'epoch': 4.85} 49%|████▊ | 4853/10000 [7:37:46<7:55:16, 5.54s/it][2025-06-19 21:07:31,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:07:31,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.14 | bwd_microstep: 3326.12 | bwd_inner_microstep: 3325.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 21:07:31,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.14 | bwd: 3326.13 | bwd_inner: 3325.33 | bwd_allreduce: 0.76 | step: 6.65 49%|████▊ | 4854/10000 [7:37:51<7:53:54, 5.53s/it] {'loss': 0.0452, 'grad_norm': 3.091329336166382, 'learning_rate': 2.1914407600419458e-05, 'epoch': 4.85} 49%|████▊ | 4854/10000 [7:37:51<7:53:54, 5.53s/it][2025-06-19 21:07:36,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:07:36,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.27 | bwd_microstep: 3366.57 | bwd_inner_microstep: 3365.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 21:07:36,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.27 | bwd: 3366.59 | bwd_inner: 3365.78 | bwd_allreduce: 0.76 | step: 6.63 49%|████▊ | 4855/10000 [7:37:57<7:53:58, 5.53s/it] {'loss': 0.058, 'grad_norm': 3.9383604526519775, 'learning_rate': 2.1907959732529098e-05, 'epoch': 4.86} 49%|████▊ | 4855/10000 [7:37:57<7:53:58, 5.53s/it][2025-06-19 21:07:42,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:07:42,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.54 | bwd_microstep: 3371.74 | bwd_inner_microstep: 3370.93 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.75 [2025-06-19 21:07:42,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.54 | bwd: 3371.75 | bwd_inner: 3370.93 | bwd_allreduce: 0.77 | step: 6.75 49%|████▊ | 4856/10000 [7:38:02<7:54:29, 5.53s/it] {'loss': 0.0055, 'grad_norm': 0.49075847864151, 'learning_rate': 2.190151166450262e-05, 'epoch': 4.86} 49%|████▊ | 4856/10000 [7:38:02<7:54:29, 5.53s/it][2025-06-19 21:07:47,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:07:47,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.65 | bwd_microstep: 3325.49 | bwd_inner_microstep: 3324.36 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.02 [2025-06-19 21:07:47,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.65 | bwd: 3325.50 | bwd_inner: 3324.36 | bwd_allreduce: 1.10 | step: 7.02 49%|████▊ | 4857/10000 [7:38:08<7:52:57, 5.52s/it] {'loss': 0.107, 'grad_norm': 2.852187395095825, 'learning_rate': 2.1895063397016396e-05, 'epoch': 4.86} 49%|████▊ | 4857/10000 [7:38:08<7:52:57, 5.52s/it][2025-06-19 21:07:53,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:07:53,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.34 | bwd_microstep: 3324.14 | bwd_inner_microstep: 3323.33 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-19 21:07:53,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.34 | bwd: 3324.15 | bwd_inner: 3323.33 | bwd_allreduce: 0.78 | step: 6.83 49%|████▊ | 4858/10000 [7:38:13<7:51:37, 5.50s/it] {'loss': 0.0293, 'grad_norm': 1.482893705368042, 'learning_rate': 2.188861493074681e-05, 'epoch': 4.86} 49%|████▊ | 4858/10000 [7:38:13<7:51:37, 5.50s/it][2025-06-19 21:07:58,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:07:58,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.58 | bwd_microstep: 3331.05 | bwd_inner_microstep: 3330.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 21:07:58,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.58 | bwd: 3331.08 | bwd_inner: 3330.26 | bwd_allreduce: 0.76 | step: 6.70 49%|████▊ | 4859/10000 [7:38:19<7:50:52, 5.50s/it] {'loss': 0.0073, 'grad_norm': 0.45566341280937195, 'learning_rate': 2.1882166266370292e-05, 'epoch': 4.86} 49%|████▊ | 4859/10000 [7:38:19<7:50:52, 5.50s/it][2025-06-19 21:08:04,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:08:04,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.72 | bwd_microstep: 3375.38 | bwd_inner_microstep: 3374.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 21:08:04,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.72 | bwd: 3375.40 | bwd_inner: 3374.58 | bwd_allreduce: 0.77 | step: 6.87 49%|████▊ | 4860/10000 [7:38:24<7:52:01, 5.51s/it] {'loss': 0.0015, 'grad_norm': 0.14897100627422333, 'learning_rate': 2.1875717404563262e-05, 'epoch': 4.86} 49%|████▊ | 4860/10000 [7:38:24<7:52:01, 5.51s/it][2025-06-19 21:08:09,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:08:09,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.61 | bwd_microstep: 3328.84 | bwd_inner_microstep: 3327.99 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.00 [2025-06-19 21:08:09,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.61 | bwd: 3328.85 | bwd_inner: 3327.99 | bwd_allreduce: 0.82 | step: 7.00 49%|████▊ | 4861/10000 [7:38:30<7:51:01, 5.50s/it] {'loss': 0.0122, 'grad_norm': 0.6367447376251221, 'learning_rate': 2.186926834600218e-05, 'epoch': 4.86} 49%|████▊ | 4861/10000 [7:38:30<7:51:01, 5.50s/it][2025-06-19 21:08:15,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:08:15,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.85 | bwd_microstep: 3325.12 | bwd_inner_microstep: 3324.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.11 [2025-06-19 21:08:15,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.85 | bwd: 3325.14 | bwd_inner: 3324.33 | bwd_allreduce: 0.76 | step: 7.12 49%|████▊ | 4862/10000 [7:38:35<7:50:23, 5.49s/it] {'loss': 0.181, 'grad_norm': 3.7397079467773438, 'learning_rate': 2.1862819091363527e-05, 'epoch': 4.86} 49%|████▊ | 4862/10000 [7:38:35<7:50:23, 5.49s/it][2025-06-19 21:08:20,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:08:20,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.50 | bwd_microstep: 3322.97 | bwd_inner_microstep: 3322.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 21:08:20,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.50 | bwd: 3322.99 | bwd_inner: 3322.18 | bwd_allreduce: 0.76 | step: 6.81 49%|████▊ | 4863/10000 [7:38:41<7:49:37, 5.49s/it] {'loss': 0.0351, 'grad_norm': 3.7457363605499268, 'learning_rate': 2.1856369641323794e-05, 'epoch': 4.86} 49%|████▊ | 4863/10000 [7:38:41<7:49:37, 5.49s/it][2025-06-19 21:08:26,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:08:26,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.19 | bwd_microstep: 3375.20 | bwd_inner_microstep: 3374.39 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 21:08:26,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.19 | bwd: 3375.21 | bwd_inner: 3374.39 | bwd_allreduce: 0.78 | step: 6.80 49%|████▊ | 4864/10000 [7:38:46<7:51:03, 5.50s/it] {'loss': 0.005, 'grad_norm': 0.31053611636161804, 'learning_rate': 2.184991999655951e-05, 'epoch': 4.86} 49%|████▊ | 4864/10000 [7:38:46<7:51:03, 5.50s/it][2025-06-19 21:08:31,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:08:31,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.11 | bwd_microstep: 3384.66 | bwd_inner_microstep: 3383.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 21:08:31,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.11 | bwd: 3384.68 | bwd_inner: 3383.87 | bwd_allreduce: 0.76 | step: 6.63 49%|████▊ | 4865/10000 [7:38:52<7:52:12, 5.52s/it] {'loss': 0.0309, 'grad_norm': 2.335364818572998, 'learning_rate': 2.1843470157747198e-05, 'epoch': 4.87} 49%|████▊ | 4865/10000 [7:38:52<7:52:12, 5.52s/it][2025-06-19 21:08:37,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.80 [2025-06-19 21:08:37,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.04 | bwd_microstep: 3328.90 | bwd_inner_microstep: 3328.10 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.48 [2025-06-19 21:08:37,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.04 | bwd: 3328.92 | bwd_inner: 3328.10 | bwd_allreduce: 0.77 | step: 7.48 49%|████▊ | 4866/10000 [7:38:57<7:50:58, 5.50s/it] {'loss': 0.0342, 'grad_norm': 2.6091861724853516, 'learning_rate': 2.183702012556342e-05, 'epoch': 4.87} 49%|████▊ | 4866/10000 [7:38:57<7:50:58, 5.50s/it][2025-06-19 21:08:42,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.87 [2025-06-19 21:08:42,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.77 | bwd_microstep: 3382.65 | bwd_inner_microstep: 3381.54 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.88 [2025-06-19 21:08:42,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.77 | bwd: 3382.67 | bwd_inner: 3381.54 | bwd_allreduce: 1.06 | step: 7.89 49%|████▊ | 4867/10000 [7:39:03<7:52:11, 5.52s/it] {'loss': 0.0374, 'grad_norm': 2.025806427001953, 'learning_rate': 2.1830569900684762e-05, 'epoch': 4.87} 49%|████▊ | 4867/10000 [7:39:03<7:52:11, 5.52s/it][2025-06-19 21:08:48,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:08:48,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.87 | bwd_microstep: 3326.79 | bwd_inner_microstep: 3326.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 21:08:48,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.87 | bwd: 3326.81 | bwd_inner: 3326.00 | bwd_allreduce: 0.76 | step: 6.73 49%|████▊ | 4868/10000 [7:39:08<7:51:02, 5.51s/it] {'loss': 0.0396, 'grad_norm': 3.8868229389190674, 'learning_rate': 2.1824119483787816e-05, 'epoch': 4.87} 49%|████▊ | 4868/10000 [7:39:08<7:51:02, 5.51s/it][2025-06-19 21:08:53,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:08:53,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.16 | bwd_microstep: 3334.67 | bwd_inner_microstep: 3333.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.90 [2025-06-19 21:08:53,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.16 | bwd: 3334.68 | bwd_inner: 3333.88 | bwd_allreduce: 0.76 | step: 6.90 49%|████▊ | 4869/10000 [7:39:14<7:50:11, 5.50s/it] {'loss': 0.0141, 'grad_norm': 0.830684244632721, 'learning_rate': 2.18176688755492e-05, 'epoch': 4.87} 49%|████▊ | 4869/10000 [7:39:14<7:50:11, 5.50s/it][2025-06-19 21:08:59,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:08:59,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.90 | bwd_microstep: 3373.66 | bwd_inner_microstep: 3372.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 21:08:59,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.90 | bwd: 3373.68 | bwd_inner: 3372.85 | bwd_allreduce: 0.78 | step: 7.12 49%|████▊ | 4870/10000 [7:39:19<7:51:12, 5.51s/it] {'loss': 0.0771, 'grad_norm': 5.090855121612549, 'learning_rate': 2.181121807664556e-05, 'epoch': 4.87} 49%|████▊ | 4870/10000 [7:39:19<7:51:12, 5.51s/it][2025-06-19 21:09:04,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:09:04,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.48 | bwd_microstep: 3379.22 | bwd_inner_microstep: 3378.34 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.95 [2025-06-19 21:09:04,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.48 | bwd: 3379.24 | bwd_inner: 3378.34 | bwd_allreduce: 0.86 | step: 6.95 49%|████▊ | 4871/10000 [7:39:25<7:52:04, 5.52s/it] {'loss': 0.0079, 'grad_norm': 0.8602548837661743, 'learning_rate': 2.1804767087753546e-05, 'epoch': 4.87} 49%|████▊ | 4871/10000 [7:39:25<7:52:04, 5.52s/it][2025-06-19 21:09:10,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:09:10,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.89 | bwd_microstep: 3328.14 | bwd_inner_microstep: 3327.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 21:09:10,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.90 | bwd: 3328.16 | bwd_inner: 3327.35 | bwd_allreduce: 0.76 | step: 6.66 49%|████▊ | 4872/10000 [7:39:31<7:50:41, 5.51s/it] {'loss': 0.1593, 'grad_norm': 5.43239164352417, 'learning_rate': 2.179831590954984e-05, 'epoch': 4.87} 49%|████▊ | 4872/10000 [7:39:31<7:50:41, 5.51s/it][2025-06-19 21:09:15,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:09:15,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.53 | bwd_microstep: 3330.55 | bwd_inner_microstep: 3329.57 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.35 [2025-06-19 21:09:15,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.53 | bwd: 3330.57 | bwd_inner: 3329.57 | bwd_allreduce: 0.95 | step: 7.36 49%|████▊ | 4873/10000 [7:39:36<7:49:59, 5.50s/it] {'loss': 0.0427, 'grad_norm': 5.015896320343018, 'learning_rate': 2.179186454271114e-05, 'epoch': 4.87} 49%|████▊ | 4873/10000 [7:39:36<7:49:59, 5.50s/it][2025-06-19 21:09:21,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 21:09:21,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.72 | bwd_microstep: 3371.08 | bwd_inner_microstep: 3369.96 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.67 [2025-06-19 21:09:21,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.72 | bwd: 3371.10 | bwd_inner: 3369.96 | bwd_allreduce: 1.08 | step: 7.67 49%|████▊ | 4874/10000 [7:39:42<7:51:02, 5.51s/it] {'loss': 0.0033, 'grad_norm': 0.2851528227329254, 'learning_rate': 2.1785412987914167e-05, 'epoch': 4.87} 49%|████▊ | 4874/10000 [7:39:42<7:51:02, 5.51s/it][2025-06-19 21:09:26,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:09:26,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.76 | bwd_microstep: 3373.02 | bwd_inner_microstep: 3372.05 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.81 [2025-06-19 21:09:26,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.76 | bwd: 3373.04 | bwd_inner: 3372.05 | bwd_allreduce: 0.95 | step: 7.81 49%|████▉ | 4875/10000 [7:39:47<7:51:48, 5.52s/it] {'loss': 0.0029, 'grad_norm': 0.26477164030075073, 'learning_rate': 2.177896124583566e-05, 'epoch': 4.88} 49%|████▉ | 4875/10000 [7:39:47<7:51:48, 5.52s/it][2025-06-19 21:09:32,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:09:32,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.50 | bwd_microstep: 3320.49 | bwd_inner_microstep: 3319.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 21:09:32,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.50 | bwd: 3320.51 | bwd_inner: 3319.70 | bwd_allreduce: 0.77 | step: 6.73 49%|████▉ | 4876/10000 [7:39:53<7:50:41, 5.51s/it] {'loss': 0.0028, 'grad_norm': 0.233273446559906, 'learning_rate': 2.177250931715237e-05, 'epoch': 4.88} 49%|████▉ | 4876/10000 [7:39:53<7:50:41, 5.51s/it][2025-06-19 21:09:37,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:09:37,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.36 | bwd_microstep: 3328.37 | bwd_inner_microstep: 3327.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 21:09:37,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.36 | bwd: 3328.39 | bwd_inner: 3327.58 | bwd_allreduce: 0.76 | step: 6.70 49%|████▉ | 4877/10000 [7:39:58<7:49:38, 5.50s/it] {'loss': 0.0069, 'grad_norm': 0.5306877493858337, 'learning_rate': 2.176605720254109e-05, 'epoch': 4.88} 49%|████▉ | 4877/10000 [7:39:58<7:49:38, 5.50s/it][2025-06-19 21:09:43,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:09:43,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.44 | bwd_microstep: 3376.18 | bwd_inner_microstep: 3375.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 21:09:43,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.44 | bwd: 3376.19 | bwd_inner: 3375.39 | bwd_allreduce: 0.76 | step: 6.68 49%|████▉ | 4878/10000 [7:40:04<7:50:38, 5.51s/it] {'loss': 0.035, 'grad_norm': 2.4863228797912598, 'learning_rate': 2.1759604902678596e-05, 'epoch': 4.88} 49%|████▉ | 4878/10000 [7:40:04<7:50:38, 5.51s/it][2025-06-19 21:09:48,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:09:48,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.01 | bwd_microstep: 3327.67 | bwd_inner_microstep: 3326.86 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.34 [2025-06-19 21:09:48,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.01 | bwd: 3327.68 | bwd_inner: 3326.86 | bwd_allreduce: 0.78 | step: 7.34 49%|████▉ | 4879/10000 [7:40:09<7:49:31, 5.50s/it] {'loss': 0.0033, 'grad_norm': 0.41895991563796997, 'learning_rate': 2.1753152418241714e-05, 'epoch': 4.88} 49%|████▉ | 4879/10000 [7:40:09<7:49:31, 5.50s/it][2025-06-19 21:09:54,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:09:54,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.19 | bwd_microstep: 3321.27 | bwd_inner_microstep: 3320.32 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.27 [2025-06-19 21:09:54,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.19 | bwd: 3321.28 | bwd_inner: 3320.32 | bwd_allreduce: 0.92 | step: 7.27 49%|████▉ | 4880/10000 [7:40:15<7:48:47, 5.49s/it] {'loss': 0.0094, 'grad_norm': 0.7583282589912415, 'learning_rate': 2.174669974990728e-05, 'epoch': 4.88} 49%|████▉ | 4880/10000 [7:40:15<7:48:47, 5.49s/it][2025-06-19 21:09:59,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:09:59,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.25 | bwd_microstep: 3378.96 | bwd_inner_microstep: 3378.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.03 [2025-06-19 21:09:59,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.25 | bwd: 3378.97 | bwd_inner: 3378.16 | bwd_allreduce: 0.77 | step: 7.04 49%|████▉ | 4881/10000 [7:40:20<7:50:21, 5.51s/it] {'loss': 0.0154, 'grad_norm': 1.6482338905334473, 'learning_rate': 2.174024689835215e-05, 'epoch': 4.88} 49%|████▉ | 4881/10000 [7:40:20<7:50:21, 5.51s/it][2025-06-19 21:10:05,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.89 [2025-06-19 21:10:05,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.36 | bwd_microstep: 3329.98 | bwd_inner_microstep: 3329.17 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 21:10:05,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.36 | bwd: 3330.00 | bwd_inner: 3329.17 | bwd_allreduce: 0.78 | step: 6.95 49%|████▉ | 4882/10000 [7:40:26<7:49:28, 5.50s/it] {'loss': 0.0048, 'grad_norm': 0.5448824763298035, 'learning_rate': 2.1733793864253204e-05, 'epoch': 4.88} 49%|████▉ | 4882/10000 [7:40:26<7:49:28, 5.50s/it][2025-06-19 21:10:10,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:10:10,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.97 | bwd_microstep: 3323.97 | bwd_inner_microstep: 3323.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-19 21:10:10,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.97 | bwd: 3323.98 | bwd_inner: 3323.16 | bwd_allreduce: 0.77 | step: 6.93 49%|████▉ | 4883/10000 [7:40:31<7:48:32, 5.49s/it] {'loss': 0.0061, 'grad_norm': 0.7555716037750244, 'learning_rate': 2.172734064828732e-05, 'epoch': 4.88} 49%|████▉ | 4883/10000 [7:40:31<7:48:32, 5.49s/it][2025-06-19 21:10:16,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:10:16,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.34 | bwd_microstep: 3378.37 | bwd_inner_microstep: 3377.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-19 21:10:16,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.34 | bwd: 3378.38 | bwd_inner: 3377.58 | bwd_allreduce: 0.76 | step: 6.75 49%|████▉ | 4884/10000 [7:40:37<7:49:49, 5.51s/it] {'loss': 0.0351, 'grad_norm': 2.073011875152588, 'learning_rate': 2.1720887251131423e-05, 'epoch': 4.88} 49%|████▉ | 4884/10000 [7:40:37<7:49:49, 5.51s/it][2025-06-19 21:10:21,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:10:21,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.92 | bwd_microstep: 3318.95 | bwd_inner_microstep: 3318.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 21:10:21,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.92 | bwd: 3318.96 | bwd_inner: 3318.15 | bwd_allreduce: 0.78 | step: 6.80 49%|████▉ | 4885/10000 [7:40:42<7:48:29, 5.50s/it] {'loss': 0.0055, 'grad_norm': 0.7818731665611267, 'learning_rate': 2.1714433673462442e-05, 'epoch': 4.88} 49%|████▉ | 4885/10000 [7:40:42<7:48:29, 5.50s/it][2025-06-19 21:10:27,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:10:27,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.61 | bwd_microstep: 3322.59 | bwd_inner_microstep: 3321.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 21:10:27,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.61 | bwd: 3322.60 | bwd_inner: 3321.79 | bwd_allreduce: 0.77 | step: 6.71 49%|████▉ | 4886/10000 [7:40:48<7:47:55, 5.49s/it] {'loss': 0.0683, 'grad_norm': 4.38103723526001, 'learning_rate': 2.170797991595732e-05, 'epoch': 4.89} 49%|████▉ | 4886/10000 [7:40:48<7:47:55, 5.49s/it][2025-06-19 21:10:32,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:10:32,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.43 | bwd_microstep: 3377.81 | bwd_inner_microstep: 3376.87 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.18 [2025-06-19 21:10:32,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.43 | bwd: 3377.83 | bwd_inner: 3376.87 | bwd_allreduce: 0.91 | step: 7.18 49%|████▉ | 4887/10000 [7:40:53<7:49:16, 5.51s/it] {'loss': 0.0011, 'grad_norm': 0.07565875351428986, 'learning_rate': 2.170152597929304e-05, 'epoch': 4.89} 49%|████▉ | 4887/10000 [7:40:53<7:49:16, 5.51s/it][2025-06-19 21:10:38,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:10:38,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.09 | bwd_microstep: 3324.24 | bwd_inner_microstep: 3323.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.10 [2025-06-19 21:10:38,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.09 | bwd: 3324.26 | bwd_inner: 3323.45 | bwd_allreduce: 0.76 | step: 7.10 49%|████▉ | 4888/10000 [7:40:59<7:48:23, 5.50s/it] {'loss': 0.0648, 'grad_norm': 4.2983012199401855, 'learning_rate': 2.1695071864146576e-05, 'epoch': 4.89} 49%|████▉ | 4888/10000 [7:40:59<7:48:23, 5.50s/it][2025-06-19 21:10:43,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:10:43,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.80 | bwd_microstep: 3316.53 | bwd_inner_microstep: 3315.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 21:10:43,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.80 | bwd: 3316.54 | bwd_inner: 3315.75 | bwd_allreduce: 0.75 | step: 6.55 49%|████▉ | 4889/10000 [7:41:04<7:47:23, 5.49s/it] {'loss': 0.0428, 'grad_norm': 2.2379908561706543, 'learning_rate': 2.168861757119494e-05, 'epoch': 4.89} 49%|████▉ | 4889/10000 [7:41:04<7:47:23, 5.49s/it][2025-06-19 21:10:49,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:10:49,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.46 | bwd_microstep: 3324.68 | bwd_inner_microstep: 3323.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 21:10:49,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.46 | bwd: 3324.70 | bwd_inner: 3323.89 | bwd_allreduce: 0.76 | step: 6.74 49%|████▉ | 4890/10000 [7:41:09<7:46:55, 5.48s/it] {'loss': 0.0027, 'grad_norm': 0.22087357938289642, 'learning_rate': 2.1682163101115154e-05, 'epoch': 4.89} 49%|████▉ | 4890/10000 [7:41:09<7:46:55, 5.48s/it][2025-06-19 21:10:54,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:10:54,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.22 | bwd_microstep: 3315.85 | bwd_inner_microstep: 3314.93 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.85 [2025-06-19 21:10:54,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.22 | bwd: 3315.87 | bwd_inner: 3314.93 | bwd_allreduce: 0.89 | step: 6.86 49%|████▉ | 4891/10000 [7:41:15<7:46:26, 5.48s/it] {'loss': 0.0029, 'grad_norm': 0.19824986159801483, 'learning_rate': 2.1675708454584275e-05, 'epoch': 4.89} 49%|████▉ | 4891/10000 [7:41:15<7:46:26, 5.48s/it][2025-06-19 21:11:00,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:11:00,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.69 | bwd_microstep: 3321.19 | bwd_inner_microstep: 3320.15 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.69 [2025-06-19 21:11:00,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.69 | bwd: 3321.21 | bwd_inner: 3320.15 | bwd_allreduce: 1.01 | step: 7.69 49%|████▉ | 4892/10000 [7:41:20<7:46:14, 5.48s/it] {'loss': 0.0054, 'grad_norm': 0.5359057188034058, 'learning_rate': 2.1669253632279355e-05, 'epoch': 4.89} 49%|████▉ | 4892/10000 [7:41:20<7:46:14, 5.48s/it][2025-06-19 21:11:05,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:11:05,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.60 | bwd_microstep: 3371.41 | bwd_inner_microstep: 3370.46 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.30 [2025-06-19 21:11:05,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.60 | bwd: 3371.43 | bwd_inner: 3370.46 | bwd_allreduce: 0.92 | step: 7.30 49%|████▉ | 4893/10000 [7:41:26<7:47:52, 5.50s/it] {'loss': 0.0866, 'grad_norm': 4.038430213928223, 'learning_rate': 2.1662798634877472e-05, 'epoch': 4.89} 49%|████▉ | 4893/10000 [7:41:26<7:47:52, 5.50s/it][2025-06-19 21:11:11,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:11:11,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.45 | bwd_microstep: 3370.46 | bwd_inner_microstep: 3369.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 21:11:11,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.45 | bwd: 3370.48 | bwd_inner: 3369.67 | bwd_allreduce: 0.76 | step: 6.75 49%|████▉ | 4894/10000 [7:41:32<7:48:52, 5.51s/it] {'loss': 0.0092, 'grad_norm': 1.0499703884124756, 'learning_rate': 2.1656343463055727e-05, 'epoch': 4.89} 49%|████▉ | 4894/10000 [7:41:32<7:48:52, 5.51s/it][2025-06-19 21:11:16,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:11:16,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.46 | bwd_microstep: 3312.79 | bwd_inner_microstep: 3311.91 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.19 [2025-06-19 21:11:16,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.46 | bwd: 3312.81 | bwd_inner: 3311.91 | bwd_allreduce: 0.84 | step: 7.19 49%|████▉ | 4895/10000 [7:41:37<7:47:21, 5.49s/it] {'loss': 0.028, 'grad_norm': 1.8129453659057617, 'learning_rate': 2.1649888117491247e-05, 'epoch': 4.89} 49%|████▉ | 4895/10000 [7:41:37<7:47:21, 5.49s/it][2025-06-19 21:11:22,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:11:22,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.13 | bwd_microstep: 3321.01 | bwd_inner_microstep: 3320.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 21:11:22,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.13 | bwd: 3321.02 | bwd_inner: 3320.20 | bwd_allreduce: 0.78 | step: 7.13 49%|████▉ | 4896/10000 [7:41:42<7:46:45, 5.49s/it] {'loss': 0.0164, 'grad_norm': 1.1395845413208008, 'learning_rate': 2.1643432598861155e-05, 'epoch': 4.9} 49%|████▉ | 4896/10000 [7:41:42<7:46:45, 5.49s/it][2025-06-19 21:11:27,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:11:27,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.67 | bwd_microstep: 3316.23 | bwd_inner_microstep: 3315.31 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.87 [2025-06-19 21:11:27,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.67 | bwd: 3316.24 | bwd_inner: 3315.31 | bwd_allreduce: 0.89 | step: 6.87 49%|████▉ | 4897/10000 [7:41:48<7:46:05, 5.48s/it] {'loss': 0.1371, 'grad_norm': 3.1856110095977783, 'learning_rate': 2.1636976907842616e-05, 'epoch': 4.9} 49%|████▉ | 4897/10000 [7:41:48<7:46:05, 5.48s/it][2025-06-19 21:11:33,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:11:33,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.22 | bwd_microstep: 3365.28 | bwd_inner_microstep: 3364.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 21:11:33,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.22 | bwd: 3365.29 | bwd_inner: 3364.49 | bwd_allreduce: 0.76 | step: 6.63 49%|████▉ | 4898/10000 [7:41:53<7:47:19, 5.50s/it] {'loss': 0.0162, 'grad_norm': 1.3806670904159546, 'learning_rate': 2.1630521045112788e-05, 'epoch': 4.9} 49%|████▉ | 4898/10000 [7:41:53<7:47:19, 5.50s/it][2025-06-19 21:11:38,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 21:11:38,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.14 | bwd_microstep: 3367.31 | bwd_inner_microstep: 3366.32 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.78 [2025-06-19 21:11:38,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.14 | bwd: 3367.33 | bwd_inner: 3366.32 | bwd_allreduce: 0.96 | step: 7.79 49%|████▉ | 4899/10000 [7:41:59<7:48:22, 5.51s/it] {'loss': 0.003, 'grad_norm': 0.2779972553253174, 'learning_rate': 2.162406501134888e-05, 'epoch': 4.9} 49%|████▉ | 4899/10000 [7:41:59<7:48:22, 5.51s/it][2025-06-19 21:11:44,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:11:44,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.19 | bwd_microstep: 3327.82 | bwd_inner_microstep: 3326.72 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.96 [2025-06-19 21:11:44,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.19 | bwd: 3327.84 | bwd_inner: 3326.73 | bwd_allreduce: 1.05 | step: 7.97 49%|████▉ | 4900/10000 [7:42:04<7:47:32, 5.50s/it] {'loss': 0.0155, 'grad_norm': 3.0190422534942627, 'learning_rate': 2.1617608807228087e-05, 'epoch': 4.9} 49%|████▉ | 4900/10000 [7:42:04<7:47:32, 5.50s/it][2025-06-19 21:11:49,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:11:49,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.88 | bwd_microstep: 3321.38 | bwd_inner_microstep: 3320.48 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.96 [2025-06-19 21:11:49,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.88 | bwd: 3321.39 | bwd_inner: 3320.48 | bwd_allreduce: 0.87 | step: 6.96 49%|████▉ | 4901/10000 [7:42:10<7:46:46, 5.49s/it] {'loss': 0.0196, 'grad_norm': 1.8082305192947388, 'learning_rate': 2.1611152433427636e-05, 'epoch': 4.9} 49%|████▉ | 4901/10000 [7:42:10<7:46:46, 5.49s/it][2025-06-19 21:11:55,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:11:55,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.62 | bwd_microstep: 3401.72 | bwd_inner_microstep: 3400.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-19 21:11:55,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.62 | bwd: 3401.73 | bwd_inner: 3400.92 | bwd_allreduce: 0.77 | step: 6.97 49%|████▉ | 4902/10000 [7:42:16<7:49:04, 5.52s/it] {'loss': 0.0147, 'grad_norm': 2.724804401397705, 'learning_rate': 2.1604695890624772e-05, 'epoch': 4.9} 49%|████▉ | 4902/10000 [7:42:16<7:49:04, 5.52s/it][2025-06-19 21:12:00,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:12:00,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.39 | bwd_microstep: 3369.04 | bwd_inner_microstep: 3368.25 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 21:12:00,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.39 | bwd: 3369.06 | bwd_inner: 3368.25 | bwd_allreduce: 0.77 | step: 6.77 49%|████▉ | 4903/10000 [7:42:21<7:49:25, 5.53s/it] {'loss': 0.0028, 'grad_norm': 0.24802832305431366, 'learning_rate': 2.1598239179496757e-05, 'epoch': 4.9} 49%|████▉ | 4903/10000 [7:42:21<7:49:25, 5.53s/it][2025-06-19 21:12:06,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:12:06,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.69 | bwd_microstep: 3319.87 | bwd_inner_microstep: 3319.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 21:12:06,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.69 | bwd: 3319.89 | bwd_inner: 3319.08 | bwd_allreduce: 0.76 | step: 6.61 49%|████▉ | 4904/10000 [7:42:27<7:47:43, 5.51s/it] {'loss': 0.0303, 'grad_norm': 1.572609782218933, 'learning_rate': 2.159178230072087e-05, 'epoch': 4.9} 49%|████▉ | 4904/10000 [7:42:27<7:47:43, 5.51s/it][2025-06-19 21:12:11,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 21:12:11,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.27 | bwd_microstep: 3373.30 | bwd_inner_microstep: 3372.16 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.05 [2025-06-19 21:12:11,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.27 | bwd: 3373.32 | bwd_inner: 3372.17 | bwd_allreduce: 1.09 | step: 8.05 49%|████▉ | 4905/10000 [7:42:32<7:48:46, 5.52s/it] {'loss': 0.0008, 'grad_norm': 0.07125573605298996, 'learning_rate': 2.1585325254974413e-05, 'epoch': 4.91} 49%|████▉ | 4905/10000 [7:42:32<7:48:46, 5.52s/it][2025-06-19 21:12:17,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:12:17,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.50 | bwd_microstep: 3367.79 | bwd_inner_microstep: 3367.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 21:12:17,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.50 | bwd: 3367.81 | bwd_inner: 3367.00 | bwd_allreduce: 0.77 | step: 6.79 49%|████▉ | 4906/10000 [7:42:38<7:49:16, 5.53s/it] {'loss': 0.07, 'grad_norm': 4.704521179199219, 'learning_rate': 2.1578868042934687e-05, 'epoch': 4.91} 49%|████▉ | 4906/10000 [7:42:38<7:49:16, 5.53s/it][2025-06-19 21:12:22,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:12:22,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.85 | bwd_microstep: 3328.16 | bwd_inner_microstep: 3327.29 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.39 [2025-06-19 21:12:22,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.85 | bwd: 3328.18 | bwd_inner: 3327.29 | bwd_allreduce: 0.84 | step: 7.39 49%|████▉ | 4907/10000 [7:42:43<7:47:42, 5.51s/it] {'loss': 0.0181, 'grad_norm': 2.9458508491516113, 'learning_rate': 2.1572410665279034e-05, 'epoch': 4.91} 49%|████▉ | 4907/10000 [7:42:43<7:47:42, 5.51s/it][2025-06-19 21:12:28,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:12:28,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.63 | bwd_microstep: 3323.51 | bwd_inner_microstep: 3322.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 21:12:28,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.63 | bwd: 3323.52 | bwd_inner: 3322.71 | bwd_allreduce: 0.76 | step: 6.74 49%|████▉ | 4908/10000 [7:42:49<7:46:37, 5.50s/it] {'loss': 0.0411, 'grad_norm': 2.498537063598633, 'learning_rate': 2.1565953122684795e-05, 'epoch': 4.91} 49%|████▉ | 4908/10000 [7:42:49<7:46:37, 5.50s/it][2025-06-19 21:12:33,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:12:33,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.18 | bwd_microstep: 3323.08 | bwd_inner_microstep: 3322.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 21:12:33,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.18 | bwd: 3323.09 | bwd_inner: 3322.29 | bwd_allreduce: 0.76 | step: 6.67 49%|████▉ | 4909/10000 [7:42:54<7:45:43, 5.49s/it] {'loss': 0.0276, 'grad_norm': 3.5516092777252197, 'learning_rate': 2.1559495415829344e-05, 'epoch': 4.91} 49%|████▉ | 4909/10000 [7:42:54<7:45:43, 5.49s/it][2025-06-19 21:12:39,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:12:39,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.18 | bwd_microstep: 3374.11 | bwd_inner_microstep: 3373.29 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.08 [2025-06-19 21:12:39,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.18 | bwd: 3374.12 | bwd_inner: 3373.29 | bwd_allreduce: 0.79 | step: 7.09 49%|████▉ | 4910/10000 [7:43:00<7:47:25, 5.51s/it] {'loss': 0.0083, 'grad_norm': 0.7971515655517578, 'learning_rate': 2.155303754539007e-05, 'epoch': 4.91} 49%|████▉ | 4910/10000 [7:43:00<7:47:25, 5.51s/it][2025-06-19 21:12:44,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:12:44,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.32 | bwd_microstep: 3327.18 | bwd_inner_microstep: 3326.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 21:12:44,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.32 | bwd: 3327.19 | bwd_inner: 3326.39 | bwd_allreduce: 0.76 | step: 6.67 49%|████▉ | 4911/10000 [7:43:05<7:46:23, 5.50s/it] {'loss': 0.003, 'grad_norm': 0.23184743523597717, 'learning_rate': 2.1546579512044356e-05, 'epoch': 4.91} 49%|████▉ | 4911/10000 [7:43:05<7:46:23, 5.50s/it][2025-06-19 21:12:50,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:12:50,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.57 | bwd_microstep: 3370.09 | bwd_inner_microstep: 3369.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-19 21:12:50,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.58 | bwd: 3370.10 | bwd_inner: 3369.28 | bwd_allreduce: 0.78 | step: 7.22 49%|████▉ | 4912/10000 [7:43:11<7:47:11, 5.51s/it] {'loss': 0.0033, 'grad_norm': 0.703371524810791, 'learning_rate': 2.154012131646963e-05, 'epoch': 4.91} 49%|████▉ | 4912/10000 [7:43:11<7:47:11, 5.51s/it][2025-06-19 21:12:55,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:12:55,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.67 | bwd_microstep: 3319.95 | bwd_inner_microstep: 3319.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 21:12:55,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.67 | bwd: 3319.96 | bwd_inner: 3319.16 | bwd_allreduce: 0.76 | step: 6.70 49%|████▉ | 4913/10000 [7:43:16<7:45:51, 5.49s/it] {'loss': 0.0383, 'grad_norm': 2.742798328399658, 'learning_rate': 2.153366295934333e-05, 'epoch': 4.91} 49%|████▉ | 4913/10000 [7:43:16<7:45:51, 5.49s/it][2025-06-19 21:13:01,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:13:01,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.91 | bwd_microstep: 3368.45 | bwd_inner_microstep: 3367.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 21:13:01,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.91 | bwd: 3368.46 | bwd_inner: 3367.66 | bwd_allreduce: 0.76 | step: 6.66 49%|████▉ | 4914/10000 [7:43:22<7:46:52, 5.51s/it] {'loss': 0.0115, 'grad_norm': 1.0309292078018188, 'learning_rate': 2.15272044413429e-05, 'epoch': 4.91} 49%|████▉ | 4914/10000 [7:43:22<7:46:52, 5.51s/it][2025-06-19 21:13:06,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:13:06,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.70 | bwd_microstep: 3326.60 | bwd_inner_microstep: 3325.65 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.01 [2025-06-19 21:13:06,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.70 | bwd: 3326.61 | bwd_inner: 3325.65 | bwd_allreduce: 0.92 | step: 7.01 49%|████▉ | 4915/10000 [7:43:27<7:45:56, 5.50s/it] {'loss': 0.021, 'grad_norm': 1.4706662893295288, 'learning_rate': 2.1520745763145808e-05, 'epoch': 4.92} 49%|████▉ | 4915/10000 [7:43:27<7:45:56, 5.50s/it][2025-06-19 21:13:12,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:13:12,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.51 | bwd_microstep: 3325.71 | bwd_inner_microstep: 3324.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 21:13:12,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.51 | bwd: 3325.72 | bwd_inner: 3324.92 | bwd_allreduce: 0.77 | step: 6.66 49%|████▉ | 4916/10000 [7:43:33<7:45:13, 5.49s/it] {'loss': 0.0019, 'grad_norm': 0.2641068696975708, 'learning_rate': 2.1514286925429553e-05, 'epoch': 4.92} 49%|████▉ | 4916/10000 [7:43:33<7:45:13, 5.49s/it][2025-06-19 21:13:17,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:13:17,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.41 | bwd_microstep: 3310.84 | bwd_inner_microstep: 3309.96 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.41 [2025-06-19 21:13:17,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.41 | bwd: 3310.85 | bwd_inner: 3309.96 | bwd_allreduce: 0.85 | step: 7.41 49%|████▉ | 4917/10000 [7:43:38<7:44:09, 5.48s/it] {'loss': 0.0054, 'grad_norm': 0.6066991090774536, 'learning_rate': 2.150782792887162e-05, 'epoch': 4.92} 49%|████▉ | 4917/10000 [7:43:38<7:44:09, 5.48s/it][2025-06-19 21:13:23,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:13:23,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.09 | bwd_microstep: 3369.35 | bwd_inner_microstep: 3368.52 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.75 [2025-06-19 21:13:23,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.09 | bwd: 3369.37 | bwd_inner: 3368.52 | bwd_allreduce: 0.81 | step: 6.75 49%|████▉ | 4918/10000 [7:43:44<7:45:27, 5.50s/it] {'loss': 0.0013, 'grad_norm': 0.09257695078849792, 'learning_rate': 2.150136877414954e-05, 'epoch': 4.92} 49%|████▉ | 4918/10000 [7:43:44<7:45:27, 5.50s/it][2025-06-19 21:13:28,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:13:28,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.86 | bwd_microstep: 3328.38 | bwd_inner_microstep: 3327.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 21:13:28,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.86 | bwd: 3328.40 | bwd_inner: 3327.59 | bwd_allreduce: 0.77 | step: 6.73 49%|████▉ | 4919/10000 [7:43:49<7:44:48, 5.49s/it] {'loss': 0.0093, 'grad_norm': 0.677825927734375, 'learning_rate': 2.1494909461940844e-05, 'epoch': 4.92} 49%|████▉ | 4919/10000 [7:43:49<7:44:48, 5.49s/it][2025-06-19 21:13:34,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:13:34,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.06 | bwd_microstep: 3315.49 | bwd_inner_microstep: 3314.58 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.91 [2025-06-19 21:13:34,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.07 | bwd: 3315.50 | bwd_inner: 3314.58 | bwd_allreduce: 0.88 | step: 6.91 49%|████▉ | 4920/10000 [7:43:54<7:43:49, 5.48s/it] {'loss': 0.027, 'grad_norm': 1.3179965019226074, 'learning_rate': 2.1488449992923083e-05, 'epoch': 4.92} 49%|████▉ | 4920/10000 [7:43:54<7:43:49, 5.48s/it][2025-06-19 21:13:39,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:13:39,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.31 | bwd_microstep: 3309.98 | bwd_inner_microstep: 3309.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 21:13:39,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.31 | bwd: 3310.00 | bwd_inner: 3309.20 | bwd_allreduce: 0.76 | step: 6.57 49%|████▉ | 4921/10000 [7:44:00<7:42:55, 5.47s/it] {'loss': 0.0063, 'grad_norm': 0.5915350317955017, 'learning_rate': 2.1481990367773823e-05, 'epoch': 4.92} 49%|████▉ | 4921/10000 [7:44:00<7:42:55, 5.47s/it][2025-06-19 21:13:45,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:13:45,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.70 | bwd_microstep: 3366.13 | bwd_inner_microstep: 3365.31 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.25 [2025-06-19 21:13:45,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.70 | bwd: 3366.15 | bwd_inner: 3365.31 | bwd_allreduce: 0.79 | step: 7.26 49%|████▉ | 4922/10000 [7:44:05<7:44:21, 5.49s/it] {'loss': 0.006, 'grad_norm': 0.7463179230690002, 'learning_rate': 2.147553058717065e-05, 'epoch': 4.92} 49%|████▉ | 4922/10000 [7:44:05<7:44:21, 5.49s/it][2025-06-19 21:13:50,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:13:50,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.23 | bwd_microstep: 3369.04 | bwd_inner_microstep: 3368.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 21:13:50,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.23 | bwd: 3369.05 | bwd_inner: 3368.24 | bwd_allreduce: 0.77 | step: 6.83 49%|████▉ | 4923/10000 [7:44:11<7:45:38, 5.50s/it] {'loss': 0.0648, 'grad_norm': 4.433161735534668, 'learning_rate': 2.1469070651791174e-05, 'epoch': 4.92} 49%|████▉ | 4923/10000 [7:44:11<7:45:38, 5.50s/it][2025-06-19 21:13:56,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:13:56,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.81 | bwd_microstep: 3318.28 | bwd_inner_microstep: 3317.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 21:13:56,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.81 | bwd: 3318.29 | bwd_inner: 3317.49 | bwd_allreduce: 0.76 | step: 6.69 49%|████▉ | 4924/10000 [7:44:16<7:44:27, 5.49s/it] {'loss': 0.0141, 'grad_norm': 0.9976993203163147, 'learning_rate': 2.1462610562312997e-05, 'epoch': 4.92} 49%|████▉ | 4924/10000 [7:44:16<7:44:27, 5.49s/it][2025-06-19 21:14:01,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:14:01,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.62 | bwd_microstep: 3318.25 | bwd_inner_microstep: 3317.44 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 21:14:01,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.62 | bwd: 3318.26 | bwd_inner: 3317.44 | bwd_allreduce: 0.78 | step: 6.97 49%|████▉ | 4925/10000 [7:44:22<7:43:31, 5.48s/it] {'loss': 0.0673, 'grad_norm': 4.951849460601807, 'learning_rate': 2.1456150319413762e-05, 'epoch': 4.92} 49%|████▉ | 4925/10000 [7:44:22<7:43:31, 5.48s/it][2025-06-19 21:14:07,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:14:07,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.83 | bwd_microstep: 3317.62 | bwd_inner_microstep: 3316.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 21:14:07,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.83 | bwd: 3317.64 | bwd_inner: 3316.81 | bwd_allreduce: 0.78 | step: 6.95 49%|████▉ | 4926/10000 [7:44:27<7:42:51, 5.47s/it] {'loss': 0.0131, 'grad_norm': 2.4572854042053223, 'learning_rate': 2.144968992377112e-05, 'epoch': 4.93} 49%|████▉ | 4926/10000 [7:44:27<7:42:51, 5.47s/it][2025-06-19 21:14:12,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 21:14:12,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.49 | bwd_microstep: 3365.83 | bwd_inner_microstep: 3364.87 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.39 [2025-06-19 21:14:12,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.49 | bwd: 3365.85 | bwd_inner: 3364.87 | bwd_allreduce: 0.93 | step: 7.40 49%|████▉ | 4927/10000 [7:44:33<7:44:29, 5.49s/it] {'loss': 0.0087, 'grad_norm': 0.7498331069946289, 'learning_rate': 2.144322937606273e-05, 'epoch': 4.93} 49%|████▉ | 4927/10000 [7:44:33<7:44:29, 5.49s/it][2025-06-19 21:14:18,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:14:18,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.77 | bwd_microstep: 3315.72 | bwd_inner_microstep: 3314.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 21:14:18,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.77 | bwd: 3315.74 | bwd_inner: 3314.92 | bwd_allreduce: 0.77 | step: 6.76 49%|████▉ | 4928/10000 [7:44:38<7:43:41, 5.49s/it] {'loss': 0.118, 'grad_norm': 5.610801696777344, 'learning_rate': 2.1436768676966282e-05, 'epoch': 4.93} 49%|████▉ | 4928/10000 [7:44:38<7:43:41, 5.49s/it][2025-06-19 21:14:23,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:14:23,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.15 | bwd_microstep: 3322.96 | bwd_inner_microstep: 3322.04 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.07 [2025-06-19 21:14:23,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.15 | bwd: 3322.97 | bwd_inner: 3322.04 | bwd_allreduce: 0.88 | step: 7.08 49%|████▉ | 4929/10000 [7:44:44<7:43:09, 5.48s/it] {'loss': 0.0764, 'grad_norm': 4.797173976898193, 'learning_rate': 2.143030782715946e-05, 'epoch': 4.93} 49%|████▉ | 4929/10000 [7:44:44<7:43:09, 5.48s/it][2025-06-19 21:14:28,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:14:28,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.31 | bwd_microstep: 3312.39 | bwd_inner_microstep: 3311.37 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.50 [2025-06-19 21:14:28,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.31 | bwd: 3312.41 | bwd_inner: 3311.37 | bwd_allreduce: 0.98 | step: 7.50 49%|████▉ | 4930/10000 [7:44:49<7:42:34, 5.47s/it] {'loss': 0.0499, 'grad_norm': 3.1817755699157715, 'learning_rate': 2.142384682731999e-05, 'epoch': 4.93} 49%|████▉ | 4930/10000 [7:44:49<7:42:34, 5.47s/it][2025-06-19 21:14:34,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 21:14:34,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.32 | bwd_microstep: 3372.49 | bwd_inner_microstep: 3371.33 | bwd_allreduce_microstep: 1.09 | step_microstep: 7.98 [2025-06-19 21:14:34,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.32 | bwd: 3372.51 | bwd_inner: 3371.33 | bwd_allreduce: 1.12 | step: 7.99 49%|████▉ | 4931/10000 [7:44:55<7:44:32, 5.50s/it] {'loss': 0.4019, 'grad_norm': 9.507871627807617, 'learning_rate': 2.14173856781256e-05, 'epoch': 4.93} 49%|████▉ | 4931/10000 [7:44:55<7:44:32, 5.50s/it][2025-06-19 21:14:40,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:14:40,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.24 | bwd_microstep: 3373.89 | bwd_inner_microstep: 3373.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 21:14:40,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.24 | bwd: 3373.91 | bwd_inner: 3373.10 | bwd_allreduce: 0.77 | step: 6.83 49%|████▉ | 4932/10000 [7:45:00<7:45:56, 5.52s/it] {'loss': 0.0987, 'grad_norm': 6.4781575202941895, 'learning_rate': 2.1410924380254023e-05, 'epoch': 4.93} 49%|████▉ | 4932/10000 [7:45:00<7:45:56, 5.52s/it][2025-06-19 21:14:45,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:14:45,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.06 | bwd_microstep: 3317.37 | bwd_inner_microstep: 3316.59 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.83 [2025-06-19 21:14:45,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.06 | bwd: 3317.38 | bwd_inner: 3316.59 | bwd_allreduce: 0.76 | step: 6.84 49%|████▉ | 4933/10000 [7:45:06<7:44:44, 5.50s/it] {'loss': 0.0189, 'grad_norm': 2.1924376487731934, 'learning_rate': 2.140446293438303e-05, 'epoch': 4.93} 49%|████▉ | 4933/10000 [7:45:06<7:44:44, 5.50s/it][2025-06-19 21:14:51,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:14:51,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.04 | bwd_microstep: 3325.45 | bwd_inner_microstep: 3324.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 21:14:51,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.04 | bwd: 3325.46 | bwd_inner: 3324.66 | bwd_allreduce: 0.76 | step: 6.67 49%|████▉ | 4934/10000 [7:45:11<7:43:55, 5.49s/it] {'loss': 0.0007, 'grad_norm': 0.03737352788448334, 'learning_rate': 2.1398001341190387e-05, 'epoch': 4.93} 49%|████▉ | 4934/10000 [7:45:11<7:43:55, 5.49s/it][2025-06-19 21:14:56,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:14:56,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.63 | bwd_microstep: 3362.41 | bwd_inner_microstep: 3361.45 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.12 [2025-06-19 21:14:56,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.63 | bwd: 3362.43 | bwd_inner: 3361.46 | bwd_allreduce: 0.91 | step: 7.13 49%|████▉ | 4935/10000 [7:45:17<7:44:43, 5.51s/it] {'loss': 0.0276, 'grad_norm': 3.25557804107666, 'learning_rate': 2.13915396013539e-05, 'epoch': 4.94} 49%|████▉ | 4935/10000 [7:45:17<7:44:43, 5.51s/it][2025-06-19 21:15:02,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:15:02,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.88 | bwd_microstep: 3366.99 | bwd_inner_microstep: 3366.14 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.01 [2025-06-19 21:15:02,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.88 | bwd: 3367.01 | bwd_inner: 3366.14 | bwd_allreduce: 0.82 | step: 7.01 49%|████▉ | 4936/10000 [7:45:22<7:45:29, 5.52s/it] {'loss': 0.0717, 'grad_norm': 6.124403476715088, 'learning_rate': 2.138507771555137e-05, 'epoch': 4.94} 49%|████▉ | 4936/10000 [7:45:22<7:45:29, 5.52s/it][2025-06-19 21:15:07,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:15:07,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.59 | bwd_microstep: 3321.27 | bwd_inner_microstep: 3320.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.12 [2025-06-19 21:15:07,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.59 | bwd: 3321.28 | bwd_inner: 3320.47 | bwd_allreduce: 0.77 | step: 7.12 49%|████▉ | 4937/10000 [7:45:28<7:44:20, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.024462169036269188, 'learning_rate': 2.137861568446061e-05, 'epoch': 4.94} 49%|████▉ | 4937/10000 [7:45:28<7:44:20, 5.50s/it][2025-06-19 21:15:13,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.72 [2025-06-19 21:15:13,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.97 | bwd_microstep: 3313.83 | bwd_inner_microstep: 3312.87 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.15 [2025-06-19 21:15:13,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.97 | bwd: 3313.84 | bwd_inner: 3312.87 | bwd_allreduce: 0.92 | step: 7.16 49%|████▉ | 4938/10000 [7:45:33<7:43:07, 5.49s/it] {'loss': 0.0079, 'grad_norm': 0.7828224897384644, 'learning_rate': 2.1372153508759467e-05, 'epoch': 4.94} 49%|████▉ | 4938/10000 [7:45:33<7:43:07, 5.49s/it][2025-06-19 21:15:18,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:15:18,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.10 | bwd_microstep: 3316.67 | bwd_inner_microstep: 3315.86 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.25 [2025-06-19 21:15:18,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.10 | bwd: 3316.69 | bwd_inner: 3315.86 | bwd_allreduce: 0.78 | step: 7.25 49%|████▉ | 4939/10000 [7:45:39<7:42:11, 5.48s/it] {'loss': 0.0013, 'grad_norm': 0.2574246823787689, 'learning_rate': 2.1365691189125784e-05, 'epoch': 4.94} 49%|████▉ | 4939/10000 [7:45:39<7:42:11, 5.48s/it][2025-06-19 21:15:24,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:15:24,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.60 | bwd_microstep: 3360.74 | bwd_inner_microstep: 3359.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 21:15:24,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.60 | bwd: 3360.76 | bwd_inner: 3359.95 | bwd_allreduce: 0.76 | step: 6.77 49%|████▉ | 4940/10000 [7:45:44<7:43:07, 5.49s/it] {'loss': 0.0295, 'grad_norm': 2.8189029693603516, 'learning_rate': 2.1359228726237437e-05, 'epoch': 4.94} 49%|████▉ | 4940/10000 [7:45:44<7:43:07, 5.49s/it][2025-06-19 21:15:29,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:15:29,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.49 | bwd_microstep: 3372.61 | bwd_inner_microstep: 3371.75 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.53 [2025-06-19 21:15:29,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.49 | bwd: 3372.63 | bwd_inner: 3371.75 | bwd_allreduce: 0.83 | step: 7.53 49%|████▉ | 4941/10000 [7:45:50<7:44:27, 5.51s/it] {'loss': 0.0015, 'grad_norm': 0.20719771087169647, 'learning_rate': 2.1352766120772306e-05, 'epoch': 4.94} 49%|████▉ | 4941/10000 [7:45:50<7:44:27, 5.51s/it][2025-06-19 21:15:35,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:15:35,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.32 | bwd_microstep: 3358.75 | bwd_inner_microstep: 3357.92 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.81 [2025-06-19 21:15:35,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.32 | bwd: 3358.77 | bwd_inner: 3357.92 | bwd_allreduce: 0.80 | step: 6.82 49%|████▉ | 4942/10000 [7:45:55<7:45:04, 5.52s/it] {'loss': 0.0026, 'grad_norm': 0.1680568903684616, 'learning_rate': 2.1346303373408282e-05, 'epoch': 4.94} 49%|████▉ | 4942/10000 [7:45:55<7:45:04, 5.52s/it][2025-06-19 21:15:40,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:15:40,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.64 | bwd_microstep: 3318.33 | bwd_inner_microstep: 3317.35 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.63 [2025-06-19 21:15:40,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.64 | bwd: 3318.35 | bwd_inner: 3317.35 | bwd_allreduce: 0.95 | step: 7.64 49%|████▉ | 4943/10000 [7:46:01<7:43:36, 5.50s/it] {'loss': 0.0349, 'grad_norm': 2.3720409870147705, 'learning_rate': 2.1339840484823283e-05, 'epoch': 4.94} 49%|████▉ | 4943/10000 [7:46:01<7:43:36, 5.50s/it][2025-06-19 21:15:46,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.73 [2025-06-19 21:15:46,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.88 | bwd_microstep: 3314.67 | bwd_inner_microstep: 3313.76 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.21 [2025-06-19 21:15:46,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.89 | bwd: 3314.70 | bwd_inner: 3313.76 | bwd_allreduce: 0.88 | step: 7.21 49%|████▉ | 4944/10000 [7:46:06<7:42:29, 5.49s/it] {'loss': 0.0276, 'grad_norm': 2.2563745975494385, 'learning_rate': 2.133337745569524e-05, 'epoch': 4.94} 49%|████▉ | 4944/10000 [7:46:06<7:42:29, 5.49s/it][2025-06-19 21:15:51,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 21:15:51,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.30 | bwd_microstep: 3324.00 | bwd_inner_microstep: 3322.83 | bwd_allreduce_microstep: 1.09 | step_microstep: 8.00 [2025-06-19 21:15:51,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.30 | bwd: 3324.02 | bwd_inner: 3322.83 | bwd_allreduce: 1.13 | step: 8.00 49%|████▉ | 4945/10000 [7:46:12<7:42:11, 5.49s/it] {'loss': 0.0184, 'grad_norm': 1.9245845079421997, 'learning_rate': 2.1326914286702088e-05, 'epoch': 4.95} 49%|████▉ | 4945/10000 [7:46:12<7:42:11, 5.49s/it][2025-06-19 21:15:56,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 21:15:56,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.32 | bwd_microstep: 3319.81 | bwd_inner_microstep: 3318.73 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.77 [2025-06-19 21:15:56,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.32 | bwd: 3319.83 | bwd_inner: 3318.73 | bwd_allreduce: 1.05 | step: 7.78 49%|████▉ | 4946/10000 [7:46:17<7:41:52, 5.48s/it] {'loss': 0.0272, 'grad_norm': 1.6661399602890015, 'learning_rate': 2.132045097852179e-05, 'epoch': 4.95} 49%|████▉ | 4946/10000 [7:46:17<7:41:52, 5.48s/it][2025-06-19 21:16:02,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:16:02,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.40 | bwd_microstep: 3314.89 | bwd_inner_microstep: 3314.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 21:16:02,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.40 | bwd: 3314.90 | bwd_inner: 3314.10 | bwd_allreduce: 0.76 | step: 6.80 49%|████▉ | 4947/10000 [7:46:23<7:41:19, 5.48s/it] {'loss': 0.0122, 'grad_norm': 1.0053961277008057, 'learning_rate': 2.1313987531832307e-05, 'epoch': 4.95} 49%|████▉ | 4947/10000 [7:46:23<7:41:19, 5.48s/it][2025-06-19 21:16:07,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:16:07,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.57 | bwd_microstep: 3361.31 | bwd_inner_microstep: 3360.37 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.06 [2025-06-19 21:16:07,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.57 | bwd: 3361.33 | bwd_inner: 3360.37 | bwd_allreduce: 0.91 | step: 7.06 49%|████▉ | 4948/10000 [7:46:28<7:42:44, 5.50s/it] {'loss': 0.0024, 'grad_norm': 0.5095145106315613, 'learning_rate': 2.1307523947311638e-05, 'epoch': 4.95} 49%|████▉ | 4948/10000 [7:46:28<7:42:44, 5.50s/it][2025-06-19 21:16:13,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:16:13,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.60 | bwd_microstep: 3370.77 | bwd_inner_microstep: 3369.88 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.41 [2025-06-19 21:16:13,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.60 | bwd: 3370.78 | bwd_inner: 3369.88 | bwd_allreduce: 0.85 | step: 7.41 49%|████▉ | 4949/10000 [7:46:34<7:43:38, 5.51s/it] {'loss': 0.0085, 'grad_norm': 0.61124187707901, 'learning_rate': 2.130106022563777e-05, 'epoch': 4.95} 49%|████▉ | 4949/10000 [7:46:34<7:43:38, 5.51s/it][2025-06-19 21:16:18,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:16:18,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.34 | bwd_microstep: 3321.21 | bwd_inner_microstep: 3320.41 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-19 21:16:18,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.34 | bwd: 3321.22 | bwd_inner: 3320.41 | bwd_allreduce: 0.77 | step: 6.86 50%|████▉ | 4950/10000 [7:46:39<7:42:24, 5.49s/it] {'loss': 0.2269, 'grad_norm': 4.117610931396484, 'learning_rate': 2.1294596367488717e-05, 'epoch': 4.95} 50%|████▉ | 4950/10000 [7:46:39<7:42:24, 5.49s/it][2025-06-19 21:16:24,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:16:24,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.08 | bwd_microstep: 3328.79 | bwd_inner_microstep: 3327.85 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.19 [2025-06-19 21:16:24,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.08 | bwd: 3328.81 | bwd_inner: 3327.85 | bwd_allreduce: 0.90 | step: 7.19 50%|████▉ | 4951/10000 [7:46:45<7:42:07, 5.49s/it] {'loss': 0.0048, 'grad_norm': 0.36416032910346985, 'learning_rate': 2.128813237354252e-05, 'epoch': 4.95} 50%|████▉ | 4951/10000 [7:46:45<7:42:07, 5.49s/it][2025-06-19 21:16:29,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.73 [2025-06-19 21:16:29,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.52 | bwd_microstep: 3317.02 | bwd_inner_microstep: 3316.15 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.96 [2025-06-19 21:16:29,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.52 | bwd: 3317.04 | bwd_inner: 3316.15 | bwd_allreduce: 0.84 | step: 6.96 50%|████▉ | 4952/10000 [7:46:50<7:41:15, 5.48s/it] {'loss': 0.031, 'grad_norm': 2.2593626976013184, 'learning_rate': 2.128166824447721e-05, 'epoch': 4.95} 50%|████▉ | 4952/10000 [7:46:50<7:41:15, 5.48s/it][2025-06-19 21:16:35,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:16:35,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.15 | bwd_microstep: 3404.83 | bwd_inner_microstep: 3403.86 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.32 [2025-06-19 21:16:35,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.15 | bwd: 3404.84 | bwd_inner: 3403.86 | bwd_allreduce: 0.94 | step: 7.32 50%|████▉ | 4953/10000 [7:46:56<7:43:44, 5.51s/it] {'loss': 0.1131, 'grad_norm': 3.705263614654541, 'learning_rate': 2.1275203980970853e-05, 'epoch': 4.95} 50%|████▉ | 4953/10000 [7:46:56<7:43:44, 5.51s/it][2025-06-19 21:16:40,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:16:40,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.55 | bwd_microstep: 3327.64 | bwd_inner_microstep: 3326.74 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.98 [2025-06-19 21:16:40,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.55 | bwd: 3327.66 | bwd_inner: 3326.74 | bwd_allreduce: 0.87 | step: 6.98 50%|████▉ | 4954/10000 [7:47:01<7:42:40, 5.50s/it] {'loss': 0.0536, 'grad_norm': 2.9871037006378174, 'learning_rate': 2.1268739583701522e-05, 'epoch': 4.95} 50%|████▉ | 4954/10000 [7:47:01<7:42:40, 5.50s/it][2025-06-19 21:16:46,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:16:46,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.03 | bwd_microstep: 3327.94 | bwd_inner_microstep: 3327.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-19 21:16:46,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.03 | bwd: 3327.95 | bwd_inner: 3327.13 | bwd_allreduce: 0.77 | step: 6.91 50%|████▉ | 4955/10000 [7:47:07<7:42:06, 5.50s/it] {'loss': 0.0141, 'grad_norm': 2.534343719482422, 'learning_rate': 2.1262275053347293e-05, 'epoch': 4.96} 50%|████▉ | 4955/10000 [7:47:07<7:42:06, 5.50s/it][2025-06-19 21:16:51,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:16:51,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.25 | bwd_microstep: 3367.37 | bwd_inner_microstep: 3366.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.88 [2025-06-19 21:16:51,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.25 | bwd: 3367.38 | bwd_inner: 3366.58 | bwd_allreduce: 0.76 | step: 6.89 50%|████▉ | 4956/10000 [7:47:12<7:43:07, 5.51s/it] {'loss': 0.0023, 'grad_norm': 0.1658981293439865, 'learning_rate': 2.125581039058627e-05, 'epoch': 4.96} 50%|████▉ | 4956/10000 [7:47:12<7:43:07, 5.51s/it][2025-06-19 21:16:57,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.72 [2025-06-19 21:16:57,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.62 | bwd_microstep: 3319.31 | bwd_inner_microstep: 3318.20 | bwd_allreduce_microstep: 1.05 | step_microstep: 8.52 [2025-06-19 21:16:57,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.62 | bwd: 3319.33 | bwd_inner: 3318.20 | bwd_allreduce: 1.08 | step: 8.53 50%|████▉ | 4957/10000 [7:47:18<7:41:56, 5.50s/it] {'loss': 0.0296, 'grad_norm': 3.7132108211517334, 'learning_rate': 2.1249345596096567e-05, 'epoch': 4.96} 50%|████▉ | 4957/10000 [7:47:18<7:41:56, 5.50s/it][2025-06-19 21:17:02,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:17:02,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.01 | bwd_microstep: 3324.63 | bwd_inner_microstep: 3323.62 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.62 [2025-06-19 21:17:02,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.01 | bwd: 3324.65 | bwd_inner: 3323.62 | bwd_allreduce: 0.97 | step: 7.62 50%|████▉ | 4958/10000 [7:47:23<7:41:39, 5.49s/it] {'loss': 0.0054, 'grad_norm': 0.7024248242378235, 'learning_rate': 2.1242880670556304e-05, 'epoch': 4.96} 50%|████▉ | 4958/10000 [7:47:23<7:41:39, 5.49s/it][2025-06-19 21:17:08,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:17:08,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.77 | bwd_microstep: 3378.85 | bwd_inner_microstep: 3378.00 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.03 [2025-06-19 21:17:08,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.77 | bwd: 3378.87 | bwd_inner: 3378.00 | bwd_allreduce: 0.82 | step: 7.04 50%|████▉ | 4959/10000 [7:47:29<7:43:13, 5.51s/it] {'loss': 0.0469, 'grad_norm': 2.5198922157287598, 'learning_rate': 2.1236415614643634e-05, 'epoch': 4.96} 50%|████▉ | 4959/10000 [7:47:29<7:43:13, 5.51s/it][2025-06-19 21:17:14,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:17:14,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.82 | bwd_microstep: 3370.37 | bwd_inner_microstep: 3369.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 21:17:14,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.82 | bwd: 3370.38 | bwd_inner: 3369.57 | bwd_allreduce: 0.77 | step: 6.72 50%|████▉ | 4960/10000 [7:47:34<7:43:55, 5.52s/it] {'loss': 0.0504, 'grad_norm': 3.449234962463379, 'learning_rate': 2.12299504290367e-05, 'epoch': 4.96} 50%|████▉ | 4960/10000 [7:47:34<7:43:55, 5.52s/it][2025-06-19 21:17:19,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.77 [2025-06-19 21:17:19,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.33 | bwd_microstep: 3327.19 | bwd_inner_microstep: 3326.02 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.93 [2025-06-19 21:17:19,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.33 | bwd: 3327.21 | bwd_inner: 3326.02 | bwd_allreduce: 1.14 | step: 7.94 50%|████▉ | 4961/10000 [7:47:40<7:42:52, 5.51s/it] {'loss': 0.0247, 'grad_norm': 2.4148623943328857, 'learning_rate': 2.1223485114413673e-05, 'epoch': 4.96} 50%|████▉ | 4961/10000 [7:47:40<7:42:52, 5.51s/it][2025-06-19 21:17:25,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:17:25,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.54 | bwd_microstep: 3324.78 | bwd_inner_microstep: 3323.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 21:17:25,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.55 | bwd: 3324.79 | bwd_inner: 3323.97 | bwd_allreduce: 0.78 | step: 6.83 50%|████▉ | 4962/10000 [7:47:45<7:41:54, 5.50s/it] {'loss': 0.0678, 'grad_norm': 6.136023044586182, 'learning_rate': 2.1217019671452737e-05, 'epoch': 4.96} 50%|████▉ | 4962/10000 [7:47:45<7:41:54, 5.50s/it][2025-06-19 21:17:30,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.92 [2025-06-19 21:17:30,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.67 | bwd_microstep: 3333.95 | bwd_inner_microstep: 3332.80 | bwd_allreduce_microstep: 1.09 | step_microstep: 8.46 [2025-06-19 21:17:30,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.67 | bwd: 3333.97 | bwd_inner: 3332.80 | bwd_allreduce: 1.11 | step: 8.47 50%|████▉ | 4963/10000 [7:47:51<7:41:37, 5.50s/it] {'loss': 0.0015, 'grad_norm': 0.2746371626853943, 'learning_rate': 2.1210554100832087e-05, 'epoch': 4.96} 50%|████▉ | 4963/10000 [7:47:51<7:41:37, 5.50s/it][2025-06-19 21:17:36,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:17:36,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.14 | bwd_microstep: 3339.25 | bwd_inner_microstep: 3338.34 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.03 [2025-06-19 21:17:36,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.14 | bwd: 3339.26 | bwd_inner: 3338.34 | bwd_allreduce: 0.88 | step: 7.04 50%|████▉ | 4964/10000 [7:47:56<7:41:46, 5.50s/it] {'loss': 0.0272, 'grad_norm': 2.0856099128723145, 'learning_rate': 2.120408840322993e-05, 'epoch': 4.96} 50%|████▉ | 4964/10000 [7:47:56<7:41:46, 5.50s/it][2025-06-19 21:17:41,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:17:41,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.25 | bwd_microstep: 3337.00 | bwd_inner_microstep: 3336.03 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.69 [2025-06-19 21:17:41,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.25 | bwd: 3337.02 | bwd_inner: 3336.03 | bwd_allreduce: 0.94 | step: 7.70 50%|████▉ | 4965/10000 [7:48:02<7:41:31, 5.50s/it] {'loss': 0.0148, 'grad_norm': 1.5662790536880493, 'learning_rate': 2.1197622579324486e-05, 'epoch': 4.96} 50%|████▉ | 4965/10000 [7:48:02<7:41:31, 5.50s/it][2025-06-19 21:17:47,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:17:47,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.94 | bwd_microstep: 3382.40 | bwd_inner_microstep: 3381.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 21:17:47,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.94 | bwd: 3382.41 | bwd_inner: 3381.62 | bwd_allreduce: 0.75 | step: 6.67 50%|████▉ | 4966/10000 [7:48:07<7:42:50, 5.52s/it] {'loss': 0.0026, 'grad_norm': 0.29536640644073486, 'learning_rate': 2.1191156629793998e-05, 'epoch': 4.97} 50%|████▉ | 4966/10000 [7:48:07<7:42:50, 5.52s/it][2025-06-19 21:17:52,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:17:52,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.01 | bwd_microstep: 3330.67 | bwd_inner_microstep: 3329.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 21:17:52,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.01 | bwd: 3330.69 | bwd_inner: 3329.89 | bwd_allreduce: 0.76 | step: 6.67 50%|████▉ | 4967/10000 [7:48:13<7:41:51, 5.51s/it] {'loss': 0.0085, 'grad_norm': 1.1824125051498413, 'learning_rate': 2.11846905553167e-05, 'epoch': 4.97} 50%|████▉ | 4967/10000 [7:48:13<7:41:51, 5.51s/it][2025-06-19 21:17:58,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:17:58,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.06 | bwd_microstep: 3386.91 | bwd_inner_microstep: 3385.74 | bwd_allreduce_microstep: 1.11 | step_microstep: 7.17 [2025-06-19 21:17:58,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.06 | bwd: 3386.92 | bwd_inner: 3385.74 | bwd_allreduce: 1.13 | step: 7.17 50%|████▉ | 4968/10000 [7:48:18<7:43:09, 5.52s/it] {'loss': 0.0004, 'grad_norm': 0.026829177513718605, 'learning_rate': 2.1178224356570864e-05, 'epoch': 4.97} 50%|████▉ | 4968/10000 [7:48:18<7:43:09, 5.52s/it][2025-06-19 21:18:03,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:18:03,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2154.55 | bwd_microstep: 3387.42 | bwd_inner_microstep: 3386.53 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.90 [2025-06-19 21:18:03,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2154.55 | bwd: 3387.43 | bwd_inner: 3386.53 | bwd_allreduce: 0.86 | step: 6.90 50%|████▉ | 4969/10000 [7:48:24<7:44:35, 5.54s/it] {'loss': 0.0118, 'grad_norm': 1.2054522037506104, 'learning_rate': 2.117175803423476e-05, 'epoch': 4.97} 50%|████▉ | 4969/10000 [7:48:24<7:44:35, 5.54s/it][2025-06-19 21:18:09,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:18:09,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.37 | bwd_microstep: 3331.55 | bwd_inner_microstep: 3330.77 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 21:18:09,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.37 | bwd: 3331.56 | bwd_inner: 3330.77 | bwd_allreduce: 0.76 | step: 6.56 50%|████▉ | 4970/10000 [7:48:29<7:43:11, 5.53s/it] {'loss': 0.0043, 'grad_norm': 0.3755446970462799, 'learning_rate': 2.1165291588986675e-05, 'epoch': 4.97} 50%|████▉ | 4970/10000 [7:48:29<7:43:11, 5.53s/it][2025-06-19 21:18:14,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:18:14,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.90 | bwd_microstep: 3331.02 | bwd_inner_microstep: 3330.21 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-19 21:18:14,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.90 | bwd: 3331.04 | bwd_inner: 3330.21 | bwd_allreduce: 0.78 | step: 6.99 50%|████▉ | 4971/10000 [7:48:35<7:42:03, 5.51s/it] {'loss': 0.0006, 'grad_norm': 0.05103178694844246, 'learning_rate': 2.115882502150492e-05, 'epoch': 4.97} 50%|████▉ | 4971/10000 [7:48:35<7:42:03, 5.51s/it][2025-06-19 21:18:20,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:18:20,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.05 | bwd_microstep: 3335.26 | bwd_inner_microstep: 3334.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 21:18:20,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.05 | bwd: 3335.27 | bwd_inner: 3334.46 | bwd_allreduce: 0.77 | step: 6.81 50%|████▉ | 4972/10000 [7:48:40<7:41:12, 5.50s/it] {'loss': 0.0953, 'grad_norm': 4.300595760345459, 'learning_rate': 2.115235833246779e-05, 'epoch': 4.97} 50%|████▉ | 4972/10000 [7:48:40<7:41:12, 5.50s/it][2025-06-19 21:18:25,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:18:25,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.53 | bwd_microstep: 3343.84 | bwd_inner_microstep: 3342.87 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.06 [2025-06-19 21:18:25,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.53 | bwd: 3343.85 | bwd_inner: 3342.87 | bwd_allreduce: 0.94 | step: 7.07 50%|████▉ | 4973/10000 [7:48:46<7:40:46, 5.50s/it] {'loss': 0.0018, 'grad_norm': 0.1079876571893692, 'learning_rate': 2.1145891522553618e-05, 'epoch': 4.97} 50%|████▉ | 4973/10000 [7:48:46<7:40:46, 5.50s/it][2025-06-19 21:18:31,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:18:31,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.88 | bwd_microstep: 3383.59 | bwd_inner_microstep: 3382.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 21:18:31,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.89 | bwd: 3383.60 | bwd_inner: 3382.78 | bwd_allreduce: 0.78 | step: 6.77 50%|████▉ | 4974/10000 [7:48:52<7:42:16, 5.52s/it] {'loss': 0.089, 'grad_norm': 3.3830015659332275, 'learning_rate': 2.113942459244075e-05, 'epoch': 4.97} 50%|████▉ | 4974/10000 [7:48:52<7:42:16, 5.52s/it][2025-06-19 21:18:36,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:18:36,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.87 | bwd_microstep: 3387.57 | bwd_inner_microstep: 3386.57 | bwd_allreduce_microstep: 0.95 | step_microstep: 6.73 [2025-06-19 21:18:36,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.87 | bwd: 3387.58 | bwd_inner: 3386.57 | bwd_allreduce: 0.97 | step: 6.74 50%|████▉ | 4975/10000 [7:48:57<7:43:27, 5.53s/it] {'loss': 0.0093, 'grad_norm': 0.8854764699935913, 'learning_rate': 2.1132957542807527e-05, 'epoch': 4.97} 50%|████▉ | 4975/10000 [7:48:57<7:43:27, 5.53s/it][2025-06-19 21:18:42,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:18:42,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.68 | bwd_microstep: 3320.05 | bwd_inner_microstep: 3319.15 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.70 [2025-06-19 21:18:42,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.68 | bwd: 3320.06 | bwd_inner: 3319.15 | bwd_allreduce: 0.87 | step: 6.72 50%|████▉ | 4976/10000 [7:49:03<7:41:46, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.01045356597751379, 'learning_rate': 2.112649037433232e-05, 'epoch': 4.98} 50%|████▉ | 4976/10000 [7:49:03<7:41:46, 5.51s/it][2025-06-19 21:18:47,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.76 [2025-06-19 21:18:47,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.44 | bwd_microstep: 3335.94 | bwd_inner_microstep: 3335.11 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.49 [2025-06-19 21:18:47,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.44 | bwd: 3335.96 | bwd_inner: 3335.11 | bwd_allreduce: 0.80 | step: 7.50 50%|████▉ | 4977/10000 [7:49:08<7:41:09, 5.51s/it] {'loss': 0.0301, 'grad_norm': 3.8696460723876953, 'learning_rate': 2.1120023087693498e-05, 'epoch': 4.98} 50%|████▉ | 4977/10000 [7:49:08<7:41:09, 5.51s/it][2025-06-19 21:18:53,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:18:53,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.29 | bwd_microstep: 3323.81 | bwd_inner_microstep: 3323.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.59 [2025-06-19 21:18:53,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.29 | bwd: 3323.82 | bwd_inner: 3323.00 | bwd_allreduce: 0.77 | step: 6.60 50%|████▉ | 4978/10000 [7:49:14<7:40:15, 5.50s/it] {'loss': 0.0035, 'grad_norm': 0.3836787939071655, 'learning_rate': 2.111355568356945e-05, 'epoch': 4.98} 50%|████▉ | 4978/10000 [7:49:14<7:40:15, 5.50s/it][2025-06-19 21:18:58,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:18:58,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.20 | bwd_microstep: 3332.38 | bwd_inner_microstep: 3331.52 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.06 [2025-06-19 21:18:58,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.20 | bwd: 3332.40 | bwd_inner: 3331.52 | bwd_allreduce: 0.82 | step: 7.06 50%|████▉ | 4979/10000 [7:49:19<7:39:40, 5.49s/it] {'loss': 0.1044, 'grad_norm': 6.05714225769043, 'learning_rate': 2.110708816263858e-05, 'epoch': 4.98} 50%|████▉ | 4979/10000 [7:49:19<7:39:40, 5.49s/it][2025-06-19 21:19:04,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:19:04,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.29 | bwd_microstep: 3383.24 | bwd_inner_microstep: 3382.37 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.98 [2025-06-19 21:19:04,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.29 | bwd: 3383.26 | bwd_inner: 3382.37 | bwd_allreduce: 0.83 | step: 6.98 50%|████▉ | 4980/10000 [7:49:25<7:41:32, 5.52s/it] {'loss': 0.0025, 'grad_norm': 0.19954484701156616, 'learning_rate': 2.11006205255793e-05, 'epoch': 4.98} 50%|████▉ | 4980/10000 [7:49:25<7:41:32, 5.52s/it][2025-06-19 21:19:09,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:19:09,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.15 | bwd_microstep: 3383.40 | bwd_inner_microstep: 3382.45 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.18 [2025-06-19 21:19:09,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.15 | bwd: 3383.42 | bwd_inner: 3382.45 | bwd_allreduce: 0.92 | step: 7.18 50%|████▉ | 4981/10000 [7:49:30<7:42:29, 5.53s/it] {'loss': 0.0186, 'grad_norm': 1.6882191896438599, 'learning_rate': 2.1094152773070034e-05, 'epoch': 4.98} 50%|████▉ | 4981/10000 [7:49:30<7:42:29, 5.53s/it][2025-06-19 21:19:15,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:19:15,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.63 | bwd_microstep: 3325.02 | bwd_inner_microstep: 3324.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.61 [2025-06-19 21:19:15,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.63 | bwd: 3325.03 | bwd_inner: 3324.22 | bwd_allreduce: 0.77 | step: 6.62 50%|████▉ | 4982/10000 [7:49:36<7:41:09, 5.51s/it] {'loss': 0.0568, 'grad_norm': 2.2727606296539307, 'learning_rate': 2.108768490578922e-05, 'epoch': 4.98} 50%|████▉ | 4982/10000 [7:49:36<7:41:09, 5.51s/it][2025-06-19 21:19:20,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 21:19:20,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.67 | bwd_microstep: 3381.35 | bwd_inner_microstep: 3380.33 | bwd_allreduce_microstep: 0.96 | step_microstep: 8.28 [2025-06-19 21:19:20,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.68 | bwd: 3381.37 | bwd_inner: 3380.33 | bwd_allreduce: 0.99 | step: 8.29 50%|████▉ | 4983/10000 [7:49:41<7:42:12, 5.53s/it] {'loss': 0.2341, 'grad_norm': 5.2054033279418945, 'learning_rate': 2.108121692441531e-05, 'epoch': 4.98} 50%|████▉ | 4983/10000 [7:49:41<7:42:12, 5.53s/it][2025-06-19 21:19:26,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:19:26,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.02 | bwd_microstep: 3335.81 | bwd_inner_microstep: 3335.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-19 21:19:26,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.02 | bwd: 3335.82 | bwd_inner: 3335.01 | bwd_allreduce: 0.77 | step: 6.83 50%|████▉ | 4984/10000 [7:49:47<7:41:24, 5.52s/it] {'loss': 0.0257, 'grad_norm': 2.7510013580322266, 'learning_rate': 2.1074748829626767e-05, 'epoch': 4.98} 50%|████▉ | 4984/10000 [7:49:47<7:41:24, 5.52s/it][2025-06-19 21:19:31,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-19 21:19:31,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.53 | bwd_microstep: 3333.07 | bwd_inner_microstep: 3332.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 21:19:31,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.53 | bwd: 3333.09 | bwd_inner: 3332.27 | bwd_allreduce: 0.77 | step: 6.88 50%|████▉ | 4985/10000 [7:49:52<7:40:20, 5.51s/it] {'loss': 0.0004, 'grad_norm': 0.05105600506067276, 'learning_rate': 2.1068280622102052e-05, 'epoch': 4.99} 50%|████▉ | 4985/10000 [7:49:52<7:40:20, 5.51s/it][2025-06-19 21:19:37,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:19:37,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.89 | bwd_microstep: 3403.39 | bwd_inner_microstep: 3402.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 21:19:37,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.89 | bwd: 3403.41 | bwd_inner: 3402.60 | bwd_allreduce: 0.76 | step: 6.68 50%|████▉ | 4986/10000 [7:49:58<7:42:17, 5.53s/it] {'loss': 0.017, 'grad_norm': 1.0425430536270142, 'learning_rate': 2.106181230251966e-05, 'epoch': 4.99} 50%|████▉ | 4986/10000 [7:49:58<7:42:17, 5.53s/it][2025-06-19 21:19:42,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 21:19:42,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.79 | bwd_microstep: 3333.91 | bwd_inner_microstep: 3332.75 | bwd_allreduce_microstep: 1.09 | step_microstep: 7.22 [2025-06-19 21:19:42,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.79 | bwd: 3333.93 | bwd_inner: 3332.75 | bwd_allreduce: 1.12 | step: 7.23 50%|████▉ | 4987/10000 [7:50:03<7:41:06, 5.52s/it] {'loss': 0.0177, 'grad_norm': 1.6237704753875732, 'learning_rate': 2.1055343871558084e-05, 'epoch': 4.99} 50%|████▉ | 4987/10000 [7:50:03<7:41:06, 5.52s/it][2025-06-19 21:19:48,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:19:48,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.79 | bwd_microstep: 3372.49 | bwd_inner_microstep: 3371.68 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-19 21:19:48,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.79 | bwd: 3372.50 | bwd_inner: 3371.68 | bwd_allreduce: 0.78 | step: 6.90 50%|████▉ | 4988/10000 [7:50:09<7:41:34, 5.53s/it] {'loss': 0.0028, 'grad_norm': 0.372662216424942, 'learning_rate': 2.1048875329895836e-05, 'epoch': 4.99} 50%|████▉ | 4988/10000 [7:50:09<7:41:34, 5.53s/it][2025-06-19 21:19:53,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:19:53,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.07 | bwd_microstep: 3318.72 | bwd_inner_microstep: 3317.90 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.85 [2025-06-19 21:19:53,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.07 | bwd: 3318.74 | bwd_inner: 3317.90 | bwd_allreduce: 0.79 | step: 6.85 50%|████▉ | 4989/10000 [7:50:14<7:40:09, 5.51s/it] {'loss': 0.0019, 'grad_norm': 0.1893145591020584, 'learning_rate': 2.1042406678211437e-05, 'epoch': 4.99} 50%|████▉ | 4989/10000 [7:50:14<7:40:09, 5.51s/it][2025-06-19 21:19:59,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 21:19:59,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.28 | bwd_microstep: 3330.04 | bwd_inner_microstep: 3328.93 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.54 [2025-06-19 21:19:59,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.28 | bwd: 3330.06 | bwd_inner: 3328.93 | bwd_allreduce: 1.07 | step: 7.54 50%|████▉ | 4990/10000 [7:50:20<7:39:31, 5.50s/it] {'loss': 0.1217, 'grad_norm': 2.791330575942993, 'learning_rate': 2.103593791718341e-05, 'epoch': 4.99} 50%|████▉ | 4990/10000 [7:50:20<7:39:31, 5.50s/it][2025-06-19 21:20:04,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:20:04,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.98 | bwd_microstep: 3322.87 | bwd_inner_microstep: 3322.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 21:20:04,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.98 | bwd: 3322.89 | bwd_inner: 3322.09 | bwd_allreduce: 0.75 | step: 6.56 50%|████▉ | 4991/10000 [7:50:25<7:38:34, 5.49s/it] {'loss': 0.0966, 'grad_norm': 3.438905715942383, 'learning_rate': 2.1029469047490303e-05, 'epoch': 4.99} 50%|████▉ | 4991/10000 [7:50:25<7:38:34, 5.49s/it][2025-06-19 21:20:10,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:20:10,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.66 | bwd_microstep: 3374.74 | bwd_inner_microstep: 3373.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-19 21:20:10,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.66 | bwd: 3374.75 | bwd_inner: 3373.95 | bwd_allreduce: 0.77 | step: 7.07 50%|████▉ | 4992/10000 [7:50:31<7:39:47, 5.51s/it] {'loss': 0.0896, 'grad_norm': 2.67653489112854, 'learning_rate': 2.102300006981068e-05, 'epoch': 4.99} 50%|████▉ | 4992/10000 [7:50:31<7:39:47, 5.51s/it][2025-06-19 21:20:15,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:20:15,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.28 | bwd_microstep: 3327.24 | bwd_inner_microstep: 3326.47 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 21:20:15,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.28 | bwd: 3327.26 | bwd_inner: 3326.47 | bwd_allreduce: 0.75 | step: 6.57 50%|████▉ | 4993/10000 [7:50:36<7:38:41, 5.50s/it] {'loss': 0.2038, 'grad_norm': 5.1436076164245605, 'learning_rate': 2.101653098482309e-05, 'epoch': 4.99} 50%|████▉ | 4993/10000 [7:50:36<7:38:41, 5.50s/it][2025-06-19 21:20:21,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:20:21,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.14 | bwd_microstep: 3322.01 | bwd_inner_microstep: 3321.18 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.88 [2025-06-19 21:20:21,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.14 | bwd: 3322.03 | bwd_inner: 3321.18 | bwd_allreduce: 0.80 | step: 6.88 50%|████▉ | 4994/10000 [7:50:42<7:37:46, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.02486005239188671, 'learning_rate': 2.1010061793206118e-05, 'epoch': 4.99} 50%|████▉ | 4994/10000 [7:50:42<7:37:46, 5.49s/it][2025-06-19 21:20:26,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:20:26,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.21 | bwd_microstep: 3327.04 | bwd_inner_microstep: 3326.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 21:20:26,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.21 | bwd: 3327.05 | bwd_inner: 3326.26 | bwd_allreduce: 0.75 | step: 6.57 50%|████▉ | 4995/10000 [7:50:47<7:37:24, 5.48s/it] {'loss': 0.0287, 'grad_norm': 1.9687923192977905, 'learning_rate': 2.1003592495638352e-05, 'epoch': 5.0} 50%|████▉ | 4995/10000 [7:50:47<7:37:24, 5.48s/it][2025-06-19 21:20:32,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:20:32,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.68 | bwd_microstep: 3321.98 | bwd_inner_microstep: 3321.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.92 [2025-06-19 21:20:32,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.68 | bwd: 3321.99 | bwd_inner: 3321.19 | bwd_allreduce: 0.76 | step: 6.92 50%|████▉ | 4996/10000 [7:50:53<7:36:53, 5.48s/it] {'loss': 0.0123, 'grad_norm': 1.0709799528121948, 'learning_rate': 2.0997123092798387e-05, 'epoch': 5.0} 50%|████▉ | 4996/10000 [7:50:53<7:36:53, 5.48s/it][2025-06-19 21:20:37,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 21:20:37,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.95 | bwd_microstep: 3371.82 | bwd_inner_microstep: 3371.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 21:20:37,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.95 | bwd: 3371.84 | bwd_inner: 3371.03 | bwd_allreduce: 0.76 | step: 6.66 50%|████▉ | 4997/10000 [7:50:58<7:38:09, 5.49s/it] {'loss': 0.0084, 'grad_norm': 0.7569308280944824, 'learning_rate': 2.0990653585364843e-05, 'epoch': 5.0} 50%|████▉ | 4997/10000 [7:50:58<7:38:09, 5.49s/it][2025-06-19 21:20:43,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:20:43,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.02 | bwd_microstep: 3379.62 | bwd_inner_microstep: 3378.71 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.13 [2025-06-19 21:20:43,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.02 | bwd: 3379.63 | bwd_inner: 3378.71 | bwd_allreduce: 0.89 | step: 7.13 50%|████▉ | 4998/10000 [7:51:04<7:39:42, 5.51s/it] {'loss': 0.0107, 'grad_norm': 0.850896418094635, 'learning_rate': 2.0984183974016334e-05, 'epoch': 5.0} 50%|████▉ | 4998/10000 [7:51:04<7:39:42, 5.51s/it][2025-06-19 21:20:48,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:20:48,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.07 | bwd_microstep: 3325.47 | bwd_inner_microstep: 3324.64 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.70 [2025-06-19 21:20:48,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.07 | bwd: 3325.49 | bwd_inner: 3324.64 | bwd_allreduce: 0.80 | step: 6.70 50%|████▉ | 4999/10000 [7:51:09<7:38:50, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.01462555956095457, 'learning_rate': 2.0977714259431496e-05, 'epoch': 5.0} 50%|████▉ | 4999/10000 [7:51:09<7:38:50, 5.50s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-06-19 21:20:56,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 21:20:56,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2092.62 | bwd_microstep: 3305.60 | bwd_inner_microstep: 3304.64 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.01 [2025-06-19 21:20:56,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2092.62 | bwd: 3305.61 | bwd_inner: 3304.64 | bwd_allreduce: 0.93 | step: 7.01 50%|█████ | 5000/10000 [7:51:17<8:29:02, 6.11s/it] {'loss': 0.0035, 'grad_norm': 0.581424355506897, 'learning_rate': 2.097124444228897e-05, 'epoch': 5.0} 50%|█████ | 5000/10000 [7:51:17<8:29:02, 6.11s/it]evaluate! [INFO|trainer.py:3910] 2025-06-19 21:21:06,193 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 21:21:06,198 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 21:21:06,199 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 21:22:01,211 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 21:22:01,214 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 21:22:01,214 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 21:22:01,214 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json evaluate! [INFO|trainer.py:3910] 2025-06-19 21:22:18,224 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 21:22:18,229 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 21:22:18,229 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 21:23:19,321 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 21:23:19,324 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 21:23:19,324 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 21:23:19,324 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json [2025-06-19 21:23:23,994] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 21:23:29,839] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 21:23:35,680] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 21:23:41,591] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 21:24:00,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:24:00,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.52 | bwd_microstep: 3352.21 | bwd_inner_microstep: 3351.39 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.43 [2025-06-19 21:24:00,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.47 | bwd: 3352.22 | bwd_inner: 3351.39 | bwd_allreduce: 0.78 | step: 7.43 50%|█████ | 5001/10000 [7:54:21<82:31:18, 59.43s/it] {'loss': 0.1956, 'grad_norm': 2.61954927444458, 'learning_rate': 2.0964774523267405e-05, 'epoch': 5.0} 50%|█████ | 5001/10000 [7:54:21<82:31:18, 59.43s/it][2025-06-19 21:24:05,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 21:24:05,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.05 | bwd_microstep: 3318.46 | bwd_inner_microstep: 3317.42 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.74 [2025-06-19 21:24:05,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.05 | bwd: 3318.49 | bwd_inner: 3317.42 | bwd_allreduce: 1.00 | step: 7.75 50%|█████ | 5002/10000 [7:54:26<60:02:10, 43.24s/it] {'loss': 0.0249, 'grad_norm': 2.9729795455932617, 'learning_rate': 2.0958304503045474e-05, 'epoch': 5.0} 50%|█████ | 5002/10000 [7:54:26<60:02:10, 43.24s/it][2025-06-19 21:24:11,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:24:11,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2093.09 | bwd_microstep: 3273.92 | bwd_inner_microstep: 3273.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.14 [2025-06-19 21:24:11,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2093.09 | bwd: 3273.94 | bwd_inner: 3273.13 | bwd_allreduce: 0.76 | step: 7.16 50%|█████ | 5003/10000 [7:54:31<44:16:38, 31.90s/it] {'loss': 0.0769, 'grad_norm': 2.2853994369506836, 'learning_rate': 2.0951834382301847e-05, 'epoch': 5.0} 50%|█████ | 5003/10000 [7:54:31<44:16:38, 31.90s/it][2025-06-19 21:24:16,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:24:16,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.84 | bwd_microstep: 3340.84 | bwd_inner_microstep: 3340.06 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 21:24:16,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.84 | bwd: 3340.86 | bwd_inner: 3340.06 | bwd_allreduce: 0.76 | step: 6.68 50%|█████ | 5004/10000 [7:54:37<33:16:30, 23.98s/it] {'loss': 0.0072, 'grad_norm': 0.3915591239929199, 'learning_rate': 2.094536416171521e-05, 'epoch': 5.0} 50%|█████ | 5004/10000 [7:54:37<33:16:30, 23.98s/it][2025-06-19 21:24:22,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:24:22,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2092.23 | bwd_microstep: 3295.08 | bwd_inner_microstep: 3294.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.21 [2025-06-19 21:24:22,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2092.23 | bwd: 3295.09 | bwd_inner: 3294.28 | bwd_allreduce: 0.77 | step: 7.21 50%|█████ | 5005/10000 [7:54:42<25:32:47, 18.41s/it] {'loss': 0.0053, 'grad_norm': 0.5758377909660339, 'learning_rate': 2.093889384196426e-05, 'epoch': 5.0} 50%|█████ | 5005/10000 [7:54:42<25:32:47, 18.41s/it][2025-06-19 21:24:27,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:24:27,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.42 | bwd_microstep: 3345.98 | bwd_inner_microstep: 3345.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 21:24:27,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.42 | bwd: 3346.00 | bwd_inner: 3345.19 | bwd_allreduce: 0.76 | step: 6.87 50%|█████ | 5006/10000 [7:54:48<20:10:06, 14.54s/it] {'loss': 0.025, 'grad_norm': 4.052544116973877, 'learning_rate': 2.0932423423727703e-05, 'epoch': 5.01} 50%|█████ | 5006/10000 [7:54:48<20:10:06, 14.54s/it][2025-06-19 21:24:33,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:24:33,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.92 | bwd_microstep: 3344.22 | bwd_inner_microstep: 3343.42 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-19 21:24:33,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.92 | bwd: 3344.23 | bwd_inner: 3343.42 | bwd_allreduce: 0.77 | step: 7.02 50%|█████ | 5007/10000 [7:54:53<16:24:16, 11.83s/it] {'loss': 0.0053, 'grad_norm': 0.3691086769104004, 'learning_rate': 2.092595290768426e-05, 'epoch': 5.01} 50%|█████ | 5007/10000 [7:54:53<16:24:16, 11.83s/it][2025-06-19 21:24:38,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:24:38,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.55 | bwd_microstep: 3352.56 | bwd_inner_microstep: 3351.66 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.79 [2025-06-19 21:24:38,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.55 | bwd: 3352.59 | bwd_inner: 3351.66 | bwd_allreduce: 0.85 | step: 7.79 50%|█████ | 5008/10000 [7:54:59<13:46:38, 9.94s/it] {'loss': 0.0009, 'grad_norm': 0.12291533499956131, 'learning_rate': 2.091948229451265e-05, 'epoch': 5.01} 50%|█████ | 5008/10000 [7:54:59<13:46:38, 9.94s/it][2025-06-19 21:24:44,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:24:44,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2092.75 | bwd_microstep: 3280.94 | bwd_inner_microstep: 3280.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-19 21:24:44,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2092.75 | bwd: 3280.95 | bwd_inner: 3280.14 | bwd_allreduce: 0.77 | step: 7.00 50%|█████ | 5009/10000 [7:55:04<11:53:36, 8.58s/it] {'loss': 0.0013, 'grad_norm': 0.20189687609672546, 'learning_rate': 2.0913011584891623e-05, 'epoch': 5.01} 50%|█████ | 5009/10000 [7:55:04<11:53:36, 8.58s/it][2025-06-19 21:24:49,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:24:49,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2083.60 | bwd_microstep: 3291.50 | bwd_inner_microstep: 3290.62 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.43 [2025-06-19 21:24:49,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2083.60 | bwd: 3291.53 | bwd_inner: 3290.62 | bwd_allreduce: 0.84 | step: 7.43 50%|█████ | 5010/10000 [7:55:10<10:34:35, 7.63s/it] {'loss': 0.1012, 'grad_norm': 4.322022914886475, 'learning_rate': 2.090654077949991e-05, 'epoch': 5.01} 50%|█████ | 5010/10000 [7:55:10<10:34:35, 7.63s/it][2025-06-19 21:24:54,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:24:54,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.41 | bwd_microstep: 3346.52 | bwd_inner_microstep: 3345.62 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.66 [2025-06-19 21:24:54,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.41 | bwd: 3346.55 | bwd_inner: 3345.62 | bwd_allreduce: 0.85 | step: 7.66 50%|█████ | 5011/10000 [7:55:15<9:41:54, 7.00s/it] {'loss': 0.0166, 'grad_norm': 1.4371998310089111, 'learning_rate': 2.090006987901628e-05, 'epoch': 5.01} 50%|█████ | 5011/10000 [7:55:15<9:41:54, 7.00s/it][2025-06-19 21:25:00,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:25:00,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2086.66 | bwd_microstep: 3287.66 | bwd_inner_microstep: 3286.73 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.40 [2025-06-19 21:25:00,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2086.66 | bwd: 3287.68 | bwd_inner: 3286.73 | bwd_allreduce: 0.90 | step: 7.40 50%|█████ | 5012/10000 [7:55:21<9:02:19, 6.52s/it] {'loss': 0.0088, 'grad_norm': 1.09418785572052, 'learning_rate': 2.08935988841195e-05, 'epoch': 5.01} 50%|█████ | 5012/10000 [7:55:21<9:02:19, 6.52s/it][2025-06-19 21:25:05,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:25:05,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2087.67 | bwd_microstep: 3295.56 | bwd_inner_microstep: 3294.74 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-19 21:25:05,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2087.67 | bwd: 3295.57 | bwd_inner: 3294.74 | bwd_allreduce: 0.79 | step: 7.31 50%|█████ | 5013/10000 [7:55:26<8:34:55, 6.20s/it] {'loss': 0.0016, 'grad_norm': 0.11967093497514725, 'learning_rate': 2.088712779548834e-05, 'epoch': 5.01} 50%|█████ | 5013/10000 [7:55:26<8:34:55, 6.20s/it][2025-06-19 21:25:11,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-19 21:25:11,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.40 | bwd_microstep: 3346.16 | bwd_inner_microstep: 3345.35 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.74 [2025-06-19 21:25:11,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.40 | bwd: 3346.18 | bwd_inner: 3345.35 | bwd_allreduce: 0.79 | step: 7.75 50%|█████ | 5014/10000 [7:55:32<8:17:42, 5.99s/it] {'loss': 0.0017, 'grad_norm': 0.0873064398765564, 'learning_rate': 2.0880656613801583e-05, 'epoch': 5.01} 50%|█████ | 5014/10000 [7:55:32<8:17:42, 5.99s/it][2025-06-19 21:25:16,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:25:16,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2090.63 | bwd_microstep: 3298.72 | bwd_inner_microstep: 3297.87 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.06 [2025-06-19 21:25:16,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2090.63 | bwd: 3298.73 | bwd_inner: 3297.87 | bwd_allreduce: 0.82 | step: 7.06 50%|█████ | 5015/10000 [7:55:37<8:03:40, 5.82s/it] {'loss': 0.0655, 'grad_norm': 3.860530376434326, 'learning_rate': 2.087418533973805e-05, 'epoch': 5.01} 50%|█████ | 5015/10000 [7:55:37<8:03:40, 5.82s/it][2025-06-19 21:25:22,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:25:22,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.19 | bwd_microstep: 3345.39 | bwd_inner_microstep: 3344.42 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.52 [2025-06-19 21:25:22,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.19 | bwd: 3345.40 | bwd_inner: 3344.42 | bwd_allreduce: 0.94 | step: 7.52 50%|█████ | 5016/10000 [7:55:43<7:55:49, 5.73s/it] {'loss': 0.0629, 'grad_norm': 3.229691982269287, 'learning_rate': 2.086771397397652e-05, 'epoch': 5.02} 50%|█████ | 5016/10000 [7:55:43<7:55:49, 5.73s/it][2025-06-19 21:25:27,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:25:27,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.01 | bwd_microstep: 3340.05 | bwd_inner_microstep: 3339.12 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.09 [2025-06-19 21:25:27,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.01 | bwd: 3340.07 | bwd_inner: 3339.12 | bwd_allreduce: 0.90 | step: 7.09 50%|█████ | 5017/10000 [7:55:48<7:50:23, 5.66s/it] {'loss': 0.0008, 'grad_norm': 0.05330978333950043, 'learning_rate': 2.086124251719583e-05, 'epoch': 5.02} 50%|█████ | 5017/10000 [7:55:48<7:50:23, 5.66s/it][2025-06-19 21:25:33,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.72 | optimizer_step: 2.73 [2025-06-19 21:25:33,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2090.52 | bwd_microstep: 3301.99 | bwd_inner_microstep: 3301.07 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.39 [2025-06-19 21:25:33,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2090.52 | bwd: 3302.01 | bwd_inner: 3301.07 | bwd_allreduce: 0.89 | step: 7.39 50%|█████ | 5018/10000 [7:55:53<7:44:31, 5.59s/it] {'loss': 0.0092, 'grad_norm': 0.6945359110832214, 'learning_rate': 2.0854770970074797e-05, 'epoch': 5.02} 50%|█████ | 5018/10000 [7:55:53<7:44:31, 5.59s/it][2025-06-19 21:25:38,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.72 [2025-06-19 21:25:38,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.13 | bwd_microstep: 3312.20 | bwd_inner_microstep: 3311.36 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.75 [2025-06-19 21:25:38,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.13 | bwd: 3312.22 | bwd_inner: 3311.36 | bwd_allreduce: 0.80 | step: 7.76 50%|█████ | 5019/10000 [7:55:59<7:41:09, 5.55s/it] {'loss': 0.0865, 'grad_norm': 6.531711578369141, 'learning_rate': 2.0848299333292247e-05, 'epoch': 5.02} 50%|█████ | 5019/10000 [7:55:59<7:41:09, 5.55s/it][2025-06-19 21:25:44,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:25:44,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.32 | bwd_microstep: 3370.23 | bwd_inner_microstep: 3369.40 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.19 [2025-06-19 21:25:44,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.32 | bwd: 3370.24 | bwd_inner: 3369.40 | bwd_allreduce: 0.80 | step: 7.19 50%|█████ | 5020/10000 [7:56:04<7:40:29, 5.55s/it] {'loss': 0.0101, 'grad_norm': 1.4088444709777832, 'learning_rate': 2.0841827607527044e-05, 'epoch': 5.02} 50%|█████ | 5020/10000 [7:56:04<7:40:29, 5.55s/it][2025-06-19 21:25:49,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:25:49,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.50 | bwd_microstep: 3317.41 | bwd_inner_microstep: 3316.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 21:25:49,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.50 | bwd: 3317.42 | bwd_inner: 3316.62 | bwd_allreduce: 0.76 | step: 6.77 50%|█████ | 5021/10000 [7:56:10<7:38:10, 5.52s/it] {'loss': 0.0034, 'grad_norm': 0.3005213737487793, 'learning_rate': 2.083535579345803e-05, 'epoch': 5.02} 50%|█████ | 5021/10000 [7:56:10<7:38:10, 5.52s/it][2025-06-19 21:25:55,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:25:55,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.17 | bwd_microstep: 3369.78 | bwd_inner_microstep: 3368.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 21:25:55,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.17 | bwd: 3369.79 | bwd_inner: 3368.98 | bwd_allreduce: 0.76 | step: 6.99 50%|█████ | 5022/10000 [7:56:15<7:38:36, 5.53s/it] {'loss': 0.0004, 'grad_norm': 0.03621959313750267, 'learning_rate': 2.0828883891764066e-05, 'epoch': 5.02} 50%|█████ | 5022/10000 [7:56:15<7:38:36, 5.53s/it][2025-06-19 21:26:00,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:26:00,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.87 | bwd_microstep: 3326.62 | bwd_inner_microstep: 3325.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 21:26:00,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.87 | bwd: 3326.63 | bwd_inner: 3325.82 | bwd_allreduce: 0.77 | step: 6.76 50%|█████ | 5023/10000 [7:56:21<7:37:05, 5.51s/it] {'loss': 0.0128, 'grad_norm': 1.4339208602905273, 'learning_rate': 2.082241190312403e-05, 'epoch': 5.02} 50%|█████ | 5023/10000 [7:56:21<7:37:05, 5.51s/it][2025-06-19 21:26:06,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:26:06,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.90 | bwd_microstep: 3325.92 | bwd_inner_microstep: 3325.00 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.53 [2025-06-19 21:26:06,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.90 | bwd: 3325.94 | bwd_inner: 3325.00 | bwd_allreduce: 0.89 | step: 7.54 50%|█████ | 5024/10000 [7:56:26<7:36:19, 5.50s/it] {'loss': 0.0541, 'grad_norm': 2.5913116931915283, 'learning_rate': 2.08159398282168e-05, 'epoch': 5.02} 50%|█████ | 5024/10000 [7:56:26<7:36:19, 5.50s/it][2025-06-19 21:26:11,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:26:11,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.70 | bwd_microstep: 3375.12 | bwd_inner_microstep: 3374.17 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.17 [2025-06-19 21:26:11,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.70 | bwd: 3375.13 | bwd_inner: 3374.17 | bwd_allreduce: 0.92 | step: 7.18 50%|█████ | 5025/10000 [7:56:32<7:37:26, 5.52s/it] {'loss': 0.0019, 'grad_norm': 0.07183630764484406, 'learning_rate': 2.0809467667721277e-05, 'epoch': 5.03} 50%|█████ | 5025/10000 [7:56:32<7:37:26, 5.52s/it][2025-06-19 21:26:17,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:26:17,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.01 | bwd_microstep: 3401.15 | bwd_inner_microstep: 3400.12 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.10 [2025-06-19 21:26:17,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.01 | bwd: 3401.18 | bwd_inner: 3400.12 | bwd_allreduce: 1.00 | step: 7.11 50%|█████ | 5026/10000 [7:56:38<7:39:02, 5.54s/it] {'loss': 0.0016, 'grad_norm': 0.11379577219486237, 'learning_rate': 2.0802995422316347e-05, 'epoch': 5.03} 50%|█████ | 5026/10000 [7:56:38<7:39:02, 5.54s/it][2025-06-19 21:26:22,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:26:22,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.31 | bwd_microstep: 3371.65 | bwd_inner_microstep: 3370.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 21:26:22,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.31 | bwd: 3371.67 | bwd_inner: 3370.87 | bwd_allreduce: 0.76 | step: 6.63 50%|█████ | 5027/10000 [7:56:43<7:39:03, 5.54s/it] {'loss': 0.0025, 'grad_norm': 0.2101111114025116, 'learning_rate': 2.0796523092680928e-05, 'epoch': 5.03} 50%|█████ | 5027/10000 [7:56:43<7:39:03, 5.54s/it][2025-06-19 21:26:28,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:26:28,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.97 | bwd_microstep: 3324.83 | bwd_inner_microstep: 3323.89 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.14 [2025-06-19 21:26:28,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.97 | bwd: 3324.84 | bwd_inner: 3323.89 | bwd_allreduce: 0.91 | step: 7.15 50%|█████ | 5028/10000 [7:56:49<7:37:21, 5.52s/it] {'loss': 0.0202, 'grad_norm': 4.357873916625977, 'learning_rate': 2.079005067949393e-05, 'epoch': 5.03} 50%|█████ | 5028/10000 [7:56:49<7:37:21, 5.52s/it][2025-06-19 21:26:33,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:26:33,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.27 | bwd_microstep: 3339.08 | bwd_inner_microstep: 3338.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.06 [2025-06-19 21:26:33,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.27 | bwd: 3339.10 | bwd_inner: 3338.29 | bwd_allreduce: 0.77 | step: 7.06 50%|█████ | 5029/10000 [7:56:54<7:36:35, 5.51s/it] {'loss': 0.0047, 'grad_norm': 0.3385443687438965, 'learning_rate': 2.0783578183434284e-05, 'epoch': 5.03} 50%|█████ | 5029/10000 [7:56:54<7:36:35, 5.51s/it][2025-06-19 21:26:39,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:26:39,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.66 | bwd_microstep: 3333.25 | bwd_inner_microstep: 3332.47 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 21:26:39,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.66 | bwd: 3333.26 | bwd_inner: 3332.47 | bwd_allreduce: 0.75 | step: 6.62 50%|█████ | 5030/10000 [7:57:00<7:35:46, 5.50s/it] {'loss': 0.0029, 'grad_norm': 0.20793575048446655, 'learning_rate': 2.0777105605180922e-05, 'epoch': 5.03} 50%|█████ | 5030/10000 [7:57:00<7:35:46, 5.50s/it][2025-06-19 21:26:44,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:26:44,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.46 | bwd_microstep: 3331.33 | bwd_inner_microstep: 3330.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 21:26:44,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.46 | bwd: 3331.35 | bwd_inner: 3330.54 | bwd_allreduce: 0.76 | step: 6.73 50%|█████ | 5031/10000 [7:57:05<7:35:03, 5.49s/it] {'loss': 0.001, 'grad_norm': 0.17273330688476562, 'learning_rate': 2.0770632945412786e-05, 'epoch': 5.03} 50%|█████ | 5031/10000 [7:57:05<7:35:03, 5.49s/it][2025-06-19 21:26:50,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:26:50,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.92 | bwd_microstep: 3380.54 | bwd_inner_microstep: 3379.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 21:26:50,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.92 | bwd: 3380.56 | bwd_inner: 3379.75 | bwd_allreduce: 0.76 | step: 6.70 50%|█████ | 5032/10000 [7:57:11<7:36:16, 5.51s/it] {'loss': 0.0116, 'grad_norm': 1.7629033327102661, 'learning_rate': 2.0764160204808834e-05, 'epoch': 5.03} 50%|█████ | 5032/10000 [7:57:11<7:36:16, 5.51s/it][2025-06-19 21:26:55,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:26:55,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.35 | bwd_microstep: 3376.18 | bwd_inner_microstep: 3375.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 21:26:55,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.35 | bwd: 3376.20 | bwd_inner: 3375.39 | bwd_allreduce: 0.76 | step: 6.67 50%|█████ | 5033/10000 [7:57:16<7:37:02, 5.52s/it] {'loss': 0.0122, 'grad_norm': 1.029558777809143, 'learning_rate': 2.075768738404802e-05, 'epoch': 5.03} 50%|█████ | 5033/10000 [7:57:16<7:37:02, 5.52s/it][2025-06-19 21:27:01,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:27:01,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.29 | bwd_microstep: 3323.98 | bwd_inner_microstep: 3322.95 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.50 [2025-06-19 21:27:01,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.29 | bwd: 3324.00 | bwd_inner: 3322.95 | bwd_allreduce: 1.00 | step: 7.51 50%|█████ | 5034/10000 [7:57:22<7:35:36, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.01837490126490593, 'learning_rate': 2.075121448380932e-05, 'epoch': 5.03} 50%|█████ | 5034/10000 [7:57:22<7:35:36, 5.50s/it][2025-06-19 21:27:06,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.73 [2025-06-19 21:27:06,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.82 | bwd_microstep: 3378.95 | bwd_inner_microstep: 3378.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 8.62 [2025-06-19 21:27:06,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.82 | bwd: 3378.96 | bwd_inner: 3378.15 | bwd_allreduce: 0.77 | step: 8.62 50%|█████ | 5035/10000 [7:57:27<7:36:44, 5.52s/it] {'loss': 0.0333, 'grad_norm': 4.749019145965576, 'learning_rate': 2.0744741504771708e-05, 'epoch': 5.04} 50%|█████ | 5035/10000 [7:57:27<7:36:44, 5.52s/it][2025-06-19 21:27:12,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:27:12,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.20 | bwd_microstep: 3385.32 | bwd_inner_microstep: 3384.36 | bwd_allreduce_microstep: 0.91 | step_microstep: 6.93 [2025-06-19 21:27:12,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.20 | bwd: 3385.33 | bwd_inner: 3384.36 | bwd_allreduce: 0.93 | step: 6.94 50%|█████ | 5036/10000 [7:57:33<7:37:46, 5.53s/it] {'loss': 0.0947, 'grad_norm': 4.457238674163818, 'learning_rate': 2.0738268447614165e-05, 'epoch': 5.04} 50%|█████ | 5036/10000 [7:57:33<7:37:46, 5.53s/it][2025-06-19 21:27:17,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:27:17,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.39 | bwd_microstep: 3345.22 | bwd_inner_microstep: 3344.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 21:27:17,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.39 | bwd: 3345.23 | bwd_inner: 3344.43 | bwd_allreduce: 0.76 | step: 6.77 50%|█████ | 5037/10000 [7:57:38<7:36:43, 5.52s/it] {'loss': 0.0019, 'grad_norm': 0.17785069346427917, 'learning_rate': 2.0731795313015693e-05, 'epoch': 5.04} 50%|█████ | 5037/10000 [7:57:38<7:36:43, 5.52s/it][2025-06-19 21:27:23,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:27:23,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.79 | bwd_microstep: 3386.20 | bwd_inner_microstep: 3385.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 21:27:23,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.79 | bwd: 3386.21 | bwd_inner: 3385.41 | bwd_allreduce: 0.77 | step: 6.71 50%|█████ | 5038/10000 [7:57:44<7:37:27, 5.53s/it] {'loss': 0.0092, 'grad_norm': 0.64700847864151, 'learning_rate': 2.0725322101655287e-05, 'epoch': 5.04} 50%|█████ | 5038/10000 [7:57:44<7:37:27, 5.53s/it][2025-06-19 21:27:28,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:27:28,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.52 | bwd_microstep: 3333.24 | bwd_inner_microstep: 3332.47 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 21:27:28,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.52 | bwd: 3333.26 | bwd_inner: 3332.47 | bwd_allreduce: 0.75 | step: 6.77 50%|█████ | 5039/10000 [7:57:49<7:36:23, 5.52s/it] {'loss': 0.0166, 'grad_norm': 1.138561487197876, 'learning_rate': 2.0718848814211952e-05, 'epoch': 5.04} 50%|█████ | 5039/10000 [7:57:49<7:36:23, 5.52s/it][2025-06-19 21:27:34,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 21:27:34,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.84 | bwd_microstep: 3331.97 | bwd_inner_microstep: 3330.86 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.72 [2025-06-19 21:27:34,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.84 | bwd: 3331.99 | bwd_inner: 3330.86 | bwd_allreduce: 1.07 | step: 7.72 50%|█████ | 5040/10000 [7:57:55<7:35:35, 5.51s/it] {'loss': 0.0024, 'grad_norm': 0.15615282952785492, 'learning_rate': 2.0712375451364718e-05, 'epoch': 5.04} 50%|█████ | 5040/10000 [7:57:55<7:35:35, 5.51s/it][2025-06-19 21:27:39,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:27:39,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.93 | bwd_microstep: 3338.60 | bwd_inner_microstep: 3337.78 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.87 [2025-06-19 21:27:39,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.93 | bwd: 3338.62 | bwd_inner: 3337.78 | bwd_allreduce: 0.79 | step: 6.88 50%|█████ | 5041/10000 [7:58:00<7:35:03, 5.51s/it] {'loss': 0.0051, 'grad_norm': 0.38862890005111694, 'learning_rate': 2.0705902013792603e-05, 'epoch': 5.04} 50%|█████ | 5041/10000 [7:58:00<7:35:03, 5.51s/it][2025-06-19 21:27:45,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:27:45,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.63 | bwd_microstep: 3341.33 | bwd_inner_microstep: 3340.38 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.78 [2025-06-19 21:27:45,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.63 | bwd: 3341.34 | bwd_inner: 3340.38 | bwd_allreduce: 0.92 | step: 6.78 50%|█████ | 5042/10000 [7:58:06<7:34:50, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.026568656787276268, 'learning_rate': 2.0699428502174647e-05, 'epoch': 5.04} 50%|█████ | 5042/10000 [7:58:06<7:34:50, 5.50s/it][2025-06-19 21:27:51,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:27:51,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.60 | bwd_microstep: 3387.92 | bwd_inner_microstep: 3387.01 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.48 [2025-06-19 21:27:51,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.60 | bwd: 3387.94 | bwd_inner: 3387.01 | bwd_allreduce: 0.88 | step: 7.48 50%|█████ | 5043/10000 [7:58:11<7:36:22, 5.52s/it] {'loss': 0.0015, 'grad_norm': 0.2400464564561844, 'learning_rate': 2.069295491718989e-05, 'epoch': 5.04} 50%|█████ | 5043/10000 [7:58:11<7:36:22, 5.52s/it][2025-06-19 21:27:56,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:27:56,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.90 | bwd_microstep: 3345.45 | bwd_inner_microstep: 3344.54 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.14 [2025-06-19 21:27:56,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.90 | bwd: 3345.47 | bwd_inner: 3344.54 | bwd_allreduce: 0.88 | step: 7.14 50%|█████ | 5044/10000 [7:58:17<7:35:46, 5.52s/it] {'loss': 0.0018, 'grad_norm': 0.09057319164276123, 'learning_rate': 2.0686481259517367e-05, 'epoch': 5.04} 50%|█████ | 5044/10000 [7:58:17<7:35:46, 5.52s/it][2025-06-19 21:28:02,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:28:02,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.30 | bwd_microstep: 3384.51 | bwd_inner_microstep: 3383.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 21:28:02,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.30 | bwd: 3384.53 | bwd_inner: 3383.73 | bwd_allreduce: 0.76 | step: 6.75 50%|█████ | 5045/10000 [7:58:22<7:36:52, 5.53s/it] {'loss': 0.0123, 'grad_norm': 0.8252429962158203, 'learning_rate': 2.0680007529836155e-05, 'epoch': 5.04} 50%|█████ | 5045/10000 [7:58:22<7:36:52, 5.53s/it][2025-06-19 21:28:07,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:28:07,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.76 | bwd_microstep: 3337.96 | bwd_inner_microstep: 3337.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 21:28:07,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.76 | bwd: 3337.98 | bwd_inner: 3337.18 | bwd_allreduce: 0.75 | step: 6.60 50%|█████ | 5046/10000 [7:58:28<7:35:43, 5.52s/it] {'loss': 0.0169, 'grad_norm': 0.9122194647789001, 'learning_rate': 2.067353372882531e-05, 'epoch': 5.05} 50%|█████ | 5046/10000 [7:58:28<7:35:43, 5.52s/it][2025-06-19 21:28:13,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:28:13,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.66 | bwd_microstep: 3420.63 | bwd_inner_microstep: 3419.86 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.56 [2025-06-19 21:28:13,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.66 | bwd: 3420.65 | bwd_inner: 3419.86 | bwd_allreduce: 0.75 | step: 6.57 50%|█████ | 5047/10000 [7:58:33<7:37:43, 5.54s/it] {'loss': 0.0037, 'grad_norm': 0.2769322097301483, 'learning_rate': 2.0667059857163897e-05, 'epoch': 5.05} 50%|█████ | 5047/10000 [7:58:33<7:37:43, 5.54s/it][2025-06-19 21:28:18,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:28:18,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.19 | bwd_microstep: 3382.96 | bwd_inner_microstep: 3381.95 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.60 [2025-06-19 21:28:18,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.19 | bwd: 3382.97 | bwd_inner: 3381.95 | bwd_allreduce: 0.98 | step: 7.60 50%|█████ | 5048/10000 [7:58:39<7:38:04, 5.55s/it] {'loss': 0.0002, 'grad_norm': 0.031271032989025116, 'learning_rate': 2.0660585915531003e-05, 'epoch': 5.05} 50%|█████ | 5048/10000 [7:58:39<7:38:04, 5.55s/it][2025-06-19 21:28:24,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:28:24,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2153.36 | bwd_microstep: 3409.01 | bwd_inner_microstep: 3408.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 21:28:24,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2153.36 | bwd: 3409.02 | bwd_inner: 3408.23 | bwd_allreduce: 0.75 | step: 6.59 50%|█████ | 5049/10000 [7:58:45<7:39:15, 5.57s/it] {'loss': 0.0885, 'grad_norm': 3.946411609649658, 'learning_rate': 2.0654111904605712e-05, 'epoch': 5.05} 50%|█████ | 5049/10000 [7:58:45<7:39:15, 5.57s/it][2025-06-19 21:28:29,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 21:28:29,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.71 | bwd_microstep: 3321.95 | bwd_inner_microstep: 3321.04 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.42 [2025-06-19 21:28:29,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.71 | bwd: 3321.96 | bwd_inner: 3321.04 | bwd_allreduce: 0.88 | step: 7.43 50%|█████ | 5050/10000 [7:58:50<7:36:52, 5.54s/it] {'loss': 0.0119, 'grad_norm': 2.2638344764709473, 'learning_rate': 2.0647637825067123e-05, 'epoch': 5.05} 50%|█████ | 5050/10000 [7:58:50<7:36:52, 5.54s/it][2025-06-19 21:28:35,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 21:28:35,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.87 | bwd_microstep: 3326.07 | bwd_inner_microstep: 3325.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 21:28:35,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.87 | bwd: 3326.08 | bwd_inner: 3325.29 | bwd_allreduce: 0.75 | step: 6.62 51%|█████ | 5051/10000 [7:58:56<7:35:09, 5.52s/it] {'loss': 0.0045, 'grad_norm': 0.7045177221298218, 'learning_rate': 2.0641163677594327e-05, 'epoch': 5.05} 51%|█████ | 5051/10000 [7:58:56<7:35:09, 5.52s/it][2025-06-19 21:28:40,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:28:40,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.23 | bwd_microstep: 3334.42 | bwd_inner_microstep: 3333.59 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.09 [2025-06-19 21:28:40,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.23 | bwd: 3334.44 | bwd_inner: 3333.59 | bwd_allreduce: 0.80 | step: 7.09 51%|█████ | 5052/10000 [7:59:01<7:34:19, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.038402557373046875, 'learning_rate': 2.0634689462866433e-05, 'epoch': 5.05} 51%|█████ | 5052/10000 [7:59:01<7:34:19, 5.51s/it][2025-06-19 21:28:46,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 21:28:46,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.51 | bwd_microstep: 3377.15 | bwd_inner_microstep: 3376.14 | bwd_allreduce_microstep: 0.95 | step_microstep: 8.24 [2025-06-19 21:28:46,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.51 | bwd: 3377.17 | bwd_inner: 3376.14 | bwd_allreduce: 0.98 | step: 8.25 51%|█████ | 5053/10000 [7:59:07<7:35:52, 5.53s/it] {'loss': 0.0039, 'grad_norm': 0.4913286566734314, 'learning_rate': 2.0628215181562567e-05, 'epoch': 5.05} 51%|█████ | 5053/10000 [7:59:07<7:35:52, 5.53s/it][2025-06-19 21:28:51,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:28:51,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.48 | bwd_microstep: 3373.38 | bwd_inner_microstep: 3372.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 21:28:51,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.48 | bwd: 3373.40 | bwd_inner: 3372.58 | bwd_allreduce: 0.77 | step: 6.80 51%|█████ | 5054/10000 [7:59:12<7:36:38, 5.54s/it] {'loss': 0.0008, 'grad_norm': 0.04217328876256943, 'learning_rate': 2.0621740834361845e-05, 'epoch': 5.05} 51%|█████ | 5054/10000 [7:59:12<7:36:38, 5.54s/it][2025-06-19 21:28:57,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 21:28:57,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.07 | bwd_microstep: 3325.71 | bwd_inner_microstep: 3324.60 | bwd_allreduce_microstep: 1.05 | step_microstep: 8.16 [2025-06-19 21:28:57,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.07 | bwd: 3325.73 | bwd_inner: 3324.60 | bwd_allreduce: 1.08 | step: 8.17 51%|█████ | 5055/10000 [7:59:18<7:35:19, 5.52s/it] {'loss': 0.0102, 'grad_norm': 0.6354070901870728, 'learning_rate': 2.0615266421943394e-05, 'epoch': 5.05} 51%|█████ | 5055/10000 [7:59:18<7:35:19, 5.52s/it][2025-06-19 21:29:02,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:29:02,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.75 | bwd_microstep: 3333.00 | bwd_inner_microstep: 3332.19 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.76 [2025-06-19 21:29:02,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.75 | bwd: 3333.02 | bwd_inner: 3332.19 | bwd_allreduce: 0.78 | step: 6.77 51%|█████ | 5056/10000 [7:59:23<7:34:15, 5.51s/it] {'loss': 0.0591, 'grad_norm': 3.096773862838745, 'learning_rate': 2.0608791944986345e-05, 'epoch': 5.06} 51%|█████ | 5056/10000 [7:59:23<7:34:15, 5.51s/it][2025-06-19 21:29:08,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:29:08,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2155.95 | bwd_microstep: 3376.64 | bwd_inner_microstep: 3375.66 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.53 [2025-06-19 21:29:08,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2155.95 | bwd: 3376.65 | bwd_inner: 3375.66 | bwd_allreduce: 0.95 | step: 7.54 51%|█████ | 5057/10000 [7:59:29<7:35:39, 5.53s/it] {'loss': 0.041, 'grad_norm': 3.037142753601074, 'learning_rate': 2.0602317404169852e-05, 'epoch': 5.06} 51%|█████ | 5057/10000 [7:59:29<7:35:39, 5.53s/it][2025-06-19 21:29:13,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:29:13,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.74 | bwd_microstep: 3326.27 | bwd_inner_microstep: 3325.28 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.30 [2025-06-19 21:29:13,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.74 | bwd: 3326.29 | bwd_inner: 3325.28 | bwd_allreduce: 0.96 | step: 7.31 51%|█████ | 5058/10000 [7:59:34<7:34:40, 5.52s/it] {'loss': 0.011, 'grad_norm': 1.2875392436981201, 'learning_rate': 2.0595842800173057e-05, 'epoch': 5.06} 51%|█████ | 5058/10000 [7:59:34<7:34:40, 5.52s/it][2025-06-19 21:29:19,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:29:19,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.23 | bwd_microstep: 3318.81 | bwd_inner_microstep: 3318.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.26 [2025-06-19 21:29:19,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.23 | bwd: 3318.83 | bwd_inner: 3318.00 | bwd_allreduce: 0.78 | step: 7.27 51%|█████ | 5059/10000 [7:59:40<7:33:39, 5.51s/it] {'loss': 0.0008, 'grad_norm': 0.11557305604219437, 'learning_rate': 2.0589368133675113e-05, 'epoch': 5.06} 51%|█████ | 5059/10000 [7:59:40<7:33:39, 5.51s/it][2025-06-19 21:29:24,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:29:24,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.46 | bwd_microstep: 3371.53 | bwd_inner_microstep: 3370.57 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.42 [2025-06-19 21:29:24,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.46 | bwd: 3371.55 | bwd_inner: 3370.57 | bwd_allreduce: 0.94 | step: 7.43 51%|█████ | 5060/10000 [7:59:45<7:34:22, 5.52s/it] {'loss': 0.0478, 'grad_norm': 1.772493839263916, 'learning_rate': 2.0582893405355196e-05, 'epoch': 5.06} 51%|█████ | 5060/10000 [7:59:45<7:34:22, 5.52s/it][2025-06-19 21:29:30,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:29:30,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.34 | bwd_microstep: 3325.83 | bwd_inner_microstep: 3324.86 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.36 [2025-06-19 21:29:30,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.34 | bwd: 3325.85 | bwd_inner: 3324.86 | bwd_allreduce: 0.94 | step: 7.36 51%|█████ | 5061/10000 [7:59:51<7:33:30, 5.51s/it] {'loss': 0.0536, 'grad_norm': 2.3189711570739746, 'learning_rate': 2.0576418615892463e-05, 'epoch': 5.06} 51%|█████ | 5061/10000 [7:59:51<7:33:30, 5.51s/it][2025-06-19 21:29:36,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:29:36,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.21 | bwd_microstep: 3374.27 | bwd_inner_microstep: 3373.43 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.08 [2025-06-19 21:29:36,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.21 | bwd: 3374.28 | bwd_inner: 3373.43 | bwd_allreduce: 0.80 | step: 7.08 51%|█████ | 5062/10000 [7:59:56<7:34:29, 5.52s/it] {'loss': 0.0265, 'grad_norm': 3.8875343799591064, 'learning_rate': 2.0569943765966086e-05, 'epoch': 5.06} 51%|█████ | 5062/10000 [7:59:56<7:34:29, 5.52s/it][2025-06-19 21:29:41,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:29:41,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.12 | bwd_microstep: 3330.00 | bwd_inner_microstep: 3329.17 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.81 [2025-06-19 21:29:41,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.12 | bwd: 3330.02 | bwd_inner: 3329.17 | bwd_allreduce: 0.80 | step: 6.82 51%|█████ | 5063/10000 [8:00:02<7:33:28, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.028128620237112045, 'learning_rate': 2.0563468856255265e-05, 'epoch': 5.06} 51%|█████ | 5063/10000 [8:00:02<7:33:28, 5.51s/it][2025-06-19 21:29:46,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:29:46,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.15 | bwd_microstep: 3322.81 | bwd_inner_microstep: 3321.97 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.45 [2025-06-19 21:29:46,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.15 | bwd: 3322.82 | bwd_inner: 3321.97 | bwd_allreduce: 0.81 | step: 7.45 51%|█████ | 5064/10000 [8:00:07<7:32:26, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.0436500608921051, 'learning_rate': 2.055699388743917e-05, 'epoch': 5.06} 51%|█████ | 5064/10000 [8:00:07<7:32:26, 5.50s/it][2025-06-19 21:29:52,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:29:52,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.47 | bwd_microstep: 3313.80 | bwd_inner_microstep: 3313.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 21:29:52,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.47 | bwd: 3313.81 | bwd_inner: 3313.02 | bwd_allreduce: 0.75 | step: 6.57 51%|█████ | 5065/10000 [8:00:13<7:31:28, 5.49s/it] {'loss': 0.0035, 'grad_norm': 0.9720744490623474, 'learning_rate': 2.0550518860197003e-05, 'epoch': 5.07} 51%|█████ | 5065/10000 [8:00:13<7:31:28, 5.49s/it][2025-06-19 21:29:57,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 21:29:57,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.64 | bwd_microstep: 3327.29 | bwd_inner_microstep: 3326.35 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.46 [2025-06-19 21:29:57,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.64 | bwd: 3327.31 | bwd_inner: 3326.35 | bwd_allreduce: 0.92 | step: 7.46 51%|█████ | 5066/10000 [8:00:18<7:31:12, 5.49s/it] {'loss': 0.0122, 'grad_norm': 0.8686150312423706, 'learning_rate': 2.0544043775207954e-05, 'epoch': 5.07} 51%|█████ | 5066/10000 [8:00:18<7:31:12, 5.49s/it][2025-06-19 21:30:03,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.79 [2025-06-19 21:30:03,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.56 | bwd_microstep: 3378.66 | bwd_inner_microstep: 3377.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 21:30:03,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.56 | bwd: 3378.68 | bwd_inner: 3377.86 | bwd_allreduce: 0.77 | step: 6.99 51%|█████ | 5067/10000 [8:00:24<7:32:55, 5.51s/it] {'loss': 0.023, 'grad_norm': 1.5916012525558472, 'learning_rate': 2.0537568633151243e-05, 'epoch': 5.07} 51%|█████ | 5067/10000 [8:00:24<7:32:55, 5.51s/it][2025-06-19 21:30:08,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:30:08,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.72 | bwd_microstep: 3321.61 | bwd_inner_microstep: 3320.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.86 [2025-06-19 21:30:08,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.72 | bwd: 3321.63 | bwd_inner: 3320.83 | bwd_allreduce: 0.76 | step: 6.87 51%|█████ | 5068/10000 [8:00:29<7:32:02, 5.50s/it] {'loss': 0.0249, 'grad_norm': 1.9823453426361084, 'learning_rate': 2.0531093434706082e-05, 'epoch': 5.07} 51%|█████ | 5068/10000 [8:00:29<7:32:02, 5.50s/it][2025-06-19 21:30:14,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:30:14,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.56 | bwd_microstep: 3311.56 | bwd_inner_microstep: 3310.73 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.82 [2025-06-19 21:30:14,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.56 | bwd: 3311.57 | bwd_inner: 3310.73 | bwd_allreduce: 0.79 | step: 6.82 51%|█████ | 5069/10000 [8:00:35<7:30:54, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.05665217339992523, 'learning_rate': 2.0524618180551682e-05, 'epoch': 5.07} 51%|█████ | 5069/10000 [8:00:35<7:30:54, 5.49s/it][2025-06-19 21:30:19,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:30:19,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.54 | bwd_microstep: 3340.28 | bwd_inner_microstep: 3339.39 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.04 [2025-06-19 21:30:19,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.54 | bwd: 3340.29 | bwd_inner: 3339.39 | bwd_allreduce: 0.85 | step: 7.04 51%|█████ | 5070/10000 [8:00:40<7:30:55, 5.49s/it] {'loss': 0.0021, 'grad_norm': 0.2530546486377716, 'learning_rate': 2.0518142871367273e-05, 'epoch': 5.07} 51%|█████ | 5070/10000 [8:00:40<7:30:55, 5.49s/it][2025-06-19 21:30:25,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:30:25,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.64 | bwd_microstep: 3325.31 | bwd_inner_microstep: 3324.38 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.32 [2025-06-19 21:30:25,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.64 | bwd: 3325.32 | bwd_inner: 3324.38 | bwd_allreduce: 0.90 | step: 7.32 51%|█████ | 5071/10000 [8:00:46<7:30:41, 5.49s/it] {'loss': 0.0035, 'grad_norm': 0.8115315437316895, 'learning_rate': 2.0511667507832076e-05, 'epoch': 5.07} 51%|█████ | 5071/10000 [8:00:46<7:30:41, 5.49s/it][2025-06-19 21:30:30,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:30:30,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.58 | bwd_microstep: 3368.04 | bwd_inner_microstep: 3367.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 21:30:30,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.58 | bwd: 3368.06 | bwd_inner: 3367.26 | bwd_allreduce: 0.75 | step: 6.78 51%|█████ | 5072/10000 [8:00:51<7:31:54, 5.50s/it] {'loss': 0.0035, 'grad_norm': 0.2741989195346832, 'learning_rate': 2.0505192090625332e-05, 'epoch': 5.07} 51%|█████ | 5072/10000 [8:00:51<7:31:54, 5.50s/it][2025-06-19 21:30:36,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:30:36,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.01 | bwd_microstep: 3368.57 | bwd_inner_microstep: 3367.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.16 [2025-06-19 21:30:36,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.01 | bwd: 3368.59 | bwd_inner: 3367.78 | bwd_allreduce: 0.76 | step: 7.16 51%|█████ | 5073/10000 [8:00:57<7:32:46, 5.51s/it] {'loss': 0.0188, 'grad_norm': 1.7574506998062134, 'learning_rate': 2.049871662042629e-05, 'epoch': 5.07} 51%|█████ | 5073/10000 [8:00:57<7:32:46, 5.51s/it][2025-06-19 21:30:41,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:30:41,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.55 | bwd_microstep: 3330.15 | bwd_inner_microstep: 3329.11 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.82 [2025-06-19 21:30:41,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.55 | bwd: 3330.18 | bwd_inner: 3329.11 | bwd_allreduce: 1.01 | step: 7.82 51%|█████ | 5074/10000 [8:01:02<7:32:02, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.02100927010178566, 'learning_rate': 2.0492241097914182e-05, 'epoch': 5.07} 51%|█████ | 5074/10000 [8:01:02<7:32:02, 5.51s/it][2025-06-19 21:30:47,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:30:47,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.40 | bwd_microstep: 3329.47 | bwd_inner_microstep: 3328.46 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.41 [2025-06-19 21:30:47,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.40 | bwd: 3329.49 | bwd_inner: 3328.46 | bwd_allreduce: 0.98 | step: 7.41 51%|█████ | 5075/10000 [8:01:08<7:31:23, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.1409754455089569, 'learning_rate': 2.0485765523768265e-05, 'epoch': 5.08} 51%|█████ | 5075/10000 [8:01:08<7:31:23, 5.50s/it][2025-06-19 21:30:52,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:30:52,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.27 | bwd_microstep: 3317.77 | bwd_inner_microstep: 3316.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-19 21:30:52,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.27 | bwd: 3317.79 | bwd_inner: 3316.98 | bwd_allreduce: 0.76 | step: 6.78 51%|█████ | 5076/10000 [8:01:13<7:30:39, 5.49s/it] {'loss': 0.043, 'grad_norm': 5.812139987945557, 'learning_rate': 2.04792898986678e-05, 'epoch': 5.08} 51%|█████ | 5076/10000 [8:01:13<7:30:39, 5.49s/it][2025-06-19 21:30:58,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:30:58,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.37 | bwd_microstep: 3314.98 | bwd_inner_microstep: 3314.18 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-19 21:30:58,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.37 | bwd: 3315.00 | bwd_inner: 3314.18 | bwd_allreduce: 0.78 | step: 6.82 51%|█████ | 5077/10000 [8:01:19<7:29:44, 5.48s/it] {'loss': 0.0123, 'grad_norm': 0.8417551517486572, 'learning_rate': 2.0472814223292054e-05, 'epoch': 5.08} 51%|█████ | 5077/10000 [8:01:19<7:29:44, 5.48s/it][2025-06-19 21:31:03,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:31:03,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.42 | bwd_microstep: 3323.50 | bwd_inner_microstep: 3322.64 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.99 [2025-06-19 21:31:03,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.42 | bwd: 3323.52 | bwd_inner: 3322.64 | bwd_allreduce: 0.82 | step: 6.99 51%|█████ | 5078/10000 [8:01:24<7:29:24, 5.48s/it] {'loss': 0.0015, 'grad_norm': 0.15081487596035004, 'learning_rate': 2.046633849832029e-05, 'epoch': 5.08} 51%|█████ | 5078/10000 [8:01:24<7:29:24, 5.48s/it][2025-06-19 21:31:09,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 21:31:09,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.58 | bwd_microstep: 3316.45 | bwd_inner_microstep: 3315.28 | bwd_allreduce_microstep: 1.10 | step_microstep: 7.91 [2025-06-19 21:31:09,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.58 | bwd: 3316.47 | bwd_inner: 3315.28 | bwd_allreduce: 1.13 | step: 7.92 51%|█████ | 5079/10000 [8:01:30<7:29:16, 5.48s/it] {'loss': 0.0365, 'grad_norm': 5.403793811798096, 'learning_rate': 2.0459862724431782e-05, 'epoch': 5.08} 51%|█████ | 5079/10000 [8:01:30<7:29:16, 5.48s/it][2025-06-19 21:31:14,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:31:14,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.35 | bwd_microstep: 3368.15 | bwd_inner_microstep: 3367.24 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.14 [2025-06-19 21:31:14,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.35 | bwd: 3368.17 | bwd_inner: 3367.24 | bwd_allreduce: 0.88 | step: 7.14 51%|█████ | 5080/10000 [8:01:35<7:30:50, 5.50s/it] {'loss': 0.0278, 'grad_norm': 1.4323490858078003, 'learning_rate': 2.04533869023058e-05, 'epoch': 5.08} 51%|█████ | 5080/10000 [8:01:35<7:30:50, 5.50s/it][2025-06-19 21:31:20,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:31:20,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.76 | bwd_microstep: 3320.13 | bwd_inner_microstep: 3319.18 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.13 [2025-06-19 21:31:20,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.76 | bwd: 3320.15 | bwd_inner: 3319.18 | bwd_allreduce: 0.93 | step: 7.13 51%|█████ | 5081/10000 [8:01:41<7:30:13, 5.49s/it] {'loss': 0.0012, 'grad_norm': 0.046878308057785034, 'learning_rate': 2.044691103262165e-05, 'epoch': 5.08} 51%|█████ | 5081/10000 [8:01:41<7:30:13, 5.49s/it][2025-06-19 21:31:25,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:31:25,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.37 | bwd_microstep: 3315.62 | bwd_inner_microstep: 3314.61 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.58 [2025-06-19 21:31:25,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.37 | bwd: 3315.64 | bwd_inner: 3314.61 | bwd_allreduce: 0.98 | step: 7.59 51%|█████ | 5082/10000 [8:01:46<7:29:31, 5.48s/it] {'loss': 0.0042, 'grad_norm': 0.6714569330215454, 'learning_rate': 2.0440435116058595e-05, 'epoch': 5.08} 51%|█████ | 5082/10000 [8:01:46<7:29:31, 5.48s/it][2025-06-19 21:31:31,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:31:31,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.09 | bwd_microstep: 3374.19 | bwd_inner_microstep: 3373.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 21:31:31,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.09 | bwd: 3374.21 | bwd_inner: 3373.41 | bwd_allreduce: 0.76 | step: 6.76 51%|█████ | 5083/10000 [8:01:52<7:30:50, 5.50s/it] {'loss': 0.0022, 'grad_norm': 0.14989635348320007, 'learning_rate': 2.0433959153295944e-05, 'epoch': 5.08} 51%|█████ | 5083/10000 [8:01:52<7:30:50, 5.50s/it][2025-06-19 21:31:36,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:31:36,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.96 | bwd_microstep: 3321.96 | bwd_inner_microstep: 3321.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 21:31:36,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.96 | bwd: 3321.98 | bwd_inner: 3321.16 | bwd_allreduce: 0.76 | step: 6.68 51%|█████ | 5084/10000 [8:01:57<7:30:01, 5.49s/it] {'loss': 0.0031, 'grad_norm': 0.2555248737335205, 'learning_rate': 2.0427483145012993e-05, 'epoch': 5.08} 51%|█████ | 5084/10000 [8:01:57<7:30:01, 5.49s/it][2025-06-19 21:31:42,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:31:42,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.10 | bwd_microstep: 3373.85 | bwd_inner_microstep: 3373.04 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 21:31:42,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.10 | bwd: 3373.87 | bwd_inner: 3373.04 | bwd_allreduce: 0.78 | step: 7.24 51%|█████ | 5085/10000 [8:02:03<7:31:01, 5.51s/it] {'loss': 0.0027, 'grad_norm': 0.37041595578193665, 'learning_rate': 2.0421007091889043e-05, 'epoch': 5.08} 51%|█████ | 5085/10000 [8:02:03<7:31:01, 5.51s/it][2025-06-19 21:31:47,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:31:47,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.07 | bwd_microstep: 3362.02 | bwd_inner_microstep: 3361.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-19 21:31:47,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.07 | bwd: 3362.04 | bwd_inner: 3361.22 | bwd_allreduce: 0.77 | step: 7.05 51%|█████ | 5086/10000 [8:02:08<7:31:23, 5.51s/it] {'loss': 0.0014, 'grad_norm': 0.17819151282310486, 'learning_rate': 2.0414530994603408e-05, 'epoch': 5.09} 51%|█████ | 5086/10000 [8:02:08<7:31:23, 5.51s/it][2025-06-19 21:31:53,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:31:53,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.93 | bwd_microstep: 3362.02 | bwd_inner_microstep: 3361.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 21:31:53,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.93 | bwd: 3362.03 | bwd_inner: 3361.24 | bwd_allreduce: 0.75 | step: 6.57 51%|█████ | 5087/10000 [8:02:14<7:31:44, 5.52s/it] {'loss': 0.0174, 'grad_norm': 2.2504656314849854, 'learning_rate': 2.040805485383539e-05, 'epoch': 5.09} 51%|█████ | 5087/10000 [8:02:14<7:31:44, 5.52s/it][2025-06-19 21:31:58,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:31:58,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.52 | bwd_microstep: 3330.67 | bwd_inner_microstep: 3329.71 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.64 [2025-06-19 21:31:58,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.52 | bwd: 3330.69 | bwd_inner: 3329.71 | bwd_allreduce: 0.93 | step: 7.64 51%|█████ | 5088/10000 [8:02:19<7:30:42, 5.51s/it] {'loss': 0.0052, 'grad_norm': 0.2881771922111511, 'learning_rate': 2.0401578670264314e-05, 'epoch': 5.09} 51%|█████ | 5088/10000 [8:02:19<7:30:42, 5.51s/it][2025-06-19 21:32:04,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 21:32:04,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.73 | bwd_microstep: 3323.86 | bwd_inner_microstep: 3322.76 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.60 [2025-06-19 21:32:04,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.73 | bwd: 3323.88 | bwd_inner: 3322.76 | bwd_allreduce: 1.06 | step: 7.60 51%|█████ | 5089/10000 [8:02:25<7:30:08, 5.50s/it] {'loss': 0.0784, 'grad_norm': 5.790783882141113, 'learning_rate': 2.0395102444569497e-05, 'epoch': 5.09} 51%|█████ | 5089/10000 [8:02:25<7:30:08, 5.50s/it][2025-06-19 21:32:09,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 21:32:09,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.31 | bwd_microstep: 3311.63 | bwd_inner_microstep: 3310.60 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.36 [2025-06-19 21:32:09,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.31 | bwd: 3311.65 | bwd_inner: 3310.60 | bwd_allreduce: 1.00 | step: 7.37 51%|█████ | 5090/10000 [8:02:30<7:29:16, 5.49s/it] {'loss': 0.0034, 'grad_norm': 0.6446902751922607, 'learning_rate': 2.038862617743027e-05, 'epoch': 5.09} 51%|█████ | 5090/10000 [8:02:30<7:29:16, 5.49s/it][2025-06-19 21:32:15,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:32:15,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.98 | bwd_microstep: 3368.53 | bwd_inner_microstep: 3367.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 21:32:15,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.98 | bwd: 3368.54 | bwd_inner: 3367.73 | bwd_allreduce: 0.77 | step: 6.98 51%|█████ | 5091/10000 [8:02:36<7:30:26, 5.51s/it] {'loss': 0.0021, 'grad_norm': 0.25806963443756104, 'learning_rate': 2.038214986952596e-05, 'epoch': 5.09} 51%|█████ | 5091/10000 [8:02:36<7:30:26, 5.51s/it][2025-06-19 21:32:20,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:32:20,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.14 | bwd_microstep: 3319.78 | bwd_inner_microstep: 3319.00 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.62 [2025-06-19 21:32:20,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.14 | bwd: 3319.79 | bwd_inner: 3319.00 | bwd_allreduce: 0.75 | step: 6.62 51%|█████ | 5092/10000 [8:02:41<7:29:20, 5.49s/it] {'loss': 0.0075, 'grad_norm': 0.510922908782959, 'learning_rate': 2.0375673521535907e-05, 'epoch': 5.09} 51%|█████ | 5092/10000 [8:02:41<7:29:20, 5.49s/it][2025-06-19 21:32:26,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:32:26,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.19 | bwd_microstep: 3320.10 | bwd_inner_microstep: 3319.21 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.02 [2025-06-19 21:32:26,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.19 | bwd: 3320.11 | bwd_inner: 3319.21 | bwd_allreduce: 0.86 | step: 7.02 51%|█████ | 5093/10000 [8:02:47<7:28:36, 5.49s/it] {'loss': 0.0034, 'grad_norm': 0.29155442118644714, 'learning_rate': 2.0369197134139437e-05, 'epoch': 5.09} 51%|█████ | 5093/10000 [8:02:47<7:28:36, 5.49s/it][2025-06-19 21:32:31,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:32:31,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.49 | bwd_microstep: 3322.78 | bwd_inner_microstep: 3321.96 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.69 [2025-06-19 21:32:31,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.49 | bwd: 3322.79 | bwd_inner: 3321.96 | bwd_allreduce: 0.79 | step: 6.70 51%|█████ | 5094/10000 [8:02:52<7:28:35, 5.49s/it] {'loss': 0.0331, 'grad_norm': 4.251161098480225, 'learning_rate': 2.0362720708015903e-05, 'epoch': 5.09} 51%|█████ | 5094/10000 [8:02:52<7:28:35, 5.49s/it][2025-06-19 21:32:37,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:32:37,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.01 | bwd_microstep: 3325.08 | bwd_inner_microstep: 3324.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 21:32:37,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.01 | bwd: 3325.09 | bwd_inner: 3324.29 | bwd_allreduce: 0.76 | step: 6.65 51%|█████ | 5095/10000 [8:02:58<7:28:13, 5.48s/it] {'loss': 0.0488, 'grad_norm': 7.031146049499512, 'learning_rate': 2.0356244243844648e-05, 'epoch': 5.09} 51%|█████ | 5095/10000 [8:02:58<7:28:13, 5.48s/it][2025-06-19 21:32:42,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:32:42,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.31 | bwd_microstep: 3323.04 | bwd_inner_microstep: 3322.05 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.45 [2025-06-19 21:32:42,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.31 | bwd: 3323.05 | bwd_inner: 3322.05 | bwd_allreduce: 0.96 | step: 7.45 51%|█████ | 5096/10000 [8:03:03<7:27:37, 5.48s/it] {'loss': 0.0061, 'grad_norm': 0.6000083088874817, 'learning_rate': 2.0349767742305034e-05, 'epoch': 5.1} 51%|█████ | 5096/10000 [8:03:03<7:27:37, 5.48s/it][2025-06-19 21:32:48,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:32:48,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.32 | bwd_microstep: 3318.68 | bwd_inner_microstep: 3317.56 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.04 [2025-06-19 21:32:48,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.32 | bwd: 3318.69 | bwd_inner: 3317.56 | bwd_allreduce: 1.09 | step: 8.04 51%|█████ | 5097/10000 [8:03:09<7:27:19, 5.47s/it] {'loss': 0.0065, 'grad_norm': 0.837358832359314, 'learning_rate': 2.03432912040764e-05, 'epoch': 5.1} 51%|█████ | 5097/10000 [8:03:09<7:27:19, 5.47s/it][2025-06-19 21:32:53,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:32:53,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.41 | bwd_microstep: 3323.13 | bwd_inner_microstep: 3322.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 21:32:53,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.41 | bwd: 3323.14 | bwd_inner: 3322.34 | bwd_allreduce: 0.76 | step: 6.69 51%|█████ | 5098/10000 [8:03:14<7:27:24, 5.48s/it] {'loss': 0.0035, 'grad_norm': 0.5551018118858337, 'learning_rate': 2.0336814629838114e-05, 'epoch': 5.1} 51%|█████ | 5098/10000 [8:03:14<7:27:24, 5.48s/it][2025-06-19 21:32:59,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 21:32:59,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.29 | bwd_microstep: 3371.61 | bwd_inner_microstep: 3370.61 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.48 [2025-06-19 21:32:59,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.29 | bwd: 3371.62 | bwd_inner: 3370.61 | bwd_allreduce: 0.97 | step: 7.49 51%|█████ | 5099/10000 [8:03:20<7:29:03, 5.50s/it] {'loss': 0.0039, 'grad_norm': 0.20473812520503998, 'learning_rate': 2.0330338020269537e-05, 'epoch': 5.1} 51%|█████ | 5099/10000 [8:03:20<7:29:03, 5.50s/it][2025-06-19 21:33:04,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:33:04,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.67 | bwd_microstep: 3311.24 | bwd_inner_microstep: 3310.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 21:33:04,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.67 | bwd: 3311.25 | bwd_inner: 3310.46 | bwd_allreduce: 0.75 | step: 6.56 51%|█████ | 5100/10000 [8:03:25<7:28:06, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.020312782377004623, 'learning_rate': 2.0323861376050035e-05, 'epoch': 5.1} 51%|█████ | 5100/10000 [8:03:25<7:28:06, 5.49s/it][2025-06-19 21:33:10,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:33:10,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.60 | bwd_microstep: 3315.25 | bwd_inner_microstep: 3314.44 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.04 [2025-06-19 21:33:10,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.60 | bwd: 3315.26 | bwd_inner: 3314.44 | bwd_allreduce: 0.79 | step: 7.04 51%|█████ | 5101/10000 [8:03:30<7:27:10, 5.48s/it] {'loss': 0.004, 'grad_norm': 0.6064722537994385, 'learning_rate': 2.0317384697858972e-05, 'epoch': 5.1} 51%|█████ | 5101/10000 [8:03:30<7:27:10, 5.48s/it][2025-06-19 21:33:15,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:33:15,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.80 | bwd_microstep: 3372.87 | bwd_inner_microstep: 3371.89 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.33 [2025-06-19 21:33:15,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.80 | bwd: 3372.89 | bwd_inner: 3371.89 | bwd_allreduce: 0.95 | step: 7.34 51%|█████ | 5102/10000 [8:03:36<7:28:47, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.04004286974668503, 'learning_rate': 2.0310907986375733e-05, 'epoch': 5.1} 51%|█████ | 5102/10000 [8:03:36<7:28:47, 5.50s/it][2025-06-19 21:33:21,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:33:21,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.00 | bwd_microstep: 3366.97 | bwd_inner_microstep: 3366.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 21:33:21,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.00 | bwd: 3366.98 | bwd_inner: 3366.18 | bwd_allreduce: 0.76 | step: 6.75 51%|█████ | 5103/10000 [8:03:42<7:29:41, 5.51s/it] {'loss': 0.0016, 'grad_norm': 0.153132826089859, 'learning_rate': 2.030443124227969e-05, 'epoch': 5.1} 51%|█████ | 5103/10000 [8:03:42<7:29:41, 5.51s/it][2025-06-19 21:33:26,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:33:26,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.13 | bwd_microstep: 3312.32 | bwd_inner_microstep: 3311.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 21:33:26,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.13 | bwd: 3312.33 | bwd_inner: 3311.54 | bwd_allreduce: 0.75 | step: 6.57 51%|█████ | 5104/10000 [8:03:47<7:28:11, 5.49s/it] {'loss': 0.0041, 'grad_norm': 0.3283708989620209, 'learning_rate': 2.029795446625022e-05, 'epoch': 5.1} 51%|█████ | 5104/10000 [8:03:47<7:28:11, 5.49s/it][2025-06-19 21:33:32,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:33:32,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.33 | bwd_microstep: 3322.49 | bwd_inner_microstep: 3321.57 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.54 [2025-06-19 21:33:32,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.33 | bwd: 3322.50 | bwd_inner: 3321.57 | bwd_allreduce: 0.89 | step: 7.54 51%|█████ | 5105/10000 [8:03:52<7:27:32, 5.49s/it] {'loss': 0.0027, 'grad_norm': 0.17946398258209229, 'learning_rate': 2.0291477658966707e-05, 'epoch': 5.11} 51%|█████ | 5105/10000 [8:03:52<7:27:32, 5.49s/it][2025-06-19 21:33:37,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:33:37,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.45 | bwd_microstep: 3365.43 | bwd_inner_microstep: 3364.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.57 [2025-06-19 21:33:37,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.45 | bwd: 3365.44 | bwd_inner: 3364.63 | bwd_allreduce: 0.76 | step: 6.59 51%|█████ | 5106/10000 [8:03:58<7:28:24, 5.50s/it] {'loss': 0.0031, 'grad_norm': 0.2846596837043762, 'learning_rate': 2.0285000821108548e-05, 'epoch': 5.11} 51%|█████ | 5106/10000 [8:03:58<7:28:24, 5.50s/it][2025-06-19 21:33:43,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:33:43,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.64 | bwd_microstep: 3322.60 | bwd_inner_microstep: 3321.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 21:33:43,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.64 | bwd: 3322.62 | bwd_inner: 3321.82 | bwd_allreduce: 0.76 | step: 6.71 51%|█████ | 5107/10000 [8:04:03<7:27:41, 5.49s/it] {'loss': 0.0154, 'grad_norm': 1.8896701335906982, 'learning_rate': 2.0278523953355124e-05, 'epoch': 5.11} 51%|█████ | 5107/10000 [8:04:03<7:27:41, 5.49s/it][2025-06-19 21:33:48,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:33:48,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.87 | bwd_microstep: 3312.82 | bwd_inner_microstep: 3312.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 21:33:48,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.87 | bwd: 3312.84 | bwd_inner: 3312.02 | bwd_allreduce: 0.77 | step: 6.99 51%|█████ | 5108/10000 [8:04:09<7:27:00, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.053001247346401215, 'learning_rate': 2.027204705638582e-05, 'epoch': 5.11} 51%|█████ | 5108/10000 [8:04:09<7:27:00, 5.48s/it][2025-06-19 21:33:54,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:33:54,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.74 | bwd_microstep: 3315.84 | bwd_inner_microstep: 3315.06 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 21:33:54,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.74 | bwd: 3315.85 | bwd_inner: 3315.06 | bwd_allreduce: 0.75 | step: 6.70 51%|█████ | 5109/10000 [8:04:14<7:26:22, 5.48s/it] {'loss': 0.0254, 'grad_norm': 1.2754322290420532, 'learning_rate': 2.0265570130880054e-05, 'epoch': 5.11} 51%|█████ | 5109/10000 [8:04:14<7:26:22, 5.48s/it][2025-06-19 21:33:59,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:33:59,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.33 | bwd_microstep: 3315.85 | bwd_inner_microstep: 3315.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-19 21:33:59,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.33 | bwd: 3315.87 | bwd_inner: 3315.06 | bwd_allreduce: 0.77 | step: 6.61 51%|█████ | 5110/10000 [8:04:20<7:25:42, 5.47s/it] {'loss': 0.0028, 'grad_norm': 0.2866098880767822, 'learning_rate': 2.025909317751721e-05, 'epoch': 5.11} 51%|█████ | 5110/10000 [8:04:20<7:25:42, 5.47s/it][2025-06-19 21:34:05,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:34:05,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.28 | bwd_microstep: 3314.31 | bwd_inner_microstep: 3313.34 | bwd_allreduce_microstep: 0.93 | step_microstep: 6.97 [2025-06-19 21:34:05,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.28 | bwd: 3314.33 | bwd_inner: 3313.34 | bwd_allreduce: 0.95 | step: 6.97 51%|█████ | 5111/10000 [8:04:25<7:25:21, 5.47s/it] {'loss': 0.0727, 'grad_norm': 18.910400390625, 'learning_rate': 2.0252616196976692e-05, 'epoch': 5.11} 51%|█████ | 5111/10000 [8:04:25<7:25:21, 5.47s/it][2025-06-19 21:34:10,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 21:34:10,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.95 | bwd_microstep: 3318.65 | bwd_inner_microstep: 3317.69 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.50 [2025-06-19 21:34:10,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.96 | bwd: 3318.67 | bwd_inner: 3317.69 | bwd_allreduce: 0.93 | step: 7.51 51%|█████ | 5112/10000 [8:04:31<7:25:40, 5.47s/it] {'loss': 0.0015, 'grad_norm': 0.12070745974779129, 'learning_rate': 2.0246139189937905e-05, 'epoch': 5.11} 51%|█████ | 5112/10000 [8:04:31<7:25:40, 5.47s/it][2025-06-19 21:34:16,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:34:16,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.90 | bwd_microstep: 3372.45 | bwd_inner_microstep: 3371.42 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.14 [2025-06-19 21:34:16,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.90 | bwd: 3372.46 | bwd_inner: 3371.42 | bwd_allreduce: 0.99 | step: 7.14 51%|█████ | 5113/10000 [8:04:36<7:27:24, 5.49s/it] {'loss': 0.0174, 'grad_norm': 3.794058084487915, 'learning_rate': 2.0239662157080266e-05, 'epoch': 5.11} 51%|█████ | 5113/10000 [8:04:36<7:27:24, 5.49s/it][2025-06-19 21:34:21,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:34:21,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.64 | bwd_microstep: 3379.40 | bwd_inner_microstep: 3378.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 21:34:21,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.64 | bwd: 3379.41 | bwd_inner: 3378.62 | bwd_allreduce: 0.75 | step: 6.69 51%|█████ | 5114/10000 [8:04:42<7:28:43, 5.51s/it] {'loss': 0.0131, 'grad_norm': 1.9620174169540405, 'learning_rate': 2.0233185099083177e-05, 'epoch': 5.11} 51%|█████ | 5114/10000 [8:04:42<7:28:43, 5.51s/it][2025-06-19 21:34:27,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.89 [2025-06-19 21:34:27,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.91 | bwd_microstep: 3310.37 | bwd_inner_microstep: 3309.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 21:34:27,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.91 | bwd: 3310.39 | bwd_inner: 3309.58 | bwd_allreduce: 0.76 | step: 6.82 51%|█████ | 5115/10000 [8:04:47<7:27:01, 5.49s/it] {'loss': 0.0079, 'grad_norm': 1.3711386919021606, 'learning_rate': 2.022670801662605e-05, 'epoch': 5.12} 51%|█████ | 5115/10000 [8:04:47<7:27:01, 5.49s/it][2025-06-19 21:34:32,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:34:32,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.81 | bwd_microstep: 3318.28 | bwd_inner_microstep: 3317.49 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 21:34:32,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.81 | bwd: 3318.29 | bwd_inner: 3317.49 | bwd_allreduce: 0.76 | step: 6.63 51%|█████ | 5116/10000 [8:04:53<7:26:01, 5.48s/it] {'loss': 0.0669, 'grad_norm': 7.483956813812256, 'learning_rate': 2.0220230910388313e-05, 'epoch': 5.12} 51%|█████ | 5116/10000 [8:04:53<7:26:01, 5.48s/it][2025-06-19 21:34:37,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:34:37,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.48 | bwd_microstep: 3315.87 | bwd_inner_microstep: 3314.88 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.07 [2025-06-19 21:34:37,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.48 | bwd: 3315.88 | bwd_inner: 3314.88 | bwd_allreduce: 0.96 | step: 7.07 51%|█████ | 5117/10000 [8:04:58<7:25:34, 5.48s/it] {'loss': 0.1518, 'grad_norm': 5.491461277008057, 'learning_rate': 2.0213753781049366e-05, 'epoch': 5.12} 51%|█████ | 5117/10000 [8:04:58<7:25:34, 5.48s/it][2025-06-19 21:34:43,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:34:43,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.51 | bwd_microstep: 3376.46 | bwd_inner_microstep: 3375.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-19 21:34:43,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.51 | bwd: 3376.47 | bwd_inner: 3375.66 | bwd_allreduce: 0.77 | step: 6.86 51%|█████ | 5118/10000 [8:05:04<7:27:16, 5.50s/it] {'loss': 0.0017, 'grad_norm': 0.27350902557373047, 'learning_rate': 2.0207276629288644e-05, 'epoch': 5.12} 51%|█████ | 5118/10000 [8:05:04<7:27:16, 5.50s/it][2025-06-19 21:34:48,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:34:48,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.27 | bwd_microstep: 3323.71 | bwd_inner_microstep: 3322.84 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.38 [2025-06-19 21:34:48,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.27 | bwd: 3323.74 | bwd_inner: 3322.84 | bwd_allreduce: 0.84 | step: 7.39 51%|█████ | 5119/10000 [8:05:09<7:26:25, 5.49s/it] {'loss': 0.0013, 'grad_norm': 0.12638290226459503, 'learning_rate': 2.020079945578556e-05, 'epoch': 5.12} 51%|█████ | 5119/10000 [8:05:09<7:26:25, 5.49s/it][2025-06-19 21:34:54,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:34:54,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.30 | bwd_microstep: 3315.22 | bwd_inner_microstep: 3314.44 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 21:34:54,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.30 | bwd: 3315.24 | bwd_inner: 3314.44 | bwd_allreduce: 0.76 | step: 6.71 51%|█████ | 5120/10000 [8:05:15<7:25:30, 5.48s/it] {'loss': 0.0006, 'grad_norm': 0.06988026201725006, 'learning_rate': 2.0194322261219548e-05, 'epoch': 5.12} 51%|█████ | 5120/10000 [8:05:15<7:25:30, 5.48s/it][2025-06-19 21:34:59,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:34:59,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.15 | bwd_microstep: 3314.07 | bwd_inner_microstep: 3313.21 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.87 [2025-06-19 21:34:59,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.15 | bwd: 3314.08 | bwd_inner: 3313.21 | bwd_allreduce: 0.83 | step: 6.88 51%|█████ | 5121/10000 [8:05:20<7:25:00, 5.47s/it] {'loss': 0.0011, 'grad_norm': 0.13690395653247833, 'learning_rate': 2.0187845046270042e-05, 'epoch': 5.12} 51%|█████ | 5121/10000 [8:05:20<7:25:00, 5.47s/it][2025-06-19 21:35:05,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:35:05,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.18 | bwd_microstep: 3336.50 | bwd_inner_microstep: 3335.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 21:35:05,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.18 | bwd: 3336.52 | bwd_inner: 3335.72 | bwd_allreduce: 0.75 | step: 6.63 51%|█████ | 5122/10000 [8:05:26<7:25:21, 5.48s/it] {'loss': 0.0015, 'grad_norm': 0.21541288495063782, 'learning_rate': 2.018136781161645e-05, 'epoch': 5.12} 51%|█████ | 5122/10000 [8:05:26<7:25:21, 5.48s/it][2025-06-19 21:35:10,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:35:10,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.03 | bwd_microstep: 3321.44 | bwd_inner_microstep: 3320.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 21:35:10,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.03 | bwd: 3321.45 | bwd_inner: 3320.65 | bwd_allreduce: 0.77 | step: 6.64 51%|█████ | 5123/10000 [8:05:31<7:24:52, 5.47s/it] {'loss': 0.0537, 'grad_norm': 4.760541915893555, 'learning_rate': 2.017489055793822e-05, 'epoch': 5.12} 51%|█████ | 5123/10000 [8:05:31<7:24:52, 5.47s/it][2025-06-19 21:35:16,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:35:16,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.59 | bwd_microstep: 3369.03 | bwd_inner_microstep: 3368.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 21:35:16,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.59 | bwd: 3369.04 | bwd_inner: 3368.24 | bwd_allreduce: 0.76 | step: 6.61 51%|█████ | 5124/10000 [8:05:37<7:26:27, 5.49s/it] {'loss': 0.0022, 'grad_norm': 0.17070835828781128, 'learning_rate': 2.0168413285914786e-05, 'epoch': 5.12} 51%|█████ | 5124/10000 [8:05:37<7:26:27, 5.49s/it][2025-06-19 21:35:21,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:35:21,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.47 | bwd_microstep: 3375.64 | bwd_inner_microstep: 3374.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 21:35:21,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.47 | bwd: 3375.66 | bwd_inner: 3374.85 | bwd_allreduce: 0.77 | step: 6.69 51%|█████▏ | 5125/10000 [8:05:42<7:27:22, 5.51s/it] {'loss': 0.0026, 'grad_norm': 0.6067229509353638, 'learning_rate': 2.0161935996225573e-05, 'epoch': 5.12} 51%|█████▏ | 5125/10000 [8:05:42<7:27:22, 5.51s/it][2025-06-19 21:35:27,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:35:27,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.64 | bwd_microstep: 3322.49 | bwd_inner_microstep: 3321.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 21:35:27,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.64 | bwd: 3322.50 | bwd_inner: 3321.69 | bwd_allreduce: 0.76 | step: 6.98 51%|█████▏ | 5126/10000 [8:05:48<7:26:29, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.020374495536088943, 'learning_rate': 2.0155458689550036e-05, 'epoch': 5.13} 51%|█████▏ | 5126/10000 [8:05:48<7:26:29, 5.50s/it][2025-06-19 21:35:32,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:35:32,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.79 | bwd_microstep: 3313.98 | bwd_inner_microstep: 3313.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 21:35:32,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.79 | bwd: 3313.99 | bwd_inner: 3313.20 | bwd_allreduce: 0.76 | step: 6.68 51%|█████▏ | 5127/10000 [8:05:53<7:25:33, 5.49s/it] {'loss': 0.0079, 'grad_norm': 1.2734938859939575, 'learning_rate': 2.0148981366567593e-05, 'epoch': 5.13} 51%|█████▏ | 5127/10000 [8:05:53<7:25:33, 5.49s/it][2025-06-19 21:35:38,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 21:35:38,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.99 | bwd_microstep: 3370.46 | bwd_inner_microstep: 3369.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.51 [2025-06-19 21:35:38,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.99 | bwd: 3370.47 | bwd_inner: 3369.66 | bwd_allreduce: 0.76 | step: 6.52 51%|█████▏ | 5128/10000 [8:05:59<7:26:45, 5.50s/it] {'loss': 0.0177, 'grad_norm': 1.5907583236694336, 'learning_rate': 2.0142504027957705e-05, 'epoch': 5.13} 51%|█████▏ | 5128/10000 [8:05:59<7:26:45, 5.50s/it][2025-06-19 21:35:43,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:35:43,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.56 | bwd_microstep: 3318.92 | bwd_inner_microstep: 3318.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 21:35:43,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.56 | bwd: 3318.93 | bwd_inner: 3318.14 | bwd_allreduce: 0.75 | step: 6.58 51%|█████▏ | 5129/10000 [8:06:04<7:25:41, 5.49s/it] {'loss': 0.0053, 'grad_norm': 0.8765392303466797, 'learning_rate': 2.0136026674399795e-05, 'epoch': 5.13} 51%|█████▏ | 5129/10000 [8:06:04<7:25:41, 5.49s/it][2025-06-19 21:35:49,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:35:49,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.60 | bwd_microstep: 3402.78 | bwd_inner_microstep: 3401.80 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.15 [2025-06-19 21:35:49,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.60 | bwd: 3402.79 | bwd_inner: 3401.80 | bwd_allreduce: 0.94 | step: 7.15 51%|█████▏ | 5130/10000 [8:06:10<7:27:53, 5.52s/it] {'loss': 0.0174, 'grad_norm': 1.2117329835891724, 'learning_rate': 2.0129549306573324e-05, 'epoch': 5.13} 51%|█████▏ | 5130/10000 [8:06:10<7:27:53, 5.52s/it][2025-06-19 21:35:54,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:35:54,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.92 | bwd_microstep: 3377.77 | bwd_inner_microstep: 3376.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 21:35:54,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.92 | bwd: 3377.78 | bwd_inner: 3376.98 | bwd_allreduce: 0.76 | step: 6.65 51%|█████▏ | 5131/10000 [8:06:15<7:28:39, 5.53s/it] {'loss': 0.0093, 'grad_norm': 0.5343098640441895, 'learning_rate': 2.0123071925157735e-05, 'epoch': 5.13} 51%|█████▏ | 5131/10000 [8:06:15<7:28:39, 5.53s/it][2025-06-19 21:36:00,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:36:00,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.73 | bwd_microstep: 3367.31 | bwd_inner_microstep: 3366.34 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.03 [2025-06-19 21:36:00,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.73 | bwd: 3367.32 | bwd_inner: 3366.34 | bwd_allreduce: 0.94 | step: 7.03 51%|█████▏ | 5132/10000 [8:06:21<7:28:33, 5.53s/it] {'loss': 0.0052, 'grad_norm': 0.6102306842803955, 'learning_rate': 2.0116594530832468e-05, 'epoch': 5.13} 51%|█████▏ | 5132/10000 [8:06:21<7:28:33, 5.53s/it][2025-06-19 21:36:05,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:36:05,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.78 | bwd_microstep: 3319.19 | bwd_inner_microstep: 3318.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 21:36:05,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.78 | bwd: 3319.21 | bwd_inner: 3318.40 | bwd_allreduce: 0.77 | step: 6.65 51%|█████▏ | 5133/10000 [8:06:26<7:27:00, 5.51s/it] {'loss': 0.0252, 'grad_norm': 3.661266565322876, 'learning_rate': 2.0110117124276978e-05, 'epoch': 5.13} 51%|█████▏ | 5133/10000 [8:06:26<7:27:00, 5.51s/it][2025-06-19 21:36:11,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 21:36:11,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.50 | bwd_microstep: 3329.88 | bwd_inner_microstep: 3328.81 | bwd_allreduce_microstep: 1.00 | step_microstep: 8.18 [2025-06-19 21:36:11,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.50 | bwd: 3329.91 | bwd_inner: 3328.81 | bwd_allreduce: 1.03 | step: 8.19 51%|█████▏ | 5134/10000 [8:06:32<7:26:03, 5.50s/it] {'loss': 0.0336, 'grad_norm': 3.5929954051971436, 'learning_rate': 2.0103639706170716e-05, 'epoch': 5.13} 51%|█████▏ | 5134/10000 [8:06:32<7:26:03, 5.50s/it][2025-06-19 21:36:16,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:36:16,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.71 | bwd_microstep: 3330.22 | bwd_inner_microstep: 3329.33 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.88 [2025-06-19 21:36:16,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.71 | bwd: 3330.23 | bwd_inner: 3329.33 | bwd_allreduce: 0.85 | step: 6.88 51%|█████▏ | 5135/10000 [8:06:37<7:25:41, 5.50s/it] {'loss': 0.0029, 'grad_norm': 0.26630112528800964, 'learning_rate': 2.0097162277193125e-05, 'epoch': 5.13} 51%|█████▏ | 5135/10000 [8:06:37<7:25:41, 5.50s/it][2025-06-19 21:36:22,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.74 [2025-06-19 21:36:22,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.41 | bwd_microstep: 3319.20 | bwd_inner_microstep: 3318.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 21:36:22,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.41 | bwd: 3319.22 | bwd_inner: 3318.43 | bwd_allreduce: 0.75 | step: 6.56 51%|█████▏ | 5136/10000 [8:06:43<7:24:58, 5.49s/it] {'loss': 0.0245, 'grad_norm': 2.1697285175323486, 'learning_rate': 2.0090684838023664e-05, 'epoch': 5.14} 51%|█████▏ | 5136/10000 [8:06:43<7:24:58, 5.49s/it][2025-06-19 21:36:27,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:36:27,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.67 | bwd_microstep: 3386.19 | bwd_inner_microstep: 3385.12 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.44 [2025-06-19 21:36:27,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.67 | bwd: 3386.21 | bwd_inner: 3385.12 | bwd_allreduce: 1.03 | step: 7.44 51%|█████▏ | 5137/10000 [8:06:48<7:26:34, 5.51s/it] {'loss': 0.0008, 'grad_norm': 0.13537177443504333, 'learning_rate': 2.008420738934178e-05, 'epoch': 5.14} 51%|█████▏ | 5137/10000 [8:06:48<7:26:34, 5.51s/it][2025-06-19 21:36:33,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:36:33,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.12 | bwd_microstep: 3378.35 | bwd_inner_microstep: 3377.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 21:36:33,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.12 | bwd: 3378.36 | bwd_inner: 3377.56 | bwd_allreduce: 0.76 | step: 6.77 51%|█████▏ | 5138/10000 [8:06:54<7:27:32, 5.52s/it] {'loss': 0.0059, 'grad_norm': 0.7911497950553894, 'learning_rate': 2.0077729931826937e-05, 'epoch': 5.14} 51%|█████▏ | 5138/10000 [8:06:54<7:27:32, 5.52s/it][2025-06-19 21:36:39,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:36:39,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.40 | bwd_microstep: 3378.45 | bwd_inner_microstep: 3377.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 21:36:39,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.40 | bwd: 3378.47 | bwd_inner: 3377.66 | bwd_allreduce: 0.76 | step: 6.85 51%|█████▏ | 5139/10000 [8:06:59<7:28:09, 5.53s/it] {'loss': 0.0037, 'grad_norm': 0.6714186072349548, 'learning_rate': 2.0071252466158583e-05, 'epoch': 5.14} 51%|█████▏ | 5139/10000 [8:06:59<7:28:09, 5.53s/it][2025-06-19 21:36:44,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:36:44,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.57 | bwd_microstep: 3369.71 | bwd_inner_microstep: 3368.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 21:36:44,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.57 | bwd: 3369.72 | bwd_inner: 3368.93 | bwd_allreduce: 0.75 | step: 6.56 51%|█████▏ | 5140/10000 [8:07:05<7:28:08, 5.53s/it] {'loss': 0.0025, 'grad_norm': 0.26194146275520325, 'learning_rate': 2.006477499301618e-05, 'epoch': 5.14} 51%|█████▏ | 5140/10000 [8:07:05<7:28:08, 5.53s/it][2025-06-19 21:36:50,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:36:50,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.90 | bwd_microstep: 3385.08 | bwd_inner_microstep: 3384.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 21:36:50,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.90 | bwd: 3385.09 | bwd_inner: 3384.29 | bwd_allreduce: 0.75 | step: 6.59 51%|█████▏ | 5141/10000 [8:07:10<7:28:45, 5.54s/it] {'loss': 0.0228, 'grad_norm': 1.7324692010879517, 'learning_rate': 2.0058297513079172e-05, 'epoch': 5.14} 51%|█████▏ | 5141/10000 [8:07:10<7:28:45, 5.54s/it][2025-06-19 21:36:55,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 21:36:55,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.41 | bwd_microstep: 3376.48 | bwd_inner_microstep: 3375.45 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.28 [2025-06-19 21:36:55,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.41 | bwd: 3376.50 | bwd_inner: 3375.45 | bwd_allreduce: 1.00 | step: 7.29 51%|█████▏ | 5142/10000 [8:07:16<7:29:05, 5.55s/it] {'loss': 0.0067, 'grad_norm': 1.268980860710144, 'learning_rate': 2.0051820027027037e-05, 'epoch': 5.14} 51%|█████▏ | 5142/10000 [8:07:16<7:29:05, 5.55s/it][2025-06-19 21:37:01,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:37:01,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.22 | bwd_microstep: 3385.82 | bwd_inner_microstep: 3385.00 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.81 [2025-06-19 21:37:01,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.22 | bwd: 3385.84 | bwd_inner: 3385.00 | bwd_allreduce: 0.79 | step: 6.81 51%|█████▏ | 5143/10000 [8:07:22<7:29:38, 5.55s/it] {'loss': 0.0446, 'grad_norm': 4.380929946899414, 'learning_rate': 2.004534253553921e-05, 'epoch': 5.14} 51%|█████▏ | 5143/10000 [8:07:22<7:29:38, 5.55s/it][2025-06-19 21:37:06,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:37:06,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.74 | bwd_microstep: 3326.47 | bwd_inner_microstep: 3325.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 21:37:06,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.74 | bwd: 3326.49 | bwd_inner: 3325.67 | bwd_allreduce: 0.77 | step: 6.75 51%|█████▏ | 5144/10000 [8:07:27<7:27:47, 5.53s/it] {'loss': 0.0041, 'grad_norm': 0.4721914529800415, 'learning_rate': 2.003886503929517e-05, 'epoch': 5.14} 51%|█████▏ | 5144/10000 [8:07:27<7:27:47, 5.53s/it][2025-06-19 21:37:12,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:37:12,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.22 | bwd_microstep: 3332.43 | bwd_inner_microstep: 3331.64 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 21:37:12,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.22 | bwd: 3332.44 | bwd_inner: 3331.64 | bwd_allreduce: 0.75 | step: 6.69 51%|█████▏ | 5145/10000 [8:07:33<7:26:25, 5.52s/it] {'loss': 0.0056, 'grad_norm': 1.5923480987548828, 'learning_rate': 2.003238753897436e-05, 'epoch': 5.14} 51%|█████▏ | 5145/10000 [8:07:33<7:26:25, 5.52s/it][2025-06-19 21:37:17,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:37:17,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.40 | bwd_microstep: 3323.33 | bwd_inner_microstep: 3322.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.97 [2025-06-19 21:37:17,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.40 | bwd: 3323.34 | bwd_inner: 3322.55 | bwd_allreduce: 0.75 | step: 6.98 51%|█████▏ | 5146/10000 [8:07:38<7:25:09, 5.50s/it] {'loss': 0.0067, 'grad_norm': 0.9123309254646301, 'learning_rate': 2.0025910035256254e-05, 'epoch': 5.15} 51%|█████▏ | 5146/10000 [8:07:38<7:25:09, 5.50s/it][2025-06-19 21:37:23,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:37:23,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.93 | bwd_microstep: 3328.19 | bwd_inner_microstep: 3327.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 21:37:23,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.93 | bwd: 3328.21 | bwd_inner: 3327.39 | bwd_allreduce: 0.77 | step: 6.68 51%|█████▏ | 5147/10000 [8:07:44<7:24:21, 5.49s/it] {'loss': 0.0042, 'grad_norm': 0.479447603225708, 'learning_rate': 2.0019432528820305e-05, 'epoch': 5.15} 51%|█████▏ | 5147/10000 [8:07:44<7:24:21, 5.49s/it][2025-06-19 21:37:28,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:37:28,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.30 | bwd_microstep: 3325.82 | bwd_inner_microstep: 3325.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 21:37:28,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.30 | bwd: 3325.84 | bwd_inner: 3325.04 | bwd_allreduce: 0.75 | step: 6.54 51%|█████▏ | 5148/10000 [8:07:49<7:23:53, 5.49s/it] {'loss': 0.0056, 'grad_norm': 0.5175116658210754, 'learning_rate': 2.001295502034597e-05, 'epoch': 5.15} 51%|█████▏ | 5148/10000 [8:07:49<7:23:53, 5.49s/it][2025-06-19 21:37:34,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:37:34,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.48 | bwd_microstep: 3318.51 | bwd_inner_microstep: 3317.66 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.08 [2025-06-19 21:37:34,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.48 | bwd: 3318.53 | bwd_inner: 3317.66 | bwd_allreduce: 0.81 | step: 7.09 51%|█████▏ | 5149/10000 [8:07:54<7:23:13, 5.48s/it] {'loss': 0.002, 'grad_norm': 0.40860891342163086, 'learning_rate': 2.000647751051272e-05, 'epoch': 5.15} 51%|█████▏ | 5149/10000 [8:07:54<7:23:13, 5.48s/it][2025-06-19 21:37:39,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:37:39,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.64 | bwd_microstep: 3327.70 | bwd_inner_microstep: 3326.91 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-19 21:37:39,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.64 | bwd: 3327.71 | bwd_inner: 3326.91 | bwd_allreduce: 0.76 | step: 6.74 52%|█████▏ | 5150/10000 [8:08:00<7:22:56, 5.48s/it] {'loss': 0.008, 'grad_norm': 1.7654685974121094, 'learning_rate': 2e-05, 'epoch': 5.15} 52%|█████▏ | 5150/10000 [8:08:00<7:22:56, 5.48s/it][2025-06-19 21:37:45,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:37:45,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.32 | bwd_microstep: 3376.85 | bwd_inner_microstep: 3375.91 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.02 [2025-06-19 21:37:45,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.32 | bwd: 3376.87 | bwd_inner: 3375.91 | bwd_allreduce: 0.92 | step: 7.03 52%|█████▏ | 5151/10000 [8:08:05<7:24:40, 5.50s/it] {'loss': 0.0246, 'grad_norm': 1.5652921199798584, 'learning_rate': 1.9993522489487287e-05, 'epoch': 5.15} 52%|█████▏ | 5151/10000 [8:08:05<7:24:40, 5.50s/it][2025-06-19 21:37:50,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:37:50,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.23 | bwd_microstep: 3334.80 | bwd_inner_microstep: 3333.84 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.51 [2025-06-19 21:37:50,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.23 | bwd: 3334.81 | bwd_inner: 3333.84 | bwd_allreduce: 0.93 | step: 7.52 52%|█████▏ | 5152/10000 [8:08:11<7:24:15, 5.50s/it] {'loss': 0.072, 'grad_norm': 5.692753314971924, 'learning_rate': 1.998704497965403e-05, 'epoch': 5.15} 52%|█████▏ | 5152/10000 [8:08:11<7:24:15, 5.50s/it][2025-06-19 21:37:56,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:37:56,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.61 | bwd_microstep: 3376.53 | bwd_inner_microstep: 3375.71 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.01 [2025-06-19 21:37:56,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.61 | bwd: 3376.54 | bwd_inner: 3375.71 | bwd_allreduce: 0.79 | step: 7.02 52%|█████▏ | 5153/10000 [8:08:17<7:25:27, 5.51s/it] {'loss': 0.0014, 'grad_norm': 0.16992928087711334, 'learning_rate': 1.99805674711797e-05, 'epoch': 5.15} 52%|█████▏ | 5153/10000 [8:08:17<7:25:27, 5.51s/it][2025-06-19 21:38:01,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:38:01,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.95 | bwd_microstep: 3375.47 | bwd_inner_microstep: 3374.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 21:38:01,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.95 | bwd: 3375.48 | bwd_inner: 3374.68 | bwd_allreduce: 0.76 | step: 6.81 52%|█████▏ | 5154/10000 [8:08:22<7:26:10, 5.52s/it] {'loss': 0.0017, 'grad_norm': 0.2873726487159729, 'learning_rate': 1.997408996474375e-05, 'epoch': 5.15} 52%|█████▏ | 5154/10000 [8:08:22<7:26:10, 5.52s/it][2025-06-19 21:38:07,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:38:07,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.75 | bwd_microstep: 3322.70 | bwd_inner_microstep: 3321.77 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.09 [2025-06-19 21:38:07,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.75 | bwd: 3322.72 | bwd_inner: 3321.77 | bwd_allreduce: 0.90 | step: 7.10 52%|█████▏ | 5155/10000 [8:08:28<7:25:01, 5.51s/it] {'loss': 0.0012, 'grad_norm': 0.17684103548526764, 'learning_rate': 1.996761246102564e-05, 'epoch': 5.16} 52%|█████▏ | 5155/10000 [8:08:28<7:25:01, 5.51s/it][2025-06-19 21:38:12,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:38:12,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.24 | bwd_microstep: 3377.87 | bwd_inner_microstep: 3377.05 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.32 [2025-06-19 21:38:12,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.24 | bwd: 3377.89 | bwd_inner: 3377.05 | bwd_allreduce: 0.80 | step: 7.32 52%|█████▏ | 5156/10000 [8:08:33<7:26:01, 5.52s/it] {'loss': 0.0219, 'grad_norm': 2.461519479751587, 'learning_rate': 1.9961134960704834e-05, 'epoch': 5.16} 52%|█████▏ | 5156/10000 [8:08:33<7:26:01, 5.52s/it][2025-06-19 21:38:18,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:38:18,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.98 | bwd_microstep: 3325.48 | bwd_inner_microstep: 3324.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 21:38:18,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.98 | bwd: 3325.50 | bwd_inner: 3324.69 | bwd_allreduce: 0.76 | step: 6.70 52%|█████▏ | 5157/10000 [8:08:39<7:24:34, 5.51s/it] {'loss': 0.0015, 'grad_norm': 0.352936327457428, 'learning_rate': 1.995465746446079e-05, 'epoch': 5.16} 52%|█████▏ | 5157/10000 [8:08:39<7:24:34, 5.51s/it][2025-06-19 21:38:23,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:38:23,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.10 | bwd_microstep: 3324.52 | bwd_inner_microstep: 3323.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-19 21:38:23,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.10 | bwd: 3324.53 | bwd_inner: 3323.73 | bwd_allreduce: 0.76 | step: 7.02 52%|█████▏ | 5158/10000 [8:08:44<7:23:30, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.2862941324710846, 'learning_rate': 1.9948179972972976e-05, 'epoch': 5.16} 52%|█████▏ | 5158/10000 [8:08:44<7:23:30, 5.50s/it][2025-06-19 21:38:29,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:38:29,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.03 | bwd_microstep: 3378.95 | bwd_inner_microstep: 3378.09 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.88 [2025-06-19 21:38:29,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.03 | bwd: 3378.96 | bwd_inner: 3378.09 | bwd_allreduce: 0.83 | step: 6.88 52%|█████▏ | 5159/10000 [8:08:50<7:24:35, 5.51s/it] {'loss': 0.0029, 'grad_norm': 0.41265878081321716, 'learning_rate': 1.994170248692083e-05, 'epoch': 5.16} 52%|█████▏ | 5159/10000 [8:08:50<7:24:35, 5.51s/it][2025-06-19 21:38:34,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:38:34,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.48 | bwd_microstep: 3330.95 | bwd_inner_microstep: 3330.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 21:38:34,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.48 | bwd: 3330.97 | bwd_inner: 3330.17 | bwd_allreduce: 0.76 | step: 6.58 52%|█████▏ | 5160/10000 [8:08:55<7:23:41, 5.50s/it] {'loss': 0.0504, 'grad_norm': 6.126983642578125, 'learning_rate': 1.993522500698383e-05, 'epoch': 5.16} 52%|█████▏ | 5160/10000 [8:08:55<7:23:41, 5.50s/it][2025-06-19 21:38:40,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:38:40,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.63 | bwd_microstep: 3374.21 | bwd_inner_microstep: 3373.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-19 21:38:40,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.63 | bwd: 3374.23 | bwd_inner: 3373.43 | bwd_allreduce: 0.76 | step: 6.79 52%|█████▏ | 5161/10000 [8:09:01<7:24:35, 5.51s/it] {'loss': 0.0032, 'grad_norm': 0.48519182205200195, 'learning_rate': 1.992874753384142e-05, 'epoch': 5.16} 52%|█████▏ | 5161/10000 [8:09:01<7:24:35, 5.51s/it][2025-06-19 21:38:45,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:38:45,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.80 | bwd_microstep: 3320.33 | bwd_inner_microstep: 3319.34 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.44 [2025-06-19 21:38:45,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.80 | bwd: 3320.34 | bwd_inner: 3319.34 | bwd_allreduce: 0.95 | step: 7.44 52%|█████▏ | 5162/10000 [8:09:06<7:23:20, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.3745526671409607, 'learning_rate': 1.9922270068173066e-05, 'epoch': 5.16} 52%|█████▏ | 5162/10000 [8:09:06<7:23:20, 5.50s/it][2025-06-19 21:38:51,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:38:51,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.05 | bwd_microstep: 3317.34 | bwd_inner_microstep: 3316.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 21:38:51,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.05 | bwd: 3317.35 | bwd_inner: 3316.54 | bwd_allreduce: 0.76 | step: 6.80 52%|█████▏ | 5163/10000 [8:09:12<7:22:26, 5.49s/it] {'loss': 0.0035, 'grad_norm': 0.3973959684371948, 'learning_rate': 1.9915792610658223e-05, 'epoch': 5.16} 52%|█████▏ | 5163/10000 [8:09:12<7:22:26, 5.49s/it][2025-06-19 21:38:56,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:38:56,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.06 | bwd_microstep: 3326.78 | bwd_inner_microstep: 3325.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 21:38:56,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.06 | bwd: 3326.79 | bwd_inner: 3325.99 | bwd_allreduce: 0.76 | step: 6.65 52%|█████▏ | 5164/10000 [8:09:17<7:22:02, 5.48s/it] {'loss': 0.0022, 'grad_norm': 0.3213501572608948, 'learning_rate': 1.9909315161976342e-05, 'epoch': 5.16} 52%|█████▏ | 5164/10000 [8:09:17<7:22:02, 5.48s/it][2025-06-19 21:39:02,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:39:02,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.07 | bwd_microstep: 3330.93 | bwd_inner_microstep: 3330.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 21:39:02,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.07 | bwd: 3330.95 | bwd_inner: 3330.13 | bwd_allreduce: 0.77 | step: 6.96 52%|█████▏ | 5165/10000 [8:09:22<7:21:46, 5.48s/it] {'loss': 0.008, 'grad_norm': 1.2408992052078247, 'learning_rate': 1.990283772280688e-05, 'epoch': 5.17} 52%|█████▏ | 5165/10000 [8:09:22<7:21:46, 5.48s/it][2025-06-19 21:39:07,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:39:07,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.05 | bwd_microstep: 3323.80 | bwd_inner_microstep: 3323.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-19 21:39:07,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.05 | bwd: 3323.82 | bwd_inner: 3323.01 | bwd_allreduce: 0.77 | step: 6.92 52%|█████▏ | 5166/10000 [8:09:28<7:21:25, 5.48s/it] {'loss': 0.0332, 'grad_norm': 2.308898687362671, 'learning_rate': 1.9896360293829287e-05, 'epoch': 5.17} 52%|█████▏ | 5166/10000 [8:09:28<7:21:25, 5.48s/it][2025-06-19 21:39:13,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:39:13,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.92 | bwd_microstep: 3330.72 | bwd_inner_microstep: 3329.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 21:39:13,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.92 | bwd: 3330.74 | bwd_inner: 3329.93 | bwd_allreduce: 0.76 | step: 6.64 52%|█████▏ | 5167/10000 [8:09:33<7:21:07, 5.48s/it] {'loss': 0.0006, 'grad_norm': 0.07039862126111984, 'learning_rate': 1.9889882875723022e-05, 'epoch': 5.17} 52%|█████▏ | 5167/10000 [8:09:33<7:21:07, 5.48s/it][2025-06-19 21:39:18,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:39:18,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.93 | bwd_microstep: 3321.53 | bwd_inner_microstep: 3320.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 21:39:18,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.93 | bwd: 3321.55 | bwd_inner: 3320.75 | bwd_allreduce: 0.76 | step: 6.63 52%|█████▏ | 5168/10000 [8:09:39<7:20:47, 5.47s/it] {'loss': 0.0029, 'grad_norm': 0.43665385246276855, 'learning_rate': 1.9883405469167542e-05, 'epoch': 5.17} 52%|█████▏ | 5168/10000 [8:09:39<7:20:47, 5.47s/it][2025-06-19 21:39:24,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.83 [2025-06-19 21:39:24,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.53 | bwd_microstep: 3366.16 | bwd_inner_microstep: 3365.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 21:39:24,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.53 | bwd: 3366.17 | bwd_inner: 3365.36 | bwd_allreduce: 0.78 | step: 7.11 52%|█████▏ | 5169/10000 [8:09:44<7:22:08, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.034342605620622635, 'learning_rate': 1.9876928074842275e-05, 'epoch': 5.17} 52%|█████▏ | 5169/10000 [8:09:44<7:22:08, 5.49s/it][2025-06-19 21:39:29,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:39:29,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.89 | bwd_microstep: 3323.54 | bwd_inner_microstep: 3322.60 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.27 [2025-06-19 21:39:29,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.89 | bwd: 3323.56 | bwd_inner: 3322.60 | bwd_allreduce: 0.92 | step: 7.27 52%|█████▏ | 5170/10000 [8:09:50<7:21:41, 5.49s/it] {'loss': 0.0037, 'grad_norm': 0.5883161425590515, 'learning_rate': 1.9870450693426682e-05, 'epoch': 5.17} 52%|█████▏ | 5170/10000 [8:09:50<7:21:41, 5.49s/it][2025-06-19 21:39:35,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 21:39:35,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.56 | bwd_microstep: 3316.54 | bwd_inner_microstep: 3315.56 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.28 [2025-06-19 21:39:35,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.56 | bwd: 3316.56 | bwd_inner: 3315.56 | bwd_allreduce: 0.94 | step: 7.28 52%|█████▏ | 5171/10000 [8:09:55<7:21:06, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.08923602104187012, 'learning_rate': 1.986397332560021e-05, 'epoch': 5.17} 52%|█████▏ | 5171/10000 [8:09:55<7:21:06, 5.48s/it][2025-06-19 21:39:40,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:39:40,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.60 | bwd_microstep: 3364.37 | bwd_inner_microstep: 3363.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.94 [2025-06-19 21:39:40,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.60 | bwd: 3364.38 | bwd_inner: 3363.58 | bwd_allreduce: 0.76 | step: 6.94 52%|█████▏ | 5172/10000 [8:10:01<7:22:20, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.00796011183410883, 'learning_rate': 1.9857495972042305e-05, 'epoch': 5.17} 52%|█████▏ | 5172/10000 [8:10:01<7:22:20, 5.50s/it][2025-06-19 21:39:46,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:39:46,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.31 | bwd_microstep: 3372.11 | bwd_inner_microstep: 3371.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-19 21:39:46,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.31 | bwd: 3372.13 | bwd_inner: 3371.31 | bwd_allreduce: 0.77 | step: 7.03 52%|█████▏ | 5173/10000 [8:10:06<7:23:23, 5.51s/it] {'loss': 0.0762, 'grad_norm': 3.839205503463745, 'learning_rate': 1.9851018633432417e-05, 'epoch': 5.17} 52%|█████▏ | 5173/10000 [8:10:06<7:23:23, 5.51s/it][2025-06-19 21:39:51,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.74 [2025-06-19 21:39:51,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.56 | bwd_microstep: 3328.89 | bwd_inner_microstep: 3327.91 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.11 [2025-06-19 21:39:51,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.56 | bwd: 3328.91 | bwd_inner: 3327.91 | bwd_allreduce: 0.95 | step: 7.11 52%|█████▏ | 5174/10000 [8:10:12<7:22:37, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.0596163310110569, 'learning_rate': 1.9844541310449977e-05, 'epoch': 5.17} 52%|█████▏ | 5174/10000 [8:10:12<7:22:37, 5.50s/it][2025-06-19 21:39:57,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:39:57,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.30 | bwd_microstep: 3320.16 | bwd_inner_microstep: 3319.35 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.30 [2025-06-19 21:39:57,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.30 | bwd: 3320.18 | bwd_inner: 3319.35 | bwd_allreduce: 0.78 | step: 7.30 52%|█████▏ | 5175/10000 [8:10:17<7:21:33, 5.49s/it] {'loss': 0.0009, 'grad_norm': 0.12499106675386429, 'learning_rate': 1.983806400377443e-05, 'epoch': 5.17} 52%|█████▏ | 5175/10000 [8:10:17<7:21:33, 5.49s/it][2025-06-19 21:40:02,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:40:02,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.56 | bwd_microstep: 3322.68 | bwd_inner_microstep: 3321.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 21:40:02,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.56 | bwd: 3322.70 | bwd_inner: 3321.87 | bwd_allreduce: 0.78 | step: 6.76 52%|█████▏ | 5176/10000 [8:10:23<7:21:00, 5.49s/it] {'loss': 0.0079, 'grad_norm': 1.225508689880371, 'learning_rate': 1.983158671408522e-05, 'epoch': 5.18} 52%|█████▏ | 5176/10000 [8:10:23<7:21:00, 5.49s/it][2025-06-19 21:40:08,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:40:08,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.15 | bwd_microstep: 3376.24 | bwd_inner_microstep: 3375.39 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.49 [2025-06-19 21:40:08,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.15 | bwd: 3376.26 | bwd_inner: 3375.39 | bwd_allreduce: 0.82 | step: 7.49 52%|█████▏ | 5177/10000 [8:10:28<7:22:34, 5.51s/it] {'loss': 0.1252, 'grad_norm': 2.9427382946014404, 'learning_rate': 1.9825109442061785e-05, 'epoch': 5.18} 52%|█████▏ | 5177/10000 [8:10:28<7:22:34, 5.51s/it][2025-06-19 21:40:13,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:40:13,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.38 | bwd_microstep: 3320.75 | bwd_inner_microstep: 3319.94 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.06 [2025-06-19 21:40:13,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.38 | bwd: 3320.77 | bwd_inner: 3319.94 | bwd_allreduce: 0.78 | step: 7.06 52%|█████▏ | 5178/10000 [8:10:34<7:21:28, 5.49s/it] {'loss': 0.0008, 'grad_norm': 0.1056690663099289, 'learning_rate': 1.981863218838356e-05, 'epoch': 5.18} 52%|█████▏ | 5178/10000 [8:10:34<7:21:28, 5.49s/it][2025-06-19 21:40:19,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:40:19,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.75 | bwd_microstep: 3320.45 | bwd_inner_microstep: 3319.52 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.43 [2025-06-19 21:40:19,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.75 | bwd: 3320.47 | bwd_inner: 3319.52 | bwd_allreduce: 0.90 | step: 7.43 52%|█████▏ | 5179/10000 [8:10:39<7:21:00, 5.49s/it] {'loss': 0.0066, 'grad_norm': 1.3385257720947266, 'learning_rate': 1.981215495372997e-05, 'epoch': 5.18} 52%|█████▏ | 5179/10000 [8:10:39<7:21:00, 5.49s/it][2025-06-19 21:40:24,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.71 | optimizer_step: 2.73 [2025-06-19 21:40:24,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.88 | bwd_microstep: 3325.86 | bwd_inner_microstep: 3324.76 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.13 [2025-06-19 21:40:24,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.88 | bwd: 3325.87 | bwd_inner: 3324.76 | bwd_allreduce: 1.06 | step: 7.13 52%|█████▏ | 5180/10000 [8:10:45<7:20:41, 5.49s/it] {'loss': 0.0025, 'grad_norm': 0.3052096664905548, 'learning_rate': 1.980567773878046e-05, 'epoch': 5.18} 52%|█████▏ | 5180/10000 [8:10:45<7:20:41, 5.49s/it][2025-06-19 21:40:30,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:40:30,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.91 | bwd_microstep: 3328.26 | bwd_inner_microstep: 3327.43 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.31 [2025-06-19 21:40:30,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.91 | bwd: 3328.27 | bwd_inner: 3327.43 | bwd_allreduce: 0.80 | step: 7.31 52%|█████▏ | 5181/10000 [8:10:50<7:20:30, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.0062411450780928135, 'learning_rate': 1.9799200544214445e-05, 'epoch': 5.18} 52%|█████▏ | 5181/10000 [8:10:50<7:20:30, 5.48s/it][2025-06-19 21:40:35,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:40:35,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.37 | bwd_microstep: 3380.82 | bwd_inner_microstep: 3380.01 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 21:40:35,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.37 | bwd: 3380.83 | bwd_inner: 3380.01 | bwd_allreduce: 0.78 | step: 7.19 52%|█████▏ | 5182/10000 [8:10:56<7:22:14, 5.51s/it] {'loss': 0.0063, 'grad_norm': 0.7548460364341736, 'learning_rate': 1.9792723370711363e-05, 'epoch': 5.18} 52%|█████▏ | 5182/10000 [8:10:56<7:22:14, 5.51s/it][2025-06-19 21:40:41,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:40:41,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.60 | bwd_microstep: 3367.44 | bwd_inner_microstep: 3366.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 21:40:41,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.60 | bwd: 3367.46 | bwd_inner: 3366.65 | bwd_allreduce: 0.76 | step: 6.83 52%|█████▏ | 5183/10000 [8:11:01<7:22:52, 5.52s/it] {'loss': 0.0013, 'grad_norm': 0.15786485373973846, 'learning_rate': 1.9786246218950637e-05, 'epoch': 5.18} 52%|█████▏ | 5183/10000 [8:11:01<7:22:52, 5.52s/it][2025-06-19 21:40:46,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:40:46,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.75 | bwd_microstep: 3318.39 | bwd_inner_microstep: 3317.57 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.17 [2025-06-19 21:40:46,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.75 | bwd: 3318.40 | bwd_inner: 3317.57 | bwd_allreduce: 0.78 | step: 7.17 52%|█████▏ | 5184/10000 [8:11:07<7:21:39, 5.50s/it] {'loss': 0.009, 'grad_norm': 1.1869471073150635, 'learning_rate': 1.97797690896117e-05, 'epoch': 5.18} 52%|█████▏ | 5184/10000 [8:11:07<7:21:39, 5.50s/it][2025-06-19 21:40:52,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:40:52,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.50 | bwd_microstep: 3369.13 | bwd_inner_microstep: 3368.32 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-19 21:40:52,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.50 | bwd: 3369.14 | bwd_inner: 3368.32 | bwd_allreduce: 0.78 | step: 6.85 52%|█████▏ | 5185/10000 [8:11:12<7:22:24, 5.51s/it] {'loss': 0.0006, 'grad_norm': 0.057528894394636154, 'learning_rate': 1.9773291983373953e-05, 'epoch': 5.18} 52%|█████▏ | 5185/10000 [8:11:12<7:22:24, 5.51s/it][2025-06-19 21:40:57,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:40:57,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.37 | bwd_microstep: 3318.93 | bwd_inner_microstep: 3318.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 21:40:57,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.38 | bwd: 3318.94 | bwd_inner: 3318.14 | bwd_allreduce: 0.76 | step: 6.65 52%|█████▏ | 5186/10000 [8:11:18<7:21:03, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.09598423540592194, 'learning_rate': 1.976681490091683e-05, 'epoch': 5.19} 52%|█████▏ | 5186/10000 [8:11:18<7:21:03, 5.50s/it][2025-06-19 21:41:03,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:41:03,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.80 | bwd_microstep: 3323.57 | bwd_inner_microstep: 3322.63 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.33 [2025-06-19 21:41:03,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.80 | bwd: 3323.58 | bwd_inner: 3322.63 | bwd_allreduce: 0.90 | step: 7.33 52%|█████▏ | 5187/10000 [8:11:23<7:20:37, 5.49s/it] {'loss': 0.0021, 'grad_norm': 0.4068971872329712, 'learning_rate': 1.9760337842919737e-05, 'epoch': 5.19} 52%|█████▏ | 5187/10000 [8:11:23<7:20:37, 5.49s/it][2025-06-19 21:41:08,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:41:08,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.89 | bwd_microstep: 3319.58 | bwd_inner_microstep: 3318.72 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.18 [2025-06-19 21:41:08,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.89 | bwd: 3319.59 | bwd_inner: 3318.72 | bwd_allreduce: 0.83 | step: 7.18 52%|█████▏ | 5188/10000 [8:11:29<7:19:50, 5.48s/it] {'loss': 0.0087, 'grad_norm': 1.0821536779403687, 'learning_rate': 1.9753860810062095e-05, 'epoch': 5.19} 52%|█████▏ | 5188/10000 [8:11:29<7:19:50, 5.48s/it][2025-06-19 21:41:14,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 21:41:14,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.62 | bwd_microstep: 3367.13 | bwd_inner_microstep: 3366.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 21:41:14,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.62 | bwd: 3367.15 | bwd_inner: 3366.35 | bwd_allreduce: 0.75 | step: 6.53 52%|█████▏ | 5189/10000 [8:11:34<7:20:52, 5.50s/it] {'loss': 0.0009, 'grad_norm': 0.11910152435302734, 'learning_rate': 1.9747383803023314e-05, 'epoch': 5.19} 52%|█████▏ | 5189/10000 [8:11:34<7:20:52, 5.50s/it][2025-06-19 21:41:19,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:41:19,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.89 | bwd_microstep: 3372.16 | bwd_inner_microstep: 3371.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 21:41:19,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.89 | bwd: 3372.18 | bwd_inner: 3371.38 | bwd_allreduce: 0.75 | step: 6.60 52%|█████▏ | 5190/10000 [8:11:40<7:21:53, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.06839160621166229, 'learning_rate': 1.9740906822482797e-05, 'epoch': 5.19} 52%|█████▏ | 5190/10000 [8:11:40<7:21:53, 5.51s/it][2025-06-19 21:41:25,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:41:25,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.38 | bwd_microstep: 3327.51 | bwd_inner_microstep: 3326.60 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.89 [2025-06-19 21:41:25,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.38 | bwd: 3327.52 | bwd_inner: 3326.60 | bwd_allreduce: 0.87 | step: 6.89 52%|█████▏ | 5191/10000 [8:11:45<7:20:50, 5.50s/it] {'loss': 0.1381, 'grad_norm': 3.274294137954712, 'learning_rate': 1.973442986911995e-05, 'epoch': 5.19} 52%|█████▏ | 5191/10000 [8:11:45<7:20:50, 5.50s/it][2025-06-19 21:41:30,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 21:41:30,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.16 | bwd_microstep: 3360.46 | bwd_inner_microstep: 3359.68 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 21:41:30,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.16 | bwd: 3360.47 | bwd_inner: 3359.68 | bwd_allreduce: 0.75 | step: 6.54 52%|█████▏ | 5192/10000 [8:11:51<7:21:14, 5.51s/it] {'loss': 0.0011, 'grad_norm': 0.15961503982543945, 'learning_rate': 1.972795294361418e-05, 'epoch': 5.19} 52%|█████▏ | 5192/10000 [8:11:51<7:21:14, 5.51s/it][2025-06-19 21:41:36,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:41:36,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.51 | bwd_microstep: 3376.60 | bwd_inner_microstep: 3375.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 21:41:36,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.52 | bwd: 3376.62 | bwd_inner: 3375.81 | bwd_allreduce: 0.76 | step: 6.77 52%|█████▏ | 5193/10000 [8:11:56<7:22:01, 5.52s/it] {'loss': 0.0183, 'grad_norm': 2.3527472019195557, 'learning_rate': 1.972147604664488e-05, 'epoch': 5.19} 52%|█████▏ | 5193/10000 [8:11:56<7:22:01, 5.52s/it][2025-06-19 21:41:41,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:41:41,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.26 | bwd_microstep: 3318.17 | bwd_inner_microstep: 3317.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-19 21:41:41,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.26 | bwd: 3318.18 | bwd_inner: 3317.36 | bwd_allreduce: 0.78 | step: 6.94 52%|█████▏ | 5194/10000 [8:12:02<7:20:38, 5.50s/it] {'loss': 0.0055, 'grad_norm': 0.7990207672119141, 'learning_rate': 1.9714999178891462e-05, 'epoch': 5.19} 52%|█████▏ | 5194/10000 [8:12:02<7:20:38, 5.50s/it][2025-06-19 21:41:47,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:41:47,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.02 | bwd_microstep: 3324.14 | bwd_inner_microstep: 3323.26 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.85 [2025-06-19 21:41:47,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.02 | bwd: 3324.16 | bwd_inner: 3323.26 | bwd_allreduce: 0.86 | step: 6.85 52%|█████▏ | 5195/10000 [8:12:07<7:19:42, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.052313730120658875, 'learning_rate': 1.9708522341033296e-05, 'epoch': 5.2} 52%|█████▏ | 5195/10000 [8:12:07<7:19:42, 5.49s/it][2025-06-19 21:41:52,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:41:52,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.08 | bwd_microstep: 3319.29 | bwd_inner_microstep: 3318.39 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.85 [2025-06-19 21:41:52,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.08 | bwd: 3319.30 | bwd_inner: 3318.39 | bwd_allreduce: 0.86 | step: 6.85 52%|█████▏ | 5196/10000 [8:12:13<7:19:00, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004937098361551762, 'learning_rate': 1.9702045533749784e-05, 'epoch': 5.2} 52%|█████▏ | 5196/10000 [8:12:13<7:19:00, 5.48s/it][2025-06-19 21:41:58,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 21:41:58,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.11 | bwd_microstep: 3318.22 | bwd_inner_microstep: 3317.15 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.60 [2025-06-19 21:41:58,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.11 | bwd: 3318.24 | bwd_inner: 3317.15 | bwd_allreduce: 1.04 | step: 7.61 52%|█████▏ | 5197/10000 [8:12:18<7:18:26, 5.48s/it] {'loss': 0.0009, 'grad_norm': 0.12717504799365997, 'learning_rate': 1.9695568757720317e-05, 'epoch': 5.2} 52%|█████▏ | 5197/10000 [8:12:18<7:18:26, 5.48s/it][2025-06-19 21:42:03,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:42:03,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.73 | bwd_microstep: 3317.76 | bwd_inner_microstep: 3316.97 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 21:42:03,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.73 | bwd: 3317.77 | bwd_inner: 3316.97 | bwd_allreduce: 0.75 | step: 6.57 52%|█████▏ | 5198/10000 [8:12:24<7:18:02, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.02461935020983219, 'learning_rate': 1.968909201362427e-05, 'epoch': 5.2} 52%|█████▏ | 5198/10000 [8:12:24<7:18:02, 5.47s/it][2025-06-19 21:42:08,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:42:08,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.26 | bwd_microstep: 3318.95 | bwd_inner_microstep: 3318.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 21:42:08,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.26 | bwd: 3318.97 | bwd_inner: 3318.17 | bwd_allreduce: 0.75 | step: 6.64 52%|█████▏ | 5199/10000 [8:12:29<7:17:49, 5.47s/it] {'loss': 0.0005, 'grad_norm': 0.0611090324819088, 'learning_rate': 1.968261530214103e-05, 'epoch': 5.2} 52%|█████▏ | 5199/10000 [8:12:29<7:17:49, 5.47s/it][2025-06-19 21:42:14,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:42:14,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.16 | bwd_microstep: 3315.92 | bwd_inner_microstep: 3315.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.12 [2025-06-19 21:42:14,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.16 | bwd: 3315.94 | bwd_inner: 3315.13 | bwd_allreduce: 0.77 | step: 7.12 52%|█████▏ | 5200/10000 [8:12:35<7:17:26, 5.47s/it] {'loss': 0.0005, 'grad_norm': 0.06862955540418625, 'learning_rate': 1.967613862394997e-05, 'epoch': 5.2} 52%|█████▏ | 5200/10000 [8:12:35<7:17:26, 5.47s/it][2025-06-19 21:42:19,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:42:19,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.69 | bwd_microstep: 3364.02 | bwd_inner_microstep: 3363.15 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.83 [2025-06-19 21:42:19,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.69 | bwd: 3364.04 | bwd_inner: 3363.15 | bwd_allreduce: 0.85 | step: 6.84 52%|█████▏ | 5201/10000 [8:12:40<7:18:39, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.05805744603276253, 'learning_rate': 1.9669661979730466e-05, 'epoch': 5.2} 52%|█████▏ | 5201/10000 [8:12:40<7:18:39, 5.48s/it][2025-06-19 21:42:25,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:42:25,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.84 | bwd_microstep: 3372.29 | bwd_inner_microstep: 3371.45 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.87 [2025-06-19 21:42:25,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.84 | bwd: 3372.31 | bwd_inner: 3371.45 | bwd_allreduce: 0.80 | step: 6.87 52%|█████▏ | 5202/10000 [8:12:46<7:19:58, 5.50s/it] {'loss': 0.0174, 'grad_norm': 1.7441319227218628, 'learning_rate': 1.966318537016189e-05, 'epoch': 5.2} 52%|█████▏ | 5202/10000 [8:12:46<7:19:58, 5.50s/it][2025-06-19 21:42:30,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:42:30,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.82 | bwd_microstep: 3323.66 | bwd_inner_microstep: 3322.74 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.02 [2025-06-19 21:42:30,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.83 | bwd: 3323.67 | bwd_inner: 3322.74 | bwd_allreduce: 0.89 | step: 7.03 52%|█████▏ | 5203/10000 [8:12:51<7:18:55, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.04674911126494408, 'learning_rate': 1.9656708795923602e-05, 'epoch': 5.2} 52%|█████▏ | 5203/10000 [8:12:51<7:18:55, 5.49s/it][2025-06-19 21:42:36,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:42:36,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.64 | bwd_microstep: 3315.33 | bwd_inner_microstep: 3314.23 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.40 [2025-06-19 21:42:36,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.64 | bwd: 3315.36 | bwd_inner: 3314.23 | bwd_allreduce: 1.06 | step: 7.40 52%|█████▏ | 5204/10000 [8:12:57<7:18:12, 5.48s/it] {'loss': 0.1253, 'grad_norm': 3.068453311920166, 'learning_rate': 1.9650232257694976e-05, 'epoch': 5.2} 52%|█████▏ | 5204/10000 [8:12:57<7:18:12, 5.48s/it][2025-06-19 21:42:41,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:42:41,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.42 | bwd_microstep: 3311.02 | bwd_inner_microstep: 3310.10 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.21 [2025-06-19 21:42:41,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.42 | bwd: 3311.03 | bwd_inner: 3310.10 | bwd_allreduce: 0.89 | step: 7.21 52%|█████▏ | 5205/10000 [8:13:02<7:17:34, 5.48s/it] {'loss': 0.0021, 'grad_norm': 0.2162497490644455, 'learning_rate': 1.9643755756155355e-05, 'epoch': 5.21} 52%|█████▏ | 5205/10000 [8:13:02<7:17:34, 5.48s/it][2025-06-19 21:42:47,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:42:47,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.10 | bwd_microstep: 3323.81 | bwd_inner_microstep: 3322.89 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.86 [2025-06-19 21:42:47,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.10 | bwd: 3323.82 | bwd_inner: 3322.89 | bwd_allreduce: 0.88 | step: 6.86 52%|█████▏ | 5206/10000 [8:13:08<7:17:26, 5.47s/it] {'loss': 0.0007, 'grad_norm': 0.05859125405550003, 'learning_rate': 1.9637279291984104e-05, 'epoch': 5.21} 52%|█████▏ | 5206/10000 [8:13:08<7:17:26, 5.47s/it][2025-06-19 21:42:52,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:42:52,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.51 | bwd_microstep: 3318.11 | bwd_inner_microstep: 3317.11 | bwd_allreduce_microstep: 0.93 | step_microstep: 6.88 [2025-06-19 21:42:52,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.51 | bwd: 3318.13 | bwd_inner: 3317.11 | bwd_allreduce: 0.96 | step: 6.88 52%|█████▏ | 5207/10000 [8:13:13<7:16:59, 5.47s/it] {'loss': 0.0021, 'grad_norm': 0.23042850196361542, 'learning_rate': 1.963080286586057e-05, 'epoch': 5.21} 52%|█████▏ | 5207/10000 [8:13:13<7:16:59, 5.47s/it][2025-06-19 21:42:58,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:42:58,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.86 | bwd_microstep: 3313.65 | bwd_inner_microstep: 3312.82 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.01 [2025-06-19 21:42:58,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.86 | bwd: 3313.67 | bwd_inner: 3312.82 | bwd_allreduce: 0.81 | step: 7.01 52%|█████▏ | 5208/10000 [8:13:19<7:16:40, 5.47s/it] {'loss': 0.0538, 'grad_norm': 2.396812677383423, 'learning_rate': 1.9624326478464103e-05, 'epoch': 5.21} 52%|█████▏ | 5208/10000 [8:13:19<7:16:40, 5.47s/it][2025-06-19 21:43:03,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:43:03,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.65 | bwd_microstep: 3311.36 | bwd_inner_microstep: 3310.49 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.11 [2025-06-19 21:43:03,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.65 | bwd: 3311.38 | bwd_inner: 3310.49 | bwd_allreduce: 0.84 | step: 7.11 52%|█████▏ | 5209/10000 [8:13:24<7:16:09, 5.46s/it] {'loss': 0.0013, 'grad_norm': 0.35144513845443726, 'learning_rate': 1.9617850130474045e-05, 'epoch': 5.21} 52%|█████▏ | 5209/10000 [8:13:24<7:16:09, 5.46s/it][2025-06-19 21:43:09,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:43:09,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.48 | bwd_microstep: 3317.91 | bwd_inner_microstep: 3317.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 21:43:09,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.48 | bwd: 3317.93 | bwd_inner: 3317.11 | bwd_allreduce: 0.77 | step: 6.95 52%|█████▏ | 5210/10000 [8:13:29<7:15:57, 5.46s/it] {'loss': 0.0078, 'grad_norm': 2.142254114151001, 'learning_rate': 1.9611373822569734e-05, 'epoch': 5.21} 52%|█████▏ | 5210/10000 [8:13:29<7:15:57, 5.46s/it][2025-06-19 21:43:14,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:43:14,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.08 | bwd_microstep: 3309.71 | bwd_inner_microstep: 3308.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 21:43:14,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.08 | bwd: 3309.72 | bwd_inner: 3308.92 | bwd_allreduce: 0.76 | step: 6.70 52%|█████▏ | 5211/10000 [8:13:35<7:15:24, 5.46s/it] {'loss': 0.0004, 'grad_norm': 0.03673446923494339, 'learning_rate': 1.9604897555430506e-05, 'epoch': 5.21} 52%|█████▏ | 5211/10000 [8:13:35<7:15:24, 5.46s/it][2025-06-19 21:43:20,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:43:20,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.90 | bwd_microstep: 3365.64 | bwd_inner_microstep: 3364.84 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 21:43:20,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.90 | bwd: 3365.65 | bwd_inner: 3364.84 | bwd_allreduce: 0.76 | step: 6.95 52%|█████▏ | 5212/10000 [8:13:40<7:16:56, 5.48s/it] {'loss': 0.0337, 'grad_norm': 3.0181400775909424, 'learning_rate': 1.959842132973569e-05, 'epoch': 5.21} 52%|█████▏ | 5212/10000 [8:13:40<7:16:56, 5.48s/it][2025-06-19 21:43:25,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:43:25,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.94 | bwd_microstep: 3310.18 | bwd_inner_microstep: 3309.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-19 21:43:25,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.94 | bwd: 3310.19 | bwd_inner: 3309.39 | bwd_allreduce: 0.76 | step: 6.81 52%|█████▏ | 5213/10000 [8:13:46<7:16:19, 5.47s/it] {'loss': 0.0127, 'grad_norm': 2.0267691612243652, 'learning_rate': 1.959194514616461e-05, 'epoch': 5.21} 52%|█████▏ | 5213/10000 [8:13:46<7:16:19, 5.47s/it][2025-06-19 21:43:31,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:43:31,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.74 | bwd_microstep: 3372.05 | bwd_inner_microstep: 3371.21 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.22 [2025-06-19 21:43:31,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.74 | bwd: 3372.06 | bwd_inner: 3371.21 | bwd_allreduce: 0.81 | step: 7.23 52%|█████▏ | 5214/10000 [8:13:51<7:17:54, 5.49s/it] {'loss': 0.0009, 'grad_norm': 0.19122470915317535, 'learning_rate': 1.9585469005396602e-05, 'epoch': 5.21} 52%|█████▏ | 5214/10000 [8:13:51<7:17:54, 5.49s/it][2025-06-19 21:43:36,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:43:36,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.52 | bwd_microstep: 3317.44 | bwd_inner_microstep: 3316.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 21:43:36,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.52 | bwd: 3317.45 | bwd_inner: 3316.64 | bwd_allreduce: 0.76 | step: 6.96 52%|█████▏ | 5215/10000 [8:13:57<7:17:06, 5.48s/it] {'loss': 0.0021, 'grad_norm': 0.4701214134693146, 'learning_rate': 1.9578992908110963e-05, 'epoch': 5.21} 52%|█████▏ | 5215/10000 [8:13:57<7:17:06, 5.48s/it][2025-06-19 21:43:42,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:43:42,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.60 | bwd_microstep: 3324.08 | bwd_inner_microstep: 3323.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 21:43:42,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.60 | bwd: 3324.10 | bwd_inner: 3323.29 | bwd_allreduce: 0.76 | step: 6.71 52%|█████▏ | 5216/10000 [8:14:02<7:16:52, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.024407830089330673, 'learning_rate': 1.957251685498701e-05, 'epoch': 5.22} 52%|█████▏ | 5216/10000 [8:14:02<7:16:52, 5.48s/it][2025-06-19 21:43:47,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 21:43:47,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.25 | bwd_microstep: 3364.88 | bwd_inner_microstep: 3363.94 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.83 [2025-06-19 21:43:47,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.25 | bwd: 3364.90 | bwd_inner: 3363.94 | bwd_allreduce: 0.92 | step: 7.84 52%|█████▏ | 5217/10000 [8:14:08<7:18:01, 5.49s/it] {'loss': 0.0085, 'grad_norm': 1.0086274147033691, 'learning_rate': 1.956604084670406e-05, 'epoch': 5.22} 52%|█████▏ | 5217/10000 [8:14:08<7:18:01, 5.49s/it][2025-06-19 21:43:53,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:43:53,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.10 | bwd_microstep: 3312.32 | bwd_inner_microstep: 3311.45 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.89 [2025-06-19 21:43:53,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.10 | bwd: 3312.33 | bwd_inner: 3311.45 | bwd_allreduce: 0.84 | step: 6.89 52%|█████▏ | 5218/10000 [8:14:13<7:17:02, 5.48s/it] {'loss': 0.0033, 'grad_norm': 0.9337301850318909, 'learning_rate': 1.9559564883941405e-05, 'epoch': 5.22} 52%|█████▏ | 5218/10000 [8:14:13<7:17:02, 5.48s/it][2025-06-19 21:43:58,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:43:58,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.93 | bwd_microstep: 3364.03 | bwd_inner_microstep: 3363.21 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.85 [2025-06-19 21:43:58,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.93 | bwd: 3364.05 | bwd_inner: 3363.21 | bwd_allreduce: 0.80 | step: 6.85 52%|█████▏ | 5219/10000 [8:14:19<7:17:55, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.008965428918600082, 'learning_rate': 1.9553088967378363e-05, 'epoch': 5.22} 52%|█████▏ | 5219/10000 [8:14:19<7:17:55, 5.50s/it][2025-06-19 21:44:04,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:44:04,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.55 | bwd_microstep: 3326.16 | bwd_inner_microstep: 3325.25 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.04 [2025-06-19 21:44:04,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.55 | bwd: 3326.18 | bwd_inner: 3325.25 | bwd_allreduce: 0.88 | step: 7.04 52%|█████▏ | 5220/10000 [8:14:24<7:17:27, 5.49s/it] {'loss': 0.0431, 'grad_norm': 3.0863051414489746, 'learning_rate': 1.9546613097694204e-05, 'epoch': 5.22} 52%|█████▏ | 5220/10000 [8:14:24<7:17:27, 5.49s/it][2025-06-19 21:44:09,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:44:09,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.00 | bwd_microstep: 3315.16 | bwd_inner_microstep: 3314.35 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.18 [2025-06-19 21:44:09,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.00 | bwd: 3315.18 | bwd_inner: 3314.35 | bwd_allreduce: 0.78 | step: 7.18 52%|█████▏ | 5221/10000 [8:14:30<7:16:44, 5.48s/it] {'loss': 0.0435, 'grad_norm': 3.316859483718872, 'learning_rate': 1.9540137275568228e-05, 'epoch': 5.22} 52%|█████▏ | 5221/10000 [8:14:30<7:16:44, 5.48s/it][2025-06-19 21:44:15,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:44:15,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.26 | bwd_microstep: 3362.90 | bwd_inner_microstep: 3361.98 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.96 [2025-06-19 21:44:15,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.26 | bwd: 3362.91 | bwd_inner: 3361.98 | bwd_allreduce: 0.89 | step: 6.96 52%|█████▏ | 5222/10000 [8:14:35<7:17:38, 5.50s/it] {'loss': 0.0292, 'grad_norm': 4.498079776763916, 'learning_rate': 1.9533661501679714e-05, 'epoch': 5.22} 52%|█████▏ | 5222/10000 [8:14:35<7:17:38, 5.50s/it][2025-06-19 21:44:20,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:44:20,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.55 | bwd_microstep: 3373.01 | bwd_inner_microstep: 3372.07 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.02 [2025-06-19 21:44:20,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.55 | bwd: 3373.02 | bwd_inner: 3372.07 | bwd_allreduce: 0.91 | step: 7.02 52%|█████▏ | 5223/10000 [8:14:41<7:18:29, 5.51s/it] {'loss': 0.0029, 'grad_norm': 0.4854179918766022, 'learning_rate': 1.952718577670795e-05, 'epoch': 5.22} 52%|█████▏ | 5223/10000 [8:14:41<7:18:29, 5.51s/it][2025-06-19 21:44:26,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:44:26,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.83 | bwd_microstep: 3307.08 | bwd_inner_microstep: 3306.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 21:44:26,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.83 | bwd: 3307.09 | bwd_inner: 3306.30 | bwd_allreduce: 0.75 | step: 6.62 52%|█████▏ | 5224/10000 [8:14:46<7:16:53, 5.49s/it] {'loss': 0.0021, 'grad_norm': 0.39784717559814453, 'learning_rate': 1.9520710101332203e-05, 'epoch': 5.22} 52%|█████▏ | 5224/10000 [8:14:46<7:16:53, 5.49s/it][2025-06-19 21:44:31,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:44:31,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.63 | bwd_microstep: 3313.53 | bwd_inner_microstep: 3312.72 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 21:44:31,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.63 | bwd: 3313.55 | bwd_inner: 3312.72 | bwd_allreduce: 0.78 | step: 7.23 52%|█████▏ | 5225/10000 [8:14:52<7:15:56, 5.48s/it] {'loss': 0.0174, 'grad_norm': 6.136165618896484, 'learning_rate': 1.9514234476231742e-05, 'epoch': 5.22} 52%|█████▏ | 5225/10000 [8:14:52<7:15:56, 5.48s/it][2025-06-19 21:44:36,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:44:36,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.69 | bwd_microstep: 3320.10 | bwd_inner_microstep: 3319.27 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.02 [2025-06-19 21:44:36,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.69 | bwd: 3320.12 | bwd_inner: 3319.27 | bwd_allreduce: 0.79 | step: 7.02 52%|█████▏ | 5226/10000 [8:14:57<7:15:42, 5.48s/it] {'loss': 0.0021, 'grad_norm': 0.32706889510154724, 'learning_rate': 1.9507758902085828e-05, 'epoch': 5.23} 52%|█████▏ | 5226/10000 [8:14:57<7:15:42, 5.48s/it][2025-06-19 21:44:42,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:44:42,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.86 | bwd_microstep: 3315.75 | bwd_inner_microstep: 3314.96 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 21:44:42,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.86 | bwd: 3315.76 | bwd_inner: 3314.96 | bwd_allreduce: 0.76 | step: 6.53 52%|█████▏ | 5227/10000 [8:15:03<7:15:03, 5.47s/it] {'loss': 0.0525, 'grad_norm': 4.983245372772217, 'learning_rate': 1.9501283379573718e-05, 'epoch': 5.23} 52%|█████▏ | 5227/10000 [8:15:03<7:15:03, 5.47s/it][2025-06-19 21:44:47,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:44:47,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.71 | bwd_microstep: 3324.03 | bwd_inner_microstep: 3323.25 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 21:44:47,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.71 | bwd: 3324.05 | bwd_inner: 3323.25 | bwd_allreduce: 0.75 | step: 6.68 52%|█████▏ | 5228/10000 [8:15:08<7:14:58, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.03584377095103264, 'learning_rate': 1.9494807909374668e-05, 'epoch': 5.23} 52%|█████▏ | 5228/10000 [8:15:08<7:14:58, 5.47s/it][2025-06-19 21:44:53,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:44:53,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.44 | bwd_microstep: 3314.94 | bwd_inner_microstep: 3314.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 21:44:53,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.44 | bwd: 3314.95 | bwd_inner: 3314.16 | bwd_allreduce: 0.75 | step: 6.77 52%|█████▏ | 5229/10000 [8:15:14<7:14:29, 5.46s/it] {'loss': 0.0083, 'grad_norm': 1.3233715295791626, 'learning_rate': 1.9488332492167934e-05, 'epoch': 5.23} 52%|█████▏ | 5229/10000 [8:15:14<7:14:29, 5.46s/it][2025-06-19 21:44:58,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:44:58,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.04 | bwd_microstep: 3325.65 | bwd_inner_microstep: 3324.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 21:44:58,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.04 | bwd: 3325.66 | bwd_inner: 3324.85 | bwd_allreduce: 0.78 | step: 7.20 52%|█████▏ | 5230/10000 [8:15:19<7:14:35, 5.47s/it] {'loss': 0.0121, 'grad_norm': 2.090668201446533, 'learning_rate': 1.9481857128632737e-05, 'epoch': 5.23} 52%|█████▏ | 5230/10000 [8:15:19<7:14:35, 5.47s/it][2025-06-19 21:45:04,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.74 [2025-06-19 21:45:04,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.77 | bwd_microstep: 3316.75 | bwd_inner_microstep: 3315.85 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.96 [2025-06-19 21:45:04,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.77 | bwd: 3316.77 | bwd_inner: 3315.85 | bwd_allreduce: 0.87 | step: 6.98 52%|█████▏ | 5231/10000 [8:15:25<7:14:19, 5.46s/it] {'loss': 0.0206, 'grad_norm': 2.412752151489258, 'learning_rate': 1.9475381819448325e-05, 'epoch': 5.23} 52%|█████▏ | 5231/10000 [8:15:25<7:14:19, 5.46s/it][2025-06-19 21:45:09,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:45:09,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.94 | bwd_microstep: 3314.98 | bwd_inner_microstep: 3314.16 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.30 [2025-06-19 21:45:09,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.94 | bwd: 3315.00 | bwd_inner: 3314.16 | bwd_allreduce: 0.79 | step: 7.30 52%|█████▏ | 5232/10000 [8:15:30<7:14:13, 5.46s/it] {'loss': 0.0329, 'grad_norm': 2.993756055831909, 'learning_rate': 1.9468906565293925e-05, 'epoch': 5.23} 52%|█████▏ | 5232/10000 [8:15:30<7:14:13, 5.46s/it][2025-06-19 21:45:15,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:45:15,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.91 | bwd_microstep: 3365.04 | bwd_inner_microstep: 3364.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 21:45:15,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.91 | bwd: 3365.06 | bwd_inner: 3364.24 | bwd_allreduce: 0.77 | step: 6.70 52%|█████▏ | 5233/10000 [8:15:36<7:16:14, 5.49s/it] {'loss': 0.0178, 'grad_norm': 2.3857316970825195, 'learning_rate': 1.9462431366848757e-05, 'epoch': 5.23} 52%|█████▏ | 5233/10000 [8:15:36<7:16:14, 5.49s/it][2025-06-19 21:45:20,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:45:20,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.03 | bwd_microstep: 3326.92 | bwd_inner_microstep: 3326.10 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.36 [2025-06-19 21:45:20,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.03 | bwd: 3326.93 | bwd_inner: 3326.10 | bwd_allreduce: 0.79 | step: 7.36 52%|█████▏ | 5234/10000 [8:15:41<7:15:46, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.023765699937939644, 'learning_rate': 1.9455956224792052e-05, 'epoch': 5.23} 52%|█████▏ | 5234/10000 [8:15:41<7:15:46, 5.49s/it][2025-06-19 21:45:26,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:45:26,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.32 | bwd_microstep: 3315.93 | bwd_inner_microstep: 3315.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 21:45:26,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.32 | bwd: 3315.94 | bwd_inner: 3315.14 | bwd_allreduce: 0.76 | step: 6.58 52%|█████▏ | 5235/10000 [8:15:46<7:15:03, 5.48s/it] {'loss': 0.0666, 'grad_norm': 4.71543550491333, 'learning_rate': 1.9449481139803007e-05, 'epoch': 5.24} 52%|█████▏ | 5235/10000 [8:15:46<7:15:03, 5.48s/it][2025-06-19 21:45:31,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:45:31,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.18 | bwd_microstep: 3324.87 | bwd_inner_microstep: 3324.05 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.43 [2025-06-19 21:45:31,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.18 | bwd: 3324.89 | bwd_inner: 3324.05 | bwd_allreduce: 0.78 | step: 7.43 52%|█████▏ | 5236/10000 [8:15:52<7:14:46, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.10067956894636154, 'learning_rate': 1.9443006112560836e-05, 'epoch': 5.24} 52%|█████▏ | 5236/10000 [8:15:52<7:14:46, 5.48s/it][2025-06-19 21:45:37,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:45:37,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.62 | bwd_microstep: 3320.64 | bwd_inner_microstep: 3319.75 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.12 [2025-06-19 21:45:37,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.62 | bwd: 3320.65 | bwd_inner: 3319.75 | bwd_allreduce: 0.86 | step: 7.12 52%|█████▏ | 5237/10000 [8:15:57<7:14:21, 5.47s/it] {'loss': 0.2204, 'grad_norm': 6.673093795776367, 'learning_rate': 1.943653114374474e-05, 'epoch': 5.24} 52%|█████▏ | 5237/10000 [8:15:57<7:14:21, 5.47s/it][2025-06-19 21:45:42,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:45:42,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.55 | bwd_microstep: 3319.51 | bwd_inner_microstep: 3318.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 21:45:42,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.55 | bwd: 3319.52 | bwd_inner: 3318.72 | bwd_allreduce: 0.76 | step: 6.77 52%|█████▏ | 5238/10000 [8:16:03<7:14:04, 5.47s/it] {'loss': 0.0065, 'grad_norm': 1.1597390174865723, 'learning_rate': 1.943005623403391e-05, 'epoch': 5.24} 52%|█████▏ | 5238/10000 [8:16:03<7:14:04, 5.47s/it][2025-06-19 21:45:48,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:45:48,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.70 | bwd_microstep: 3315.45 | bwd_inner_microstep: 3314.62 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.93 [2025-06-19 21:45:48,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.70 | bwd: 3315.47 | bwd_inner: 3314.62 | bwd_allreduce: 0.80 | step: 6.93 52%|█████▏ | 5239/10000 [8:16:08<7:13:54, 5.47s/it] {'loss': 0.0426, 'grad_norm': 2.097280263900757, 'learning_rate': 1.9423581384107547e-05, 'epoch': 5.24} 52%|█████▏ | 5239/10000 [8:16:08<7:13:54, 5.47s/it][2025-06-19 21:45:53,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:45:53,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.97 | bwd_microstep: 3321.54 | bwd_inner_microstep: 3320.59 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.73 [2025-06-19 21:45:53,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.97 | bwd: 3321.55 | bwd_inner: 3320.59 | bwd_allreduce: 0.92 | step: 7.74 52%|█████▏ | 5240/10000 [8:16:14<7:14:00, 5.47s/it] {'loss': 0.0044, 'grad_norm': 0.49896660447120667, 'learning_rate': 1.9417106594644814e-05, 'epoch': 5.24} 52%|█████▏ | 5240/10000 [8:16:14<7:14:00, 5.47s/it][2025-06-19 21:45:58,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:45:58,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.41 | bwd_microstep: 3318.75 | bwd_inner_microstep: 3317.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 21:45:58,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.41 | bwd: 3318.77 | bwd_inner: 3317.96 | bwd_allreduce: 0.76 | step: 6.77 52%|█████▏ | 5241/10000 [8:16:19<7:13:47, 5.47s/it] {'loss': 0.0025, 'grad_norm': 0.32305285334587097, 'learning_rate': 1.941063186632489e-05, 'epoch': 5.24} 52%|█████▏ | 5241/10000 [8:16:19<7:13:47, 5.47s/it][2025-06-19 21:46:04,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:46:04,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.06 | bwd_microstep: 3327.14 | bwd_inner_microstep: 3326.18 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.53 [2025-06-19 21:46:04,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.06 | bwd: 3327.16 | bwd_inner: 3326.18 | bwd_allreduce: 0.93 | step: 7.54 52%|█████▏ | 5242/10000 [8:16:25<7:13:48, 5.47s/it] {'loss': 0.0022, 'grad_norm': 0.4844355881214142, 'learning_rate': 1.940415719982695e-05, 'epoch': 5.24} 52%|█████▏ | 5242/10000 [8:16:25<7:13:48, 5.47s/it][2025-06-19 21:46:09,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:46:09,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.94 | bwd_microstep: 3319.94 | bwd_inner_microstep: 3319.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 21:46:09,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.94 | bwd: 3319.96 | bwd_inner: 3319.15 | bwd_allreduce: 0.76 | step: 6.75 52%|█████▏ | 5243/10000 [8:16:30<7:13:41, 5.47s/it] {'loss': 0.0025, 'grad_norm': 0.2443421632051468, 'learning_rate': 1.9397682595830154e-05, 'epoch': 5.24} 52%|█████▏ | 5243/10000 [8:16:30<7:13:41, 5.47s/it][2025-06-19 21:46:15,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:46:15,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.42 | bwd_microstep: 3318.52 | bwd_inner_microstep: 3317.74 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 21:46:15,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.42 | bwd: 3318.54 | bwd_inner: 3317.74 | bwd_allreduce: 0.75 | step: 6.61 52%|█████▏ | 5244/10000 [8:16:36<7:13:21, 5.47s/it] {'loss': 0.0006, 'grad_norm': 0.0565871000289917, 'learning_rate': 1.939120805501366e-05, 'epoch': 5.24} 52%|█████▏ | 5244/10000 [8:16:36<7:13:21, 5.47s/it][2025-06-19 21:46:20,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.90 [2025-06-19 21:46:20,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.05 | bwd_microstep: 3328.18 | bwd_inner_microstep: 3327.33 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.92 [2025-06-19 21:46:20,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.05 | bwd: 3328.20 | bwd_inner: 3327.33 | bwd_allreduce: 0.81 | step: 7.92 52%|█████▏ | 5245/10000 [8:16:41<7:13:42, 5.47s/it] {'loss': 0.0053, 'grad_norm': 0.8687425851821899, 'learning_rate': 1.9384733578056616e-05, 'epoch': 5.25} 52%|█████▏ | 5245/10000 [8:16:41<7:13:42, 5.47s/it][2025-06-19 21:46:26,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.73 [2025-06-19 21:46:26,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.54 | bwd_microstep: 3331.62 | bwd_inner_microstep: 3330.84 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-19 21:46:26,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.54 | bwd: 3331.64 | bwd_inner: 3330.84 | bwd_allreduce: 0.76 | step: 6.78 52%|█████▏ | 5246/10000 [8:16:47<7:13:59, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.0689239650964737, 'learning_rate': 1.9378259165638162e-05, 'epoch': 5.25} 52%|█████▏ | 5246/10000 [8:16:47<7:13:59, 5.48s/it][2025-06-19 21:46:31,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:46:31,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.23 | bwd_microstep: 3327.11 | bwd_inner_microstep: 3326.22 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.12 [2025-06-19 21:46:31,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.23 | bwd: 3327.13 | bwd_inner: 3326.22 | bwd_allreduce: 0.85 | step: 7.13 52%|█████▏ | 5247/10000 [8:16:52<7:13:53, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.06553734093904495, 'learning_rate': 1.9371784818437436e-05, 'epoch': 5.25} 52%|█████▏ | 5247/10000 [8:16:52<7:13:53, 5.48s/it][2025-06-19 21:46:37,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:46:37,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.42 | bwd_microstep: 3332.65 | bwd_inner_microstep: 3331.84 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 21:46:37,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.42 | bwd: 3332.67 | bwd_inner: 3331.84 | bwd_allreduce: 0.78 | step: 7.21 52%|█████▏ | 5248/10000 [8:16:58<7:14:03, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.05122866854071617, 'learning_rate': 1.9365310537133566e-05, 'epoch': 5.25} 52%|█████▏ | 5248/10000 [8:16:58<7:14:03, 5.48s/it][2025-06-19 21:46:42,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:46:42,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.28 | bwd_microstep: 3340.31 | bwd_inner_microstep: 3339.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 21:46:42,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.28 | bwd: 3340.33 | bwd_inner: 3339.52 | bwd_allreduce: 0.76 | step: 6.81 52%|█████▏ | 5249/10000 [8:17:03<7:14:10, 5.48s/it] {'loss': 0.0038, 'grad_norm': 0.6914318203926086, 'learning_rate': 1.9358836322405677e-05, 'epoch': 5.25} 52%|█████▏ | 5249/10000 [8:17:03<7:14:10, 5.48s/it][2025-06-19 21:46:48,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:46:48,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.76 | bwd_microstep: 3328.13 | bwd_inner_microstep: 3327.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 21:46:48,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.76 | bwd: 3328.14 | bwd_inner: 3327.33 | bwd_allreduce: 0.77 | step: 6.66 52%|█████▎ | 5250/10000 [8:17:09<7:13:58, 5.48s/it] {'loss': 0.0126, 'grad_norm': 2.283235549926758, 'learning_rate': 1.9352362174932887e-05, 'epoch': 5.25} 52%|█████▎ | 5250/10000 [8:17:09<7:13:58, 5.48s/it][2025-06-19 21:46:53,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:46:53,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.05 | bwd_microstep: 3328.86 | bwd_inner_microstep: 3327.97 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.23 [2025-06-19 21:46:53,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.05 | bwd: 3328.88 | bwd_inner: 3327.97 | bwd_allreduce: 0.85 | step: 7.23 53%|█████▎ | 5251/10000 [8:17:14<7:13:42, 5.48s/it] {'loss': 0.0014, 'grad_norm': 0.1011597216129303, 'learning_rate': 1.9345888095394295e-05, 'epoch': 5.25} 53%|█████▎ | 5251/10000 [8:17:14<7:13:42, 5.48s/it][2025-06-19 21:46:59,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:46:59,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.80 | bwd_microstep: 3327.76 | bwd_inner_microstep: 3326.92 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.36 [2025-06-19 21:46:59,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.81 | bwd: 3327.77 | bwd_inner: 3326.92 | bwd_allreduce: 0.80 | step: 7.36 53%|█████▎ | 5252/10000 [8:17:20<7:13:41, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.09282281249761581, 'learning_rate': 1.9339414084469004e-05, 'epoch': 5.25} 53%|█████▎ | 5252/10000 [8:17:20<7:13:41, 5.48s/it][2025-06-19 21:47:04,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:47:04,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.65 | bwd_microstep: 3378.93 | bwd_inner_microstep: 3378.12 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.18 [2025-06-19 21:47:04,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.65 | bwd: 3378.95 | bwd_inner: 3378.12 | bwd_allreduce: 0.78 | step: 7.18 53%|█████▎ | 5253/10000 [8:17:25<7:15:27, 5.50s/it] {'loss': 0.0006, 'grad_norm': 0.09407571703195572, 'learning_rate': 1.9332940142836106e-05, 'epoch': 5.25} 53%|█████▎ | 5253/10000 [8:17:25<7:15:27, 5.50s/it][2025-06-19 21:47:10,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:47:10,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.80 | bwd_microstep: 3330.57 | bwd_inner_microstep: 3329.77 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 21:47:10,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.80 | bwd: 3330.58 | bwd_inner: 3329.77 | bwd_allreduce: 0.77 | step: 6.98 53%|█████▎ | 5254/10000 [8:17:31<7:14:39, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.003976619802415371, 'learning_rate': 1.9326466271174693e-05, 'epoch': 5.25} 53%|█████▎ | 5254/10000 [8:17:31<7:14:39, 5.49s/it][2025-06-19 21:47:15,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:47:15,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.09 | bwd_microstep: 3376.76 | bwd_inner_microstep: 3375.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.75 [2025-06-19 21:47:15,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.09 | bwd: 3376.78 | bwd_inner: 3375.95 | bwd_allreduce: 0.78 | step: 6.75 53%|█████▎ | 5255/10000 [8:17:36<7:15:52, 5.51s/it] {'loss': 0.0008, 'grad_norm': 0.07565658539533615, 'learning_rate': 1.931999247016385e-05, 'epoch': 5.25} 53%|█████▎ | 5255/10000 [8:17:36<7:15:52, 5.51s/it][2025-06-19 21:47:21,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:47:21,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.71 | bwd_microstep: 3332.10 | bwd_inner_microstep: 3331.30 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 21:47:21,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.71 | bwd: 3332.12 | bwd_inner: 3331.30 | bwd_allreduce: 0.77 | step: 7.14 53%|█████▎ | 5256/10000 [8:17:42<7:15:07, 5.50s/it] {'loss': 0.0052, 'grad_norm': 1.0946316719055176, 'learning_rate': 1.9313518740482636e-05, 'epoch': 5.26} 53%|█████▎ | 5256/10000 [8:17:42<7:15:07, 5.50s/it][2025-06-19 21:47:26,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:47:26,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.38 | bwd_microstep: 3338.15 | bwd_inner_microstep: 3337.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 21:47:26,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.38 | bwd: 3338.16 | bwd_inner: 3337.36 | bwd_allreduce: 0.76 | step: 6.64 53%|█████▎ | 5257/10000 [8:17:47<7:14:34, 5.50s/it] {'loss': 0.0019, 'grad_norm': 0.21170905232429504, 'learning_rate': 1.930704508281012e-05, 'epoch': 5.26} 53%|█████▎ | 5257/10000 [8:17:47<7:14:34, 5.50s/it][2025-06-19 21:47:32,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:47:32,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.28 | bwd_microstep: 3328.80 | bwd_inner_microstep: 3327.93 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.84 [2025-06-19 21:47:32,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.28 | bwd: 3328.81 | bwd_inner: 3327.93 | bwd_allreduce: 0.84 | step: 6.84 53%|█████▎ | 5258/10000 [8:17:53<7:14:02, 5.49s/it] {'loss': 0.006, 'grad_norm': 0.7918748259544373, 'learning_rate': 1.9300571497825356e-05, 'epoch': 5.26} 53%|█████▎ | 5258/10000 [8:17:53<7:14:02, 5.49s/it][2025-06-19 21:47:37,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 21:47:37,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.77 | bwd_microstep: 3341.02 | bwd_inner_microstep: 3339.91 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.88 [2025-06-19 21:47:37,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.77 | bwd: 3341.05 | bwd_inner: 3339.91 | bwd_allreduce: 1.07 | step: 7.89 53%|█████▎ | 5259/10000 [8:17:58<7:14:11, 5.49s/it] {'loss': 0.0044, 'grad_norm': 0.7270030379295349, 'learning_rate': 1.92940979862074e-05, 'epoch': 5.26} 53%|█████▎ | 5259/10000 [8:17:58<7:14:11, 5.49s/it][2025-06-19 21:47:43,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 21:47:43,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.28 | bwd_microstep: 3333.94 | bwd_inner_microstep: 3332.86 | bwd_allreduce_microstep: 1.02 | step_microstep: 8.55 [2025-06-19 21:47:43,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.28 | bwd: 3333.95 | bwd_inner: 3332.86 | bwd_allreduce: 1.04 | step: 8.55 53%|█████▎ | 5260/10000 [8:18:04<7:14:13, 5.50s/it] {'loss': 0.061, 'grad_norm': 4.137311935424805, 'learning_rate': 1.928762454863529e-05, 'epoch': 5.26} 53%|█████▎ | 5260/10000 [8:18:04<7:14:13, 5.50s/it][2025-06-19 21:47:48,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:47:48,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.22 | bwd_microstep: 3325.86 | bwd_inner_microstep: 3324.90 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.40 [2025-06-19 21:47:48,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.22 | bwd: 3325.88 | bwd_inner: 3324.90 | bwd_allreduce: 0.93 | step: 7.41 53%|█████▎ | 5261/10000 [8:18:09<7:14:03, 5.50s/it] {'loss': 0.0086, 'grad_norm': 0.9362318515777588, 'learning_rate': 1.9281151185788054e-05, 'epoch': 5.26} 53%|█████▎ | 5261/10000 [8:18:09<7:14:03, 5.50s/it][2025-06-19 21:47:54,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:47:54,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.84 | bwd_microstep: 3388.57 | bwd_inner_microstep: 3387.66 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.35 [2025-06-19 21:47:54,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.84 | bwd: 3388.59 | bwd_inner: 3387.66 | bwd_allreduce: 0.88 | step: 7.35 53%|█████▎ | 5262/10000 [8:18:15<7:15:46, 5.52s/it] {'loss': 0.0071, 'grad_norm': 1.3867721557617188, 'learning_rate': 1.9274677898344723e-05, 'epoch': 5.26} 53%|█████▎ | 5262/10000 [8:18:15<7:15:46, 5.52s/it][2025-06-19 21:47:59,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:47:59,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.04 | bwd_microstep: 3329.65 | bwd_inner_microstep: 3328.83 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.21 [2025-06-19 21:47:59,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.04 | bwd: 3329.66 | bwd_inner: 3328.83 | bwd_allreduce: 0.79 | step: 7.21 53%|█████▎ | 5263/10000 [8:18:20<7:14:55, 5.51s/it] {'loss': 0.0006, 'grad_norm': 0.03598164767026901, 'learning_rate': 1.9268204686984314e-05, 'epoch': 5.26} 53%|█████▎ | 5263/10000 [8:18:20<7:14:55, 5.51s/it][2025-06-19 21:48:05,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:48:05,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.23 | bwd_microstep: 3377.52 | bwd_inner_microstep: 3376.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 21:48:05,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.23 | bwd: 3377.53 | bwd_inner: 3376.73 | bwd_allreduce: 0.76 | step: 6.75 53%|█████▎ | 5264/10000 [8:18:26<7:15:43, 5.52s/it] {'loss': 0.0005, 'grad_norm': 0.0808756947517395, 'learning_rate': 1.9261731552385835e-05, 'epoch': 5.26} 53%|█████▎ | 5264/10000 [8:18:26<7:15:43, 5.52s/it][2025-06-19 21:48:10,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:48:10,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.36 | bwd_microstep: 3323.38 | bwd_inner_microstep: 3322.58 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.73 [2025-06-19 21:48:10,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.36 | bwd: 3323.39 | bwd_inner: 3322.58 | bwd_allreduce: 0.77 | step: 7.54 53%|█████▎ | 5265/10000 [8:18:31<7:14:40, 5.51s/it] {'loss': 0.0022, 'grad_norm': 0.29079893231391907, 'learning_rate': 1.9255258495228302e-05, 'epoch': 5.26} 53%|█████▎ | 5265/10000 [8:18:31<7:14:40, 5.51s/it][2025-06-19 21:48:16,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:48:16,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.27 | bwd_microstep: 3323.21 | bwd_inner_microstep: 3322.16 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.15 [2025-06-19 21:48:16,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.27 | bwd: 3323.22 | bwd_inner: 3322.16 | bwd_allreduce: 1.02 | step: 7.15 53%|█████▎ | 5266/10000 [8:18:37<7:13:38, 5.50s/it] {'loss': 0.0548, 'grad_norm': 4.542107582092285, 'learning_rate': 1.9248785516190687e-05, 'epoch': 5.27} 53%|█████▎ | 5266/10000 [8:18:37<7:13:38, 5.50s/it][2025-06-19 21:48:21,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:48:21,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.84 | bwd_microstep: 3380.47 | bwd_inner_microstep: 3379.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 21:48:21,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.84 | bwd: 3380.49 | bwd_inner: 3379.68 | bwd_allreduce: 0.76 | step: 6.72 53%|█████▎ | 5267/10000 [8:18:42<7:15:18, 5.52s/it] {'loss': 0.0006, 'grad_norm': 0.09354919195175171, 'learning_rate': 1.9242312615951985e-05, 'epoch': 5.27} 53%|█████▎ | 5267/10000 [8:18:42<7:15:18, 5.52s/it][2025-06-19 21:48:27,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:48:27,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.06 | bwd_microstep: 3370.48 | bwd_inner_microstep: 3369.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.12 [2025-06-19 21:48:27,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.06 | bwd: 3370.49 | bwd_inner: 3369.69 | bwd_allreduce: 0.77 | step: 7.12 53%|█████▎ | 5268/10000 [8:18:48<7:15:51, 5.53s/it] {'loss': 0.0009, 'grad_norm': 0.13693681359291077, 'learning_rate': 1.9235839795191173e-05, 'epoch': 5.27} 53%|█████▎ | 5268/10000 [8:18:48<7:15:51, 5.53s/it][2025-06-19 21:48:33,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.75 | optimizer_step: 2.73 [2025-06-19 21:48:33,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.66 | bwd_microstep: 3407.48 | bwd_inner_microstep: 3406.57 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.84 [2025-06-19 21:48:33,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.66 | bwd: 3407.50 | bwd_inner: 3406.57 | bwd_allreduce: 0.87 | step: 7.84 53%|█████▎ | 5269/10000 [8:18:53<7:17:15, 5.55s/it] {'loss': 0.0079, 'grad_norm': 1.0447945594787598, 'learning_rate': 1.9229367054587217e-05, 'epoch': 5.27} 53%|█████▎ | 5269/10000 [8:18:53<7:17:15, 5.55s/it][2025-06-19 21:48:38,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:48:38,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.96 | bwd_microstep: 3323.93 | bwd_inner_microstep: 3323.04 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.99 [2025-06-19 21:48:38,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.96 | bwd: 3323.95 | bwd_inner: 3323.04 | bwd_allreduce: 0.86 | step: 6.99 53%|█████▎ | 5270/10000 [8:18:59<7:15:56, 5.53s/it] {'loss': 0.0004, 'grad_norm': 0.06278746575117111, 'learning_rate': 1.9222894394819088e-05, 'epoch': 5.27} 53%|█████▎ | 5270/10000 [8:18:59<7:15:56, 5.53s/it][2025-06-19 21:48:43,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:48:43,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.43 | bwd_microstep: 3317.85 | bwd_inner_microstep: 3316.97 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.24 [2025-06-19 21:48:43,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.43 | bwd: 3317.87 | bwd_inner: 3316.97 | bwd_allreduce: 0.85 | step: 7.25 53%|█████▎ | 5271/10000 [8:19:04<7:14:13, 5.51s/it] {'loss': 0.1009, 'grad_norm': 5.395644664764404, 'learning_rate': 1.9216421816565723e-05, 'epoch': 5.27} 53%|█████▎ | 5271/10000 [8:19:04<7:14:13, 5.51s/it][2025-06-19 21:48:49,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:48:49,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.53 | bwd_microstep: 3407.09 | bwd_inner_microstep: 3406.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 21:48:49,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.53 | bwd: 3407.10 | bwd_inner: 3406.30 | bwd_allreduce: 0.76 | step: 6.67 53%|█████▎ | 5272/10000 [8:19:10<7:15:59, 5.53s/it] {'loss': 0.0174, 'grad_norm': 3.618011236190796, 'learning_rate': 1.9209949320506077e-05, 'epoch': 5.27} 53%|█████▎ | 5272/10000 [8:19:10<7:15:59, 5.53s/it][2025-06-19 21:48:55,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 21:48:55,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.02 | bwd_microstep: 3327.71 | bwd_inner_microstep: 3326.65 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.88 [2025-06-19 21:48:55,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.02 | bwd: 3327.73 | bwd_inner: 3326.65 | bwd_allreduce: 1.02 | step: 7.88 53%|█████▎ | 5273/10000 [8:19:15<7:14:47, 5.52s/it] {'loss': 0.0038, 'grad_norm': 0.7588223218917847, 'learning_rate': 1.9203476907319075e-05, 'epoch': 5.27} 53%|█████▎ | 5273/10000 [8:19:15<7:14:47, 5.52s/it][2025-06-19 21:49:00,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 21:49:00,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.10 | bwd_microstep: 3373.76 | bwd_inner_microstep: 3372.81 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.27 [2025-06-19 21:49:00,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.10 | bwd: 3373.78 | bwd_inner: 3372.81 | bwd_allreduce: 0.92 | step: 7.27 53%|█████▎ | 5274/10000 [8:19:21<7:15:22, 5.53s/it] {'loss': 0.0006, 'grad_norm': 0.14710469543933868, 'learning_rate': 1.9197004577683653e-05, 'epoch': 5.27} 53%|█████▎ | 5274/10000 [8:19:21<7:15:22, 5.53s/it][2025-06-19 21:49:06,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:49:06,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.57 | bwd_microstep: 3402.55 | bwd_inner_microstep: 3401.74 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 21:49:06,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.57 | bwd: 3402.56 | bwd_inner: 3401.74 | bwd_allreduce: 0.78 | step: 7.19 53%|█████▎ | 5275/10000 [8:19:26<7:16:35, 5.54s/it] {'loss': 0.1334, 'grad_norm': 5.677093505859375, 'learning_rate': 1.9190532332278733e-05, 'epoch': 5.28} 53%|█████▎ | 5275/10000 [8:19:27<7:16:35, 5.54s/it][2025-06-19 21:49:11,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:49:11,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.96 | bwd_microstep: 3323.39 | bwd_inner_microstep: 3322.57 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.39 [2025-06-19 21:49:11,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.96 | bwd: 3323.40 | bwd_inner: 3322.57 | bwd_allreduce: 0.79 | step: 7.39 53%|█████▎ | 5276/10000 [8:19:32<7:15:05, 5.53s/it] {'loss': 0.0431, 'grad_norm': 5.12135648727417, 'learning_rate': 1.9184060171783203e-05, 'epoch': 5.28} 53%|█████▎ | 5276/10000 [8:19:32<7:15:05, 5.53s/it][2025-06-19 21:49:17,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:49:17,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.68 | bwd_microstep: 3399.17 | bwd_inner_microstep: 3398.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 21:49:17,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.68 | bwd: 3399.19 | bwd_inner: 3398.38 | bwd_allreduce: 0.76 | step: 6.80 53%|█████▎ | 5277/10000 [8:19:38<7:16:18, 5.54s/it] {'loss': 0.0033, 'grad_norm': 0.7834377288818359, 'learning_rate': 1.9177588096875976e-05, 'epoch': 5.28} 53%|█████▎ | 5277/10000 [8:19:38<7:16:18, 5.54s/it][2025-06-19 21:49:22,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:49:22,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.16 | bwd_microstep: 3319.63 | bwd_inner_microstep: 3318.77 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.64 [2025-06-19 21:49:22,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.16 | bwd: 3319.65 | bwd_inner: 3318.77 | bwd_allreduce: 0.83 | step: 7.64 53%|█████▎ | 5278/10000 [8:19:43<7:14:23, 5.52s/it] {'loss': 0.0013, 'grad_norm': 0.30331090092658997, 'learning_rate': 1.917111610823594e-05, 'epoch': 5.28} 53%|█████▎ | 5278/10000 [8:19:43<7:14:23, 5.52s/it][2025-06-19 21:49:28,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:49:28,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.18 | bwd_microstep: 3328.42 | bwd_inner_microstep: 3327.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 21:49:28,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.18 | bwd: 3328.43 | bwd_inner: 3327.62 | bwd_allreduce: 0.77 | step: 6.69 53%|█████▎ | 5279/10000 [8:19:49<7:13:17, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.038024406880140305, 'learning_rate': 1.9164644206541977e-05, 'epoch': 5.28} 53%|█████▎ | 5279/10000 [8:19:49<7:13:17, 5.51s/it][2025-06-19 21:49:33,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:49:33,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.12 | bwd_microstep: 3329.58 | bwd_inner_microstep: 3328.76 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.30 [2025-06-19 21:49:33,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.12 | bwd: 3329.59 | bwd_inner: 3328.76 | bwd_allreduce: 0.78 | step: 7.30 53%|█████▎ | 5280/10000 [8:19:54<7:12:31, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.5827098488807678, 'learning_rate': 1.9158172392472966e-05, 'epoch': 5.28} 53%|█████▎ | 5280/10000 [8:19:54<7:12:31, 5.50s/it][2025-06-19 21:49:39,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:49:39,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.49 | bwd_microstep: 3320.23 | bwd_inner_microstep: 3319.42 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 21:49:39,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.49 | bwd: 3320.25 | bwd_inner: 3319.42 | bwd_allreduce: 0.78 | step: 6.83 53%|█████▎ | 5281/10000 [8:19:59<7:11:41, 5.49s/it] {'loss': 0.0007, 'grad_norm': 0.13624951243400574, 'learning_rate': 1.915170066670776e-05, 'epoch': 5.28} 53%|█████▎ | 5281/10000 [8:19:59<7:11:41, 5.49s/it][2025-06-19 21:49:44,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:49:44,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.22 | bwd_microstep: 3322.82 | bwd_inner_microstep: 3322.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.03 [2025-06-19 21:49:44,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.22 | bwd: 3322.84 | bwd_inner: 3322.03 | bwd_allreduce: 0.76 | step: 7.03 53%|█████▎ | 5282/10000 [8:20:05<7:11:11, 5.48s/it] {'loss': 0.0119, 'grad_norm': 1.9907395839691162, 'learning_rate': 1.9145229029925214e-05, 'epoch': 5.28} 53%|█████▎ | 5282/10000 [8:20:05<7:11:11, 5.48s/it][2025-06-19 21:49:50,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 21:49:50,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.73 | bwd_microstep: 3320.20 | bwd_inner_microstep: 3319.31 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.50 [2025-06-19 21:49:50,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.73 | bwd: 3320.22 | bwd_inner: 3319.31 | bwd_allreduce: 0.86 | step: 7.51 53%|█████▎ | 5283/10000 [8:20:10<7:10:44, 5.48s/it] {'loss': 0.0244, 'grad_norm': 4.006170272827148, 'learning_rate': 1.9138757482804175e-05, 'epoch': 5.28} 53%|█████▎ | 5283/10000 [8:20:10<7:10:44, 5.48s/it][2025-06-19 21:49:55,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:49:55,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.62 | bwd_microstep: 3318.93 | bwd_inner_microstep: 3318.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 21:49:55,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.62 | bwd: 3318.95 | bwd_inner: 3318.13 | bwd_allreduce: 0.77 | step: 6.82 53%|█████▎ | 5284/10000 [8:20:16<7:10:28, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.00596972880885005, 'learning_rate': 1.913228602602348e-05, 'epoch': 5.28} 53%|█████▎ | 5284/10000 [8:20:16<7:10:28, 5.48s/it][2025-06-19 21:50:01,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 21:50:01,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.72 | bwd_microstep: 3367.74 | bwd_inner_microstep: 3366.84 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.95 [2025-06-19 21:50:01,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.72 | bwd: 3367.76 | bwd_inner: 3366.84 | bwd_allreduce: 0.88 | step: 6.95 53%|█████▎ | 5285/10000 [8:20:21<7:11:40, 5.49s/it] {'loss': 0.0119, 'grad_norm': 3.1442043781280518, 'learning_rate': 1.9125814660261963e-05, 'epoch': 5.29} 53%|█████▎ | 5285/10000 [8:20:21<7:11:40, 5.49s/it][2025-06-19 21:50:06,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:50:06,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.43 | bwd_microstep: 3370.51 | bwd_inner_microstep: 3369.71 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 21:50:06,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.43 | bwd: 3370.53 | bwd_inner: 3369.71 | bwd_allreduce: 0.78 | step: 7.18 53%|█████▎ | 5286/10000 [8:20:27<7:12:41, 5.51s/it] {'loss': 0.0035, 'grad_norm': 0.49024447798728943, 'learning_rate': 1.911934338619842e-05, 'epoch': 5.29} 53%|█████▎ | 5286/10000 [8:20:27<7:12:41, 5.51s/it][2025-06-19 21:50:12,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:50:12,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.87 | bwd_microstep: 3370.82 | bwd_inner_microstep: 3370.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 21:50:12,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.87 | bwd: 3370.83 | bwd_inner: 3370.04 | bwd_allreduce: 0.76 | step: 6.70 53%|█████▎ | 5287/10000 [8:20:32<7:13:15, 5.52s/it] {'loss': 0.0186, 'grad_norm': 2.2223212718963623, 'learning_rate': 1.9112872204511668e-05, 'epoch': 5.29} 53%|█████▎ | 5287/10000 [8:20:32<7:13:15, 5.52s/it][2025-06-19 21:50:17,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:50:17,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.91 | bwd_microstep: 3366.25 | bwd_inner_microstep: 3365.25 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.76 [2025-06-19 21:50:17,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.91 | bwd: 3366.27 | bwd_inner: 3365.25 | bwd_allreduce: 0.97 | step: 7.76 53%|█████▎ | 5288/10000 [8:20:38<7:13:42, 5.52s/it] {'loss': 0.1259, 'grad_norm': 8.480264663696289, 'learning_rate': 1.9106401115880507e-05, 'epoch': 5.29} 53%|█████▎ | 5288/10000 [8:20:38<7:13:42, 5.52s/it][2025-06-19 21:50:23,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 21:50:23,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.34 | bwd_microstep: 3379.57 | bwd_inner_microstep: 3378.56 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.49 [2025-06-19 21:50:23,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.34 | bwd: 3379.59 | bwd_inner: 3378.56 | bwd_allreduce: 0.98 | step: 7.49 53%|█████▎ | 5289/10000 [8:20:44<7:14:18, 5.53s/it] {'loss': 0.0008, 'grad_norm': 0.06776086986064911, 'learning_rate': 1.9099930120983724e-05, 'epoch': 5.29} 53%|█████▎ | 5289/10000 [8:20:44<7:14:18, 5.53s/it][2025-06-19 21:50:28,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:50:28,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.88 | bwd_microstep: 3323.36 | bwd_inner_microstep: 3322.56 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-19 21:50:28,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.88 | bwd: 3323.38 | bwd_inner: 3322.56 | bwd_allreduce: 0.78 | step: 7.16 53%|█████▎ | 5290/10000 [8:20:49<7:12:48, 5.51s/it] {'loss': 0.3507, 'grad_norm': 9.314658164978027, 'learning_rate': 1.90934592205001e-05, 'epoch': 5.29} 53%|█████▎ | 5290/10000 [8:20:49<7:12:48, 5.51s/it][2025-06-19 21:50:34,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:50:34,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.14 | bwd_microstep: 3371.51 | bwd_inner_microstep: 3370.50 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.12 [2025-06-19 21:50:34,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.14 | bwd: 3371.52 | bwd_inner: 3370.50 | bwd_allreduce: 0.98 | step: 7.13 53%|█████▎ | 5291/10000 [8:20:55<7:13:16, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.02947342023253441, 'learning_rate': 1.9086988415108387e-05, 'epoch': 5.29} 53%|█████▎ | 5291/10000 [8:20:55<7:13:16, 5.52s/it][2025-06-19 21:50:39,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:50:39,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.60 | bwd_microstep: 3316.83 | bwd_inner_microstep: 3316.01 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.26 [2025-06-19 21:50:39,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.60 | bwd: 3316.84 | bwd_inner: 3316.01 | bwd_allreduce: 0.79 | step: 7.26 53%|█████▎ | 5292/10000 [8:21:00<7:11:48, 5.50s/it] {'loss': 0.0177, 'grad_norm': 1.5643079280853271, 'learning_rate': 1.9080517705487353e-05, 'epoch': 5.29} 53%|█████▎ | 5292/10000 [8:21:00<7:11:48, 5.50s/it][2025-06-19 21:50:45,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:50:45,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.88 | bwd_microstep: 3314.30 | bwd_inner_microstep: 3313.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 21:50:45,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.88 | bwd: 3314.32 | bwd_inner: 3313.51 | bwd_allreduce: 0.77 | step: 6.79 53%|█████▎ | 5293/10000 [8:21:05<7:10:35, 5.49s/it] {'loss': 0.0033, 'grad_norm': 0.6662566065788269, 'learning_rate': 1.9074047092315745e-05, 'epoch': 5.29} 53%|█████▎ | 5293/10000 [8:21:05<7:10:35, 5.49s/it][2025-06-19 21:50:50,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:50:50,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.84 | bwd_microstep: 3315.25 | bwd_inner_microstep: 3314.45 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-19 21:50:50,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.84 | bwd: 3315.27 | bwd_inner: 3314.45 | bwd_allreduce: 0.78 | step: 6.97 53%|█████▎ | 5294/10000 [8:21:11<7:09:49, 5.48s/it] {'loss': 0.0004, 'grad_norm': 0.09531908482313156, 'learning_rate': 1.9067576576272297e-05, 'epoch': 5.29} 53%|█████▎ | 5294/10000 [8:21:11<7:09:49, 5.48s/it][2025-06-19 21:50:56,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 21:50:56,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.26 | bwd_microstep: 3314.92 | bwd_inner_microstep: 3313.95 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.34 [2025-06-19 21:50:56,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.26 | bwd: 3314.94 | bwd_inner: 3313.95 | bwd_allreduce: 0.94 | step: 7.34 53%|█████▎ | 5295/10000 [8:21:16<7:09:03, 5.47s/it] {'loss': 0.0032, 'grad_norm': 0.561968207359314, 'learning_rate': 1.9061106158035744e-05, 'epoch': 5.29} 53%|█████▎ | 5295/10000 [8:21:16<7:09:03, 5.47s/it][2025-06-19 21:51:01,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:51:01,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.11 | bwd_microstep: 3323.14 | bwd_inner_microstep: 3322.29 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.95 [2025-06-19 21:51:01,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.11 | bwd: 3323.15 | bwd_inner: 3322.29 | bwd_allreduce: 0.82 | step: 6.95 53%|█████▎ | 5296/10000 [8:21:22<7:08:51, 5.47s/it] {'loss': 0.0017, 'grad_norm': 0.2728888988494873, 'learning_rate': 1.9054635838284796e-05, 'epoch': 5.3} 53%|█████▎ | 5296/10000 [8:21:22<7:08:51, 5.47s/it][2025-06-19 21:51:07,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:51:07,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.37 | bwd_microstep: 3362.32 | bwd_inner_microstep: 3361.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 21:51:07,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.37 | bwd: 3362.33 | bwd_inner: 3361.53 | bwd_allreduce: 0.76 | step: 6.65 53%|█████▎ | 5297/10000 [8:21:27<7:10:00, 5.49s/it] {'loss': 0.0054, 'grad_norm': 0.9226633310317993, 'learning_rate': 1.904816561769816e-05, 'epoch': 5.3} 53%|█████▎ | 5297/10000 [8:21:27<7:10:00, 5.49s/it][2025-06-19 21:51:12,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:51:12,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.01 | bwd_microstep: 3367.08 | bwd_inner_microstep: 3366.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 21:51:12,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.01 | bwd: 3367.10 | bwd_inner: 3366.29 | bwd_allreduce: 0.76 | step: 6.98 53%|█████▎ | 5298/10000 [8:21:33<7:11:06, 5.50s/it] {'loss': 0.0011, 'grad_norm': 0.15734897553920746, 'learning_rate': 1.9041695496954532e-05, 'epoch': 5.3} 53%|█████▎ | 5298/10000 [8:21:33<7:11:06, 5.50s/it][2025-06-19 21:51:18,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:51:18,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.55 | bwd_microstep: 3322.18 | bwd_inner_microstep: 3321.25 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.32 [2025-06-19 21:51:18,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.55 | bwd: 3322.19 | bwd_inner: 3321.25 | bwd_allreduce: 0.90 | step: 7.33 53%|█████▎ | 5299/10000 [8:21:38<7:10:21, 5.49s/it] {'loss': 0.0322, 'grad_norm': 3.195702314376831, 'learning_rate': 1.9035225476732598e-05, 'epoch': 5.3} 53%|█████▎ | 5299/10000 [8:21:38<7:10:21, 5.49s/it][2025-06-19 21:51:23,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:51:23,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.18 | bwd_microstep: 3320.42 | bwd_inner_microstep: 3319.61 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-19 21:51:23,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.18 | bwd: 3320.43 | bwd_inner: 3319.61 | bwd_allreduce: 0.78 | step: 7.22 53%|█████▎ | 5300/10000 [8:21:44<7:09:39, 5.49s/it] {'loss': 0.0051, 'grad_norm': 1.0018287897109985, 'learning_rate': 1.9028755557711043e-05, 'epoch': 5.3} 53%|█████▎ | 5300/10000 [8:21:44<7:09:39, 5.49s/it][2025-06-19 21:51:29,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:51:29,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.04 | bwd_microstep: 3323.43 | bwd_inner_microstep: 3322.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 21:51:29,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.04 | bwd: 3323.44 | bwd_inner: 3322.63 | bwd_allreduce: 0.76 | step: 6.73 53%|█████▎ | 5301/10000 [8:21:49<7:09:12, 5.48s/it] {'loss': 0.0099, 'grad_norm': 1.0401418209075928, 'learning_rate': 1.9022285740568514e-05, 'epoch': 5.3} 53%|█████▎ | 5301/10000 [8:21:49<7:09:12, 5.48s/it][2025-06-19 21:51:34,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:51:34,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.14 | bwd_microstep: 3315.46 | bwd_inner_microstep: 3314.65 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.26 [2025-06-19 21:51:34,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.14 | bwd: 3315.47 | bwd_inner: 3314.65 | bwd_allreduce: 0.78 | step: 7.26 53%|█████▎ | 5302/10000 [8:21:55<7:08:35, 5.47s/it] {'loss': 0.004, 'grad_norm': 0.5043461322784424, 'learning_rate': 1.901581602598367e-05, 'epoch': 5.3} 53%|█████▎ | 5302/10000 [8:21:55<7:08:35, 5.47s/it][2025-06-19 21:51:40,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:51:40,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.81 | bwd_microstep: 3373.47 | bwd_inner_microstep: 3372.53 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.23 [2025-06-19 21:51:40,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.81 | bwd: 3373.49 | bwd_inner: 3372.53 | bwd_allreduce: 0.91 | step: 7.23 53%|█████▎ | 5303/10000 [8:22:00<7:10:04, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.007268378045409918, 'learning_rate': 1.900934641463516e-05, 'epoch': 5.3} 53%|█████▎ | 5303/10000 [8:22:00<7:10:04, 5.49s/it][2025-06-19 21:51:45,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:51:45,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.49 | bwd_microstep: 3317.68 | bwd_inner_microstep: 3316.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 21:51:45,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.49 | bwd: 3317.70 | bwd_inner: 3316.89 | bwd_allreduce: 0.76 | step: 6.68 53%|█████▎ | 5304/10000 [8:22:06<7:09:40, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.01732729747891426, 'learning_rate': 1.9002876907201616e-05, 'epoch': 5.3} 53%|█████▎ | 5304/10000 [8:22:06<7:09:40, 5.49s/it][2025-06-19 21:51:50,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:51:50,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.64 | bwd_microstep: 3316.08 | bwd_inner_microstep: 3315.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 21:51:50,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.64 | bwd: 3316.09 | bwd_inner: 3315.28 | bwd_allreduce: 0.77 | step: 7.10 53%|█████▎ | 5305/10000 [8:22:11<7:08:55, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.010801000520586967, 'learning_rate': 1.8996407504361654e-05, 'epoch': 5.3} 53%|█████▎ | 5305/10000 [8:22:11<7:08:55, 5.48s/it][2025-06-19 21:51:56,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.72 [2025-06-19 21:51:56,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.30 | bwd_microstep: 3313.57 | bwd_inner_microstep: 3312.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 21:51:56,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.30 | bwd: 3313.59 | bwd_inner: 3312.79 | bwd_allreduce: 0.75 | step: 6.73 53%|█████▎ | 5306/10000 [8:22:17<7:08:07, 5.47s/it] {'loss': 0.0005, 'grad_norm': 0.08531294763088226, 'learning_rate': 1.8989938206793892e-05, 'epoch': 5.31} 53%|█████▎ | 5306/10000 [8:22:17<7:08:07, 5.47s/it][2025-06-19 21:52:01,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:52:01,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.90 | bwd_microstep: 3314.04 | bwd_inner_microstep: 3313.25 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.03 [2025-06-19 21:52:01,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.91 | bwd: 3314.05 | bwd_inner: 3313.25 | bwd_allreduce: 0.76 | step: 7.04 53%|█████▎ | 5307/10000 [8:22:22<7:07:36, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.022805728018283844, 'learning_rate': 1.898346901517692e-05, 'epoch': 5.31} 53%|█████▎ | 5307/10000 [8:22:22<7:07:36, 5.47s/it][2025-06-19 21:52:07,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:52:07,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.62 | bwd_microstep: 3366.08 | bwd_inner_microstep: 3365.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 21:52:07,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.62 | bwd: 3366.09 | bwd_inner: 3365.30 | bwd_allreduce: 0.75 | step: 6.54 53%|█████▎ | 5308/10000 [8:22:28<7:09:08, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.010938003659248352, 'learning_rate': 1.8976999930189328e-05, 'epoch': 5.31} 53%|█████▎ | 5308/10000 [8:22:28<7:09:08, 5.49s/it][2025-06-19 21:52:12,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:52:12,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.70 | bwd_microstep: 3311.10 | bwd_inner_microstep: 3310.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 21:52:12,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.70 | bwd: 3311.12 | bwd_inner: 3310.32 | bwd_allreduce: 0.75 | step: 6.57 53%|█████▎ | 5309/10000 [8:22:33<7:08:12, 5.48s/it] {'loss': 0.0032, 'grad_norm': 0.5193213224411011, 'learning_rate': 1.8970530952509697e-05, 'epoch': 5.31} 53%|█████▎ | 5309/10000 [8:22:33<7:08:12, 5.48s/it][2025-06-19 21:52:18,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:52:18,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.74 | bwd_microstep: 3365.09 | bwd_inner_microstep: 3364.12 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.23 [2025-06-19 21:52:18,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.74 | bwd: 3365.11 | bwd_inner: 3364.12 | bwd_allreduce: 0.94 | step: 7.23 53%|█████▎ | 5310/10000 [8:22:39<7:09:18, 5.49s/it] {'loss': 0.0061, 'grad_norm': 1.129644513130188, 'learning_rate': 1.8964062082816593e-05, 'epoch': 5.31} 53%|█████▎ | 5310/10000 [8:22:39<7:09:18, 5.49s/it][2025-06-19 21:52:23,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:52:23,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.88 | bwd_microstep: 3316.83 | bwd_inner_microstep: 3316.00 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.40 [2025-06-19 21:52:23,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.88 | bwd: 3316.85 | bwd_inner: 3316.00 | bwd_allreduce: 0.80 | step: 7.41 53%|█████▎ | 5311/10000 [8:22:44<7:08:39, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.014453002251684666, 'learning_rate': 1.8957593321788576e-05, 'epoch': 5.31} 53%|█████▎ | 5311/10000 [8:22:44<7:08:39, 5.49s/it][2025-06-19 21:52:29,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:52:29,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.72 | bwd_microstep: 3317.07 | bwd_inner_microstep: 3316.00 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.39 [2025-06-19 21:52:29,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.72 | bwd: 3317.09 | bwd_inner: 3316.00 | bwd_allreduce: 1.03 | step: 7.41 53%|█████▎ | 5312/10000 [8:22:50<7:08:17, 5.48s/it] {'loss': 0.0035, 'grad_norm': 0.6059486269950867, 'learning_rate': 1.895112467010417e-05, 'epoch': 5.31} 53%|█████▎ | 5312/10000 [8:22:50<7:08:17, 5.48s/it][2025-06-19 21:52:34,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:52:34,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.10 | bwd_microstep: 3313.15 | bwd_inner_microstep: 3312.34 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 21:52:34,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.10 | bwd: 3313.16 | bwd_inner: 3312.34 | bwd_allreduce: 0.77 | step: 7.08 53%|█████▎ | 5313/10000 [8:22:55<7:07:38, 5.47s/it] {'loss': 0.0197, 'grad_norm': 2.7812812328338623, 'learning_rate': 1.894465612844192e-05, 'epoch': 5.31} 53%|█████▎ | 5313/10000 [8:22:55<7:07:38, 5.47s/it][2025-06-19 21:52:40,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 21:52:40,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.75 | bwd_microstep: 3364.98 | bwd_inner_microstep: 3363.93 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.20 [2025-06-19 21:52:40,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.75 | bwd: 3365.00 | bwd_inner: 3363.93 | bwd_allreduce: 1.02 | step: 7.20 53%|█████▎ | 5314/10000 [8:23:01<7:08:56, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.011914148926734924, 'learning_rate': 1.8938187697480345e-05, 'epoch': 5.31} 53%|█████▎ | 5314/10000 [8:23:01<7:08:56, 5.49s/it][2025-06-19 21:52:45,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:52:45,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.15 | bwd_microstep: 3360.23 | bwd_inner_microstep: 3359.31 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.12 [2025-06-19 21:52:45,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.15 | bwd: 3360.25 | bwd_inner: 3359.31 | bwd_allreduce: 0.88 | step: 7.12 53%|█████▎ | 5315/10000 [8:23:06<7:09:41, 5.50s/it] {'loss': 0.0443, 'grad_norm': 4.107802867889404, 'learning_rate': 1.8931719377897955e-05, 'epoch': 5.32} 53%|█████▎ | 5315/10000 [8:23:06<7:09:41, 5.50s/it][2025-06-19 21:52:51,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:52:51,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.48 | bwd_microstep: 3309.16 | bwd_inner_microstep: 3308.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-19 21:52:51,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.48 | bwd: 3309.17 | bwd_inner: 3308.35 | bwd_allreduce: 0.78 | step: 7.22 53%|█████▎ | 5316/10000 [8:23:12<7:08:29, 5.49s/it] {'loss': 0.0008, 'grad_norm': 0.11672534048557281, 'learning_rate': 1.8925251170373243e-05, 'epoch': 5.32} 53%|█████▎ | 5316/10000 [8:23:12<7:08:29, 5.49s/it][2025-06-19 21:52:56,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:52:56,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.45 | bwd_microstep: 3318.07 | bwd_inner_microstep: 3317.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 21:52:56,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.45 | bwd: 3318.08 | bwd_inner: 3317.28 | bwd_allreduce: 0.76 | step: 6.74 53%|█████▎ | 5317/10000 [8:23:17<7:07:49, 5.48s/it] {'loss': 0.0419, 'grad_norm': 3.9842417240142822, 'learning_rate': 1.8918783075584693e-05, 'epoch': 5.32} 53%|█████▎ | 5317/10000 [8:23:17<7:07:49, 5.48s/it][2025-06-19 21:53:02,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:53:02,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.76 | bwd_microstep: 3316.53 | bwd_inner_microstep: 3315.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 21:53:02,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.76 | bwd: 3316.54 | bwd_inner: 3315.73 | bwd_allreduce: 0.77 | step: 6.68 53%|█████▎ | 5318/10000 [8:23:23<7:07:16, 5.48s/it] {'loss': 0.0022, 'grad_norm': 0.3351678252220154, 'learning_rate': 1.891231509421078e-05, 'epoch': 5.32} 53%|█████▎ | 5318/10000 [8:23:23<7:07:16, 5.48s/it][2025-06-19 21:53:07,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:53:07,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.85 | bwd_microstep: 3320.52 | bwd_inner_microstep: 3319.47 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.20 [2025-06-19 21:53:07,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.85 | bwd: 3320.54 | bwd_inner: 3319.47 | bwd_allreduce: 1.02 | step: 7.19 53%|█████▎ | 5319/10000 [8:23:28<7:06:57, 5.47s/it] {'loss': 0.0119, 'grad_norm': 2.0831916332244873, 'learning_rate': 1.890584722692997e-05, 'epoch': 5.32} 53%|█████▎ | 5319/10000 [8:23:28<7:06:57, 5.47s/it][2025-06-19 21:53:13,231] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:53:13,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.87 | bwd_microstep: 3362.36 | bwd_inner_microstep: 3361.55 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 21:53:13,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.87 | bwd: 3362.37 | bwd_inner: 3361.55 | bwd_allreduce: 0.78 | step: 7.21 53%|█████▎ | 5320/10000 [8:23:34<7:08:02, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.033855509012937546, 'learning_rate': 1.8899379474420704e-05, 'epoch': 5.32} 53%|█████▎ | 5320/10000 [8:23:34<7:08:02, 5.49s/it][2025-06-19 21:53:18,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:53:18,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.92 | bwd_microstep: 3320.16 | bwd_inner_microstep: 3319.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 21:53:18,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.92 | bwd: 3320.17 | bwd_inner: 3319.35 | bwd_allreduce: 0.77 | step: 6.80 53%|█████▎ | 5321/10000 [8:23:39<7:07:20, 5.48s/it] {'loss': 0.0127, 'grad_norm': 1.2995737791061401, 'learning_rate': 1.889291183736143e-05, 'epoch': 5.32} 53%|█████▎ | 5321/10000 [8:23:39<7:07:20, 5.48s/it][2025-06-19 21:53:24,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 21:53:24,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.75 | bwd_microstep: 3365.57 | bwd_inner_microstep: 3364.61 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.55 [2025-06-19 21:53:24,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.75 | bwd: 3365.59 | bwd_inner: 3364.61 | bwd_allreduce: 0.92 | step: 7.55 53%|█████▎ | 5322/10000 [8:23:45<7:08:22, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.016211124137043953, 'learning_rate': 1.8886444316430557e-05, 'epoch': 5.32} 53%|█████▎ | 5322/10000 [8:23:45<7:08:22, 5.49s/it][2025-06-19 21:53:29,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:53:29,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.15 | bwd_microstep: 3368.20 | bwd_inner_microstep: 3367.40 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.70 [2025-06-19 21:53:29,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.15 | bwd: 3368.21 | bwd_inner: 3367.40 | bwd_allreduce: 0.77 | step: 6.71 53%|█████▎ | 5323/10000 [8:23:50<7:09:10, 5.51s/it] {'loss': 0.0433, 'grad_norm': 6.816671371459961, 'learning_rate': 1.887997691230651e-05, 'epoch': 5.32} 53%|█████▎ | 5323/10000 [8:23:50<7:09:10, 5.51s/it][2025-06-19 21:53:35,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:53:35,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.22 | bwd_microstep: 3358.14 | bwd_inner_microstep: 3357.34 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 21:53:35,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.22 | bwd: 3358.16 | bwd_inner: 3357.34 | bwd_allreduce: 0.78 | step: 6.96 53%|█████▎ | 5324/10000 [8:23:56<7:09:27, 5.51s/it] {'loss': 0.0009, 'grad_norm': 0.23722508549690247, 'learning_rate': 1.8873509625667686e-05, 'epoch': 5.32} 53%|█████▎ | 5324/10000 [8:23:56<7:09:27, 5.51s/it][2025-06-19 21:53:40,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:53:40,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.98 | bwd_microstep: 3303.83 | bwd_inner_microstep: 3302.99 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.26 [2025-06-19 21:53:40,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.98 | bwd: 3303.85 | bwd_inner: 3302.99 | bwd_allreduce: 0.81 | step: 7.26 53%|█████▎ | 5325/10000 [8:24:01<7:07:54, 5.49s/it] {'loss': 0.0199, 'grad_norm': 3.4179649353027344, 'learning_rate': 1.8867042457192473e-05, 'epoch': 5.33} 53%|█████▎ | 5325/10000 [8:24:01<7:07:54, 5.49s/it][2025-06-19 21:53:46,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:53:46,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.32 | bwd_microstep: 3365.40 | bwd_inner_microstep: 3364.57 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.85 [2025-06-19 21:53:46,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.32 | bwd: 3365.42 | bwd_inner: 3364.57 | bwd_allreduce: 0.81 | step: 6.85 53%|█████▎ | 5326/10000 [8:24:07<7:08:53, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.037575315684080124, 'learning_rate': 1.886057540755926e-05, 'epoch': 5.33} 53%|█████▎ | 5326/10000 [8:24:07<7:08:53, 5.51s/it][2025-06-19 21:53:51,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:53:51,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.96 | bwd_microstep: 3310.18 | bwd_inner_microstep: 3309.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 21:53:51,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.96 | bwd: 3310.19 | bwd_inner: 3309.38 | bwd_allreduce: 0.77 | step: 6.58 53%|█████▎ | 5327/10000 [8:24:12<7:07:36, 5.49s/it] {'loss': 0.0425, 'grad_norm': 3.101163625717163, 'learning_rate': 1.8854108477446385e-05, 'epoch': 5.33} 53%|█████▎ | 5327/10000 [8:24:12<7:07:36, 5.49s/it][2025-06-19 21:53:57,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:53:57,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.91 | bwd_microstep: 3357.10 | bwd_inner_microstep: 3356.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 21:53:57,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.91 | bwd: 3357.12 | bwd_inner: 3356.32 | bwd_allreduce: 0.76 | step: 6.66 53%|█████▎ | 5328/10000 [8:24:18<7:08:15, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.012139874510467052, 'learning_rate': 1.8847641667532216e-05, 'epoch': 5.33} 53%|█████▎ | 5328/10000 [8:24:18<7:08:15, 5.50s/it][2025-06-19 21:54:02,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:54:02,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.09 | bwd_microstep: 3357.23 | bwd_inner_microstep: 3356.28 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.94 [2025-06-19 21:54:02,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.09 | bwd: 3357.25 | bwd_inner: 3356.28 | bwd_allreduce: 0.92 | step: 7.95 53%|█████▎ | 5329/10000 [8:24:23<7:08:42, 5.51s/it] {'loss': 0.0009, 'grad_norm': 0.1668434590101242, 'learning_rate': 1.8841174978495087e-05, 'epoch': 5.33} 53%|█████▎ | 5329/10000 [8:24:23<7:08:42, 5.51s/it][2025-06-19 21:54:08,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:54:08,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.23 | bwd_microstep: 3372.25 | bwd_inner_microstep: 3371.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 21:54:08,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.23 | bwd: 3372.26 | bwd_inner: 3371.46 | bwd_allreduce: 0.76 | step: 6.66 53%|█████▎ | 5330/10000 [8:24:29<7:09:24, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.00680069113150239, 'learning_rate': 1.8834708411013325e-05, 'epoch': 5.33} 53%|█████▎ | 5330/10000 [8:24:29<7:09:24, 5.52s/it][h264 @ 0x412bc580] Reference 5 >= 5 [h264 @ 0x412bc580] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x42936100] left block unavailable for requested intra mode [h264 @ 0x42936100] error while decoding MB 0 25, bytestream 45493 [2025-06-19 21:54:13,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:54:13,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.24 | bwd_microstep: 3392.23 | bwd_inner_microstep: 3391.36 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.90 [2025-06-19 21:54:13,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.24 | bwd: 3392.25 | bwd_inner: 3391.36 | bwd_allreduce: 0.85 | step: 6.90 53%|█████▎ | 5331/10000 [8:24:34<7:10:33, 5.53s/it] {'loss': 0.3436, 'grad_norm': 9.271410942077637, 'learning_rate': 1.8828241965765244e-05, 'epoch': 5.33} 53%|█████▎ | 5331/10000 [8:24:34<7:10:33, 5.53s/it][h264 @ 0x42896a40] Reference 5 >= 5 [h264 @ 0x42896a40] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x42896a40] left block unavailable for requested intra mode [h264 @ 0x42896a40] error while decoding MB 0 25, bytestream 45493 [2025-06-19 21:54:19,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:54:19,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.72 | bwd_microstep: 3307.04 | bwd_inner_microstep: 3306.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-19 21:54:19,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.72 | bwd: 3307.06 | bwd_inner: 3306.24 | bwd_allreduce: 0.77 | step: 7.05 53%|█████▎ | 5332/10000 [8:24:40<7:08:30, 5.51s/it] {'loss': 0.0105, 'grad_norm': 1.5996633768081665, 'learning_rate': 1.8821775643429142e-05, 'epoch': 5.33} 53%|█████▎ | 5332/10000 [8:24:40<7:08:30, 5.51s/it][2025-06-19 21:54:24,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:54:24,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.25 | bwd_microstep: 3360.11 | bwd_inner_microstep: 3359.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 21:54:24,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.25 | bwd: 3360.13 | bwd_inner: 3359.33 | bwd_allreduce: 0.75 | step: 6.64 53%|█████▎ | 5333/10000 [8:24:45<7:08:44, 5.51s/it] {'loss': 0.0244, 'grad_norm': 2.6259377002716064, 'learning_rate': 1.8815309444683306e-05, 'epoch': 5.33} 53%|█████▎ | 5333/10000 [8:24:45<7:08:44, 5.51s/it][2025-06-19 21:54:30,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:54:30,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.49 | bwd_microstep: 3367.94 | bwd_inner_microstep: 3367.06 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.96 [2025-06-19 21:54:30,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.49 | bwd: 3367.96 | bwd_inner: 3367.06 | bwd_allreduce: 0.84 | step: 6.97 53%|█████▎ | 5334/10000 [8:24:51<7:09:08, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.04966103285551071, 'learning_rate': 1.880884337020601e-05, 'epoch': 5.33} 53%|█████▎ | 5334/10000 [8:24:51<7:09:08, 5.52s/it][2025-06-19 21:54:35,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:54:35,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.45 | bwd_microstep: 3305.79 | bwd_inner_microstep: 3304.89 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.94 [2025-06-19 21:54:35,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.45 | bwd: 3305.80 | bwd_inner: 3304.89 | bwd_allreduce: 0.87 | step: 6.95 53%|█████▎ | 5335/10000 [8:24:56<7:07:21, 5.50s/it] {'loss': 0.0276, 'grad_norm': 3.2333192825317383, 'learning_rate': 1.8802377420675513e-05, 'epoch': 5.33} 53%|█████▎ | 5335/10000 [8:24:56<7:07:21, 5.50s/it][2025-06-19 21:54:41,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:54:41,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.42 | bwd_microstep: 3360.01 | bwd_inner_microstep: 3359.11 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.06 [2025-06-19 21:54:41,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.42 | bwd: 3360.03 | bwd_inner: 3359.11 | bwd_allreduce: 0.88 | step: 7.06 53%|█████▎ | 5336/10000 [8:25:02<7:07:52, 5.50s/it] {'loss': 0.0247, 'grad_norm': 2.0999274253845215, 'learning_rate': 1.879591159677008e-05, 'epoch': 5.34} 53%|█████▎ | 5336/10000 [8:25:02<7:07:52, 5.50s/it][2025-06-19 21:54:46,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 21:54:46,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2094.51 | bwd_microstep: 3306.35 | bwd_inner_microstep: 3305.28 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.47 [2025-06-19 21:54:46,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2094.51 | bwd: 3306.37 | bwd_inner: 3305.28 | bwd_allreduce: 1.03 | step: 7.48 53%|█████▎ | 5337/10000 [8:25:07<7:06:19, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.03870705887675285, 'learning_rate': 1.8789445899167923e-05, 'epoch': 5.34} 53%|█████▎ | 5337/10000 [8:25:07<7:06:19, 5.49s/it][2025-06-19 21:54:52,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:54:52,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.03 | bwd_microstep: 3326.14 | bwd_inner_microstep: 3325.32 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.38 [2025-06-19 21:54:52,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.03 | bwd: 3326.15 | bwd_inner: 3325.32 | bwd_allreduce: 0.78 | step: 7.38 53%|█████▎ | 5338/10000 [8:25:13<7:05:53, 5.48s/it] {'loss': 0.2246, 'grad_norm': 3.412097454071045, 'learning_rate': 1.878298032854727e-05, 'epoch': 5.34} 53%|█████▎ | 5338/10000 [8:25:13<7:05:53, 5.48s/it][2025-06-19 21:54:57,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 21:54:57,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.57 | bwd_microstep: 3362.71 | bwd_inner_microstep: 3361.62 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.41 [2025-06-19 21:54:57,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.57 | bwd: 3362.73 | bwd_inner: 3361.62 | bwd_allreduce: 1.05 | step: 7.41 53%|█████▎ | 5339/10000 [8:25:18<7:06:57, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.158768892288208, 'learning_rate': 1.8776514885586333e-05, 'epoch': 5.34} 53%|█████▎ | 5339/10000 [8:25:18<7:06:57, 5.50s/it][2025-06-19 21:55:03,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:55:03,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.28 | bwd_microstep: 3366.93 | bwd_inner_microstep: 3366.11 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.30 [2025-06-19 21:55:03,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.28 | bwd: 3366.94 | bwd_inner: 3366.11 | bwd_allreduce: 0.79 | step: 7.30 53%|█████▎ | 5340/10000 [8:25:24<7:07:38, 5.51s/it] {'loss': 0.0328, 'grad_norm': 2.417304754257202, 'learning_rate': 1.8770049570963306e-05, 'epoch': 5.34} 53%|█████▎ | 5340/10000 [8:25:24<7:07:38, 5.51s/it][2025-06-19 21:55:08,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:55:08,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.00 | bwd_microstep: 3367.26 | bwd_inner_microstep: 3366.45 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 21:55:08,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.00 | bwd: 3367.28 | bwd_inner: 3366.45 | bwd_allreduce: 0.78 | step: 7.23 53%|█████▎ | 5341/10000 [8:25:29<7:08:17, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.027598239481449127, 'learning_rate': 1.8763584385356376e-05, 'epoch': 5.34} 53%|█████▎ | 5341/10000 [8:25:29<7:08:17, 5.52s/it][2025-06-19 21:55:14,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:55:14,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.12 | bwd_microstep: 3358.06 | bwd_inner_microstep: 3357.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 21:55:14,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.12 | bwd: 3358.08 | bwd_inner: 3357.27 | bwd_allreduce: 0.76 | step: 6.62 53%|█████▎ | 5342/10000 [8:25:35<7:08:12, 5.52s/it] {'loss': 0.0178, 'grad_norm': 1.9267362356185913, 'learning_rate': 1.87571193294437e-05, 'epoch': 5.34} 53%|█████▎ | 5342/10000 [8:25:35<7:08:12, 5.52s/it][2025-06-19 21:55:19,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 21:55:19,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.29 | bwd_microstep: 3314.41 | bwd_inner_microstep: 3313.33 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.72 [2025-06-19 21:55:19,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.29 | bwd: 3314.44 | bwd_inner: 3313.33 | bwd_allreduce: 1.04 | step: 7.73 53%|█████▎ | 5343/10000 [8:25:40<7:06:41, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.009626636281609535, 'learning_rate': 1.875065440390344e-05, 'epoch': 5.34} 53%|█████▎ | 5343/10000 [8:25:40<7:06:41, 5.50s/it][2025-06-19 21:55:25,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:55:25,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.13 | bwd_microstep: 3359.99 | bwd_inner_microstep: 3359.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 21:55:25,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.13 | bwd: 3360.01 | bwd_inner: 3359.20 | bwd_allreduce: 0.76 | step: 6.78 53%|█████▎ | 5344/10000 [8:25:46<7:07:14, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.006078833248466253, 'learning_rate': 1.8744189609413733e-05, 'epoch': 5.34} 53%|█████▎ | 5344/10000 [8:25:46<7:07:14, 5.51s/it][2025-06-19 21:55:30,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.76 [2025-06-19 21:55:30,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.29 | bwd_microstep: 3308.97 | bwd_inner_microstep: 3308.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 8.04 [2025-06-19 21:55:30,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.29 | bwd: 3308.99 | bwd_inner: 3308.16 | bwd_allreduce: 0.78 | step: 8.04 53%|█████▎ | 5345/10000 [8:25:51<7:05:45, 5.49s/it] {'loss': 0.0264, 'grad_norm': 1.9177578687667847, 'learning_rate': 1.873772494665271e-05, 'epoch': 5.34} 53%|█████▎ | 5345/10000 [8:25:51<7:05:45, 5.49s/it][2025-06-19 21:55:36,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:55:36,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.58 | bwd_microstep: 3314.78 | bwd_inner_microstep: 3313.95 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.80 [2025-06-19 21:55:36,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.58 | bwd: 3314.79 | bwd_inner: 3313.95 | bwd_allreduce: 0.79 | step: 6.80 53%|█████▎ | 5346/10000 [8:25:57<7:04:51, 5.48s/it] {'loss': 0.0058, 'grad_norm': 0.40728074312210083, 'learning_rate': 1.8731260416298488e-05, 'epoch': 5.35} 53%|█████▎ | 5346/10000 [8:25:57<7:04:51, 5.48s/it][2025-06-19 21:55:41,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:55:41,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.93 | bwd_microstep: 3308.06 | bwd_inner_microstep: 3307.24 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.32 [2025-06-19 21:55:41,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.93 | bwd: 3308.07 | bwd_inner: 3307.24 | bwd_allreduce: 0.78 | step: 7.32 53%|█████▎ | 5347/10000 [8:26:02<7:04:06, 5.47s/it] {'loss': 0.0006, 'grad_norm': 0.10448235273361206, 'learning_rate': 1.872479601902915e-05, 'epoch': 5.35} 53%|█████▎ | 5347/10000 [8:26:02<7:04:06, 5.47s/it][2025-06-19 21:55:47,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 21:55:47,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.46 | bwd_microstep: 3316.09 | bwd_inner_microstep: 3314.98 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.69 [2025-06-19 21:55:47,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.46 | bwd: 3316.12 | bwd_inner: 3314.98 | bwd_allreduce: 1.07 | step: 7.69 53%|█████▎ | 5348/10000 [8:26:07<7:04:01, 5.47s/it] {'loss': 0.0095, 'grad_norm': 0.8834191560745239, 'learning_rate': 1.8718331755522792e-05, 'epoch': 5.35} 53%|█████▎ | 5348/10000 [8:26:07<7:04:01, 5.47s/it][2025-06-19 21:55:52,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:55:52,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.43 | bwd_microstep: 3315.55 | bwd_inner_microstep: 3314.74 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-19 21:55:52,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.43 | bwd: 3315.57 | bwd_inner: 3314.74 | bwd_allreduce: 0.79 | step: 7.31 53%|█████▎ | 5349/10000 [8:26:13<7:03:59, 5.47s/it] {'loss': 0.0033, 'grad_norm': 0.36508703231811523, 'learning_rate': 1.8711867626457486e-05, 'epoch': 5.35} 53%|█████▎ | 5349/10000 [8:26:13<7:03:59, 5.47s/it][2025-06-19 21:55:58,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:55:58,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.87 | bwd_microstep: 3371.70 | bwd_inner_microstep: 3370.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.50 [2025-06-19 21:55:58,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.87 | bwd: 3371.71 | bwd_inner: 3370.89 | bwd_allreduce: 0.77 | step: 7.50 54%|█████▎ | 5350/10000 [8:26:18<7:05:25, 5.49s/it] {'loss': 0.0042, 'grad_norm': 0.9647418260574341, 'learning_rate': 1.8705403632511286e-05, 'epoch': 5.35} 54%|█████▎ | 5350/10000 [8:26:18<7:05:25, 5.49s/it][2025-06-19 21:56:03,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:56:03,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.11 | bwd_microstep: 3365.55 | bwd_inner_microstep: 3364.74 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 21:56:03,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.11 | bwd: 3365.57 | bwd_inner: 3364.74 | bwd_allreduce: 0.78 | step: 7.08 54%|█████▎ | 5351/10000 [8:26:24<7:06:09, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.08477933704853058, 'learning_rate': 1.8698939774362245e-05, 'epoch': 5.35} 54%|█████▎ | 5351/10000 [8:26:24<7:06:09, 5.50s/it][2025-06-19 21:56:09,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:56:09,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.37 | bwd_microstep: 3314.83 | bwd_inner_microstep: 3314.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-19 21:56:09,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.37 | bwd: 3314.84 | bwd_inner: 3314.03 | bwd_allreduce: 0.77 | step: 6.97 54%|█████▎ | 5352/10000 [8:26:29<7:05:01, 5.49s/it] {'loss': 0.0012, 'grad_norm': 0.10281488299369812, 'learning_rate': 1.8692476052688372e-05, 'epoch': 5.35} 54%|█████▎ | 5352/10000 [8:26:29<7:05:01, 5.49s/it][2025-06-19 21:56:14,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:56:14,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.85 | bwd_microstep: 3320.63 | bwd_inner_microstep: 3319.83 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.69 [2025-06-19 21:56:14,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.85 | bwd: 3320.65 | bwd_inner: 3319.83 | bwd_allreduce: 0.77 | step: 6.69 54%|█████▎ | 5353/10000 [8:26:35<7:04:22, 5.48s/it] {'loss': 0.0072, 'grad_norm': 1.110512137413025, 'learning_rate': 1.8686012468167697e-05, 'epoch': 5.35} 54%|█████▎ | 5353/10000 [8:26:35<7:04:22, 5.48s/it][2025-06-19 21:56:20,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:56:20,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.07 | bwd_microstep: 3337.95 | bwd_inner_microstep: 3337.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 21:56:20,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.07 | bwd: 3337.96 | bwd_inner: 3337.14 | bwd_allreduce: 0.78 | step: 6.95 54%|█████▎ | 5354/10000 [8:26:40<7:04:19, 5.48s/it] {'loss': 0.0085, 'grad_norm': 0.56552654504776, 'learning_rate': 1.8679549021478215e-05, 'epoch': 5.35} 54%|█████▎ | 5354/10000 [8:26:40<7:04:19, 5.48s/it][2025-06-19 21:56:25,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:56:25,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.66 | bwd_microstep: 3320.98 | bwd_inner_microstep: 3320.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-19 21:56:25,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.66 | bwd: 3321.00 | bwd_inner: 3320.18 | bwd_allreduce: 0.77 | step: 7.00 54%|█████▎ | 5355/10000 [8:26:46<7:03:45, 5.47s/it] {'loss': 0.0036, 'grad_norm': 0.3754465579986572, 'learning_rate': 1.8673085713297915e-05, 'epoch': 5.36} 54%|█████▎ | 5355/10000 [8:26:46<7:03:45, 5.47s/it][2025-06-19 21:56:31,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:56:31,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.05 | bwd_microstep: 3367.28 | bwd_inner_microstep: 3366.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 21:56:31,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.05 | bwd: 3367.29 | bwd_inner: 3366.50 | bwd_allreduce: 0.75 | step: 6.67 54%|█████▎ | 5356/10000 [8:26:51<7:05:03, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.00653995992615819, 'learning_rate': 1.866662254430477e-05, 'epoch': 5.36} 54%|█████▎ | 5356/10000 [8:26:51<7:05:03, 5.49s/it][2025-06-19 21:56:36,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 21:56:36,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.66 | bwd_microstep: 3320.00 | bwd_inner_microstep: 3319.02 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.63 [2025-06-19 21:56:36,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.66 | bwd: 3320.02 | bwd_inner: 3319.02 | bwd_allreduce: 0.95 | step: 7.63 54%|█████▎ | 5357/10000 [8:26:57<7:04:15, 5.48s/it] {'loss': 0.0011, 'grad_norm': 0.10531152039766312, 'learning_rate': 1.866015951517672e-05, 'epoch': 5.36} 54%|█████▎ | 5357/10000 [8:26:57<7:04:15, 5.48s/it][2025-06-19 21:56:42,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:56:42,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.02 | bwd_microstep: 3321.31 | bwd_inner_microstep: 3320.49 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.19 [2025-06-19 21:56:42,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.02 | bwd: 3321.33 | bwd_inner: 3320.49 | bwd_allreduce: 0.79 | step: 7.19 54%|█████▎ | 5358/10000 [8:27:02<7:03:52, 5.48s/it] {'loss': 0.0129, 'grad_norm': 1.5246057510375977, 'learning_rate': 1.865369662659172e-05, 'epoch': 5.36} 54%|█████▎ | 5358/10000 [8:27:02<7:03:52, 5.48s/it][2025-06-19 21:56:47,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:56:47,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.97 | bwd_microstep: 3401.27 | bwd_inner_microstep: 3400.38 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.96 [2025-06-19 21:56:47,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.97 | bwd: 3401.29 | bwd_inner: 3400.38 | bwd_allreduce: 0.86 | step: 6.96 54%|█████▎ | 5359/10000 [8:27:08<7:06:19, 5.51s/it] {'loss': 0.0245, 'grad_norm': 3.244792938232422, 'learning_rate': 1.86472338792277e-05, 'epoch': 5.36} 54%|█████▎ | 5359/10000 [8:27:08<7:06:19, 5.51s/it][2025-06-19 21:56:53,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:56:53,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.79 | bwd_microstep: 3324.66 | bwd_inner_microstep: 3323.71 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.24 [2025-06-19 21:56:53,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.79 | bwd: 3324.67 | bwd_inner: 3323.71 | bwd_allreduce: 0.92 | step: 7.25 54%|█████▎ | 5360/10000 [8:27:13<7:05:18, 5.50s/it] {'loss': 0.0102, 'grad_norm': 0.8351026773452759, 'learning_rate': 1.8640771273762567e-05, 'epoch': 5.36} 54%|█████▎ | 5360/10000 [8:27:13<7:05:18, 5.50s/it][2025-06-19 21:56:58,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:56:58,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.75 | bwd_microstep: 3371.97 | bwd_inner_microstep: 3371.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 21:56:58,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.75 | bwd: 3371.98 | bwd_inner: 3371.18 | bwd_allreduce: 0.76 | step: 6.62 54%|█████▎ | 5361/10000 [8:27:19<7:06:23, 5.51s/it] {'loss': 0.0017, 'grad_norm': 0.23068684339523315, 'learning_rate': 1.8634308810874226e-05, 'epoch': 5.36} 54%|█████▎ | 5361/10000 [8:27:19<7:06:23, 5.51s/it][2025-06-19 21:57:04,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:57:04,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.51 | bwd_microstep: 3323.21 | bwd_inner_microstep: 3322.38 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.91 [2025-06-19 21:57:04,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.51 | bwd: 3323.22 | bwd_inner: 3322.38 | bwd_allreduce: 0.79 | step: 6.92 54%|█████▎ | 5362/10000 [8:27:24<7:05:16, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.06554821878671646, 'learning_rate': 1.8627846491240543e-05, 'epoch': 5.36} 54%|█████▎ | 5362/10000 [8:27:24<7:05:16, 5.50s/it][2025-06-19 21:57:09,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:57:09,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.08 | bwd_microstep: 3376.26 | bwd_inner_microstep: 3375.30 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.64 [2025-06-19 21:57:09,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.08 | bwd: 3376.27 | bwd_inner: 3375.30 | bwd_allreduce: 0.93 | step: 7.64 54%|█████▎ | 5363/10000 [8:27:30<7:06:24, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.04877254366874695, 'learning_rate': 1.8621384315539397e-05, 'epoch': 5.36} 54%|█████▎ | 5363/10000 [8:27:30<7:06:24, 5.52s/it][2025-06-19 21:57:15,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 21:57:15,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.18 | bwd_microstep: 3322.88 | bwd_inner_microstep: 3321.98 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.89 [2025-06-19 21:57:15,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.18 | bwd: 3322.89 | bwd_inner: 3321.98 | bwd_allreduce: 0.87 | step: 6.89 54%|█████▎ | 5364/10000 [8:27:35<7:05:12, 5.50s/it] {'loss': 0.0131, 'grad_norm': 1.6920585632324219, 'learning_rate': 1.8614922284448636e-05, 'epoch': 5.36} 54%|█████▎ | 5364/10000 [8:27:35<7:05:12, 5.50s/it][2025-06-19 21:57:20,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:57:20,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.88 | bwd_microstep: 3382.31 | bwd_inner_microstep: 3381.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 21:57:20,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.88 | bwd: 3382.33 | bwd_inner: 3381.51 | bwd_allreduce: 0.78 | step: 6.81 54%|█████▎ | 5365/10000 [8:27:41<7:06:26, 5.52s/it] {'loss': 0.0004, 'grad_norm': 0.05053344741463661, 'learning_rate': 1.86084603986461e-05, 'epoch': 5.37} 54%|█████▎ | 5365/10000 [8:27:41<7:06:26, 5.52s/it][2025-06-19 21:57:26,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:57:26,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.25 | bwd_microstep: 3375.57 | bwd_inner_microstep: 3374.77 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-19 21:57:26,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.25 | bwd: 3375.58 | bwd_inner: 3374.77 | bwd_allreduce: 0.77 | step: 6.86 54%|█████▎ | 5366/10000 [8:27:47<7:07:10, 5.53s/it] {'loss': 0.0016, 'grad_norm': 0.1900026798248291, 'learning_rate': 1.860199865880961e-05, 'epoch': 5.37} 54%|█████▎ | 5366/10000 [8:27:47<7:07:10, 5.53s/it][2025-06-19 21:57:31,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:57:31,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.66 | bwd_microstep: 3329.75 | bwd_inner_microstep: 3328.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 21:57:31,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.66 | bwd: 3329.77 | bwd_inner: 3328.95 | bwd_allreduce: 0.78 | step: 7.13 54%|█████▎ | 5367/10000 [8:27:52<7:06:04, 5.52s/it] {'loss': 0.0841, 'grad_norm': 3.206340789794922, 'learning_rate': 1.859553706561698e-05, 'epoch': 5.37} 54%|█████▎ | 5367/10000 [8:27:52<7:06:04, 5.52s/it][2025-06-19 21:57:37,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:57:37,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.87 | bwd_microstep: 3323.27 | bwd_inner_microstep: 3322.38 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.23 [2025-06-19 21:57:37,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.87 | bwd: 3323.29 | bwd_inner: 3322.38 | bwd_allreduce: 0.86 | step: 7.24 54%|█████▎ | 5368/10000 [8:27:57<7:04:48, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.022435307502746582, 'learning_rate': 1.858907561974598e-05, 'epoch': 5.37} 54%|█████▎ | 5368/10000 [8:27:58<7:04:48, 5.50s/it][2025-06-19 21:57:42,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 21:57:42,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.06 | bwd_microstep: 3327.01 | bwd_inner_microstep: 3325.84 | bwd_allreduce_microstep: 1.11 | step_microstep: 8.09 [2025-06-19 21:57:42,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.06 | bwd: 3327.04 | bwd_inner: 3325.84 | bwd_allreduce: 1.14 | step: 8.09 54%|█████▎ | 5369/10000 [8:28:03<7:04:21, 5.50s/it] {'loss': 0.0648, 'grad_norm': 2.989288091659546, 'learning_rate': 1.858261432187441e-05, 'epoch': 5.37} 54%|█████▎ | 5369/10000 [8:28:03<7:04:21, 5.50s/it][2025-06-19 21:57:48,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:57:48,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.66 | bwd_microstep: 3326.08 | bwd_inner_microstep: 3325.14 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.99 [2025-06-19 21:57:48,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.66 | bwd: 3326.10 | bwd_inner: 3325.14 | bwd_allreduce: 0.92 | step: 6.99 54%|█████▎ | 5370/10000 [8:28:08<7:03:59, 5.49s/it] {'loss': 0.0018, 'grad_norm': 0.22415997087955475, 'learning_rate': 1.857615317268001e-05, 'epoch': 5.37} 54%|█████▎ | 5370/10000 [8:28:08<7:03:59, 5.49s/it][2025-06-19 21:57:53,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 21:57:53,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.90 | bwd_microstep: 3334.13 | bwd_inner_microstep: 3333.11 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.18 [2025-06-19 21:57:53,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.90 | bwd: 3334.14 | bwd_inner: 3333.11 | bwd_allreduce: 0.98 | step: 7.19 54%|█████▎ | 5371/10000 [8:28:14<7:03:34, 5.49s/it] {'loss': 0.1831, 'grad_norm': 11.871440887451172, 'learning_rate': 1.856969217284054e-05, 'epoch': 5.37} 54%|█████▎ | 5371/10000 [8:28:14<7:03:34, 5.49s/it][2025-06-19 21:57:59,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:57:59,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.66 | bwd_microstep: 3330.82 | bwd_inner_microstep: 3330.02 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-19 21:57:59,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.67 | bwd: 3330.83 | bwd_inner: 3330.02 | bwd_allreduce: 0.77 | step: 6.98 54%|█████▎ | 5372/10000 [8:28:19<7:03:14, 5.49s/it] {'loss': 0.0062, 'grad_norm': 0.8179737329483032, 'learning_rate': 1.856323132303373e-05, 'epoch': 5.37} 54%|█████▎ | 5372/10000 [8:28:19<7:03:14, 5.49s/it][2025-06-19 21:58:04,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:58:04,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.73 | bwd_microstep: 3328.00 | bwd_inner_microstep: 3327.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 21:58:04,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.73 | bwd: 3328.01 | bwd_inner: 3327.20 | bwd_allreduce: 0.76 | step: 6.72 54%|█████▎ | 5373/10000 [8:28:25<7:02:51, 5.48s/it] {'loss': 0.0043, 'grad_norm': 0.7081115245819092, 'learning_rate': 1.8556770623937276e-05, 'epoch': 5.37} 54%|█████▎ | 5373/10000 [8:28:25<7:02:51, 5.48s/it][2025-06-19 21:58:10,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:58:10,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.18 | bwd_microstep: 3321.08 | bwd_inner_microstep: 3320.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 21:58:10,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.18 | bwd: 3321.10 | bwd_inner: 3320.28 | bwd_allreduce: 0.77 | step: 7.14 54%|█████▎ | 5374/10000 [8:28:30<7:02:29, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.020253853872418404, 'learning_rate': 1.8550310076228884e-05, 'epoch': 5.37} 54%|█████▎ | 5374/10000 [8:28:30<7:02:29, 5.48s/it][2025-06-19 21:58:15,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:58:15,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.60 | bwd_microstep: 3320.21 | bwd_inner_microstep: 3319.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 21:58:15,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.60 | bwd: 3320.22 | bwd_inner: 3319.41 | bwd_allreduce: 0.76 | step: 6.67 54%|█████▍ | 5375/10000 [8:28:36<7:02:04, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.01960826851427555, 'learning_rate': 1.854384968058624e-05, 'epoch': 5.38} 54%|█████▍ | 5375/10000 [8:28:36<7:02:04, 5.48s/it][2025-06-19 21:58:21,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-19 21:58:21,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.41 | bwd_microstep: 3378.50 | bwd_inner_microstep: 3377.70 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-19 21:58:21,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.41 | bwd: 3378.52 | bwd_inner: 3377.70 | bwd_allreduce: 0.77 | step: 6.78 54%|█████▍ | 5376/10000 [8:28:41<7:03:47, 5.50s/it] {'loss': 0.0177, 'grad_norm': 1.9569600820541382, 'learning_rate': 1.8537389437687006e-05, 'epoch': 5.38} 54%|█████▍ | 5376/10000 [8:28:41<7:03:47, 5.50s/it][2025-06-19 21:58:26,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:58:26,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.89 | bwd_microstep: 3331.07 | bwd_inner_microstep: 3330.25 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.25 [2025-06-19 21:58:26,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.89 | bwd: 3331.08 | bwd_inner: 3330.25 | bwd_allreduce: 0.78 | step: 7.26 54%|█████▍ | 5377/10000 [8:28:47<7:03:32, 5.50s/it] {'loss': 0.0006, 'grad_norm': 0.11825504153966904, 'learning_rate': 1.8530929348208836e-05, 'epoch': 5.38} 54%|█████▍ | 5377/10000 [8:28:47<7:03:32, 5.50s/it][2025-06-19 21:58:32,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:58:32,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.97 | bwd_microstep: 3377.81 | bwd_inner_microstep: 3377.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-19 21:58:32,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.97 | bwd: 3377.82 | bwd_inner: 3377.00 | bwd_allreduce: 0.78 | step: 6.85 54%|█████▍ | 5378/10000 [8:28:52<7:04:46, 5.51s/it] {'loss': 0.0248, 'grad_norm': 3.774052143096924, 'learning_rate': 1.8524469412829354e-05, 'epoch': 5.38} 54%|█████▍ | 5378/10000 [8:28:52<7:04:46, 5.51s/it][2025-06-19 21:58:37,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 21:58:37,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.41 | bwd_microstep: 3331.80 | bwd_inner_microstep: 3330.89 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.87 [2025-06-19 21:58:37,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.41 | bwd: 3331.83 | bwd_inner: 3330.89 | bwd_allreduce: 0.86 | step: 7.86 54%|█████▍ | 5379/10000 [8:28:58<7:04:15, 5.51s/it] {'loss': 0.0008, 'grad_norm': 0.06446714699268341, 'learning_rate': 1.8518009632226184e-05, 'epoch': 5.38} 54%|█████▍ | 5379/10000 [8:28:58<7:04:15, 5.51s/it][2025-06-19 21:58:43,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:58:43,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.65 | bwd_microstep: 3324.72 | bwd_inner_microstep: 3323.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 21:58:43,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.65 | bwd: 3324.73 | bwd_inner: 3323.93 | bwd_allreduce: 0.76 | step: 6.82 54%|█████▍ | 5380/10000 [8:29:03<7:03:22, 5.50s/it] {'loss': 0.0028, 'grad_norm': 0.31972283124923706, 'learning_rate': 1.8511550007076924e-05, 'epoch': 5.38} 54%|█████▍ | 5380/10000 [8:29:03<7:03:22, 5.50s/it][2025-06-19 21:58:48,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:58:48,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.93 | bwd_microstep: 3374.43 | bwd_inner_microstep: 3373.58 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.84 [2025-06-19 21:58:48,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.93 | bwd: 3374.44 | bwd_inner: 3373.58 | bwd_allreduce: 0.82 | step: 6.84 54%|█████▍ | 5381/10000 [8:29:09<7:04:27, 5.51s/it] {'loss': 0.0014, 'grad_norm': 0.1557728797197342, 'learning_rate': 1.850509053805916e-05, 'epoch': 5.38} 54%|█████▍ | 5381/10000 [8:29:09<7:04:27, 5.51s/it][2025-06-19 21:58:54,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 21:58:54,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.83 | bwd_microstep: 3369.23 | bwd_inner_microstep: 3368.07 | bwd_allreduce_microstep: 1.09 | step_microstep: 7.64 [2025-06-19 21:58:54,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.83 | bwd: 3369.26 | bwd_inner: 3368.07 | bwd_allreduce: 1.12 | step: 7.64 54%|█████▍ | 5382/10000 [8:29:15<7:05:09, 5.52s/it] {'loss': 0.0022, 'grad_norm': 0.4724171459674835, 'learning_rate': 1.849863122585047e-05, 'epoch': 5.38} 54%|█████▍ | 5382/10000 [8:29:15<7:05:09, 5.52s/it][2025-06-19 21:58:59,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:58:59,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.73 | bwd_microstep: 3323.75 | bwd_inner_microstep: 3322.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 21:58:59,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.73 | bwd: 3323.76 | bwd_inner: 3322.95 | bwd_allreduce: 0.78 | step: 6.96 54%|█████▍ | 5383/10000 [8:29:20<7:04:07, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.039238445460796356, 'learning_rate': 1.8492172071128388e-05, 'epoch': 5.38} 54%|█████▍ | 5383/10000 [8:29:20<7:04:07, 5.51s/it][2025-06-19 21:59:05,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 21:59:05,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.04 | bwd_microstep: 3369.03 | bwd_inner_microstep: 3368.10 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.03 [2025-06-19 21:59:05,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.03 | bwd: 3369.04 | bwd_inner: 3368.10 | bwd_allreduce: 0.90 | step: 7.03 54%|█████▍ | 5384/10000 [8:29:26<7:04:50, 5.52s/it] {'loss': 0.0646, 'grad_norm': 6.142223834991455, 'learning_rate': 1.8485713074570453e-05, 'epoch': 5.38} 54%|█████▍ | 5384/10000 [8:29:26<7:04:50, 5.52s/it][2025-06-19 21:59:10,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 21:59:10,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.19 | bwd_microstep: 3375.88 | bwd_inner_microstep: 3375.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 21:59:10,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.19 | bwd: 3375.90 | bwd_inner: 3375.09 | bwd_allreduce: 0.76 | step: 6.68 54%|█████▍ | 5385/10000 [8:29:31<7:05:27, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.0126488097012043, 'learning_rate': 1.8479254236854195e-05, 'epoch': 5.38} 54%|█████▍ | 5385/10000 [8:29:31<7:05:27, 5.53s/it][2025-06-19 21:59:16,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 21:59:16,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.17 | bwd_microstep: 3333.45 | bwd_inner_microstep: 3332.62 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.38 [2025-06-19 21:59:16,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.17 | bwd: 3333.46 | bwd_inner: 3332.62 | bwd_allreduce: 0.79 | step: 7.38 54%|█████▍ | 5386/10000 [8:29:37<7:04:17, 5.52s/it] {'loss': 0.0179, 'grad_norm': 2.916093349456787, 'learning_rate': 1.8472795558657108e-05, 'epoch': 5.39} 54%|█████▍ | 5386/10000 [8:29:37<7:04:17, 5.52s/it][2025-06-19 21:59:21,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:59:21,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.33 | bwd_microstep: 3321.33 | bwd_inner_microstep: 3320.51 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.77 [2025-06-19 21:59:21,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.33 | bwd: 3321.34 | bwd_inner: 3320.51 | bwd_allreduce: 0.79 | step: 6.78 54%|█████▍ | 5387/10000 [8:29:42<7:03:02, 5.50s/it] {'loss': 0.0172, 'grad_norm': 1.810326337814331, 'learning_rate': 1.846633704065668e-05, 'epoch': 5.39} 54%|█████▍ | 5387/10000 [8:29:42<7:03:02, 5.50s/it][2025-06-19 21:59:27,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 21:59:27,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.90 | bwd_microstep: 3377.59 | bwd_inner_microstep: 3376.61 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.65 [2025-06-19 21:59:27,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.90 | bwd: 3377.61 | bwd_inner: 3376.61 | bwd_allreduce: 0.96 | step: 7.65 54%|█████▍ | 5388/10000 [8:29:48<7:03:56, 5.52s/it] {'loss': 0.0006, 'grad_norm': 0.05163548141717911, 'learning_rate': 1.8459878683530376e-05, 'epoch': 5.39} 54%|█████▍ | 5388/10000 [8:29:48<7:03:56, 5.52s/it][2025-06-19 21:59:32,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 21:59:32,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.29 | bwd_microstep: 3325.12 | bwd_inner_microstep: 3324.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 21:59:32,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.29 | bwd: 3325.13 | bwd_inner: 3324.33 | bwd_allreduce: 0.76 | step: 6.74 54%|█████▍ | 5389/10000 [8:29:53<7:03:03, 5.50s/it] {'loss': 0.0204, 'grad_norm': 1.7468469142913818, 'learning_rate': 1.845342048795565e-05, 'epoch': 5.39} 54%|█████▍ | 5389/10000 [8:29:53<7:03:03, 5.50s/it][2025-06-19 21:59:38,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 21:59:38,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.87 | bwd_microstep: 3374.69 | bwd_inner_microstep: 3373.82 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.86 [2025-06-19 21:59:38,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.88 | bwd: 3374.71 | bwd_inner: 3373.82 | bwd_allreduce: 0.84 | step: 6.86 54%|█████▍ | 5390/10000 [8:29:59<7:04:00, 5.52s/it] {'loss': 0.0053, 'grad_norm': 0.564896285533905, 'learning_rate': 1.844696245460994e-05, 'epoch': 5.39} 54%|█████▍ | 5390/10000 [8:29:59<7:04:00, 5.52s/it][2025-06-19 21:59:43,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 21:59:43,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.81 | bwd_microstep: 3329.82 | bwd_inner_microstep: 3329.01 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.16 [2025-06-19 21:59:43,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.81 | bwd: 3329.84 | bwd_inner: 3329.01 | bwd_allreduce: 0.78 | step: 7.17 54%|█████▍ | 5391/10000 [8:30:04<7:03:19, 5.51s/it] {'loss': 0.0132, 'grad_norm': 1.6224637031555176, 'learning_rate': 1.8440504584170656e-05, 'epoch': 5.39} 54%|█████▍ | 5391/10000 [8:30:04<7:03:19, 5.51s/it][2025-06-19 21:59:49,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 21:59:49,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.82 | bwd_microstep: 3321.59 | bwd_inner_microstep: 3320.74 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.07 [2025-06-19 21:59:49,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.82 | bwd: 3321.62 | bwd_inner: 3320.74 | bwd_allreduce: 0.81 | step: 7.07 54%|█████▍ | 5392/10000 [8:30:10<7:02:23, 5.50s/it] {'loss': 0.0006, 'grad_norm': 0.06850902736186981, 'learning_rate': 1.843404687731521e-05, 'epoch': 5.39} 54%|█████▍ | 5392/10000 [8:30:10<7:02:23, 5.50s/it][2025-06-19 21:59:54,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 21:59:54,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.09 | bwd_microstep: 3373.51 | bwd_inner_microstep: 3372.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 21:59:54,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.09 | bwd: 3373.53 | bwd_inner: 3372.72 | bwd_allreduce: 0.76 | step: 6.65 54%|█████▍ | 5393/10000 [8:30:15<7:03:24, 5.51s/it] {'loss': 0.0033, 'grad_norm': 0.6300173401832581, 'learning_rate': 1.8427589334720976e-05, 'epoch': 5.39} 54%|█████▍ | 5393/10000 [8:30:15<7:03:24, 5.51s/it][2025-06-19 22:00:00,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:00:00,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.07 | bwd_microstep: 3322.74 | bwd_inner_microstep: 3321.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 22:00:00,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.07 | bwd: 3322.76 | bwd_inner: 3321.95 | bwd_allreduce: 0.76 | step: 6.68 54%|█████▍ | 5394/10000 [8:30:21<7:02:21, 5.50s/it] {'loss': 0.0127, 'grad_norm': 2.237119674682617, 'learning_rate': 1.842113195706532e-05, 'epoch': 5.39} 54%|█████▍ | 5394/10000 [8:30:21<7:02:21, 5.50s/it][2025-06-19 22:00:05,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:00:05,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.29 | bwd_microstep: 3364.25 | bwd_inner_microstep: 3363.43 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.50 [2025-06-19 22:00:05,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.29 | bwd: 3364.26 | bwd_inner: 3363.43 | bwd_allreduce: 0.79 | step: 7.50 54%|█████▍ | 5395/10000 [8:30:26<7:03:02, 5.51s/it] {'loss': 0.0013, 'grad_norm': 0.10577773302793503, 'learning_rate': 1.8414674745025593e-05, 'epoch': 5.39} 54%|█████▍ | 5395/10000 [8:30:26<7:03:02, 5.51s/it][2025-06-19 22:00:11,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:00:11,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.08 | bwd_microstep: 3369.69 | bwd_inner_microstep: 3368.88 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-19 22:00:11,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.08 | bwd: 3369.71 | bwd_inner: 3368.88 | bwd_allreduce: 0.78 | step: 7.00 54%|█████▍ | 5396/10000 [8:30:32<7:03:31, 5.52s/it] {'loss': 0.0004, 'grad_norm': 0.0499013289809227, 'learning_rate': 1.840821769927913e-05, 'epoch': 5.4} 54%|█████▍ | 5396/10000 [8:30:32<7:03:31, 5.52s/it][2025-06-19 22:00:16,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:00:16,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.97 | bwd_microstep: 3369.15 | bwd_inner_microstep: 3368.13 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.51 [2025-06-19 22:00:16,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.97 | bwd: 3369.17 | bwd_inner: 3368.13 | bwd_allreduce: 0.98 | step: 7.51 54%|█████▍ | 5397/10000 [8:30:37<7:03:53, 5.53s/it] {'loss': 0.038, 'grad_norm': 3.154853582382202, 'learning_rate': 1.8401760820503253e-05, 'epoch': 5.4} 54%|█████▍ | 5397/10000 [8:30:37<7:03:53, 5.53s/it][2025-06-19 22:00:22,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:00:22,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.76 | bwd_microstep: 3318.85 | bwd_inner_microstep: 3318.04 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.85 [2025-06-19 22:00:22,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.76 | bwd: 3318.87 | bwd_inner: 3318.04 | bwd_allreduce: 0.78 | step: 6.85 54%|█████▍ | 5398/10000 [8:30:43<7:02:23, 5.51s/it] {'loss': 0.0071, 'grad_norm': 0.9664172530174255, 'learning_rate': 1.8395304109375238e-05, 'epoch': 5.4} 54%|█████▍ | 5398/10000 [8:30:43<7:02:23, 5.51s/it][2025-06-19 22:00:27,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.72 [2025-06-19 22:00:27,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.39 | bwd_microstep: 3369.21 | bwd_inner_microstep: 3368.40 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.34 [2025-06-19 22:00:27,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.39 | bwd: 3369.22 | bwd_inner: 3368.40 | bwd_allreduce: 0.78 | step: 7.34 54%|█████▍ | 5399/10000 [8:30:48<7:03:05, 5.52s/it] {'loss': 0.0007, 'grad_norm': 0.060116469860076904, 'learning_rate': 1.838884756657237e-05, 'epoch': 5.4} 54%|█████▍ | 5399/10000 [8:30:48<7:03:05, 5.52s/it][2025-06-19 22:00:33,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.88 [2025-06-19 22:00:33,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.25 | bwd_microstep: 3323.73 | bwd_inner_microstep: 3322.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.91 [2025-06-19 22:00:33,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.25 | bwd: 3323.74 | bwd_inner: 3322.94 | bwd_allreduce: 0.76 | step: 6.92 54%|█████▍ | 5400/10000 [8:30:54<7:02:00, 5.50s/it] {'loss': 0.0017, 'grad_norm': 0.1650223731994629, 'learning_rate': 1.838239119277192e-05, 'epoch': 5.4} 54%|█████▍ | 5400/10000 [8:30:54<7:02:00, 5.50s/it][2025-06-19 22:00:38,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:00:38,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.55 | bwd_microstep: 3319.40 | bwd_inner_microstep: 3318.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-19 22:00:38,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.55 | bwd: 3319.42 | bwd_inner: 3318.60 | bwd_allreduce: 0.77 | step: 7.02 54%|█████▍ | 5401/10000 [8:30:59<7:01:02, 5.49s/it] {'loss': 0.0679, 'grad_norm': 6.55330753326416, 'learning_rate': 1.8375934988651124e-05, 'epoch': 5.4} 54%|█████▍ | 5401/10000 [8:30:59<7:01:02, 5.49s/it][2025-06-19 22:00:44,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:00:44,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.01 | bwd_microstep: 3366.60 | bwd_inner_microstep: 3365.62 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.11 [2025-06-19 22:00:44,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.02 | bwd: 3366.62 | bwd_inner: 3365.62 | bwd_allreduce: 0.94 | step: 7.11 54%|█████▍ | 5402/10000 [8:31:05<7:01:46, 5.50s/it] {'loss': 0.0045, 'grad_norm': 0.4930969476699829, 'learning_rate': 1.8369478954887215e-05, 'epoch': 5.4} 54%|█████▍ | 5402/10000 [8:31:05<7:01:46, 5.50s/it][2025-06-19 22:00:49,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:00:49,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.39 | bwd_microstep: 3364.75 | bwd_inner_microstep: 3363.94 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 22:00:49,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.40 | bwd: 3364.76 | bwd_inner: 3363.94 | bwd_allreduce: 0.78 | step: 7.30 54%|█████▍ | 5403/10000 [8:31:10<7:02:21, 5.51s/it] {'loss': 0.0012, 'grad_norm': 0.08230355381965637, 'learning_rate': 1.836302309215739e-05, 'epoch': 5.4} 54%|█████▍ | 5403/10000 [8:31:10<7:02:21, 5.51s/it][2025-06-19 22:00:55,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:00:55,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.04 | bwd_microstep: 3378.07 | bwd_inner_microstep: 3377.10 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.35 [2025-06-19 22:00:55,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.04 | bwd: 3378.09 | bwd_inner: 3377.10 | bwd_allreduce: 0.94 | step: 7.35 54%|█████▍ | 5404/10000 [8:31:16<7:03:08, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.05419967323541641, 'learning_rate': 1.835656740113885e-05, 'epoch': 5.4} 54%|█████▍ | 5404/10000 [8:31:16<7:03:08, 5.52s/it][2025-06-19 22:01:00,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.84 | optimizer_step: 2.89 [2025-06-19 22:01:00,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.60 | bwd_microstep: 3323.54 | bwd_inner_microstep: 3322.74 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.62 [2025-06-19 22:01:00,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.60 | bwd: 3323.56 | bwd_inner: 3322.74 | bwd_allreduce: 0.78 | step: 7.63 54%|█████▍ | 5405/10000 [8:31:21<7:01:53, 5.51s/it] {'loss': 0.0014, 'grad_norm': 0.35127463936805725, 'learning_rate': 1.835011188250876e-05, 'epoch': 5.41} 54%|█████▍ | 5405/10000 [8:31:21<7:01:53, 5.51s/it][2025-06-19 22:01:06,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:01:06,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.12 | bwd_microstep: 3362.90 | bwd_inner_microstep: 3361.93 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.17 [2025-06-19 22:01:06,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.12 | bwd: 3362.91 | bwd_inner: 3361.93 | bwd_allreduce: 0.94 | step: 7.17 54%|█████▍ | 5406/10000 [8:31:27<7:02:13, 5.51s/it] {'loss': 0.1006, 'grad_norm': 6.493162155151367, 'learning_rate': 1.8343656536944273e-05, 'epoch': 5.41} 54%|█████▍ | 5406/10000 [8:31:27<7:02:13, 5.51s/it][2025-06-19 22:01:12,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:01:12,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.55 | bwd_microstep: 3366.54 | bwd_inner_microstep: 3365.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 22:01:12,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.55 | bwd: 3366.55 | bwd_inner: 3365.75 | bwd_allreduce: 0.76 | step: 6.72 54%|█████▍ | 5407/10000 [8:31:32<7:02:41, 5.52s/it] {'loss': 0.0016, 'grad_norm': 0.20334264636039734, 'learning_rate': 1.833720136512254e-05, 'epoch': 5.41} 54%|█████▍ | 5407/10000 [8:31:32<7:02:41, 5.52s/it][2025-06-19 22:01:17,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:01:17,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.99 | bwd_microstep: 3397.50 | bwd_inner_microstep: 3396.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 22:01:17,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.99 | bwd: 3397.52 | bwd_inner: 3396.70 | bwd_allreduce: 0.77 | step: 6.90 54%|█████▍ | 5408/10000 [8:31:38<7:03:49, 5.54s/it] {'loss': 0.0022, 'grad_norm': 0.234193816781044, 'learning_rate': 1.8330746367720658e-05, 'epoch': 5.41} 54%|█████▍ | 5408/10000 [8:31:38<7:03:49, 5.54s/it][2025-06-19 22:01:23,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:01:23,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.15 | bwd_microstep: 3364.80 | bwd_inner_microstep: 3364.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-19 22:01:23,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.15 | bwd: 3364.81 | bwd_inner: 3364.00 | bwd_allreduce: 0.78 | step: 6.93 54%|█████▍ | 5409/10000 [8:31:43<7:03:46, 5.54s/it] {'loss': 0.0427, 'grad_norm': 2.22041916847229, 'learning_rate': 1.8324291545415735e-05, 'epoch': 5.41} 54%|█████▍ | 5409/10000 [8:31:43<7:03:46, 5.54s/it][2025-06-19 22:01:28,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:01:28,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.76 | bwd_microstep: 3319.39 | bwd_inner_microstep: 3318.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 22:01:28,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.76 | bwd: 3319.41 | bwd_inner: 3318.60 | bwd_allreduce: 0.77 | step: 6.71 54%|█████▍ | 5410/10000 [8:31:49<7:01:53, 5.51s/it] {'loss': 0.0265, 'grad_norm': 2.0708954334259033, 'learning_rate': 1.831783689888485e-05, 'epoch': 5.41} 54%|█████▍ | 5410/10000 [8:31:49<7:01:53, 5.51s/it][2025-06-19 22:01:34,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:01:34,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.80 | bwd_microstep: 3358.12 | bwd_inner_microstep: 3357.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 22:01:34,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.80 | bwd: 3358.13 | bwd_inner: 3357.31 | bwd_allreduce: 0.78 | step: 7.10 54%|█████▍ | 5411/10000 [8:31:54<7:01:55, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.026409057900309563, 'learning_rate': 1.8311382428805066e-05, 'epoch': 5.41} 54%|█████▍ | 5411/10000 [8:31:54<7:01:55, 5.52s/it][2025-06-19 22:01:39,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.77 [2025-06-19 22:01:39,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.14 | bwd_microstep: 3321.31 | bwd_inner_microstep: 3320.33 | bwd_allreduce_microstep: 0.94 | step_microstep: 6.75 [2025-06-19 22:01:39,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.14 | bwd: 3321.33 | bwd_inner: 3320.33 | bwd_allreduce: 0.96 | step: 6.76 54%|█████▍ | 5412/10000 [8:32:00<7:00:35, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.048322588205337524, 'learning_rate': 1.8304928135853434e-05, 'epoch': 5.41} 54%|█████▍ | 5412/10000 [8:32:00<7:00:35, 5.50s/it][2025-06-19 22:01:45,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:01:45,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.08 | bwd_microstep: 3322.82 | bwd_inner_microstep: 3322.00 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.83 [2025-06-19 22:01:45,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.08 | bwd: 3322.83 | bwd_inner: 3322.00 | bwd_allreduce: 0.79 | step: 6.84 54%|█████▍ | 5413/10000 [8:32:05<6:59:53, 5.49s/it] {'loss': 0.0006, 'grad_norm': 0.09381522238254547, 'learning_rate': 1.829847402070697e-05, 'epoch': 5.41} 54%|█████▍ | 5413/10000 [8:32:05<6:59:53, 5.49s/it][2025-06-19 22:01:50,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:01:50,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.21 | bwd_microstep: 3318.81 | bwd_inner_microstep: 3318.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 22:01:50,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.21 | bwd: 3318.82 | bwd_inner: 3318.01 | bwd_allreduce: 0.76 | step: 6.98 54%|█████▍ | 5414/10000 [8:32:11<6:59:19, 5.49s/it] {'loss': 0.0024, 'grad_norm': 0.26577991247177124, 'learning_rate': 1.8292020084042682e-05, 'epoch': 5.41} 54%|█████▍ | 5414/10000 [8:32:11<6:59:19, 5.49s/it][2025-06-19 22:01:55,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:01:55,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.09 | bwd_microstep: 3313.67 | bwd_inner_microstep: 3312.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 22:01:55,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.09 | bwd: 3313.68 | bwd_inner: 3312.89 | bwd_allreduce: 0.75 | step: 6.56 54%|█████▍ | 5415/10000 [8:32:16<6:58:24, 5.48s/it] {'loss': 0.0034, 'grad_norm': 0.5112165212631226, 'learning_rate': 1.8285566326537564e-05, 'epoch': 5.42} 54%|█████▍ | 5415/10000 [8:32:16<6:58:24, 5.48s/it][2025-06-19 22:02:01,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:02:01,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.76 | bwd_microstep: 3367.45 | bwd_inner_microstep: 3366.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 22:02:01,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.76 | bwd: 3367.46 | bwd_inner: 3366.66 | bwd_allreduce: 0.76 | step: 6.84 54%|█████▍ | 5416/10000 [8:32:22<6:59:34, 5.49s/it] {'loss': 0.0042, 'grad_norm': 0.6820607781410217, 'learning_rate': 1.8279112748868577e-05, 'epoch': 5.42} 54%|█████▍ | 5416/10000 [8:32:22<6:59:34, 5.49s/it][2025-06-19 22:02:07,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 22:02:07,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.73 | bwd_microstep: 3370.83 | bwd_inner_microstep: 3369.98 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.63 [2025-06-19 22:02:07,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.73 | bwd: 3370.85 | bwd_inner: 3369.98 | bwd_allreduce: 0.82 | step: 7.63 54%|█████▍ | 5417/10000 [8:32:27<7:00:46, 5.51s/it] {'loss': 0.0071, 'grad_norm': 7.547612190246582, 'learning_rate': 1.8272659351712688e-05, 'epoch': 5.42} 54%|█████▍ | 5417/10000 [8:32:27<7:00:46, 5.51s/it][2025-06-19 22:02:12,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.82 [2025-06-19 22:02:12,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.47 | bwd_microstep: 3391.53 | bwd_inner_microstep: 3390.72 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.31 [2025-06-19 22:02:12,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.47 | bwd: 3391.54 | bwd_inner: 3390.72 | bwd_allreduce: 0.78 | step: 7.31 54%|█████▍ | 5418/10000 [8:32:33<7:01:58, 5.53s/it] {'loss': 0.0002, 'grad_norm': 0.023562612012028694, 'learning_rate': 1.8266206135746806e-05, 'epoch': 5.42} 54%|█████▍ | 5418/10000 [8:32:33<7:01:58, 5.53s/it][2025-06-19 22:02:18,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:02:18,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.60 | bwd_microstep: 3374.54 | bwd_inner_microstep: 3373.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 22:02:18,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.60 | bwd: 3374.55 | bwd_inner: 3373.75 | bwd_allreduce: 0.76 | step: 6.82 54%|█████▍ | 5419/10000 [8:32:38<7:02:14, 5.53s/it] {'loss': 0.0005, 'grad_norm': 0.09837831556797028, 'learning_rate': 1.8259753101647852e-05, 'epoch': 5.42} 54%|█████▍ | 5419/10000 [8:32:38<7:02:14, 5.53s/it][2025-06-19 22:02:23,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 22:02:23,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.30 | bwd_microstep: 3313.84 | bwd_inner_microstep: 3312.83 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.33 [2025-06-19 22:02:23,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.30 | bwd: 3313.87 | bwd_inner: 3312.83 | bwd_allreduce: 0.97 | step: 7.33 54%|█████▍ | 5420/10000 [8:32:44<7:00:30, 5.51s/it] {'loss': 0.0533, 'grad_norm': 2.6340317726135254, 'learning_rate': 1.8253300250092722e-05, 'epoch': 5.42} 54%|█████▍ | 5420/10000 [8:32:44<7:00:30, 5.51s/it][2025-06-19 22:02:29,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.85 [2025-06-19 22:02:29,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.70 | bwd_microstep: 3314.37 | bwd_inner_microstep: 3313.41 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.71 [2025-06-19 22:02:29,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.70 | bwd: 3314.39 | bwd_inner: 3313.41 | bwd_allreduce: 0.92 | step: 7.71 54%|█████▍ | 5421/10000 [8:32:49<6:59:20, 5.49s/it] {'loss': 0.008, 'grad_norm': 1.3111982345581055, 'learning_rate': 1.8246847581758293e-05, 'epoch': 5.42} 54%|█████▍ | 5421/10000 [8:32:49<6:59:20, 5.49s/it][2025-06-19 22:02:34,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:02:34,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.93 | bwd_microstep: 3362.24 | bwd_inner_microstep: 3361.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-19 22:02:34,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.93 | bwd: 3362.25 | bwd_inner: 3361.43 | bwd_allreduce: 0.78 | step: 6.97 54%|█████▍ | 5422/10000 [8:32:55<7:00:10, 5.51s/it] {'loss': 0.0091, 'grad_norm': 1.369225263595581, 'learning_rate': 1.8240395097321414e-05, 'epoch': 5.42} 54%|█████▍ | 5422/10000 [8:32:55<7:00:10, 5.51s/it][2025-06-19 22:02:40,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.73 | optimizer_step: 2.73 [2025-06-19 22:02:40,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.67 | bwd_microstep: 3314.33 | bwd_inner_microstep: 3313.43 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.40 [2025-06-19 22:02:40,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.67 | bwd: 3314.34 | bwd_inner: 3313.43 | bwd_allreduce: 0.87 | step: 7.40 54%|█████▍ | 5423/10000 [8:33:00<6:58:57, 5.49s/it] {'loss': 0.0034, 'grad_norm': 0.698149561882019, 'learning_rate': 1.8233942797458924e-05, 'epoch': 5.42} 54%|█████▍ | 5423/10000 [8:33:00<6:58:57, 5.49s/it][2025-06-19 22:02:45,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:02:45,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.57 | bwd_microstep: 3311.26 | bwd_inner_microstep: 3310.47 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 22:02:45,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.57 | bwd: 3311.27 | bwd_inner: 3310.47 | bwd_allreduce: 0.76 | step: 6.64 54%|█████▍ | 5424/10000 [8:33:06<6:57:55, 5.48s/it] {'loss': 0.0536, 'grad_norm': 7.762650012969971, 'learning_rate': 1.8227490682847632e-05, 'epoch': 5.42} 54%|█████▍ | 5424/10000 [8:33:06<6:57:55, 5.48s/it][2025-06-19 22:02:51,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:02:51,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.56 | bwd_microstep: 3367.67 | bwd_inner_microstep: 3366.66 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.42 [2025-06-19 22:02:51,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.56 | bwd: 3367.69 | bwd_inner: 3366.66 | bwd_allreduce: 0.98 | step: 7.42 54%|█████▍ | 5425/10000 [8:33:11<6:58:59, 5.49s/it] {'loss': 0.0023, 'grad_norm': 0.3090662658214569, 'learning_rate': 1.8221038754164348e-05, 'epoch': 5.42} 54%|█████▍ | 5425/10000 [8:33:11<6:58:59, 5.49s/it][2025-06-19 22:02:56,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:02:56,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.65 | bwd_microstep: 3313.31 | bwd_inner_microstep: 3312.47 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.14 [2025-06-19 22:02:56,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.65 | bwd: 3313.33 | bwd_inner: 3312.47 | bwd_allreduce: 0.80 | step: 7.15 54%|█████▍ | 5426/10000 [8:33:17<6:58:06, 5.48s/it] {'loss': 0.0244, 'grad_norm': 3.8255014419555664, 'learning_rate': 1.8214587012085837e-05, 'epoch': 5.43} 54%|█████▍ | 5426/10000 [8:33:17<6:58:06, 5.48s/it][2025-06-19 22:03:01,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:03:01,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.71 | bwd_microstep: 3308.34 | bwd_inner_microstep: 3307.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 22:03:01,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.71 | bwd: 3308.36 | bwd_inner: 3307.56 | bwd_allreduce: 0.75 | step: 6.66 54%|█████▍ | 5427/10000 [8:33:22<6:57:12, 5.47s/it] {'loss': 0.0017, 'grad_norm': 0.27437323331832886, 'learning_rate': 1.820813545728886e-05, 'epoch': 5.43} 54%|█████▍ | 5427/10000 [8:33:22<6:57:12, 5.47s/it][2025-06-19 22:03:07,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:03:07,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.47 | bwd_microstep: 3358.82 | bwd_inner_microstep: 3358.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 22:03:07,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.47 | bwd: 3358.83 | bwd_inner: 3358.03 | bwd_allreduce: 0.75 | step: 6.64 54%|█████▍ | 5428/10000 [8:33:28<6:58:12, 5.49s/it] {'loss': 0.0176, 'grad_norm': 2.5398905277252197, 'learning_rate': 1.8201684090450168e-05, 'epoch': 5.43} 54%|█████▍ | 5428/10000 [8:33:28<6:58:12, 5.49s/it][2025-06-19 22:03:13,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:03:13,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.27 | bwd_microstep: 3368.84 | bwd_inner_microstep: 3367.80 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.32 [2025-06-19 22:03:13,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.27 | bwd: 3368.86 | bwd_inner: 3367.80 | bwd_allreduce: 0.99 | step: 7.32 54%|█████▍ | 5429/10000 [8:33:33<6:59:15, 5.50s/it] {'loss': 0.1501, 'grad_norm': 9.089829444885254, 'learning_rate': 1.819523291224646e-05, 'epoch': 5.43} 54%|█████▍ | 5429/10000 [8:33:33<6:59:15, 5.50s/it][2025-06-19 22:03:18,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:03:18,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.72 | bwd_microstep: 3314.45 | bwd_inner_microstep: 3313.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.87 [2025-06-19 22:03:18,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.72 | bwd: 3314.46 | bwd_inner: 3313.66 | bwd_allreduce: 0.76 | step: 6.87 54%|█████▍ | 5430/10000 [8:33:39<6:58:10, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.036408860236406326, 'learning_rate': 1.8188781923354447e-05, 'epoch': 5.43} 54%|█████▍ | 5430/10000 [8:33:39<6:58:10, 5.49s/it][2025-06-19 22:03:23,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:03:23,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.57 | bwd_microstep: 3308.85 | bwd_inner_microstep: 3308.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 22:03:23,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.57 | bwd: 3308.86 | bwd_inner: 3308.06 | bwd_allreduce: 0.77 | step: 6.64 54%|█████▍ | 5431/10000 [8:33:44<6:57:01, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.028320187702775, 'learning_rate': 1.8182331124450802e-05, 'epoch': 5.43} 54%|█████▍ | 5431/10000 [8:33:44<6:57:01, 5.48s/it][2025-06-19 22:03:29,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:03:29,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.36 | bwd_microstep: 3314.99 | bwd_inner_microstep: 3314.21 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.56 [2025-06-19 22:03:29,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.36 | bwd: 3315.00 | bwd_inner: 3314.21 | bwd_allreduce: 0.75 | step: 6.57 54%|█████▍ | 5432/10000 [8:33:50<6:56:26, 5.47s/it] {'loss': 0.0037, 'grad_norm': 0.3660646677017212, 'learning_rate': 1.8175880516212187e-05, 'epoch': 5.43} 54%|█████▍ | 5432/10000 [8:33:50<6:56:26, 5.47s/it][2025-06-19 22:03:34,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:03:34,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.03 | bwd_microstep: 3309.01 | bwd_inner_microstep: 3308.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 22:03:34,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.03 | bwd: 3309.02 | bwd_inner: 3308.22 | bwd_allreduce: 0.76 | step: 6.67 54%|█████▍ | 5433/10000 [8:33:55<6:55:49, 5.46s/it] {'loss': 0.0106, 'grad_norm': 0.7521923780441284, 'learning_rate': 1.8169430099315245e-05, 'epoch': 5.43} 54%|█████▍ | 5433/10000 [8:33:55<6:55:49, 5.46s/it][2025-06-19 22:03:40,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.91 [2025-06-19 22:03:40,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.42 | bwd_microstep: 3309.36 | bwd_inner_microstep: 3308.56 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 22:03:40,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.42 | bwd: 3309.38 | bwd_inner: 3308.56 | bwd_allreduce: 0.78 | step: 7.14 54%|█████▍ | 5434/10000 [8:34:01<6:55:32, 5.46s/it] {'loss': 0.0051, 'grad_norm': 0.6017529368400574, 'learning_rate': 1.8162979874436585e-05, 'epoch': 5.43} 54%|█████▍ | 5434/10000 [8:34:01<6:55:32, 5.46s/it][2025-06-19 22:03:45,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:03:45,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.59 | bwd_microstep: 3307.75 | bwd_inner_microstep: 3306.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 22:03:45,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.59 | bwd: 3307.76 | bwd_inner: 3306.95 | bwd_allreduce: 0.77 | step: 6.90 54%|█████▍ | 5435/10000 [8:34:06<6:55:21, 5.46s/it] {'loss': 0.1521, 'grad_norm': 7.7331624031066895, 'learning_rate': 1.815652984225281e-05, 'epoch': 5.43} 54%|█████▍ | 5435/10000 [8:34:06<6:55:21, 5.46s/it][2025-06-19 22:03:51,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:03:51,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.60 | bwd_microstep: 3376.27 | bwd_inner_microstep: 3375.49 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 22:03:51,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.60 | bwd: 3376.29 | bwd_inner: 3375.49 | bwd_allreduce: 0.76 | step: 6.64 54%|█████▍ | 5436/10000 [8:34:12<6:57:23, 5.49s/it] {'loss': 0.0079, 'grad_norm': 1.0838515758514404, 'learning_rate': 1.8150080003440495e-05, 'epoch': 5.44} 54%|█████▍ | 5436/10000 [8:34:12<6:57:23, 5.49s/it][2025-06-19 22:03:56,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:03:56,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.13 | bwd_microstep: 3311.18 | bwd_inner_microstep: 3310.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 22:03:56,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.13 | bwd: 3311.20 | bwd_inner: 3310.40 | bwd_allreduce: 0.76 | step: 6.65 54%|█████▍ | 5437/10000 [8:34:17<6:56:30, 5.48s/it] {'loss': 0.0033, 'grad_norm': 0.32808640599250793, 'learning_rate': 1.8143630358676202e-05, 'epoch': 5.44} 54%|█████▍ | 5437/10000 [8:34:17<6:56:30, 5.48s/it][2025-06-19 22:04:02,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:04:02,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.26 | bwd_microstep: 3310.28 | bwd_inner_microstep: 3309.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 22:04:02,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.26 | bwd: 3310.30 | bwd_inner: 3309.50 | bwd_allreduce: 0.76 | step: 6.62 54%|█████▍ | 5438/10000 [8:34:22<6:55:48, 5.47s/it] {'loss': 0.0174, 'grad_norm': 3.69551420211792, 'learning_rate': 1.813718090863648e-05, 'epoch': 5.44} 54%|█████▍ | 5438/10000 [8:34:22<6:55:48, 5.47s/it][2025-06-19 22:04:07,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:04:07,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.06 | bwd_microstep: 3354.09 | bwd_inner_microstep: 3353.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-19 22:04:07,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.06 | bwd: 3354.10 | bwd_inner: 3353.28 | bwd_allreduce: 0.77 | step: 7.02 54%|█████▍ | 5439/10000 [8:34:28<6:56:40, 5.48s/it] {'loss': 0.0004, 'grad_norm': 0.08832062035799026, 'learning_rate': 1.8130731653997825e-05, 'epoch': 5.44} 54%|█████▍ | 5439/10000 [8:34:28<6:56:40, 5.48s/it][2025-06-19 22:04:13,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:04:13,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.43 | bwd_microstep: 3312.32 | bwd_inner_microstep: 3311.25 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.44 [2025-06-19 22:04:13,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.43 | bwd: 3312.33 | bwd_inner: 3311.25 | bwd_allreduce: 1.03 | step: 7.44 54%|█████▍ | 5440/10000 [8:34:33<6:55:55, 5.47s/it] {'loss': 0.0079, 'grad_norm': 2.7419819831848145, 'learning_rate': 1.8124282595436745e-05, 'epoch': 5.44} 54%|█████▍ | 5440/10000 [8:34:33<6:55:55, 5.47s/it][2025-06-19 22:04:18,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:04:18,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.57 | bwd_microstep: 3316.18 | bwd_inner_microstep: 3315.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 22:04:18,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.57 | bwd: 3316.19 | bwd_inner: 3315.38 | bwd_allreduce: 0.77 | step: 6.67 54%|█████▍ | 5441/10000 [8:34:39<6:55:45, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.019631123170256615, 'learning_rate': 1.8117833733629715e-05, 'epoch': 5.44} 54%|█████▍ | 5441/10000 [8:34:39<6:55:45, 5.47s/it][2025-06-19 22:04:24,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:04:24,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.19 | bwd_microstep: 3362.78 | bwd_inner_microstep: 3361.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 22:04:24,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.19 | bwd: 3362.79 | bwd_inner: 3361.98 | bwd_allreduce: 0.77 | step: 6.74 54%|█████▍ | 5442/10000 [8:34:44<6:56:52, 5.49s/it] {'loss': 0.0051, 'grad_norm': 1.0402495861053467, 'learning_rate': 1.811138506925319e-05, 'epoch': 5.44} 54%|█████▍ | 5442/10000 [8:34:44<6:56:52, 5.49s/it][2025-06-19 22:04:29,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:04:29,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.13 | bwd_microstep: 3313.39 | bwd_inner_microstep: 3312.58 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 22:04:29,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.13 | bwd: 3313.41 | bwd_inner: 3312.58 | bwd_allreduce: 0.78 | step: 7.19 54%|█████▍ | 5443/10000 [8:34:50<6:55:55, 5.48s/it] {'loss': 0.0245, 'grad_norm': 2.3336076736450195, 'learning_rate': 1.8104936602983618e-05, 'epoch': 5.44} 54%|█████▍ | 5443/10000 [8:34:50<6:55:55, 5.48s/it][2025-06-19 22:04:35,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:04:35,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.04 | bwd_microstep: 3367.03 | bwd_inner_microstep: 3366.07 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.12 [2025-06-19 22:04:35,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.04 | bwd: 3367.05 | bwd_inner: 3366.07 | bwd_allreduce: 0.93 | step: 7.13 54%|█████▍ | 5444/10000 [8:34:55<6:56:55, 5.49s/it] {'loss': 0.002, 'grad_norm': 0.4061192274093628, 'learning_rate': 1.8098488335497387e-05, 'epoch': 5.44} 54%|█████▍ | 5444/10000 [8:34:55<6:56:55, 5.49s/it][2025-06-19 22:04:40,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:04:40,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.46 | bwd_microstep: 3324.33 | bwd_inner_microstep: 3323.20 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.60 [2025-06-19 22:04:40,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.46 | bwd: 3324.36 | bwd_inner: 3323.20 | bwd_allreduce: 1.09 | step: 7.59 54%|█████▍ | 5445/10000 [8:35:01<6:56:29, 5.49s/it] {'loss': 0.0013, 'grad_norm': 0.257577508687973, 'learning_rate': 1.809204026747091e-05, 'epoch': 5.45} 54%|█████▍ | 5445/10000 [8:35:01<6:56:29, 5.49s/it][2025-06-19 22:04:46,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:04:46,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.12 | bwd_microstep: 3359.35 | bwd_inner_microstep: 3358.55 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 22:04:46,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.12 | bwd: 3359.37 | bwd_inner: 3358.55 | bwd_allreduce: 0.77 | step: 6.80 54%|█████▍ | 5446/10000 [8:35:06<6:57:23, 5.50s/it] {'loss': 0.0036, 'grad_norm': 0.35781803727149963, 'learning_rate': 1.808559239958055e-05, 'epoch': 5.45} 54%|█████▍ | 5446/10000 [8:35:06<6:57:23, 5.50s/it][2025-06-19 22:04:51,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:04:51,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.86 | bwd_microstep: 3317.18 | bwd_inner_microstep: 3316.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-19 22:04:51,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.86 | bwd: 3317.20 | bwd_inner: 3316.38 | bwd_allreduce: 0.77 | step: 6.85 54%|█████▍ | 5447/10000 [8:35:12<6:56:30, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.006417813710868359, 'learning_rate': 1.8079144732502666e-05, 'epoch': 5.45} 54%|█████▍ | 5447/10000 [8:35:12<6:56:30, 5.49s/it][2025-06-19 22:04:57,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:04:57,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.10 | bwd_microstep: 3313.45 | bwd_inner_microstep: 3312.48 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.26 [2025-06-19 22:04:57,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.10 | bwd: 3313.47 | bwd_inner: 3312.48 | bwd_allreduce: 0.94 | step: 7.26 54%|█████▍ | 5448/10000 [8:35:17<6:55:49, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.0431755892932415, 'learning_rate': 1.8072697266913582e-05, 'epoch': 5.45} 54%|█████▍ | 5448/10000 [8:35:17<6:55:49, 5.48s/it][2025-06-19 22:05:02,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:05:02,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.81 | bwd_microstep: 3363.23 | bwd_inner_microstep: 3362.39 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.95 [2025-06-19 22:05:02,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.81 | bwd: 3363.25 | bwd_inner: 3362.39 | bwd_allreduce: 0.81 | step: 6.95 54%|█████▍ | 5449/10000 [8:35:23<6:56:51, 5.50s/it] {'loss': 0.0079, 'grad_norm': 1.370842456817627, 'learning_rate': 1.8066250003489613e-05, 'epoch': 5.45} 54%|█████▍ | 5449/10000 [8:35:23<6:56:51, 5.50s/it][2025-06-19 22:05:08,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:05:08,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.86 | bwd_microstep: 3314.01 | bwd_inner_microstep: 3313.05 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.11 [2025-06-19 22:05:08,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.86 | bwd: 3314.02 | bwd_inner: 3313.05 | bwd_allreduce: 0.93 | step: 7.11 55%|█████▍ | 5450/10000 [8:35:28<6:56:01, 5.49s/it] {'loss': 0.0008, 'grad_norm': 0.2100183069705963, 'learning_rate': 1.805980294290704e-05, 'epoch': 5.45} 55%|█████▍ | 5450/10000 [8:35:28<6:56:01, 5.49s/it][2025-06-19 22:05:13,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:05:13,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.74 | bwd_microstep: 3362.29 | bwd_inner_microstep: 3361.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-19 22:05:13,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.74 | bwd: 3362.31 | bwd_inner: 3361.50 | bwd_allreduce: 0.77 | step: 7.02 55%|█████▍ | 5451/10000 [8:35:34<6:56:58, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.20338544249534607, 'learning_rate': 1.8053356085842136e-05, 'epoch': 5.45} 55%|█████▍ | 5451/10000 [8:35:34<6:56:58, 5.50s/it][2025-06-19 22:05:19,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:05:19,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.53 | bwd_microstep: 3326.83 | bwd_inner_microstep: 3326.03 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-19 22:05:19,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.53 | bwd: 3326.84 | bwd_inner: 3326.03 | bwd_allreduce: 0.77 | step: 6.83 55%|█████▍ | 5452/10000 [8:35:39<6:56:14, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.05608488619327545, 'learning_rate': 1.8046909432971143e-05, 'epoch': 5.45} 55%|█████▍ | 5452/10000 [8:35:39<6:56:14, 5.49s/it][2025-06-19 22:05:24,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 22:05:24,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.50 | bwd_microstep: 3374.83 | bwd_inner_microstep: 3373.57 | bwd_allreduce_microstep: 1.19 | step_microstep: 7.73 [2025-06-19 22:05:24,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.50 | bwd: 3374.86 | bwd_inner: 3373.57 | bwd_allreduce: 1.22 | step: 7.73 55%|█████▍ | 5453/10000 [8:35:45<6:57:24, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.023982994258403778, 'learning_rate': 1.804046298497029e-05, 'epoch': 5.45} 55%|█████▍ | 5453/10000 [8:35:45<6:57:24, 5.51s/it][2025-06-19 22:05:30,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:05:30,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.17 | bwd_microstep: 3320.54 | bwd_inner_microstep: 3319.56 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.23 [2025-06-19 22:05:30,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.17 | bwd: 3320.55 | bwd_inner: 3319.56 | bwd_allreduce: 0.94 | step: 7.24 55%|█████▍ | 5454/10000 [8:35:50<6:56:25, 5.50s/it] {'loss': 0.1651, 'grad_norm': 4.972741603851318, 'learning_rate': 1.8034016742515774e-05, 'epoch': 5.45} 55%|█████▍ | 5454/10000 [8:35:50<6:56:25, 5.50s/it][2025-06-19 22:05:35,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:05:35,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.38 | bwd_microstep: 3310.53 | bwd_inner_microstep: 3309.61 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.31 [2025-06-19 22:05:35,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.38 | bwd: 3310.55 | bwd_inner: 3309.61 | bwd_allreduce: 0.89 | step: 7.31 55%|█████▍ | 5455/10000 [8:35:56<6:55:26, 5.48s/it] {'loss': 0.0035, 'grad_norm': 0.5892428159713745, 'learning_rate': 1.8027570706283774e-05, 'epoch': 5.46} 55%|█████▍ | 5455/10000 [8:35:56<6:55:26, 5.48s/it][2025-06-19 22:05:40,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:05:40,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.15 | bwd_microstep: 3312.82 | bwd_inner_microstep: 3312.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 22:05:40,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.15 | bwd: 3312.83 | bwd_inner: 3312.03 | bwd_allreduce: 0.76 | step: 6.67 55%|█████▍ | 5456/10000 [8:36:01<6:54:53, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.028389040380716324, 'learning_rate': 1.802112487695045e-05, 'epoch': 5.46} 55%|█████▍ | 5456/10000 [8:36:01<6:54:53, 5.48s/it][2025-06-19 22:05:46,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:05:46,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.58 | bwd_microstep: 3396.74 | bwd_inner_microstep: 3395.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 22:05:46,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.58 | bwd: 3396.75 | bwd_inner: 3395.95 | bwd_allreduce: 0.76 | step: 6.60 55%|█████▍ | 5457/10000 [8:36:07<6:56:55, 5.51s/it] {'loss': 0.0179, 'grad_norm': 2.0039172172546387, 'learning_rate': 1.8014679255191943e-05, 'epoch': 5.46} 55%|█████▍ | 5457/10000 [8:36:07<6:56:55, 5.51s/it][2025-06-19 22:05:52,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:05:52,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.91 | bwd_microstep: 3374.40 | bwd_inner_microstep: 3373.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 22:05:52,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.91 | bwd: 3374.41 | bwd_inner: 3373.61 | bwd_allreduce: 0.76 | step: 6.62 55%|█████▍ | 5458/10000 [8:36:12<6:57:30, 5.52s/it] {'loss': 0.0004, 'grad_norm': 0.05029807239770889, 'learning_rate': 1.8008233841684376e-05, 'epoch': 5.46} 55%|█████▍ | 5458/10000 [8:36:12<6:57:30, 5.52s/it][2025-06-19 22:05:57,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:05:57,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.72 | bwd_microstep: 3329.91 | bwd_inner_microstep: 3329.10 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 22:05:57,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.72 | bwd: 3329.93 | bwd_inner: 3329.11 | bwd_allreduce: 0.78 | step: 7.05 55%|█████▍ | 5459/10000 [8:36:18<6:56:23, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.03353547304868698, 'learning_rate': 1.800178863710383e-05, 'epoch': 5.46} 55%|█████▍ | 5459/10000 [8:36:18<6:56:23, 5.50s/it][2025-06-19 22:06:03,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:06:03,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.98 | bwd_microstep: 3323.14 | bwd_inner_microstep: 3322.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-19 22:06:03,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.98 | bwd: 3323.15 | bwd_inner: 3322.35 | bwd_allreduce: 0.76 | step: 6.80 55%|█████▍ | 5460/10000 [8:36:23<6:55:25, 5.49s/it] {'loss': 0.0021, 'grad_norm': 0.4083103537559509, 'learning_rate': 1.799534364212638e-05, 'epoch': 5.46} 55%|█████▍ | 5460/10000 [8:36:23<6:55:25, 5.49s/it][2025-06-19 22:06:08,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 22:06:08,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.91 | bwd_microstep: 3375.35 | bwd_inner_microstep: 3374.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 22:06:08,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.91 | bwd: 3375.36 | bwd_inner: 3374.56 | bwd_allreduce: 0.76 | step: 6.60 55%|█████▍ | 5461/10000 [8:36:29<6:56:34, 5.51s/it] {'loss': 0.0013, 'grad_norm': 0.14220483601093292, 'learning_rate': 1.7988898857428082e-05, 'epoch': 5.46} 55%|█████▍ | 5461/10000 [8:36:29<6:56:34, 5.51s/it][2025-06-19 22:06:14,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:06:14,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.99 | bwd_microstep: 3367.50 | bwd_inner_microstep: 3366.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 22:06:14,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.99 | bwd: 3367.51 | bwd_inner: 3366.71 | bwd_allreduce: 0.76 | step: 6.73 55%|█████▍ | 5462/10000 [8:36:34<6:56:58, 5.51s/it] {'loss': 0.0015, 'grad_norm': 0.3588564693927765, 'learning_rate': 1.7982454283684955e-05, 'epoch': 5.46} 55%|█████▍ | 5462/10000 [8:36:34<6:56:58, 5.51s/it][2025-06-19 22:06:19,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:06:19,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.88 | bwd_microstep: 3323.06 | bwd_inner_microstep: 3322.02 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.42 [2025-06-19 22:06:19,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.88 | bwd: 3323.07 | bwd_inner: 3322.02 | bwd_allreduce: 1.01 | step: 7.44 55%|█████▍ | 5463/10000 [8:36:40<6:55:59, 5.50s/it] {'loss': 0.064, 'grad_norm': 4.596227645874023, 'learning_rate': 1.7976009921573015e-05, 'epoch': 5.46} 55%|█████▍ | 5463/10000 [8:36:40<6:55:59, 5.50s/it][2025-06-19 22:06:25,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 22:06:25,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.00 | bwd_microstep: 3330.14 | bwd_inner_microstep: 3329.00 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.28 [2025-06-19 22:06:25,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.00 | bwd: 3330.17 | bwd_inner: 3329.00 | bwd_allreduce: 1.10 | step: 8.28 55%|█████▍ | 5464/10000 [8:36:45<6:55:53, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.05942521616816521, 'learning_rate': 1.7969565771768238e-05, 'epoch': 5.46} 55%|█████▍ | 5464/10000 [8:36:45<6:55:53, 5.50s/it][2025-06-19 22:06:30,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:06:30,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.27 | bwd_microstep: 3328.32 | bwd_inner_microstep: 3327.51 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.89 [2025-06-19 22:06:30,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.27 | bwd: 3328.34 | bwd_inner: 3327.51 | bwd_allreduce: 0.78 | step: 6.89 55%|█████▍ | 5465/10000 [8:36:51<6:55:30, 5.50s/it] {'loss': 0.0014, 'grad_norm': 0.24983985722064972, 'learning_rate': 1.7963121834946585e-05, 'epoch': 5.46} 55%|█████▍ | 5465/10000 [8:36:51<6:55:30, 5.50s/it][2025-06-19 22:06:36,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:06:36,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.74 | bwd_microstep: 3342.57 | bwd_inner_microstep: 3341.59 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.24 [2025-06-19 22:06:36,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.74 | bwd: 3342.58 | bwd_inner: 3341.59 | bwd_allreduce: 0.95 | step: 7.24 55%|█████▍ | 5466/10000 [8:36:56<6:55:25, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.17704050242900848, 'learning_rate': 1.7956678111784002e-05, 'epoch': 5.47} 55%|█████▍ | 5466/10000 [8:36:56<6:55:25, 5.50s/it][2025-06-19 22:06:41,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:06:41,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.57 | bwd_microstep: 3327.83 | bwd_inner_microstep: 3327.05 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 22:06:41,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.57 | bwd: 3327.85 | bwd_inner: 3327.05 | bwd_allreduce: 0.76 | step: 6.56 55%|█████▍ | 5467/10000 [8:37:02<6:54:57, 5.49s/it] {'loss': 0.1179, 'grad_norm': 5.6442413330078125, 'learning_rate': 1.79502346029564e-05, 'epoch': 5.47} 55%|█████▍ | 5467/10000 [8:37:02<6:54:57, 5.49s/it][2025-06-19 22:06:47,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:06:47,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.22 | bwd_microstep: 3329.62 | bwd_inner_microstep: 3328.83 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 22:06:47,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.22 | bwd: 3329.63 | bwd_inner: 3328.83 | bwd_allreduce: 0.76 | step: 6.72 55%|█████▍ | 5468/10000 [8:37:07<6:54:24, 5.49s/it] {'loss': 0.0286, 'grad_norm': 3.306621789932251, 'learning_rate': 1.7943791309139685e-05, 'epoch': 5.47} 55%|█████▍ | 5468/10000 [8:37:07<6:54:24, 5.49s/it][2025-06-19 22:06:52,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 22:06:52,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.84 | bwd_microstep: 3377.27 | bwd_inner_microstep: 3376.37 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.95 [2025-06-19 22:06:52,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.84 | bwd: 3377.28 | bwd_inner: 3376.37 | bwd_allreduce: 0.87 | step: 6.95 55%|█████▍ | 5469/10000 [8:37:13<6:55:48, 5.51s/it] {'loss': 0.0061, 'grad_norm': 1.2460739612579346, 'learning_rate': 1.7937348231009706e-05, 'epoch': 5.47} 55%|█████▍ | 5469/10000 [8:37:13<6:55:48, 5.51s/it][2025-06-19 22:06:58,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:06:58,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.61 | bwd_microstep: 3328.36 | bwd_inner_microstep: 3327.50 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.40 [2025-06-19 22:06:58,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.61 | bwd: 3328.37 | bwd_inner: 3327.50 | bwd_allreduce: 0.84 | step: 7.41 55%|█████▍ | 5470/10000 [8:37:18<6:55:19, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.00497419573366642, 'learning_rate': 1.7930905369242328e-05, 'epoch': 5.47} 55%|█████▍ | 5470/10000 [8:37:18<6:55:19, 5.50s/it][2025-06-19 22:07:03,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:07:03,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.63 | bwd_microstep: 3376.48 | bwd_inner_microstep: 3375.53 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.12 [2025-06-19 22:07:03,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.63 | bwd: 3376.50 | bwd_inner: 3375.53 | bwd_allreduce: 0.92 | step: 7.12 55%|█████▍ | 5471/10000 [8:37:24<6:56:26, 5.52s/it] {'loss': 0.0024, 'grad_norm': 0.29822781682014465, 'learning_rate': 1.7924462724513376e-05, 'epoch': 5.47} 55%|█████▍ | 5471/10000 [8:37:24<6:56:26, 5.52s/it][2025-06-19 22:07:09,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:07:09,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.46 | bwd_microstep: 3326.11 | bwd_inner_microstep: 3325.20 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.98 [2025-06-19 22:07:09,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.46 | bwd: 3326.13 | bwd_inner: 3325.20 | bwd_allreduce: 0.89 | step: 6.98 55%|█████▍ | 5472/10000 [8:37:29<6:55:29, 5.51s/it] {'loss': 0.0186, 'grad_norm': 3.4821181297302246, 'learning_rate': 1.7918020297498647e-05, 'epoch': 5.47} 55%|█████▍ | 5472/10000 [8:37:29<6:55:29, 5.51s/it][2025-06-19 22:07:14,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:07:14,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.65 | bwd_microstep: 3372.92 | bwd_inner_microstep: 3372.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 22:07:14,652] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.65 | bwd: 3372.93 | bwd_inner: 3372.14 | bwd_allreduce: 0.75 | step: 6.58 55%|█████▍ | 5473/10000 [8:37:35<6:56:26, 5.52s/it] {'loss': 0.0008, 'grad_norm': 0.1533563733100891, 'learning_rate': 1.7911578088873932e-05, 'epoch': 5.47} 55%|█████▍ | 5473/10000 [8:37:35<6:56:26, 5.52s/it][2025-06-19 22:07:20,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:07:20,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.18 | bwd_microstep: 3386.23 | bwd_inner_microstep: 3385.41 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.07 [2025-06-19 22:07:20,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.18 | bwd: 3386.24 | bwd_inner: 3385.41 | bwd_allreduce: 0.79 | step: 7.07 55%|█████▍ | 5474/10000 [8:37:41<6:57:19, 5.53s/it] {'loss': 0.0005, 'grad_norm': 0.050935324281454086, 'learning_rate': 1.7905136099314976e-05, 'epoch': 5.47} 55%|█████▍ | 5474/10000 [8:37:41<6:57:19, 5.53s/it][2025-06-19 22:07:25,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:07:25,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.02 | bwd_microstep: 3325.32 | bwd_inner_microstep: 3324.50 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-19 22:07:25,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.02 | bwd: 3325.34 | bwd_inner: 3324.50 | bwd_allreduce: 0.78 | step: 6.86 55%|█████▍ | 5475/10000 [8:37:46<6:56:03, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.020602161064743996, 'learning_rate': 1.7898694329497523e-05, 'epoch': 5.47} 55%|█████▍ | 5475/10000 [8:37:46<6:56:03, 5.52s/it][2025-06-19 22:07:31,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:07:31,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.02 | bwd_microstep: 3344.05 | bwd_inner_microstep: 3343.00 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.43 [2025-06-19 22:07:31,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.02 | bwd: 3344.07 | bwd_inner: 3343.00 | bwd_allreduce: 1.01 | step: 7.44 55%|█████▍ | 5476/10000 [8:37:51<6:55:39, 5.51s/it] {'loss': 0.0169, 'grad_norm': 2.6030101776123047, 'learning_rate': 1.789225278009728e-05, 'epoch': 5.48} 55%|█████▍ | 5476/10000 [8:37:51<6:55:39, 5.51s/it][2025-06-19 22:07:36,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:07:36,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.40 | bwd_microstep: 3323.17 | bwd_inner_microstep: 3322.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 22:07:36,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.40 | bwd: 3323.18 | bwd_inner: 3322.38 | bwd_allreduce: 0.76 | step: 6.76 55%|█████▍ | 5477/10000 [8:37:57<6:54:43, 5.50s/it] {'loss': 0.0013, 'grad_norm': 0.22758451104164124, 'learning_rate': 1.788581145178994e-05, 'epoch': 5.48} 55%|█████▍ | 5477/10000 [8:37:57<6:54:43, 5.50s/it][2025-06-19 22:07:42,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 22:07:42,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.94 | bwd_microstep: 3385.50 | bwd_inner_microstep: 3384.29 | bwd_allreduce_microstep: 1.16 | step_microstep: 7.28 [2025-06-19 22:07:42,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.94 | bwd: 3385.52 | bwd_inner: 3384.29 | bwd_allreduce: 1.18 | step: 7.28 55%|█████▍ | 5478/10000 [8:38:03<6:55:47, 5.52s/it] {'loss': 0.0427, 'grad_norm': 6.041800022125244, 'learning_rate': 1.7879370345251176e-05, 'epoch': 5.48} 55%|█████▍ | 5478/10000 [8:38:03<6:55:47, 5.52s/it][2025-06-19 22:07:47,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:07:47,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.76 | bwd_microstep: 3340.45 | bwd_inner_microstep: 3339.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 22:07:47,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.76 | bwd: 3340.46 | bwd_inner: 3339.66 | bwd_allreduce: 0.76 | step: 6.66 55%|█████▍ | 5479/10000 [8:38:08<6:55:18, 5.51s/it] {'loss': 0.0333, 'grad_norm': 2.558192014694214, 'learning_rate': 1.787292946115661e-05, 'epoch': 5.48} 55%|█████▍ | 5479/10000 [8:38:08<6:55:18, 5.51s/it][2025-06-19 22:07:53,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:07:53,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.89 | bwd_microstep: 3349.18 | bwd_inner_microstep: 3348.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 22:07:53,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.89 | bwd: 3349.20 | bwd_inner: 3348.38 | bwd_allreduce: 0.78 | step: 7.14 55%|█████▍ | 5480/10000 [8:38:14<6:54:56, 5.51s/it] {'loss': 0.1007, 'grad_norm': 5.470917701721191, 'learning_rate': 1.7866488800181874e-05, 'epoch': 5.48} 55%|█████▍ | 5480/10000 [8:38:14<6:54:56, 5.51s/it][2025-06-19 22:07:58,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:07:58,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.57 | bwd_microstep: 3380.95 | bwd_inner_microstep: 3380.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 22:07:58,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.57 | bwd: 3380.96 | bwd_inner: 3380.16 | bwd_allreduce: 0.76 | step: 6.65 55%|█████▍ | 5481/10000 [8:38:19<6:55:48, 5.52s/it] {'loss': 0.0775, 'grad_norm': 3.6591734886169434, 'learning_rate': 1.786004836300256e-05, 'epoch': 5.48} 55%|█████▍ | 5481/10000 [8:38:19<6:55:48, 5.52s/it][2025-06-19 22:08:04,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:08:04,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.52 | bwd_microstep: 3342.29 | bwd_inner_microstep: 3341.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 22:08:04,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.52 | bwd: 3342.31 | bwd_inner: 3341.50 | bwd_allreduce: 0.76 | step: 6.77 55%|█████▍ | 5482/10000 [8:38:25<6:54:58, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.01890760287642479, 'learning_rate': 1.785360815029424e-05, 'epoch': 5.48} 55%|█████▍ | 5482/10000 [8:38:25<6:54:58, 5.51s/it][2025-06-19 22:08:09,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:08:09,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.88 | bwd_microstep: 3375.75 | bwd_inner_microstep: 3374.89 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.26 [2025-06-19 22:08:09,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.88 | bwd: 3375.77 | bwd_inner: 3374.89 | bwd_allreduce: 0.82 | step: 7.25 55%|█████▍ | 5483/10000 [8:38:30<6:55:42, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.03264593333005905, 'learning_rate': 1.7847168162732468e-05, 'epoch': 5.48} 55%|█████▍ | 5483/10000 [8:38:30<6:55:42, 5.52s/it][2025-06-19 22:08:15,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:08:15,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.88 | bwd_microstep: 3378.06 | bwd_inner_microstep: 3377.25 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 22:08:15,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.88 | bwd: 3378.08 | bwd_inner: 3377.25 | bwd_allreduce: 0.78 | step: 6.76 55%|█████▍ | 5484/10000 [8:38:36<6:56:08, 5.53s/it] {'loss': 0.0024, 'grad_norm': 0.28656014800071716, 'learning_rate': 1.784072840099276e-05, 'epoch': 5.48} 55%|█████▍ | 5484/10000 [8:38:36<6:56:08, 5.53s/it][2025-06-19 22:08:20,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.72 [2025-06-19 22:08:20,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.03 | bwd_microstep: 3384.36 | bwd_inner_microstep: 3383.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.72 [2025-06-19 22:08:20,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.03 | bwd: 3384.38 | bwd_inner: 3383.57 | bwd_allreduce: 0.76 | step: 7.72 55%|█████▍ | 5485/10000 [8:38:41<6:56:40, 5.54s/it] {'loss': 0.0033, 'grad_norm': 0.5334572792053223, 'learning_rate': 1.783428886575062e-05, 'epoch': 5.49} 55%|█████▍ | 5485/10000 [8:38:41<6:56:40, 5.54s/it][2025-06-19 22:08:26,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:08:26,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.77 | bwd_microstep: 3328.10 | bwd_inner_microstep: 3327.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 22:08:26,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.77 | bwd: 3328.11 | bwd_inner: 3327.31 | bwd_allreduce: 0.76 | step: 6.62 55%|█████▍ | 5486/10000 [8:38:47<6:55:29, 5.52s/it] {'loss': 0.0013, 'grad_norm': 0.11336799710988998, 'learning_rate': 1.782784955768153e-05, 'epoch': 5.49} 55%|█████▍ | 5486/10000 [8:38:47<6:55:29, 5.52s/it][2025-06-19 22:08:31,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:08:31,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.65 | bwd_microstep: 3336.40 | bwd_inner_microstep: 3335.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-19 22:08:31,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.65 | bwd: 3336.41 | bwd_inner: 3335.60 | bwd_allreduce: 0.77 | step: 7.03 55%|█████▍ | 5487/10000 [8:38:52<6:54:39, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.02852221392095089, 'learning_rate': 1.7821410477460935e-05, 'epoch': 5.49} 55%|█████▍ | 5487/10000 [8:38:52<6:54:39, 5.51s/it][2025-06-19 22:08:37,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:08:37,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.93 | bwd_microstep: 3329.39 | bwd_inner_microstep: 3328.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-19 22:08:37,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.93 | bwd: 3329.40 | bwd_inner: 3328.59 | bwd_allreduce: 0.77 | step: 6.98 55%|█████▍ | 5488/10000 [8:38:58<6:53:54, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.03916836902499199, 'learning_rate': 1.7814971625764274e-05, 'epoch': 5.49} 55%|█████▍ | 5488/10000 [8:38:58<6:53:54, 5.50s/it][2025-06-19 22:08:42,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:08:42,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.26 | bwd_microstep: 3324.86 | bwd_inner_microstep: 3323.90 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.47 [2025-06-19 22:08:42,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.26 | bwd: 3324.88 | bwd_inner: 3323.90 | bwd_allreduce: 0.93 | step: 7.47 55%|█████▍ | 5489/10000 [8:39:03<6:53:21, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.015182506293058395, 'learning_rate': 1.7808533003266956e-05, 'epoch': 5.49} 55%|█████▍ | 5489/10000 [8:39:03<6:53:21, 5.50s/it][2025-06-19 22:08:48,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:08:48,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.39 | bwd_microstep: 3376.73 | bwd_inner_microstep: 3375.92 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.89 [2025-06-19 22:08:48,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.39 | bwd: 3376.74 | bwd_inner: 3375.92 | bwd_allreduce: 0.78 | step: 6.89 55%|█████▍ | 5490/10000 [8:39:09<6:54:34, 5.52s/it] {'loss': 0.0334, 'grad_norm': 3.0559253692626953, 'learning_rate': 1.7802094610644345e-05, 'epoch': 5.49} 55%|█████▍ | 5490/10000 [8:39:09<6:54:34, 5.52s/it][2025-06-19 22:08:53,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:08:53,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.82 | bwd_microstep: 3332.88 | bwd_inner_microstep: 3332.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 22:08:53,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.82 | bwd: 3332.89 | bwd_inner: 3332.10 | bwd_allreduce: 0.75 | step: 6.60 55%|█████▍ | 5491/10000 [8:39:14<6:53:42, 5.51s/it] {'loss': 0.0426, 'grad_norm': 5.618199825286865, 'learning_rate': 1.779565644857181e-05, 'epoch': 5.49} 55%|█████▍ | 5491/10000 [8:39:14<6:53:42, 5.51s/it][2025-06-19 22:08:59,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:08:59,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.54 | bwd_microstep: 3370.81 | bwd_inner_microstep: 3370.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 22:08:59,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.54 | bwd: 3370.83 | bwd_inner: 3370.02 | bwd_allreduce: 0.77 | step: 6.73 55%|█████▍ | 5492/10000 [8:39:20<6:54:25, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.013594444841146469, 'learning_rate': 1.778921851772468e-05, 'epoch': 5.49} 55%|█████▍ | 5492/10000 [8:39:20<6:54:25, 5.52s/it][2025-06-19 22:09:04,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:09:04,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.51 | bwd_microstep: 3323.75 | bwd_inner_microstep: 3322.92 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.61 [2025-06-19 22:09:04,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.51 | bwd: 3323.77 | bwd_inner: 3322.92 | bwd_allreduce: 0.80 | step: 6.61 55%|█████▍ | 5493/10000 [8:39:25<6:53:20, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.06280257552862167, 'learning_rate': 1.7782780818778266e-05, 'epoch': 5.49} 55%|█████▍ | 5493/10000 [8:39:25<6:53:20, 5.50s/it][2025-06-19 22:09:10,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:09:10,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.84 | bwd_microstep: 3381.94 | bwd_inner_microstep: 3381.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 22:09:10,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.84 | bwd: 3381.95 | bwd_inner: 3381.15 | bwd_allreduce: 0.76 | step: 6.68 55%|█████▍ | 5494/10000 [8:39:31<6:54:16, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.030024293810129166, 'learning_rate': 1.7776343352407867e-05, 'epoch': 5.49} 55%|█████▍ | 5494/10000 [8:39:31<6:54:16, 5.52s/it][2025-06-19 22:09:15,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:09:15,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.30 | bwd_microstep: 3331.62 | bwd_inner_microstep: 3330.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 22:09:15,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.30 | bwd: 3331.64 | bwd_inner: 3330.83 | bwd_allreduce: 0.76 | step: 6.71 55%|█████▍ | 5495/10000 [8:39:36<6:53:15, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.5824575424194336, 'learning_rate': 1.776990611928872e-05, 'epoch': 5.5} 55%|█████▍ | 5495/10000 [8:39:36<6:53:15, 5.50s/it][2025-06-19 22:09:21,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 22:09:21,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.33 | bwd_microstep: 3333.88 | bwd_inner_microstep: 3332.79 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.65 [2025-06-19 22:09:21,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.34 | bwd: 3333.90 | bwd_inner: 3332.79 | bwd_allreduce: 1.05 | step: 7.66 55%|█████▍ | 5496/10000 [8:39:42<6:52:52, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.05006831884384155, 'learning_rate': 1.7763469120096074e-05, 'epoch': 5.5} 55%|█████▍ | 5496/10000 [8:39:42<6:52:52, 5.50s/it][2025-06-19 22:09:27,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:09:27,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.48 | bwd_microstep: 3403.46 | bwd_inner_microstep: 3402.61 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.04 [2025-06-19 22:09:27,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.48 | bwd: 3403.47 | bwd_inner: 3402.61 | bwd_allreduce: 0.82 | step: 7.04 55%|█████▍ | 5497/10000 [8:39:47<6:54:37, 5.52s/it] {'loss': 0.055, 'grad_norm': 3.450610637664795, 'learning_rate': 1.7757032355505132e-05, 'epoch': 5.5} 55%|█████▍ | 5497/10000 [8:39:47<6:54:37, 5.52s/it][2025-06-19 22:09:32,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:09:32,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.32 | bwd_microstep: 3326.32 | bwd_inner_microstep: 3325.51 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.05 [2025-06-19 22:09:32,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.32 | bwd: 3326.33 | bwd_inner: 3325.51 | bwd_allreduce: 0.78 | step: 7.05 55%|█████▍ | 5498/10000 [8:39:53<6:53:45, 5.51s/it] {'loss': 0.0006, 'grad_norm': 0.12394696474075317, 'learning_rate': 1.7750595826191087e-05, 'epoch': 5.5} 55%|█████▍ | 5498/10000 [8:39:53<6:53:45, 5.51s/it][2025-06-19 22:09:38,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:09:38,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.22 | bwd_microstep: 3367.11 | bwd_inner_microstep: 3366.08 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.17 [2025-06-19 22:09:38,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.22 | bwd: 3367.13 | bwd_inner: 3366.08 | bwd_allreduce: 1.00 | step: 7.16 55%|█████▍ | 5499/10000 [8:39:58<6:56:43, 5.56s/it] {'loss': 0.0001, 'grad_norm': 0.02700686827301979, 'learning_rate': 1.7744159532829108e-05, 'epoch': 5.5} 55%|█████▍ | 5499/10000 [8:39:58<6:56:43, 5.56s/it][2025-06-19 22:09:43,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:09:43,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.88 | bwd_microstep: 3371.08 | bwd_inner_microstep: 3370.12 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.13 [2025-06-19 22:09:43,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.88 | bwd: 3371.10 | bwd_inner: 3370.12 | bwd_allreduce: 0.93 | step: 7.14 55%|█████▌ | 5500/10000 [8:40:04<6:56:39, 5.56s/it] {'loss': 0.0003, 'grad_norm': 0.03891816735267639, 'learning_rate': 1.7737723476094317e-05, 'epoch': 5.5} 55%|█████▌ | 5500/10000 [8:40:04<6:56:39, 5.56s/it][2025-06-19 22:09:49,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 22:09:49,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.68 | bwd_microstep: 3376.06 | bwd_inner_microstep: 3374.90 | bwd_allreduce_microstep: 1.10 | step_microstep: 8.23 [2025-06-19 22:09:49,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.68 | bwd: 3376.08 | bwd_inner: 3374.90 | bwd_allreduce: 1.13 | step: 8.24 55%|█████▌ | 5501/10000 [8:40:10<6:56:35, 5.56s/it] {'loss': 0.0001, 'grad_norm': 0.006477846298366785, 'learning_rate': 1.7731287656661834e-05, 'epoch': 5.5} 55%|█████▌ | 5501/10000 [8:40:10<6:56:35, 5.56s/it][2025-06-19 22:09:54,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:09:54,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.76 | bwd_microstep: 3321.02 | bwd_inner_microstep: 3320.14 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.29 [2025-06-19 22:09:54,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.76 | bwd: 3321.04 | bwd_inner: 3320.14 | bwd_allreduce: 0.85 | step: 7.29 55%|█████▌ | 5502/10000 [8:40:15<6:54:50, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.014929606579244137, 'learning_rate': 1.7724852075206747e-05, 'epoch': 5.5} 55%|█████▌ | 5502/10000 [8:40:15<6:54:50, 5.53s/it][2025-06-19 22:10:00,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:10:00,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.63 | bwd_microstep: 3315.99 | bwd_inner_microstep: 3315.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 22:10:00,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.63 | bwd: 3316.00 | bwd_inner: 3315.19 | bwd_allreduce: 0.77 | step: 6.82 55%|█████▌ | 5503/10000 [8:40:21<6:53:14, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.01057793851941824, 'learning_rate': 1.7718416732404117e-05, 'epoch': 5.5} 55%|█████▌ | 5503/10000 [8:40:21<6:53:14, 5.51s/it][2025-06-19 22:10:05,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:10:05,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.10 | bwd_microstep: 3366.87 | bwd_inner_microstep: 3366.04 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.79 [2025-06-19 22:10:05,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.10 | bwd: 3366.88 | bwd_inner: 3366.04 | bwd_allreduce: 0.80 | step: 6.80 55%|█████▌ | 5504/10000 [8:40:26<6:53:33, 5.52s/it] {'loss': 0.0265, 'grad_norm': 3.241464138031006, 'learning_rate': 1.7711981628928994e-05, 'epoch': 5.5} 55%|█████▌ | 5504/10000 [8:40:26<6:53:33, 5.52s/it][2025-06-19 22:10:11,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.90 [2025-06-19 22:10:11,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.95 | bwd_microstep: 3369.41 | bwd_inner_microstep: 3368.58 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.49 [2025-06-19 22:10:11,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.95 | bwd: 3369.43 | bwd_inner: 3368.58 | bwd_allreduce: 0.81 | step: 7.49 55%|█████▌ | 5505/10000 [8:40:32<6:53:53, 5.52s/it] {'loss': 0.0021, 'grad_norm': 0.5976582169532776, 'learning_rate': 1.7705546765456367e-05, 'epoch': 5.5} 55%|█████▌ | 5505/10000 [8:40:32<6:53:53, 5.52s/it][2025-06-19 22:10:16,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 22:10:16,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.50 | bwd_microstep: 3317.09 | bwd_inner_microstep: 3316.02 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.59 [2025-06-19 22:10:16,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.50 | bwd: 3317.11 | bwd_inner: 3316.02 | bwd_allreduce: 1.04 | step: 7.60 55%|█████▌ | 5506/10000 [8:40:37<6:52:45, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.021725835278630257, 'learning_rate': 1.7699112142661237e-05, 'epoch': 5.51} 55%|█████▌ | 5506/10000 [8:40:37<6:52:45, 5.51s/it][2025-06-19 22:10:22,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 22:10:22,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.67 | bwd_microstep: 3317.22 | bwd_inner_microstep: 3316.25 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.34 [2025-06-19 22:10:22,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.67 | bwd: 3317.23 | bwd_inner: 3316.25 | bwd_allreduce: 0.93 | step: 7.36 55%|█████▌ | 5507/10000 [8:40:43<6:51:45, 5.50s/it] {'loss': 0.0006, 'grad_norm': 0.09078019857406616, 'learning_rate': 1.7692677761218563e-05, 'epoch': 5.51} 55%|█████▌ | 5507/10000 [8:40:43<6:51:45, 5.50s/it][2025-06-19 22:10:27,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:10:27,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.93 | bwd_microstep: 3314.37 | bwd_inner_microstep: 3313.53 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.04 [2025-06-19 22:10:27,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.93 | bwd: 3314.39 | bwd_inner: 3313.53 | bwd_allreduce: 0.80 | step: 7.04 55%|█████▌ | 5508/10000 [8:40:48<6:51:00, 5.49s/it] {'loss': 0.0071, 'grad_norm': 0.7803462147712708, 'learning_rate': 1.7686243621803286e-05, 'epoch': 5.51} 55%|█████▌ | 5508/10000 [8:40:48<6:51:00, 5.49s/it][2025-06-19 22:10:33,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:10:33,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.22 | bwd_microstep: 3314.08 | bwd_inner_microstep: 3313.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 22:10:33,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.22 | bwd: 3314.10 | bwd_inner: 3313.28 | bwd_allreduce: 0.78 | step: 6.97 55%|█████▌ | 5509/10000 [8:40:53<6:50:16, 5.48s/it] {'loss': 0.0013, 'grad_norm': 0.4220573902130127, 'learning_rate': 1.7679809725090318e-05, 'epoch': 5.51} 55%|█████▌ | 5509/10000 [8:40:53<6:50:16, 5.48s/it][2025-06-19 22:10:38,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:10:38,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.09 | bwd_microstep: 3379.11 | bwd_inner_microstep: 3378.13 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.75 [2025-06-19 22:10:38,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.09 | bwd: 3379.14 | bwd_inner: 3378.13 | bwd_allreduce: 0.95 | step: 7.76 55%|█████▌ | 5510/10000 [8:40:59<6:51:43, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.0678417980670929, 'learning_rate': 1.7673376071754536e-05, 'epoch': 5.51} 55%|█████▌ | 5510/10000 [8:40:59<6:51:43, 5.50s/it][2025-06-19 22:10:44,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.79 [2025-06-19 22:10:44,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.80 | bwd_microstep: 3370.20 | bwd_inner_microstep: 3369.34 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.03 [2025-06-19 22:10:44,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.80 | bwd: 3370.22 | bwd_inner: 3369.34 | bwd_allreduce: 0.83 | step: 7.03 55%|█████▌ | 5511/10000 [8:41:05<6:52:51, 5.52s/it] {'loss': 0.0071, 'grad_norm': 1.660704255104065, 'learning_rate': 1.7666942662470807e-05, 'epoch': 5.51} 55%|█████▌ | 5511/10000 [8:41:05<6:52:51, 5.52s/it][2025-06-19 22:10:49,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:10:49,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.80 | bwd_microstep: 3367.62 | bwd_inner_microstep: 3366.79 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.27 [2025-06-19 22:10:49,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.80 | bwd: 3367.64 | bwd_inner: 3366.79 | bwd_allreduce: 0.80 | step: 7.27 55%|█████▌ | 5512/10000 [8:41:10<6:53:21, 5.53s/it] {'loss': 0.0533, 'grad_norm': 5.638448238372803, 'learning_rate': 1.7660509497913965e-05, 'epoch': 5.51} 55%|█████▌ | 5512/10000 [8:41:10<6:53:21, 5.53s/it][2025-06-19 22:10:55,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:10:55,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.92 | bwd_microstep: 3364.40 | bwd_inner_microstep: 3363.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 22:10:55,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.92 | bwd: 3364.41 | bwd_inner: 3363.61 | bwd_allreduce: 0.76 | step: 6.63 55%|█████▌ | 5513/10000 [8:41:16<6:53:23, 5.53s/it] {'loss': 0.0013, 'grad_norm': 0.1231386736035347, 'learning_rate': 1.7654076578758817e-05, 'epoch': 5.51} 55%|█████▌ | 5513/10000 [8:41:16<6:53:23, 5.53s/it][2025-06-19 22:11:00,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:11:00,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.70 | bwd_microstep: 3314.49 | bwd_inner_microstep: 3313.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-19 22:11:00,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.70 | bwd: 3314.50 | bwd_inner: 3313.69 | bwd_allreduce: 0.77 | step: 6.79 55%|█████▌ | 5514/10000 [8:41:21<6:51:45, 5.51s/it] {'loss': 0.0252, 'grad_norm': 1.530025601387024, 'learning_rate': 1.7647643905680158e-05, 'epoch': 5.51} 55%|█████▌ | 5514/10000 [8:41:21<6:51:45, 5.51s/it][2025-06-19 22:11:06,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:11:06,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.66 | bwd_microstep: 3323.63 | bwd_inner_microstep: 3322.84 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 22:11:06,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.66 | bwd: 3323.65 | bwd_inner: 3322.84 | bwd_allreduce: 0.76 | step: 6.69 55%|█████▌ | 5515/10000 [8:41:27<6:50:43, 5.49s/it] {'loss': 0.0126, 'grad_norm': 2.687234401702881, 'learning_rate': 1.7641211479352726e-05, 'epoch': 5.51} 55%|█████▌ | 5515/10000 [8:41:27<6:50:43, 5.49s/it][2025-06-19 22:11:11,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:11:11,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.09 | bwd_microstep: 3315.61 | bwd_inner_microstep: 3314.79 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.22 [2025-06-19 22:11:11,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.09 | bwd: 3315.63 | bwd_inner: 3314.79 | bwd_allreduce: 0.79 | step: 7.22 55%|█████▌ | 5516/10000 [8:41:32<6:49:55, 5.49s/it] {'loss': 0.0098, 'grad_norm': 1.440697193145752, 'learning_rate': 1.7634779300451262e-05, 'epoch': 5.52} 55%|█████▌ | 5516/10000 [8:41:32<6:49:55, 5.49s/it][2025-06-19 22:11:17,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 22:11:17,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.75 | bwd_microstep: 3363.09 | bwd_inner_microstep: 3362.26 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.31 [2025-06-19 22:11:17,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.75 | bwd: 3363.11 | bwd_inner: 3362.26 | bwd_allreduce: 0.79 | step: 7.31 55%|█████▌ | 5517/10000 [8:41:38<6:50:48, 5.50s/it] {'loss': 0.0649, 'grad_norm': 2.414055585861206, 'learning_rate': 1.7628347369650474e-05, 'epoch': 5.52} 55%|█████▌ | 5517/10000 [8:41:38<6:50:48, 5.50s/it][2025-06-19 22:11:22,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:11:22,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.83 | bwd_microstep: 3309.63 | bwd_inner_microstep: 3308.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-19 22:11:22,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.83 | bwd: 3309.64 | bwd_inner: 3308.83 | bwd_allreduce: 0.76 | step: 6.84 55%|█████▌ | 5518/10000 [8:41:43<6:49:59, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.18433253467082977, 'learning_rate': 1.762191568762504e-05, 'epoch': 5.52} 55%|█████▌ | 5518/10000 [8:41:43<6:49:59, 5.49s/it][2025-06-19 22:11:28,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:11:28,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.36 | bwd_microstep: 3381.83 | bwd_inner_microstep: 3381.01 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.17 [2025-06-19 22:11:28,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.36 | bwd: 3381.85 | bwd_inner: 3381.01 | bwd_allreduce: 0.79 | step: 7.17 55%|█████▌ | 5519/10000 [8:41:49<6:51:15, 5.51s/it] {'loss': 0.0121, 'grad_norm': 1.1987946033477783, 'learning_rate': 1.7615484255049617e-05, 'epoch': 5.52} 55%|█████▌ | 5519/10000 [8:41:49<6:51:15, 5.51s/it][2025-06-19 22:11:33,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:11:33,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.60 | bwd_microstep: 3375.97 | bwd_inner_microstep: 3375.17 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-19 22:11:33,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.60 | bwd: 3375.99 | bwd_inner: 3375.17 | bwd_allreduce: 0.78 | step: 6.82 55%|█████▌ | 5520/10000 [8:41:54<6:52:08, 5.52s/it] {'loss': 0.0052, 'grad_norm': 1.474905252456665, 'learning_rate': 1.7609053072598827e-05, 'epoch': 5.52} 55%|█████▌ | 5520/10000 [8:41:54<6:52:08, 5.52s/it][2025-06-19 22:11:39,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:11:39,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.94 | bwd_microstep: 3313.82 | bwd_inner_microstep: 3312.99 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.23 [2025-06-19 22:11:39,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.94 | bwd: 3313.83 | bwd_inner: 3312.99 | bwd_allreduce: 0.80 | step: 7.23 55%|█████▌ | 5521/10000 [8:42:00<6:50:42, 5.50s/it] {'loss': 0.0015, 'grad_norm': 0.3472503423690796, 'learning_rate': 1.760262214094727e-05, 'epoch': 5.52} 55%|█████▌ | 5521/10000 [8:42:00<6:50:42, 5.50s/it][2025-06-19 22:11:44,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:11:44,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.10 | bwd_microstep: 3324.52 | bwd_inner_microstep: 3323.70 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.28 [2025-06-19 22:11:44,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.10 | bwd: 3324.53 | bwd_inner: 3323.70 | bwd_allreduce: 0.78 | step: 7.28 55%|█████▌ | 5522/10000 [8:42:05<6:49:54, 5.49s/it] {'loss': 0.0044, 'grad_norm': 0.9121647477149963, 'learning_rate': 1.759619146076952e-05, 'epoch': 5.52} 55%|█████▌ | 5522/10000 [8:42:05<6:49:54, 5.49s/it][2025-06-19 22:11:50,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:11:50,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.51 | bwd_microstep: 3377.91 | bwd_inner_microstep: 3377.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 22:11:50,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.51 | bwd: 3377.92 | bwd_inner: 3377.11 | bwd_allreduce: 0.77 | step: 6.70 55%|█████▌ | 5523/10000 [8:42:11<6:50:54, 5.51s/it] {'loss': 0.0122, 'grad_norm': 0.9818816781044006, 'learning_rate': 1.7589761032740133e-05, 'epoch': 5.52} 55%|█████▌ | 5523/10000 [8:42:11<6:50:54, 5.51s/it][2025-06-19 22:11:55,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:11:55,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.25 | bwd_microstep: 3326.48 | bwd_inner_microstep: 3325.68 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-19 22:11:55,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.25 | bwd: 3326.49 | bwd_inner: 3325.68 | bwd_allreduce: 0.77 | step: 7.28 55%|█████▌ | 5524/10000 [8:42:16<6:49:57, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.02338200807571411, 'learning_rate': 1.758333085753364e-05, 'epoch': 5.52} 55%|█████▌ | 5524/10000 [8:42:16<6:49:57, 5.50s/it][2025-06-19 22:12:01,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:12:01,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.29 | bwd_microstep: 3318.25 | bwd_inner_microstep: 3317.18 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.39 [2025-06-19 22:12:01,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.29 | bwd: 3318.28 | bwd_inner: 3317.18 | bwd_allreduce: 1.03 | step: 7.39 55%|█████▌ | 5525/10000 [8:42:22<6:49:01, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.04244200140237808, 'learning_rate': 1.757690093582451e-05, 'epoch': 5.53} 55%|█████▌ | 5525/10000 [8:42:22<6:49:01, 5.48s/it][2025-06-19 22:12:06,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:12:06,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.71 | bwd_microstep: 3325.55 | bwd_inner_microstep: 3324.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 22:12:06,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.71 | bwd: 3325.56 | bwd_inner: 3324.76 | bwd_allreduce: 0.76 | step: 6.65 55%|█████▌ | 5526/10000 [8:42:27<6:48:39, 5.48s/it] {'loss': 0.0119, 'grad_norm': 2.4596216678619385, 'learning_rate': 1.7570471268287225e-05, 'epoch': 5.53} 55%|█████▌ | 5526/10000 [8:42:27<6:48:39, 5.48s/it][2025-06-19 22:12:12,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:12:12,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.38 | bwd_microstep: 3323.18 | bwd_inner_microstep: 3322.21 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.03 [2025-06-19 22:12:12,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.38 | bwd: 3323.19 | bwd_inner: 3322.21 | bwd_allreduce: 0.94 | step: 7.04 55%|█████▌ | 5527/10000 [8:42:32<6:48:04, 5.47s/it] {'loss': 0.0151, 'grad_norm': 2.464473009109497, 'learning_rate': 1.756404185559623e-05, 'epoch': 5.53} 55%|█████▌ | 5527/10000 [8:42:32<6:48:04, 5.47s/it][2025-06-19 22:12:17,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:12:17,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.67 | bwd_microstep: 3315.58 | bwd_inner_microstep: 3314.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 22:12:17,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.67 | bwd: 3315.60 | bwd_inner: 3314.79 | bwd_allreduce: 0.76 | step: 6.85 55%|█████▌ | 5528/10000 [8:42:38<6:47:44, 5.47s/it] {'loss': 0.139, 'grad_norm': 4.9159698486328125, 'learning_rate': 1.7557612698425935e-05, 'epoch': 5.53} 55%|█████▌ | 5528/10000 [8:42:38<6:47:44, 5.47s/it][2025-06-19 22:12:23,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:12:23,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.86 | bwd_microstep: 3307.79 | bwd_inner_microstep: 3306.97 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.08 [2025-06-19 22:12:23,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.86 | bwd: 3307.80 | bwd_inner: 3306.97 | bwd_allreduce: 0.79 | step: 7.09 55%|█████▌ | 5529/10000 [8:42:43<6:47:12, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.007779822219163179, 'learning_rate': 1.7551183797450748e-05, 'epoch': 5.53} 55%|█████▌ | 5529/10000 [8:42:43<6:47:12, 5.46s/it][2025-06-19 22:12:28,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:12:28,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.34 | bwd_microstep: 3362.77 | bwd_inner_microstep: 3361.81 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.06 [2025-06-19 22:12:28,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.34 | bwd: 3362.78 | bwd_inner: 3361.82 | bwd_allreduce: 0.92 | step: 7.07 55%|█████▌ | 5530/10000 [8:42:49<6:48:26, 5.48s/it] {'loss': 0.004, 'grad_norm': 0.6306765079498291, 'learning_rate': 1.7544755153345004e-05, 'epoch': 5.53} 55%|█████▌ | 5530/10000 [8:42:49<6:48:26, 5.48s/it][2025-06-19 22:12:34,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:12:34,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.90 | bwd_microstep: 3374.37 | bwd_inner_microstep: 3373.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 22:12:34,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.90 | bwd: 3374.39 | bwd_inner: 3373.58 | bwd_allreduce: 0.76 | step: 6.73 55%|█████▌ | 5531/10000 [8:42:54<6:49:37, 5.50s/it] {'loss': 0.0022, 'grad_norm': 0.47136181592941284, 'learning_rate': 1.753832676678305e-05, 'epoch': 5.53} 55%|█████▌ | 5531/10000 [8:42:54<6:49:37, 5.50s/it][2025-06-19 22:12:39,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:12:39,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.42 | bwd_microstep: 3374.77 | bwd_inner_microstep: 3373.96 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 22:12:39,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.42 | bwd: 3374.78 | bwd_inner: 3373.96 | bwd_allreduce: 0.78 | step: 7.14 55%|█████▌ | 5532/10000 [8:43:00<6:50:29, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.018411513417959213, 'learning_rate': 1.753189863843919e-05, 'epoch': 5.53} 55%|█████▌ | 5532/10000 [8:43:00<6:50:29, 5.51s/it][2025-06-19 22:12:45,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:12:45,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.33 | bwd_microstep: 3367.66 | bwd_inner_microstep: 3366.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 22:12:45,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.33 | bwd: 3367.67 | bwd_inner: 3366.88 | bwd_allreduce: 0.75 | step: 6.56 55%|█████▌ | 5533/10000 [8:43:06<6:51:06, 5.52s/it] {'loss': 0.0037, 'grad_norm': 1.0824499130249023, 'learning_rate': 1.7525470768987712e-05, 'epoch': 5.53} 55%|█████▌ | 5533/10000 [8:43:06<6:51:06, 5.52s/it][2025-06-19 22:12:50,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:12:50,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.76 | bwd_microstep: 3316.85 | bwd_inner_microstep: 3316.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 22:12:50,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.76 | bwd: 3316.87 | bwd_inner: 3316.06 | bwd_allreduce: 0.76 | step: 6.71 55%|█████▌ | 5534/10000 [8:43:11<6:49:37, 5.50s/it] {'loss': 0.0064, 'grad_norm': 1.3901925086975098, 'learning_rate': 1.7519043159102868e-05, 'epoch': 5.53} 55%|█████▌ | 5534/10000 [8:43:11<6:49:37, 5.50s/it][2025-06-19 22:12:56,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:12:56,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.09 | bwd_microstep: 3315.29 | bwd_inner_microstep: 3314.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 22:12:56,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.09 | bwd: 3315.30 | bwd_inner: 3314.49 | bwd_allreduce: 0.77 | step: 6.94 55%|█████▌ | 5535/10000 [8:43:16<6:48:41, 5.49s/it] {'loss': 0.001, 'grad_norm': 0.2497609555721283, 'learning_rate': 1.751261580945888e-05, 'epoch': 5.54} 55%|█████▌ | 5535/10000 [8:43:16<6:48:41, 5.49s/it][2025-06-19 22:13:01,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:13:01,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.05 | bwd_microstep: 3318.11 | bwd_inner_microstep: 3317.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 22:13:01,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.05 | bwd: 3318.13 | bwd_inner: 3317.32 | bwd_allreduce: 0.76 | step: 6.72 55%|█████▌ | 5536/10000 [8:43:22<6:47:55, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.04883825033903122, 'learning_rate': 1.7506188720729946e-05, 'epoch': 5.54} 55%|█████▌ | 5536/10000 [8:43:22<6:47:55, 5.48s/it][2025-06-19 22:13:07,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:13:07,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.18 | bwd_microstep: 3364.90 | bwd_inner_microstep: 3364.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 22:13:07,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.18 | bwd: 3364.91 | bwd_inner: 3364.11 | bwd_allreduce: 0.76 | step: 6.68 55%|█████▌ | 5537/10000 [8:43:27<6:48:47, 5.50s/it] {'loss': 0.1259, 'grad_norm': 4.455260753631592, 'learning_rate': 1.7499761893590243e-05, 'epoch': 5.54} 55%|█████▌ | 5537/10000 [8:43:27<6:48:47, 5.50s/it][2025-06-19 22:13:12,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:13:12,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.83 | bwd_microstep: 3365.81 | bwd_inner_microstep: 3364.86 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.07 [2025-06-19 22:13:12,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.83 | bwd: 3365.83 | bwd_inner: 3364.86 | bwd_allreduce: 0.93 | step: 7.07 55%|█████▌ | 5538/10000 [8:43:33<6:49:31, 5.51s/it] {'loss': 0.0213, 'grad_norm': 3.8019068241119385, 'learning_rate': 1.7493335328713913e-05, 'epoch': 5.54} 55%|█████▌ | 5538/10000 [8:43:33<6:49:31, 5.51s/it][2025-06-19 22:13:18,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:13:18,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.65 | bwd_microstep: 3366.85 | bwd_inner_microstep: 3365.87 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.57 [2025-06-19 22:13:18,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.66 | bwd: 3366.87 | bwd_inner: 3365.87 | bwd_allreduce: 0.96 | step: 7.57 55%|█████▌ | 5539/10000 [8:43:38<6:50:08, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.014895789325237274, 'learning_rate': 1.7486909026775082e-05, 'epoch': 5.54} 55%|█████▌ | 5539/10000 [8:43:38<6:50:08, 5.52s/it][2025-06-19 22:13:23,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:13:23,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2194.76 | bwd_microstep: 3373.16 | bwd_inner_microstep: 3372.11 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.33 [2025-06-19 22:13:23,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2194.76 | bwd: 3373.19 | bwd_inner: 3372.11 | bwd_allreduce: 1.01 | step: 7.34 55%|█████▌ | 5540/10000 [8:43:44<6:52:09, 5.54s/it] {'loss': 0.065, 'grad_norm': 7.782683372497559, 'learning_rate': 1.7480482988447822e-05, 'epoch': 5.54} 55%|█████▌ | 5540/10000 [8:43:44<6:52:09, 5.54s/it][2025-06-19 22:13:29,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.82 [2025-06-19 22:13:29,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2072.23 | bwd_microstep: 3250.03 | bwd_inner_microstep: 3249.25 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 22:13:29,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2072.24 | bwd: 3250.05 | bwd_inner: 3249.25 | bwd_allreduce: 0.75 | step: 6.66 55%|█████▌ | 5541/10000 [8:43:49<6:47:58, 5.49s/it] {'loss': 0.0015, 'grad_norm': 0.33308878540992737, 'learning_rate': 1.7474057214406202e-05, 'epoch': 5.54} 55%|█████▌ | 5541/10000 [8:43:49<6:47:58, 5.49s/it][2025-06-19 22:13:34,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:13:34,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.25 | bwd_microstep: 3309.37 | bwd_inner_microstep: 3308.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 22:13:34,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.25 | bwd: 3309.39 | bwd_inner: 3308.57 | bwd_allreduce: 0.77 | step: 7.01 55%|█████▌ | 5542/10000 [8:43:55<6:47:02, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.01111743226647377, 'learning_rate': 1.746763170532426e-05, 'epoch': 5.54} 55%|█████▌ | 5542/10000 [8:43:55<6:47:02, 5.48s/it][2025-06-19 22:13:40,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:13:40,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.90 | bwd_microstep: 3389.92 | bwd_inner_microstep: 3389.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.91 [2025-06-19 22:13:40,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.90 | bwd: 3389.93 | bwd_inner: 3389.13 | bwd_allreduce: 0.76 | step: 6.92 55%|█████▌ | 5543/10000 [8:44:00<6:48:56, 5.51s/it] {'loss': 0.0245, 'grad_norm': 2.7802064418792725, 'learning_rate': 1.7461206461876e-05, 'epoch': 5.54} 55%|█████▌ | 5543/10000 [8:44:00<6:48:56, 5.51s/it][2025-06-19 22:13:45,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:13:45,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.95 | bwd_microstep: 3376.20 | bwd_inner_microstep: 3375.38 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.72 [2025-06-19 22:13:45,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.95 | bwd: 3376.21 | bwd_inner: 3375.38 | bwd_allreduce: 0.78 | step: 6.72 55%|█████▌ | 5544/10000 [8:44:06<6:49:39, 5.52s/it] {'loss': 0.0119, 'grad_norm': 1.579203724861145, 'learning_rate': 1.7454781484735396e-05, 'epoch': 5.54} 55%|█████▌ | 5544/10000 [8:44:06<6:49:39, 5.52s/it][2025-06-19 22:13:51,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:13:51,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.71 | bwd_microstep: 3308.10 | bwd_inner_microstep: 3307.14 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.15 [2025-06-19 22:13:51,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.71 | bwd: 3308.30 | bwd_inner: 3307.14 | bwd_allreduce: 0.93 | step: 7.15 55%|█████▌ | 5545/10000 [8:44:11<6:48:02, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.010660314001142979, 'learning_rate': 1.7448356774576404e-05, 'epoch': 5.54} 55%|█████▌ | 5545/10000 [8:44:11<6:48:02, 5.50s/it][2025-06-19 22:13:56,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 22:13:56,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.70 | bwd_microstep: 3360.55 | bwd_inner_microstep: 3359.44 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.96 [2025-06-19 22:13:56,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.70 | bwd: 3360.57 | bwd_inner: 3359.44 | bwd_allreduce: 1.08 | step: 7.97 55%|█████▌ | 5546/10000 [8:44:17<6:48:40, 5.51s/it] {'loss': 0.0125, 'grad_norm': 1.3114867210388184, 'learning_rate': 1.7441932332072942e-05, 'epoch': 5.55} 55%|█████▌ | 5546/10000 [8:44:17<6:48:40, 5.51s/it][2025-06-19 22:14:02,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:14:02,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.32 | bwd_microstep: 3398.09 | bwd_inner_microstep: 3397.29 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 22:14:02,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.32 | bwd: 3398.10 | bwd_inner: 3397.29 | bwd_allreduce: 0.77 | step: 7.00 55%|█████▌ | 5547/10000 [8:44:23<6:50:15, 5.53s/it] {'loss': 0.0045, 'grad_norm': 0.4636899530887604, 'learning_rate': 1.7435508157898903e-05, 'epoch': 5.55} 55%|█████▌ | 5547/10000 [8:44:23<6:50:15, 5.53s/it][2025-06-19 22:14:07,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:14:07,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.10 | bwd_microstep: 3390.94 | bwd_inner_microstep: 3390.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 22:14:07,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.10 | bwd: 3390.96 | bwd_inner: 3390.16 | bwd_allreduce: 0.76 | step: 6.70 55%|█████▌ | 5548/10000 [8:44:28<6:51:01, 5.54s/it] {'loss': 0.017, 'grad_norm': 2.4097654819488525, 'learning_rate': 1.742908425272816e-05, 'epoch': 5.55} 55%|█████▌ | 5548/10000 [8:44:28<6:51:01, 5.54s/it][2025-06-19 22:14:13,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:14:13,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.29 | bwd_microstep: 3320.17 | bwd_inner_microstep: 3319.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-19 22:14:13,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.29 | bwd: 3320.18 | bwd_inner: 3319.36 | bwd_allreduce: 0.78 | step: 7.28 55%|█████▌ | 5549/10000 [8:44:34<6:49:09, 5.52s/it] {'loss': 0.0106, 'grad_norm': 1.0751490592956543, 'learning_rate': 1.7422660617234547e-05, 'epoch': 5.55} 55%|█████▌ | 5549/10000 [8:44:34<6:49:09, 5.52s/it][2025-06-19 22:14:18,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.97 [2025-06-19 22:14:18,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.63 | bwd_microstep: 3361.41 | bwd_inner_microstep: 3360.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.19 [2025-06-19 22:14:18,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.63 | bwd: 3361.42 | bwd_inner: 3360.60 | bwd_allreduce: 0.77 | step: 7.19 56%|█████▌ | 5550/10000 [8:44:39<6:49:11, 5.52s/it] {'loss': 0.0081, 'grad_norm': 0.9499965310096741, 'learning_rate': 1.7416237252091878e-05, 'epoch': 5.55} 56%|█████▌ | 5550/10000 [8:44:39<6:49:11, 5.52s/it][2025-06-19 22:14:24,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:14:24,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.97 | bwd_microstep: 3317.23 | bwd_inner_microstep: 3316.37 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.06 [2025-06-19 22:14:24,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.97 | bwd: 3317.24 | bwd_inner: 3316.37 | bwd_allreduce: 0.82 | step: 7.07 56%|█████▌ | 5551/10000 [8:44:45<6:47:48, 5.50s/it] {'loss': 0.0023, 'grad_norm': 0.4623832404613495, 'learning_rate': 1.7409814157973924e-05, 'epoch': 5.55} 56%|█████▌ | 5551/10000 [8:44:45<6:47:48, 5.50s/it][2025-06-19 22:14:29,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:14:29,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.27 | bwd_microstep: 3362.79 | bwd_inner_microstep: 3362.00 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.85 [2025-06-19 22:14:29,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.27 | bwd: 3362.80 | bwd_inner: 3362.00 | bwd_allreduce: 0.76 | step: 6.85 56%|█████▌ | 5552/10000 [8:44:50<6:48:31, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.012217904441058636, 'learning_rate': 1.7403391335554442e-05, 'epoch': 5.55} 56%|█████▌ | 5552/10000 [8:44:50<6:48:31, 5.51s/it][2025-06-19 22:14:35,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:14:35,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.80 | bwd_microstep: 3359.56 | bwd_inner_microstep: 3358.48 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.48 [2025-06-19 22:14:35,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.80 | bwd: 3359.60 | bwd_inner: 3358.48 | bwd_allreduce: 1.03 | step: 7.49 56%|█████▌ | 5553/10000 [8:44:56<6:49:02, 5.52s/it] {'loss': 0.002, 'grad_norm': 0.7124353647232056, 'learning_rate': 1.739696878550716e-05, 'epoch': 5.55} 56%|█████▌ | 5553/10000 [8:44:56<6:49:02, 5.52s/it][2025-06-19 22:14:40,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:14:40,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.65 | bwd_microstep: 3315.91 | bwd_inner_microstep: 3315.10 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-19 22:14:40,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.66 | bwd: 3315.93 | bwd_inner: 3315.10 | bwd_allreduce: 0.78 | step: 7.24 56%|█████▌ | 5554/10000 [8:45:01<6:47:35, 5.50s/it] {'loss': 0.0051, 'grad_norm': 2.060593843460083, 'learning_rate': 1.739054650850577e-05, 'epoch': 5.55} 56%|█████▌ | 5554/10000 [8:45:01<6:47:35, 5.50s/it][2025-06-19 22:14:46,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:14:46,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.91 | bwd_microstep: 3315.35 | bwd_inner_microstep: 3314.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 22:14:46,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.91 | bwd: 3315.36 | bwd_inner: 3314.56 | bwd_allreduce: 0.76 | step: 6.62 56%|█████▌ | 5555/10000 [8:45:07<6:46:22, 5.49s/it] {'loss': 0.0079, 'grad_norm': 1.381915807723999, 'learning_rate': 1.7384124505223946e-05, 'epoch': 5.55} 56%|█████▌ | 5555/10000 [8:45:07<6:46:22, 5.49s/it][2025-06-19 22:14:51,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:14:51,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.53 | bwd_microstep: 3321.76 | bwd_inner_microstep: 3320.96 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 22:14:51,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.53 | bwd: 3321.78 | bwd_inner: 3320.96 | bwd_allreduce: 0.78 | step: 7.10 56%|█████▌ | 5556/10000 [8:45:12<6:45:53, 5.48s/it] {'loss': 0.002, 'grad_norm': 0.3324698805809021, 'learning_rate': 1.7377702776335315e-05, 'epoch': 5.56} 56%|█████▌ | 5556/10000 [8:45:12<6:45:53, 5.48s/it][2025-06-19 22:14:57,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:14:57,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.31 | bwd_microstep: 3315.87 | bwd_inner_microstep: 3315.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 22:14:57,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.31 | bwd: 3315.89 | bwd_inner: 3315.08 | bwd_allreduce: 0.76 | step: 6.63 56%|█████▌ | 5557/10000 [8:45:17<6:45:17, 5.47s/it] {'loss': 0.0004, 'grad_norm': 0.03900980204343796, 'learning_rate': 1.7371281322513492e-05, 'epoch': 5.56} 56%|█████▌ | 5557/10000 [8:45:17<6:45:17, 5.47s/it][2025-06-19 22:15:02,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:15:02,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.43 | bwd_microstep: 3376.75 | bwd_inner_microstep: 3375.73 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.56 [2025-06-19 22:15:02,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.43 | bwd: 3376.77 | bwd_inner: 3375.73 | bwd_allreduce: 0.99 | step: 7.56 56%|█████▌ | 5558/10000 [8:45:23<6:46:47, 5.49s/it] {'loss': 0.0006, 'grad_norm': 0.11502672731876373, 'learning_rate': 1.7364860144432058e-05, 'epoch': 5.56} 56%|█████▌ | 5558/10000 [8:45:23<6:46:47, 5.49s/it][2025-06-19 22:15:08,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:15:08,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.11 | bwd_microstep: 3310.86 | bwd_inner_microstep: 3310.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 22:15:08,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.11 | bwd: 3310.87 | bwd_inner: 3310.06 | bwd_allreduce: 0.77 | step: 6.93 56%|█████▌ | 5559/10000 [8:45:28<6:45:45, 5.48s/it] {'loss': 0.0032, 'grad_norm': 0.624915599822998, 'learning_rate': 1.7358439242764565e-05, 'epoch': 5.56} 56%|█████▌ | 5559/10000 [8:45:28<6:45:45, 5.48s/it][2025-06-19 22:15:13,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:15:13,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.83 | bwd_microstep: 3329.66 | bwd_inner_microstep: 3328.51 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.31 [2025-06-19 22:15:13,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.83 | bwd: 3329.68 | bwd_inner: 3328.51 | bwd_allreduce: 1.10 | step: 7.30 56%|█████▌ | 5560/10000 [8:45:34<6:45:28, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.03279512748122215, 'learning_rate': 1.7352018618184545e-05, 'epoch': 5.56} 56%|█████▌ | 5560/10000 [8:45:34<6:45:28, 5.48s/it][2025-06-19 22:15:19,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:15:19,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.25 | bwd_microstep: 3378.77 | bwd_inner_microstep: 3377.71 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.25 [2025-06-19 22:15:19,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.25 | bwd: 3378.79 | bwd_inner: 3377.71 | bwd_allreduce: 1.02 | step: 7.25 56%|█████▌ | 5561/10000 [8:45:40<6:47:00, 5.50s/it] {'loss': 0.0119, 'grad_norm': 1.3017278909683228, 'learning_rate': 1.734559827136547e-05, 'epoch': 5.56} 56%|█████▌ | 5561/10000 [8:45:40<6:47:00, 5.50s/it][2025-06-19 22:15:24,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:15:24,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.43 | bwd_microstep: 3379.56 | bwd_inner_microstep: 3378.73 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.35 [2025-06-19 22:15:24,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.43 | bwd: 3379.57 | bwd_inner: 3378.73 | bwd_allreduce: 0.79 | step: 7.35 56%|█████▌ | 5562/10000 [8:45:45<6:48:04, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.008772560395300388, 'learning_rate': 1.733917820298082e-05, 'epoch': 5.56} 56%|█████▌ | 5562/10000 [8:45:45<6:48:04, 5.52s/it][2025-06-19 22:15:30,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:15:30,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.24 | bwd_microstep: 3379.33 | bwd_inner_microstep: 3378.51 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.00 [2025-06-19 22:15:30,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.24 | bwd: 3379.35 | bwd_inner: 3378.51 | bwd_allreduce: 0.79 | step: 7.00 56%|█████▌ | 5563/10000 [8:45:51<6:48:45, 5.53s/it] {'loss': 0.002, 'grad_norm': 0.20002534985542297, 'learning_rate': 1.7332758413704023e-05, 'epoch': 5.56} 56%|█████▌ | 5563/10000 [8:45:51<6:48:45, 5.53s/it][2025-06-19 22:15:35,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:15:35,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.60 | bwd_microstep: 3326.41 | bwd_inner_microstep: 3325.59 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-19 22:15:35,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.60 | bwd: 3326.43 | bwd_inner: 3325.59 | bwd_allreduce: 0.79 | step: 7.13 56%|█████▌ | 5564/10000 [8:45:56<6:47:33, 5.51s/it] {'loss': 0.0094, 'grad_norm': 2.0299131870269775, 'learning_rate': 1.7326338904208496e-05, 'epoch': 5.56} 56%|█████▌ | 5564/10000 [8:45:56<6:47:33, 5.51s/it][2025-06-19 22:15:41,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:15:41,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.54 | bwd_microstep: 3324.62 | bwd_inner_microstep: 3323.81 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.96 [2025-06-19 22:15:41,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.54 | bwd: 3324.63 | bwd_inner: 3323.81 | bwd_allreduce: 0.78 | step: 6.96 56%|█████▌ | 5565/10000 [8:46:02<6:46:34, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.09599640965461731, 'learning_rate': 1.7319919675167616e-05, 'epoch': 5.56} 56%|█████▌ | 5565/10000 [8:46:02<6:46:34, 5.50s/it][2025-06-19 22:15:46,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:15:46,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2153.97 | bwd_microstep: 3404.31 | bwd_inner_microstep: 3403.49 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.79 [2025-06-19 22:15:46,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2153.97 | bwd: 3404.33 | bwd_inner: 3403.49 | bwd_allreduce: 0.80 | step: 6.79 56%|█████▌ | 5566/10000 [8:46:07<6:48:32, 5.53s/it] {'loss': 0.0005, 'grad_norm': 0.14893685281276703, 'learning_rate': 1.7313500727254712e-05, 'epoch': 5.57} 56%|█████▌ | 5566/10000 [8:46:07<6:48:32, 5.53s/it][2025-06-19 22:15:52,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:15:52,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.19 | bwd_microstep: 3330.58 | bwd_inner_microstep: 3329.74 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.79 [2025-06-19 22:15:52,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.19 | bwd: 3330.60 | bwd_inner: 3329.74 | bwd_allreduce: 0.80 | step: 6.79 56%|█████▌ | 5567/10000 [8:46:13<6:47:30, 5.52s/it] {'loss': 0.0032, 'grad_norm': 0.6806870698928833, 'learning_rate': 1.7307082061143117e-05, 'epoch': 5.57} 56%|█████▌ | 5567/10000 [8:46:13<6:47:30, 5.52s/it][2025-06-19 22:15:57,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:15:57,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.40 | bwd_microstep: 3329.26 | bwd_inner_microstep: 3328.22 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.31 [2025-06-19 22:15:57,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.40 | bwd: 3329.28 | bwd_inner: 3328.22 | bwd_allreduce: 1.01 | step: 8.32 56%|█████▌ | 5568/10000 [8:46:18<6:46:35, 5.50s/it] {'loss': 0.012, 'grad_norm': 2.8224313259124756, 'learning_rate': 1.7300663677506114e-05, 'epoch': 5.57} 56%|█████▌ | 5568/10000 [8:46:18<6:46:35, 5.50s/it][2025-06-19 22:16:03,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 22:16:03,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.91 | bwd_microstep: 3324.79 | bwd_inner_microstep: 3323.86 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.08 [2025-06-19 22:16:03,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.91 | bwd: 3324.81 | bwd_inner: 3323.86 | bwd_allreduce: 0.90 | step: 7.09 56%|█████▌ | 5569/10000 [8:46:24<6:45:57, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001500375452451408, 'learning_rate': 1.7294245577016964e-05, 'epoch': 5.57} 56%|█████▌ | 5569/10000 [8:46:24<6:45:57, 5.50s/it][2025-06-19 22:16:08,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:16:08,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.00 | bwd_microstep: 3331.31 | bwd_inner_microstep: 3330.44 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.14 [2025-06-19 22:16:08,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.00 | bwd: 3331.33 | bwd_inner: 3330.44 | bwd_allreduce: 0.82 | step: 7.14 56%|█████▌ | 5570/10000 [8:46:29<6:45:23, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.014410106465220451, 'learning_rate': 1.7287827760348898e-05, 'epoch': 5.57} 56%|█████▌ | 5570/10000 [8:46:29<6:45:23, 5.49s/it][2025-06-19 22:16:14,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:16:14,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.23 | bwd_microstep: 3372.18 | bwd_inner_microstep: 3371.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 22:16:14,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.23 | bwd: 3372.19 | bwd_inner: 3371.37 | bwd_allreduce: 0.78 | step: 7.20 56%|█████▌ | 5571/10000 [8:46:35<6:46:43, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.01678740605711937, 'learning_rate': 1.728141022817511e-05, 'epoch': 5.57} 56%|█████▌ | 5571/10000 [8:46:35<6:46:43, 5.51s/it][2025-06-19 22:16:19,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:16:19,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.99 | bwd_microstep: 3329.00 | bwd_inner_microstep: 3328.10 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.94 [2025-06-19 22:16:19,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.99 | bwd: 3329.01 | bwd_inner: 3328.10 | bwd_allreduce: 0.87 | step: 6.95 56%|█████▌ | 5572/10000 [8:46:40<6:46:05, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.013042467646300793, 'learning_rate': 1.727499298116877e-05, 'epoch': 5.57} 56%|█████▌ | 5572/10000 [8:46:40<6:46:05, 5.50s/it][2025-06-19 22:16:25,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:16:25,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.18 | bwd_microstep: 3373.10 | bwd_inner_microstep: 3372.29 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 22:16:25,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.18 | bwd: 3373.12 | bwd_inner: 3372.29 | bwd_allreduce: 0.78 | step: 6.97 56%|█████▌ | 5573/10000 [8:46:46<6:46:51, 5.51s/it] {'loss': 0.0013, 'grad_norm': 0.1381506472826004, 'learning_rate': 1.726857602000302e-05, 'epoch': 5.57} 56%|█████▌ | 5573/10000 [8:46:46<6:46:51, 5.51s/it][2025-06-19 22:16:30,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:16:30,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.59 | bwd_microstep: 3379.55 | bwd_inner_microstep: 3378.64 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.25 [2025-06-19 22:16:30,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.59 | bwd: 3379.56 | bwd_inner: 3378.64 | bwd_allreduce: 0.88 | step: 7.25 56%|█████▌ | 5574/10000 [8:46:51<6:47:35, 5.53s/it] {'loss': 0.012, 'grad_norm': 2.0851502418518066, 'learning_rate': 1.7262159345350965e-05, 'epoch': 5.57} 56%|█████▌ | 5574/10000 [8:46:51<6:47:35, 5.53s/it][2025-06-19 22:16:36,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 22:16:36,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.45 | bwd_microstep: 3325.82 | bwd_inner_microstep: 3324.96 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.98 [2025-06-19 22:16:36,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.45 | bwd: 3325.83 | bwd_inner: 3324.96 | bwd_allreduce: 0.83 | step: 6.99 56%|█████▌ | 5575/10000 [8:46:57<6:46:28, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.004593926947563887, 'learning_rate': 1.72557429578857e-05, 'epoch': 5.58} 56%|█████▌ | 5575/10000 [8:46:57<6:46:28, 5.51s/it][2025-06-19 22:16:41,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:16:41,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.68 | bwd_microstep: 3326.61 | bwd_inner_microstep: 3325.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-19 22:16:41,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.68 | bwd: 3326.62 | bwd_inner: 3325.81 | bwd_allreduce: 0.77 | step: 6.80 56%|█████▌ | 5576/10000 [8:47:02<6:45:34, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.023921754211187363, 'learning_rate': 1.7249326858280256e-05, 'epoch': 5.58} 56%|█████▌ | 5576/10000 [8:47:02<6:45:34, 5.50s/it][2025-06-19 22:16:47,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:16:47,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.76 | bwd_microstep: 3328.32 | bwd_inner_microstep: 3327.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.32 [2025-06-19 22:16:47,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.76 | bwd: 3328.34 | bwd_inner: 3327.51 | bwd_allreduce: 0.78 | step: 7.32 56%|█████▌ | 5577/10000 [8:47:08<6:45:04, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.02503708377480507, 'learning_rate': 1.7242911047207654e-05, 'epoch': 5.58} 56%|█████▌ | 5577/10000 [8:47:08<6:45:04, 5.49s/it][2025-06-19 22:16:52,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:16:52,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.33 | bwd_microstep: 3377.60 | bwd_inner_microstep: 3376.79 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.05 [2025-06-19 22:16:52,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.33 | bwd: 3377.62 | bwd_inner: 3376.79 | bwd_allreduce: 0.78 | step: 7.05 56%|█████▌ | 5578/10000 [8:47:13<6:46:11, 5.51s/it] {'loss': 0.0171, 'grad_norm': 1.7922593355178833, 'learning_rate': 1.723649552534089e-05, 'epoch': 5.58} 56%|█████▌ | 5578/10000 [8:47:13<6:46:11, 5.51s/it][2025-06-19 22:16:58,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:16:58,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.04 | bwd_microstep: 3383.90 | bwd_inner_microstep: 3383.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 22:16:58,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.04 | bwd: 3383.92 | bwd_inner: 3383.11 | bwd_allreduce: 0.77 | step: 6.99 56%|█████▌ | 5579/10000 [8:47:19<6:47:01, 5.52s/it] {'loss': 0.0006, 'grad_norm': 0.05865911394357681, 'learning_rate': 1.723008029335292e-05, 'epoch': 5.58} 56%|█████▌ | 5579/10000 [8:47:19<6:47:01, 5.52s/it][2025-06-19 22:17:04,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:17:04,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.02 | bwd_microstep: 3375.41 | bwd_inner_microstep: 3374.40 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.17 [2025-06-19 22:17:04,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.02 | bwd: 3375.42 | bwd_inner: 3374.40 | bwd_allreduce: 0.97 | step: 7.17 56%|█████▌ | 5580/10000 [8:47:24<6:47:38, 5.53s/it] {'loss': 0.0175, 'grad_norm': 2.594327688217163, 'learning_rate': 1.722366535191667e-05, 'epoch': 5.58} 56%|█████▌ | 5580/10000 [8:47:24<6:47:38, 5.53s/it][2025-06-19 22:17:09,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:17:09,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.94 | bwd_microstep: 3382.72 | bwd_inner_microstep: 3381.91 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 22:17:09,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.94 | bwd: 3382.74 | bwd_inner: 3381.91 | bwd_allreduce: 0.78 | step: 7.20 56%|█████▌ | 5581/10000 [8:47:30<6:48:13, 5.54s/it] {'loss': 0.0119, 'grad_norm': 2.9415295124053955, 'learning_rate': 1.721725070170504e-05, 'epoch': 5.58} 56%|█████▌ | 5581/10000 [8:47:30<6:48:13, 5.54s/it][2025-06-19 22:17:15,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:17:15,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.96 | bwd_microstep: 3326.67 | bwd_inner_microstep: 3325.88 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 22:17:15,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.96 | bwd: 3326.69 | bwd_inner: 3325.88 | bwd_allreduce: 0.77 | step: 6.78 56%|█████▌ | 5582/10000 [8:47:35<6:46:35, 5.52s/it] {'loss': 0.0245, 'grad_norm': 2.747903823852539, 'learning_rate': 1.7210836343390894e-05, 'epoch': 5.58} 56%|█████▌ | 5582/10000 [8:47:35<6:46:35, 5.52s/it][2025-06-19 22:17:20,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:17:20,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.52 | bwd_microstep: 3381.43 | bwd_inner_microstep: 3380.59 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.87 [2025-06-19 22:17:20,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.52 | bwd: 3381.45 | bwd_inner: 3380.59 | bwd_allreduce: 0.80 | step: 6.88 56%|█████▌ | 5583/10000 [8:47:41<6:47:12, 5.53s/it] {'loss': 0.0006, 'grad_norm': 0.08161282539367676, 'learning_rate': 1.7204422277647074e-05, 'epoch': 5.58} 56%|█████▌ | 5583/10000 [8:47:41<6:47:12, 5.53s/it][2025-06-19 22:17:26,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:17:26,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.97 | bwd_microstep: 3377.87 | bwd_inner_microstep: 3377.02 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.19 [2025-06-19 22:17:26,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.97 | bwd: 3377.89 | bwd_inner: 3377.02 | bwd_allreduce: 0.81 | step: 7.19 56%|█████▌ | 5584/10000 [8:47:46<6:47:29, 5.54s/it] {'loss': 0.0178, 'grad_norm': 3.4436635971069336, 'learning_rate': 1.7198008505146383e-05, 'epoch': 5.58} 56%|█████▌ | 5584/10000 [8:47:46<6:47:29, 5.54s/it][2025-06-19 22:17:31,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:17:31,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.51 | bwd_microstep: 3326.18 | bwd_inner_microstep: 3325.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 22:17:31,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.51 | bwd: 3326.20 | bwd_inner: 3325.39 | bwd_allreduce: 0.76 | step: 6.87 56%|█████▌ | 5585/10000 [8:47:52<6:46:14, 5.52s/it] {'loss': 0.0018, 'grad_norm': 0.2955440282821655, 'learning_rate': 1.7191595026561597e-05, 'epoch': 5.58} 56%|█████▌ | 5585/10000 [8:47:52<6:46:14, 5.52s/it][2025-06-19 22:17:37,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:17:37,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.56 | bwd_microstep: 3378.53 | bwd_inner_microstep: 3377.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-19 22:17:37,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.56 | bwd: 3378.54 | bwd_inner: 3377.72 | bwd_allreduce: 0.77 | step: 7.08 56%|█████▌ | 5586/10000 [8:47:57<6:46:47, 5.53s/it] {'loss': 0.014, 'grad_norm': 3.0428125858306885, 'learning_rate': 1.7185181842565456e-05, 'epoch': 5.59} 56%|█████▌ | 5586/10000 [8:47:57<6:46:47, 5.53s/it][2025-06-19 22:17:42,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 22:17:42,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.37 | bwd_microstep: 3325.26 | bwd_inner_microstep: 3324.15 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.30 [2025-06-19 22:17:42,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.37 | bwd: 3325.27 | bwd_inner: 3324.15 | bwd_allreduce: 1.07 | step: 7.30 56%|█████▌ | 5587/10000 [8:48:03<6:45:32, 5.51s/it] {'loss': 0.0051, 'grad_norm': 1.0519269704818726, 'learning_rate': 1.7178768953830674e-05, 'epoch': 5.59} 56%|█████▌ | 5587/10000 [8:48:03<6:45:32, 5.51s/it][2025-06-19 22:17:48,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 22:17:48,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.05 | bwd_microstep: 3367.52 | bwd_inner_microstep: 3366.42 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.59 [2025-06-19 22:17:48,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.05 | bwd: 3367.54 | bwd_inner: 3366.42 | bwd_allreduce: 1.06 | step: 7.59 56%|█████▌ | 5588/10000 [8:48:09<6:46:07, 5.52s/it] {'loss': 0.0175, 'grad_norm': 2.3176729679107666, 'learning_rate': 1.7172356361029936e-05, 'epoch': 5.59} 56%|█████▌ | 5588/10000 [8:48:09<6:46:07, 5.52s/it][2025-06-19 22:17:53,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-19 22:17:53,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.48 | bwd_microstep: 3383.90 | bwd_inner_microstep: 3383.09 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 22:17:53,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.48 | bwd: 3383.91 | bwd_inner: 3383.09 | bwd_allreduce: 0.78 | step: 7.21 56%|█████▌ | 5589/10000 [8:48:14<6:46:58, 5.54s/it] {'loss': 0.0008, 'grad_norm': 0.08951491117477417, 'learning_rate': 1.716594406483589e-05, 'epoch': 5.59} 56%|█████▌ | 5589/10000 [8:48:14<6:46:58, 5.54s/it][2025-06-19 22:17:59,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:17:59,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.56 | bwd_microstep: 3330.14 | bwd_inner_microstep: 3329.21 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.00 [2025-06-19 22:17:59,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.56 | bwd: 3330.15 | bwd_inner: 3329.21 | bwd_allreduce: 0.89 | step: 7.01 56%|█████▌ | 5590/10000 [8:48:20<6:45:44, 5.52s/it] {'loss': 0.0174, 'grad_norm': 1.8838549852371216, 'learning_rate': 1.7159532065921164e-05, 'epoch': 5.59} 56%|█████▌ | 5590/10000 [8:48:20<6:45:44, 5.52s/it][2025-06-19 22:18:04,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:18:04,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.90 | bwd_microstep: 3327.69 | bwd_inner_microstep: 3326.71 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.36 [2025-06-19 22:18:04,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.90 | bwd: 3327.71 | bwd_inner: 3326.71 | bwd_allreduce: 0.95 | step: 7.37 56%|█████▌ | 5591/10000 [8:48:25<6:44:50, 5.51s/it] {'loss': 0.0036, 'grad_norm': 0.6321669220924377, 'learning_rate': 1.7153120364958336e-05, 'epoch': 5.59} 56%|█████▌ | 5591/10000 [8:48:25<6:44:50, 5.51s/it][2025-06-19 22:18:10,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:18:10,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.79 | bwd_microstep: 3384.09 | bwd_inner_microstep: 3383.28 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.04 [2025-06-19 22:18:10,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.79 | bwd: 3384.10 | bwd_inner: 3383.28 | bwd_allreduce: 0.78 | step: 7.05 56%|█████▌ | 5592/10000 [8:48:31<6:45:48, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.024608200415968895, 'learning_rate': 1.7146708962619973e-05, 'epoch': 5.59} 56%|█████▌ | 5592/10000 [8:48:31<6:45:48, 5.52s/it][2025-06-19 22:18:15,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:18:15,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.14 | bwd_microstep: 3318.53 | bwd_inner_microstep: 3317.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 22:18:15,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.14 | bwd: 3318.54 | bwd_inner: 3317.75 | bwd_allreduce: 0.75 | step: 6.54 56%|█████▌ | 5593/10000 [8:48:36<6:44:19, 5.50s/it] {'loss': 0.0016, 'grad_norm': 0.2277337908744812, 'learning_rate': 1.7140297859578594e-05, 'epoch': 5.59} 56%|█████▌ | 5593/10000 [8:48:36<6:44:19, 5.50s/it][2025-06-19 22:18:21,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.80 [2025-06-19 22:18:21,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.37 | bwd_microstep: 3380.20 | bwd_inner_microstep: 3379.21 | bwd_allreduce_microstep: 0.94 | step_microstep: 8.41 [2025-06-19 22:18:21,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.38 | bwd: 3380.22 | bwd_inner: 3379.21 | bwd_allreduce: 0.96 | step: 8.42 56%|█████▌ | 5594/10000 [8:48:42<6:45:24, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.005987565964460373, 'learning_rate': 1.7133887056506696e-05, 'epoch': 5.59} 56%|█████▌ | 5594/10000 [8:48:42<6:45:24, 5.52s/it][2025-06-19 22:18:26,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 22:18:26,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.85 | bwd_microstep: 3379.31 | bwd_inner_microstep: 3378.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 22:18:26,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.85 | bwd: 3379.32 | bwd_inner: 3378.53 | bwd_allreduce: 0.75 | step: 6.54 56%|█████▌ | 5595/10000 [8:48:47<6:45:56, 5.53s/it] {'loss': 0.0433, 'grad_norm': 5.189009189605713, 'learning_rate': 1.7127476554076752e-05, 'epoch': 5.59} 56%|█████▌ | 5595/10000 [8:48:47<6:45:56, 5.53s/it][2025-06-19 22:18:32,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:18:32,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.53 | bwd_microstep: 3370.27 | bwd_inner_microstep: 3369.46 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 22:18:32,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.53 | bwd: 3370.29 | bwd_inner: 3369.46 | bwd_allreduce: 0.78 | step: 7.00 56%|█████▌ | 5596/10000 [8:48:53<6:45:59, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.009822983294725418, 'learning_rate': 1.7121066352961175e-05, 'epoch': 5.6} 56%|█████▌ | 5596/10000 [8:48:53<6:45:59, 5.53s/it][2025-06-19 22:18:37,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:18:37,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.58 | bwd_microstep: 3329.34 | bwd_inner_microstep: 3328.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 22:18:37,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.58 | bwd: 3329.36 | bwd_inner: 3328.55 | bwd_allreduce: 0.76 | step: 6.66 56%|█████▌ | 5597/10000 [8:48:58<6:44:44, 5.52s/it] {'loss': 0.1055, 'grad_norm': 5.858606815338135, 'learning_rate': 1.7114656453832383e-05, 'epoch': 5.6} 56%|█████▌ | 5597/10000 [8:48:58<6:44:44, 5.52s/it][2025-06-19 22:18:43,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:18:43,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.07 | bwd_microstep: 3321.28 | bwd_inner_microstep: 3320.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-19 22:18:43,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.07 | bwd: 3321.30 | bwd_inner: 3320.49 | bwd_allreduce: 0.77 | step: 7.09 56%|█████▌ | 5598/10000 [8:49:04<6:43:38, 5.50s/it] {'loss': 0.012, 'grad_norm': 2.04995059967041, 'learning_rate': 1.7108246857362733e-05, 'epoch': 5.6} 56%|█████▌ | 5598/10000 [8:49:04<6:43:38, 5.50s/it][2025-06-19 22:18:48,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 22:18:48,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.32 | bwd_microstep: 3320.87 | bwd_inner_microstep: 3319.81 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.50 [2025-06-19 22:18:48,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.32 | bwd: 3320.89 | bwd_inner: 3319.81 | bwd_allreduce: 1.02 | step: 7.50 56%|█████▌ | 5599/10000 [8:49:09<6:43:02, 5.49s/it] {'loss': 0.1129, 'grad_norm': 4.0723958015441895, 'learning_rate': 1.7101837564224568e-05, 'epoch': 5.6} 56%|█████▌ | 5599/10000 [8:49:09<6:43:02, 5.49s/it][2025-06-19 22:18:54,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:18:54,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.59 | bwd_microstep: 3322.94 | bwd_inner_microstep: 3321.96 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.10 [2025-06-19 22:18:54,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.59 | bwd: 3322.95 | bwd_inner: 3321.96 | bwd_allreduce: 0.95 | step: 7.10 56%|█████▌ | 5600/10000 [8:49:15<6:42:26, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.023300793021917343, 'learning_rate': 1.70954285750902e-05, 'epoch': 5.6} 56%|█████▌ | 5600/10000 [8:49:15<6:42:26, 5.49s/it][2025-06-19 22:18:59,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:18:59,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.03 | bwd_microstep: 3319.87 | bwd_inner_microstep: 3319.05 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.20 [2025-06-19 22:18:59,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.03 | bwd: 3319.88 | bwd_inner: 3319.05 | bwd_allreduce: 0.79 | step: 7.20 56%|█████▌ | 5601/10000 [8:49:20<6:41:50, 5.48s/it] {'loss': 0.0329, 'grad_norm': 4.896517753601074, 'learning_rate': 1.7089019890631885e-05, 'epoch': 5.6} 56%|█████▌ | 5601/10000 [8:49:20<6:41:50, 5.48s/it][2025-06-19 22:19:05,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:19:05,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.03 | bwd_microstep: 3322.10 | bwd_inner_microstep: 3321.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 22:19:05,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.03 | bwd: 3322.12 | bwd_inner: 3321.32 | bwd_allreduce: 0.75 | step: 6.60 56%|█████▌ | 5602/10000 [8:49:26<6:41:24, 5.48s/it] {'loss': 0.0103, 'grad_norm': 2.3039824962615967, 'learning_rate': 1.7082611511521877e-05, 'epoch': 5.6} 56%|█████▌ | 5602/10000 [8:49:26<6:41:24, 5.48s/it][2025-06-19 22:19:10,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:19:10,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.16 | bwd_microstep: 3320.99 | bwd_inner_microstep: 3320.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 22:19:10,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.16 | bwd: 3321.01 | bwd_inner: 3320.20 | bwd_allreduce: 0.76 | step: 6.67 56%|█████▌ | 5603/10000 [8:49:31<6:41:03, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.008923976682126522, 'learning_rate': 1.7076203438432377e-05, 'epoch': 5.6} 56%|█████▌ | 5603/10000 [8:49:31<6:41:03, 5.47s/it][2025-06-19 22:19:16,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:19:16,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.85 | bwd_microstep: 3378.76 | bwd_inner_microstep: 3377.76 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.81 [2025-06-19 22:19:16,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.85 | bwd: 3378.78 | bwd_inner: 3377.76 | bwd_allreduce: 0.97 | step: 7.82 56%|█████▌ | 5604/10000 [8:49:37<6:42:42, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.028569534420967102, 'learning_rate': 1.7069795672035568e-05, 'epoch': 5.6} 56%|█████▌ | 5604/10000 [8:49:37<6:42:42, 5.50s/it][2025-06-19 22:19:21,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:19:21,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.91 | bwd_microstep: 3319.06 | bwd_inner_microstep: 3318.18 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.06 [2025-06-19 22:19:21,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.91 | bwd: 3319.08 | bwd_inner: 3318.18 | bwd_allreduce: 0.85 | step: 7.06 56%|█████▌ | 5605/10000 [8:49:42<6:41:49, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.04709114879369736, 'learning_rate': 1.7063388213003593e-05, 'epoch': 5.61} 56%|█████▌ | 5605/10000 [8:49:42<6:41:49, 5.49s/it][2025-06-19 22:19:27,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:19:27,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.73 | bwd_microstep: 3320.34 | bwd_inner_microstep: 3319.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 22:19:27,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.73 | bwd: 3320.35 | bwd_inner: 3319.55 | bwd_allreduce: 0.76 | step: 6.63 56%|█████▌ | 5606/10000 [8:49:47<6:41:05, 5.48s/it] {'loss': 0.0121, 'grad_norm': 2.2614541053771973, 'learning_rate': 1.705698106200857e-05, 'epoch': 5.61} 56%|█████▌ | 5606/10000 [8:49:47<6:41:05, 5.48s/it][2025-06-19 22:19:32,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:19:32,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.74 | bwd_microstep: 3369.53 | bwd_inner_microstep: 3368.57 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.66 [2025-06-19 22:19:32,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.74 | bwd: 3369.55 | bwd_inner: 3368.57 | bwd_allreduce: 0.93 | step: 7.67 56%|█████▌ | 5607/10000 [8:49:53<6:42:22, 5.50s/it] {'loss': 0.0174, 'grad_norm': 2.1301777362823486, 'learning_rate': 1.705057421972257e-05, 'epoch': 5.61} 56%|█████▌ | 5607/10000 [8:49:53<6:42:22, 5.50s/it][2025-06-19 22:19:38,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:19:38,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.01 | bwd_microstep: 3313.87 | bwd_inner_microstep: 3313.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 22:19:38,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.01 | bwd: 3313.88 | bwd_inner: 3313.08 | bwd_allreduce: 0.76 | step: 6.67 56%|█████▌ | 5608/10000 [8:49:58<6:41:32, 5.49s/it] {'loss': 0.0119, 'grad_norm': 0.9683175086975098, 'learning_rate': 1.7044167686817646e-05, 'epoch': 5.61} 56%|█████▌ | 5608/10000 [8:49:58<6:41:32, 5.49s/it][2025-06-19 22:19:43,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:19:43,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.99 | bwd_microstep: 3314.99 | bwd_inner_microstep: 3314.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 22:19:43,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.99 | bwd: 3315.00 | bwd_inner: 3314.19 | bwd_allreduce: 0.77 | step: 6.94 56%|█████▌ | 5609/10000 [8:50:04<6:40:55, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0015866183675825596, 'learning_rate': 1.7037761463965815e-05, 'epoch': 5.61} 56%|█████▌ | 5609/10000 [8:50:04<6:40:55, 5.48s/it][2025-06-19 22:19:49,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:19:49,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.28 | bwd_microstep: 3309.94 | bwd_inner_microstep: 3309.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 22:19:49,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.28 | bwd: 3309.95 | bwd_inner: 3309.13 | bwd_allreduce: 0.78 | step: 7.00 56%|█████▌ | 5610/10000 [8:50:09<6:40:17, 5.47s/it] {'loss': 0.0005, 'grad_norm': 0.10107942670583725, 'learning_rate': 1.7031355551839056e-05, 'epoch': 5.61} 56%|█████▌ | 5610/10000 [8:50:09<6:40:17, 5.47s/it][2025-06-19 22:19:54,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:19:54,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.66 | bwd_microstep: 3366.15 | bwd_inner_microstep: 3365.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 22:19:54,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.66 | bwd: 3366.16 | bwd_inner: 3365.35 | bwd_allreduce: 0.76 | step: 6.71 56%|█████▌ | 5611/10000 [8:50:15<6:41:45, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.041032057255506516, 'learning_rate': 1.702494995110933e-05, 'epoch': 5.61} 56%|█████▌ | 5611/10000 [8:50:15<6:41:45, 5.49s/it][2025-06-19 22:20:00,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:20:00,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.43 | bwd_microstep: 3316.54 | bwd_inner_microstep: 3315.74 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-19 22:20:00,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.44 | bwd: 3316.55 | bwd_inner: 3315.74 | bwd_allreduce: 0.78 | step: 6.77 56%|█████▌ | 5612/10000 [8:50:20<6:40:56, 5.48s/it] {'loss': 0.012, 'grad_norm': 2.8865246772766113, 'learning_rate': 1.701854466244854e-05, 'epoch': 5.61} 56%|█████▌ | 5612/10000 [8:50:20<6:40:56, 5.48s/it][2025-06-19 22:20:05,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:20:05,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.54 | bwd_microstep: 3372.78 | bwd_inner_microstep: 3371.82 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.45 [2025-06-19 22:20:05,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.54 | bwd: 3372.80 | bwd_inner: 3371.82 | bwd_allreduce: 0.93 | step: 7.45 56%|█████▌ | 5613/10000 [8:50:26<6:42:06, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.3876722753047943, 'learning_rate': 1.7012139686528575e-05, 'epoch': 5.61} 56%|█████▌ | 5613/10000 [8:50:26<6:42:06, 5.50s/it][2025-06-19 22:20:11,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:20:11,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.70 | bwd_microstep: 3321.20 | bwd_inner_microstep: 3320.25 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.48 [2025-06-19 22:20:11,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.70 | bwd: 3321.21 | bwd_inner: 3320.25 | bwd_allreduce: 0.91 | step: 7.48 56%|█████▌ | 5614/10000 [8:50:31<6:41:22, 5.49s/it] {'loss': 0.0006, 'grad_norm': 0.06481316685676575, 'learning_rate': 1.700573502402129e-05, 'epoch': 5.61} 56%|█████▌ | 5614/10000 [8:50:31<6:41:22, 5.49s/it][2025-06-19 22:20:16,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:20:16,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.83 | bwd_microstep: 3317.55 | bwd_inner_microstep: 3316.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-19 22:20:16,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.83 | bwd: 3317.56 | bwd_inner: 3316.75 | bwd_allreduce: 0.77 | step: 7.08 56%|█████▌ | 5615/10000 [8:50:37<6:40:46, 5.48s/it] {'loss': 0.0009, 'grad_norm': 0.1656164973974228, 'learning_rate': 1.6999330675598507e-05, 'epoch': 5.62} 56%|█████▌ | 5615/10000 [8:50:37<6:40:46, 5.48s/it][2025-06-19 22:20:22,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:20:22,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.21 | bwd_microstep: 3309.34 | bwd_inner_microstep: 3308.43 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.74 [2025-06-19 22:20:22,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.21 | bwd: 3309.36 | bwd_inner: 3308.43 | bwd_allreduce: 0.88 | step: 7.75 56%|█████▌ | 5616/10000 [8:50:42<6:40:01, 5.47s/it] {'loss': 0.0006, 'grad_norm': 0.1661982536315918, 'learning_rate': 1.699292664193201e-05, 'epoch': 5.62} 56%|█████▌ | 5616/10000 [8:50:42<6:40:01, 5.47s/it][2025-06-19 22:20:27,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:20:27,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.81 | bwd_microstep: 3317.81 | bwd_inner_microstep: 3317.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 22:20:27,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.81 | bwd: 3317.83 | bwd_inner: 3317.02 | bwd_allreduce: 0.76 | step: 6.84 56%|█████▌ | 5617/10000 [8:50:48<6:39:43, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.020516827702522278, 'learning_rate': 1.698652292369355e-05, 'epoch': 5.62} 56%|█████▌ | 5617/10000 [8:50:48<6:39:43, 5.47s/it][2025-06-19 22:20:32,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:20:32,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.90 | bwd_microstep: 3320.16 | bwd_inner_microstep: 3319.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.16 [2025-06-19 22:20:32,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.90 | bwd: 3320.17 | bwd_inner: 3319.35 | bwd_allreduce: 0.77 | step: 7.16 56%|█████▌ | 5618/10000 [8:50:53<6:39:31, 5.47s/it] {'loss': 0.0004, 'grad_norm': 0.05638584494590759, 'learning_rate': 1.698011952155485e-05, 'epoch': 5.62} 56%|█████▌ | 5618/10000 [8:50:53<6:39:31, 5.47s/it][2025-06-19 22:20:38,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:20:38,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.23 | bwd_microstep: 3362.94 | bwd_inner_microstep: 3362.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 22:20:38,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.23 | bwd: 3362.95 | bwd_inner: 3362.15 | bwd_allreduce: 0.76 | step: 6.69 56%|█████▌ | 5619/10000 [8:50:59<6:40:41, 5.49s/it] {'loss': 0.0054, 'grad_norm': 0.7998835444450378, 'learning_rate': 1.6973716436187595e-05, 'epoch': 5.62} 56%|█████▌ | 5619/10000 [8:50:59<6:40:41, 5.49s/it][2025-06-19 22:20:44,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:20:44,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.62 | bwd_microstep: 3363.83 | bwd_inner_microstep: 3362.85 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.40 [2025-06-19 22:20:44,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.62 | bwd: 3363.84 | bwd_inner: 3362.85 | bwd_allreduce: 0.95 | step: 7.40 56%|█████▌ | 5620/10000 [8:51:04<6:41:37, 5.50s/it] {'loss': 0.0013, 'grad_norm': 0.24846354126930237, 'learning_rate': 1.696731366826344e-05, 'epoch': 5.62} 56%|█████▌ | 5620/10000 [8:51:04<6:41:37, 5.50s/it][2025-06-19 22:20:49,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:20:49,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.10 | bwd_microstep: 3315.88 | bwd_inner_microstep: 3315.07 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 22:20:49,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.10 | bwd: 3315.89 | bwd_inner: 3315.07 | bwd_allreduce: 0.77 | step: 6.95 56%|█████▌ | 5621/10000 [8:51:10<6:40:46, 5.49s/it] {'loss': 0.0126, 'grad_norm': 0.8969090580940247, 'learning_rate': 1.6960911218454018e-05, 'epoch': 5.62} 56%|█████▌ | 5621/10000 [8:51:10<6:40:46, 5.49s/it][2025-06-19 22:20:54,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:20:54,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.89 | bwd_microstep: 3320.04 | bwd_inner_microstep: 3318.89 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.65 [2025-06-19 22:20:54,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.89 | bwd: 3320.06 | bwd_inner: 3318.89 | bwd_allreduce: 1.11 | step: 7.65 56%|█████▌ | 5622/10000 [8:51:15<6:40:12, 5.48s/it] {'loss': 0.0922, 'grad_norm': 8.828432083129883, 'learning_rate': 1.6954509087430893e-05, 'epoch': 5.62} 56%|█████▌ | 5622/10000 [8:51:15<6:40:12, 5.48s/it][2025-06-19 22:21:00,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:21:00,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.50 | bwd_microstep: 3368.12 | bwd_inner_microstep: 3367.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 22:21:00,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.50 | bwd: 3368.13 | bwd_inner: 3367.31 | bwd_allreduce: 0.77 | step: 6.95 56%|█████▌ | 5623/10000 [8:51:21<6:41:31, 5.50s/it] {'loss': 0.0014, 'grad_norm': 0.16797591745853424, 'learning_rate': 1.694810727586563e-05, 'epoch': 5.62} 56%|█████▌ | 5623/10000 [8:51:21<6:41:31, 5.50s/it][2025-06-19 22:21:05,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:21:05,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.62 | bwd_microstep: 3317.32 | bwd_inner_microstep: 3316.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 22:21:05,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.62 | bwd: 3317.33 | bwd_inner: 3316.52 | bwd_allreduce: 0.76 | step: 6.85 56%|█████▌ | 5624/10000 [8:51:26<6:40:22, 5.49s/it] {'loss': 0.0021, 'grad_norm': 0.40626809000968933, 'learning_rate': 1.694170578442975e-05, 'epoch': 5.62} 56%|█████▌ | 5624/10000 [8:51:26<6:40:22, 5.49s/it][2025-06-19 22:21:11,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:21:11,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.93 | bwd_microstep: 3309.59 | bwd_inner_microstep: 3308.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-19 22:21:11,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.93 | bwd: 3309.60 | bwd_inner: 3308.79 | bwd_allreduce: 0.77 | step: 6.83 56%|█████▋ | 5625/10000 [8:51:32<6:39:37, 5.48s/it] {'loss': 0.0004, 'grad_norm': 0.05658772587776184, 'learning_rate': 1.693530461379474e-05, 'epoch': 5.62} 56%|█████▋ | 5625/10000 [8:51:32<6:39:37, 5.48s/it][2025-06-19 22:21:16,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:21:16,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.35 | bwd_microstep: 3320.07 | bwd_inner_microstep: 3319.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 22:21:16,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.35 | bwd: 3320.08 | bwd_inner: 3319.28 | bwd_allreduce: 0.76 | step: 6.65 56%|█████▋ | 5626/10000 [8:51:37<6:39:17, 5.48s/it] {'loss': 0.0063, 'grad_norm': 1.062200903892517, 'learning_rate': 1.6928903764632055e-05, 'epoch': 5.63} 56%|█████▋ | 5626/10000 [8:51:37<6:39:17, 5.48s/it][2025-06-19 22:21:22,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:21:22,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.39 | bwd_microstep: 3368.09 | bwd_inner_microstep: 3367.29 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 22:21:22,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.39 | bwd: 3368.10 | bwd_inner: 3367.29 | bwd_allreduce: 0.77 | step: 6.77 56%|█████▋ | 5627/10000 [8:51:43<6:40:21, 5.49s/it] {'loss': 0.0201, 'grad_norm': 4.292270660400391, 'learning_rate': 1.692250323761311e-05, 'epoch': 5.63} 56%|█████▋ | 5627/10000 [8:51:43<6:40:21, 5.49s/it][2025-06-19 22:21:27,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.78 [2025-06-19 22:21:27,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.27 | bwd_microstep: 3309.93 | bwd_inner_microstep: 3309.09 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.49 [2025-06-19 22:21:27,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.27 | bwd: 3309.95 | bwd_inner: 3309.09 | bwd_allreduce: 0.81 | step: 7.50 56%|█████▋ | 5628/10000 [8:51:48<6:39:31, 5.48s/it] {'loss': 0.0026, 'grad_norm': 0.4355924725532532, 'learning_rate': 1.6916103033409287e-05, 'epoch': 5.63} 56%|█████▋ | 5628/10000 [8:51:48<6:39:31, 5.48s/it][2025-06-19 22:21:33,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:21:33,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.11 | bwd_microstep: 3324.01 | bwd_inner_microstep: 3323.09 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.69 [2025-06-19 22:21:33,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.11 | bwd: 3324.03 | bwd_inner: 3323.09 | bwd_allreduce: 0.89 | step: 7.69 56%|█████▋ | 5629/10000 [8:51:54<6:39:00, 5.48s/it] {'loss': 0.0901, 'grad_norm': 4.034698009490967, 'learning_rate': 1.690970315269195e-05, 'epoch': 5.63} 56%|█████▋ | 5629/10000 [8:51:54<6:39:00, 5.48s/it][2025-06-19 22:21:38,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.70 | optimizer_step: 2.73 [2025-06-19 22:21:38,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.02 | bwd_microstep: 3319.21 | bwd_inner_microstep: 3318.25 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.04 [2025-06-19 22:21:38,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.02 | bwd: 3319.23 | bwd_inner: 3318.25 | bwd_allreduce: 0.93 | step: 7.05 56%|█████▋ | 5630/10000 [8:51:59<6:38:39, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.030871717259287834, 'learning_rate': 1.6903303596132406e-05, 'epoch': 5.63} 56%|█████▋ | 5630/10000 [8:51:59<6:38:39, 5.47s/it][2025-06-19 22:21:44,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.73 [2025-06-19 22:21:44,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.64 | bwd_microstep: 3328.81 | bwd_inner_microstep: 3328.00 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.51 [2025-06-19 22:21:44,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.64 | bwd: 3328.83 | bwd_inner: 3328.00 | bwd_allreduce: 0.78 | step: 7.50 56%|█████▋ | 5631/10000 [8:52:05<6:38:44, 5.48s/it] {'loss': 0.0051, 'grad_norm': 0.842634916305542, 'learning_rate': 1.6896904364401948e-05, 'epoch': 5.63} 56%|█████▋ | 5631/10000 [8:52:05<6:38:44, 5.48s/it][2025-06-19 22:21:49,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:21:49,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.88 | bwd_microstep: 3368.13 | bwd_inner_microstep: 3367.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 22:21:49,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.88 | bwd: 3368.15 | bwd_inner: 3367.33 | bwd_allreduce: 0.77 | step: 6.83 56%|█████▋ | 5632/10000 [8:52:10<6:39:48, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.08035524934530258, 'learning_rate': 1.6890505458171814e-05, 'epoch': 5.63} 56%|█████▋ | 5632/10000 [8:52:10<6:39:48, 5.49s/it][2025-06-19 22:21:55,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:21:55,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.07 | bwd_microstep: 3313.40 | bwd_inner_microstep: 3312.59 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 22:21:55,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.08 | bwd: 3313.42 | bwd_inner: 3312.59 | bwd_allreduce: 0.78 | step: 7.06 56%|█████▋ | 5633/10000 [8:52:16<6:38:49, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.00727501604706049, 'learning_rate': 1.6884106878113225e-05, 'epoch': 5.63} 56%|█████▋ | 5633/10000 [8:52:16<6:38:49, 5.48s/it][2025-06-19 22:22:00,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:22:00,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.62 | bwd_microstep: 3310.94 | bwd_inner_microstep: 3310.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-19 22:22:00,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.62 | bwd: 3310.95 | bwd_inner: 3310.13 | bwd_allreduce: 0.77 | step: 6.85 56%|█████▋ | 5634/10000 [8:52:21<6:38:10, 5.47s/it] {'loss': 0.0008, 'grad_norm': 0.09083247929811478, 'learning_rate': 1.6877708624897365e-05, 'epoch': 5.63} 56%|█████▋ | 5634/10000 [8:52:21<6:38:10, 5.47s/it][2025-06-19 22:22:06,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:22:06,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.61 | bwd_microstep: 3316.46 | bwd_inner_microstep: 3315.53 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.93 [2025-06-19 22:22:06,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.61 | bwd: 3316.47 | bwd_inner: 3315.53 | bwd_allreduce: 0.89 | step: 6.93 56%|█████▋ | 5635/10000 [8:52:26<6:37:42, 5.47s/it] {'loss': 0.0329, 'grad_norm': 6.829148292541504, 'learning_rate': 1.687131069919538e-05, 'epoch': 5.63} 56%|█████▋ | 5635/10000 [8:52:26<6:37:42, 5.47s/it][2025-06-19 22:22:11,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:22:11,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.72 | bwd_microstep: 3366.25 | bwd_inner_microstep: 3365.19 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.86 [2025-06-19 22:22:11,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.72 | bwd: 3366.27 | bwd_inner: 3365.19 | bwd_allreduce: 1.02 | step: 7.86 56%|█████▋ | 5636/10000 [8:52:32<6:39:08, 5.49s/it] {'loss': 0.0016, 'grad_norm': 0.19893504679203033, 'learning_rate': 1.686491310167839e-05, 'epoch': 5.64} 56%|█████▋ | 5636/10000 [8:52:32<6:39:08, 5.49s/it][2025-06-19 22:22:17,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:22:17,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.15 | bwd_microstep: 3360.94 | bwd_inner_microstep: 3360.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 22:22:17,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.15 | bwd: 3360.96 | bwd_inner: 3360.16 | bwd_allreduce: 0.75 | step: 6.58 56%|█████▋ | 5637/10000 [8:52:38<6:40:01, 5.50s/it] {'loss': 0.0018, 'grad_norm': 0.5269841551780701, 'learning_rate': 1.6858515833017458e-05, 'epoch': 5.64} 56%|█████▋ | 5637/10000 [8:52:38<6:40:01, 5.50s/it][2025-06-19 22:22:22,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:22:22,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.66 | bwd_microstep: 3367.02 | bwd_inner_microstep: 3366.14 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.51 [2025-06-19 22:22:22,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.66 | bwd: 3367.03 | bwd_inner: 3366.14 | bwd_allreduce: 0.86 | step: 7.52 56%|█████▋ | 5638/10000 [8:52:43<6:40:33, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.028933029621839523, 'learning_rate': 1.6852118893883634e-05, 'epoch': 5.64} 56%|█████▋ | 5638/10000 [8:52:43<6:40:33, 5.51s/it][2025-06-19 22:22:28,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:22:28,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.36 | bwd_microstep: 3312.37 | bwd_inner_microstep: 3311.45 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.57 [2025-06-19 22:22:28,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.36 | bwd: 3312.38 | bwd_inner: 3311.45 | bwd_allreduce: 0.88 | step: 7.57 56%|█████▋ | 5639/10000 [8:52:49<6:39:36, 5.50s/it] {'loss': 0.0009, 'grad_norm': 0.13593924045562744, 'learning_rate': 1.684572228494793e-05, 'epoch': 5.64} 56%|█████▋ | 5639/10000 [8:52:49<6:39:36, 5.50s/it][2025-06-19 22:22:33,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:22:33,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.50 | bwd_microstep: 3372.22 | bwd_inner_microstep: 3371.26 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.24 [2025-06-19 22:22:33,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.50 | bwd: 3372.23 | bwd_inner: 3371.26 | bwd_allreduce: 0.93 | step: 7.24 56%|█████▋ | 5640/10000 [8:52:54<6:40:25, 5.51s/it] {'loss': 0.002, 'grad_norm': 0.5050732493400574, 'learning_rate': 1.6839326006881314e-05, 'epoch': 5.64} 56%|█████▋ | 5640/10000 [8:52:54<6:40:25, 5.51s/it][2025-06-19 22:22:39,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:22:39,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.80 | bwd_microstep: 3325.16 | bwd_inner_microstep: 3324.21 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.11 [2025-06-19 22:22:39,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.80 | bwd: 3325.17 | bwd_inner: 3324.21 | bwd_allreduce: 0.91 | step: 7.12 56%|█████▋ | 5641/10000 [8:53:00<6:39:47, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.1084330752491951, 'learning_rate': 1.6832930060354738e-05, 'epoch': 5.64} 56%|█████▋ | 5641/10000 [8:53:00<6:39:47, 5.50s/it][2025-06-19 22:22:44,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:22:44,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.82 | bwd_microstep: 3315.93 | bwd_inner_microstep: 3315.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-19 22:22:44,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.82 | bwd: 3315.95 | bwd_inner: 3315.14 | bwd_allreduce: 0.76 | step: 7.03 56%|█████▋ | 5642/10000 [8:53:05<6:39:13, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.016438640654087067, 'learning_rate': 1.6826534446039095e-05, 'epoch': 5.64} 56%|█████▋ | 5642/10000 [8:53:05<6:39:13, 5.50s/it][2025-06-19 22:22:50,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:22:50,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.74 | bwd_microstep: 3313.04 | bwd_inner_microstep: 3312.25 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 22:22:50,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.74 | bwd: 3313.05 | bwd_inner: 3312.25 | bwd_allreduce: 0.76 | step: 6.71 56%|█████▋ | 5643/10000 [8:53:11<6:38:12, 5.48s/it] {'loss': 0.0176, 'grad_norm': 3.3274035453796387, 'learning_rate': 1.682013916460526e-05, 'epoch': 5.64} 56%|█████▋ | 5643/10000 [8:53:11<6:38:12, 5.48s/it][2025-06-19 22:22:55,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:22:55,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.79 | bwd_microstep: 3364.90 | bwd_inner_microstep: 3364.02 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.76 [2025-06-19 22:22:55,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.79 | bwd: 3364.92 | bwd_inner: 3364.02 | bwd_allreduce: 0.86 | step: 6.77 56%|█████▋ | 5644/10000 [8:53:16<6:39:06, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.05742441862821579, 'learning_rate': 1.6813744216724067e-05, 'epoch': 5.64} 56%|█████▋ | 5644/10000 [8:53:16<6:39:06, 5.50s/it][2025-06-19 22:23:01,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:23:01,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.43 | bwd_microstep: 3310.20 | bwd_inner_microstep: 3309.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 22:23:01,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.44 | bwd: 3310.22 | bwd_inner: 3309.41 | bwd_allreduce: 0.76 | step: 6.78 56%|█████▋ | 5645/10000 [8:53:21<6:37:54, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.007207981310784817, 'learning_rate': 1.680734960306632e-05, 'epoch': 5.64} 56%|█████▋ | 5645/10000 [8:53:21<6:37:54, 5.48s/it][2025-06-19 22:23:06,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:23:06,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.26 | bwd_microstep: 3370.59 | bwd_inner_microstep: 3369.80 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 22:23:06,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.26 | bwd: 3370.60 | bwd_inner: 3369.80 | bwd_allreduce: 0.76 | step: 6.79 56%|█████▋ | 5646/10000 [8:53:27<6:39:00, 5.50s/it] {'loss': 0.0438, 'grad_norm': 3.253511667251587, 'learning_rate': 1.6800955324302794e-05, 'epoch': 5.65} 56%|█████▋ | 5646/10000 [8:53:27<6:39:00, 5.50s/it][2025-06-19 22:23:12,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.74 [2025-06-19 22:23:12,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.43 | bwd_microstep: 3307.10 | bwd_inner_microstep: 3306.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 22:23:12,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.43 | bwd: 3307.11 | bwd_inner: 3306.32 | bwd_allreduce: 0.75 | step: 6.66 56%|█████▋ | 5647/10000 [8:53:32<6:37:42, 5.48s/it] {'loss': 0.0033, 'grad_norm': 0.41453683376312256, 'learning_rate': 1.6794561381104192e-05, 'epoch': 5.65} 56%|█████▋ | 5647/10000 [8:53:32<6:37:42, 5.48s/it][2025-06-19 22:23:17,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:23:17,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.59 | bwd_microstep: 3308.76 | bwd_inner_microstep: 3307.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 22:23:17,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.59 | bwd: 3308.78 | bwd_inner: 3307.98 | bwd_allreduce: 0.76 | step: 6.60 56%|█████▋ | 5648/10000 [8:53:38<6:36:47, 5.47s/it] {'loss': 0.008, 'grad_norm': 1.3580679893493652, 'learning_rate': 1.6788167774141228e-05, 'epoch': 5.65} 56%|█████▋ | 5648/10000 [8:53:38<6:36:47, 5.47s/it][2025-06-19 22:23:23,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:23:23,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.56 | bwd_microstep: 3314.20 | bwd_inner_microstep: 3313.31 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.86 [2025-06-19 22:23:23,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.56 | bwd: 3314.21 | bwd_inner: 3313.31 | bwd_allreduce: 0.86 | step: 6.86 56%|█████▋ | 5649/10000 [8:53:43<6:36:23, 5.47s/it] {'loss': 0.0008, 'grad_norm': 0.16457170248031616, 'learning_rate': 1.6781774504084558e-05, 'epoch': 5.65} 56%|█████▋ | 5649/10000 [8:53:43<6:36:23, 5.47s/it][2025-06-19 22:23:28,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:23:28,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.07 | bwd_microstep: 3311.20 | bwd_inner_microstep: 3310.27 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.90 [2025-06-19 22:23:28,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.07 | bwd: 3311.21 | bwd_inner: 3310.27 | bwd_allreduce: 0.90 | step: 6.90 56%|█████▋ | 5650/10000 [8:53:49<6:36:24, 5.47s/it] {'loss': 0.0121, 'grad_norm': 2.683966636657715, 'learning_rate': 1.6775381571604806e-05, 'epoch': 5.65} 56%|█████▋ | 5650/10000 [8:53:49<6:36:24, 5.47s/it][2025-06-19 22:23:33,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:23:33,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.01 | bwd_microstep: 3307.28 | bwd_inner_microstep: 3306.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 22:23:33,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.01 | bwd: 3307.29 | bwd_inner: 3306.49 | bwd_allreduce: 0.76 | step: 6.96 57%|█████▋ | 5651/10000 [8:53:54<6:35:56, 5.46s/it] {'loss': 0.0052, 'grad_norm': 1.1952406167984009, 'learning_rate': 1.6768988977372563e-05, 'epoch': 5.65} 57%|█████▋ | 5651/10000 [8:53:54<6:35:56, 5.46s/it][2025-06-19 22:23:39,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:23:39,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.26 | bwd_microstep: 3320.01 | bwd_inner_microstep: 3319.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 22:23:39,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.26 | bwd: 3320.03 | bwd_inner: 3319.22 | bwd_allreduce: 0.76 | step: 6.79 57%|█████▋ | 5652/10000 [8:54:00<6:35:53, 5.46s/it] {'loss': 0.0429, 'grad_norm': 5.186319828033447, 'learning_rate': 1.676259672205838e-05, 'epoch': 5.65} 57%|█████▋ | 5652/10000 [8:54:00<6:35:53, 5.46s/it][2025-06-19 22:23:44,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:23:44,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.36 | bwd_microstep: 3312.39 | bwd_inner_microstep: 3311.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 22:23:44,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.35 | bwd: 3312.40 | bwd_inner: 3311.61 | bwd_allreduce: 0.75 | step: 6.62 57%|█████▋ | 5653/10000 [8:54:05<6:35:29, 5.46s/it] {'loss': 0.0769, 'grad_norm': 5.20194149017334, 'learning_rate': 1.6756204806332775e-05, 'epoch': 5.65} 57%|█████▋ | 5653/10000 [8:54:05<6:35:29, 5.46s/it][2025-06-19 22:23:50,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:23:50,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.73 | bwd_microstep: 3316.42 | bwd_inner_microstep: 3315.36 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.12 [2025-06-19 22:23:50,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.73 | bwd: 3316.44 | bwd_inner: 3315.36 | bwd_allreduce: 1.02 | step: 7.12 57%|█████▋ | 5654/10000 [8:54:11<6:35:19, 5.46s/it] {'loss': 0.0002, 'grad_norm': 0.066397525370121, 'learning_rate': 1.674981323086623e-05, 'epoch': 5.65} 57%|█████▋ | 5654/10000 [8:54:11<6:35:19, 5.46s/it][2025-06-19 22:23:55,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:23:55,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.45 | bwd_microstep: 3377.92 | bwd_inner_microstep: 3377.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 22:23:55,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.45 | bwd: 3377.93 | bwd_inner: 3377.13 | bwd_allreduce: 0.76 | step: 6.68 57%|█████▋ | 5655/10000 [8:54:16<6:37:14, 5.49s/it] {'loss': 0.0246, 'grad_norm': 3.2131593227386475, 'learning_rate': 1.674342199632919e-05, 'epoch': 5.66} 57%|█████▋ | 5655/10000 [8:54:16<6:37:14, 5.49s/it][2025-06-19 22:24:01,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 22:24:01,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.02 | bwd_microstep: 3318.32 | bwd_inner_microstep: 3317.32 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.23 [2025-06-19 22:24:01,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.03 | bwd: 3318.34 | bwd_inner: 3317.32 | bwd_allreduce: 0.97 | step: 7.23 57%|█████▋ | 5656/10000 [8:54:22<6:36:36, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.03929442912340164, 'learning_rate': 1.673703110339208e-05, 'epoch': 5.66} 57%|█████▋ | 5656/10000 [8:54:22<6:36:36, 5.48s/it][2025-06-19 22:24:06,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:24:06,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.71 | bwd_microstep: 3403.72 | bwd_inner_microstep: 3402.90 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.17 [2025-06-19 22:24:06,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.71 | bwd: 3403.74 | bwd_inner: 3402.90 | bwd_allreduce: 0.79 | step: 7.17 57%|█████▋ | 5657/10000 [8:54:27<6:38:45, 5.51s/it] {'loss': 0.0205, 'grad_norm': 1.8814507722854614, 'learning_rate': 1.6730640552725256e-05, 'epoch': 5.66} 57%|█████▋ | 5657/10000 [8:54:27<6:38:45, 5.51s/it][2025-06-19 22:24:12,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:24:12,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.92 | bwd_microstep: 3317.49 | bwd_inner_microstep: 3316.50 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.19 [2025-06-19 22:24:12,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.92 | bwd: 3317.51 | bwd_inner: 3316.50 | bwd_allreduce: 0.96 | step: 7.19 57%|█████▋ | 5658/10000 [8:54:33<6:37:42, 5.50s/it] {'loss': 0.0645, 'grad_norm': 12.475399017333984, 'learning_rate': 1.672425034499906e-05, 'epoch': 5.66} 57%|█████▋ | 5658/10000 [8:54:33<6:37:42, 5.50s/it][2025-06-19 22:24:17,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:24:17,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.82 | bwd_microstep: 3375.95 | bwd_inner_microstep: 3375.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 22:24:17,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.82 | bwd: 3375.96 | bwd_inner: 3375.16 | bwd_allreduce: 0.76 | step: 6.67 57%|█████▋ | 5659/10000 [8:54:38<6:38:34, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.024444622918963432, 'learning_rate': 1.6717860480883805e-05, 'epoch': 5.66} 57%|█████▋ | 5659/10000 [8:54:38<6:38:34, 5.51s/it][2025-06-19 22:24:23,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:24:23,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.27 | bwd_microstep: 3324.00 | bwd_inner_microstep: 3323.05 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.00 [2025-06-19 22:24:23,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.27 | bwd: 3324.02 | bwd_inner: 3323.05 | bwd_allreduce: 0.92 | step: 7.01 57%|█████▋ | 5660/10000 [8:54:44<6:37:35, 5.50s/it] {'loss': 0.0122, 'grad_norm': 2.595264434814453, 'learning_rate': 1.6711470961049747e-05, 'epoch': 5.66} 57%|█████▋ | 5660/10000 [8:54:44<6:37:35, 5.50s/it][2025-06-19 22:24:28,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:24:28,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.02 | bwd_microstep: 3359.91 | bwd_inner_microstep: 3358.98 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.00 [2025-06-19 22:24:28,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.02 | bwd: 3359.92 | bwd_inner: 3358.98 | bwd_allreduce: 0.90 | step: 7.00 57%|█████▋ | 5661/10000 [8:54:49<6:38:19, 5.51s/it] {'loss': 0.0009, 'grad_norm': 0.16393908858299255, 'learning_rate': 1.670508178616713e-05, 'epoch': 5.66} 57%|█████▋ | 5661/10000 [8:54:49<6:38:19, 5.51s/it][2025-06-19 22:24:34,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:24:34,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.52 | bwd_microstep: 3398.46 | bwd_inner_microstep: 3397.57 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.00 [2025-06-19 22:24:34,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.52 | bwd: 3398.48 | bwd_inner: 3397.57 | bwd_allreduce: 0.86 | step: 7.00 57%|█████▋ | 5662/10000 [8:54:55<6:39:44, 5.53s/it] {'loss': 0.0008, 'grad_norm': 0.1483791172504425, 'learning_rate': 1.6698692956906136e-05, 'epoch': 5.66} 57%|█████▋ | 5662/10000 [8:54:55<6:39:44, 5.53s/it][2025-06-19 22:24:39,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:24:39,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.89 | bwd_microstep: 3315.36 | bwd_inner_microstep: 3314.48 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.43 [2025-06-19 22:24:39,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.89 | bwd: 3315.37 | bwd_inner: 3314.48 | bwd_allreduce: 0.85 | step: 7.43 57%|█████▋ | 5663/10000 [8:55:00<6:38:14, 5.51s/it] {'loss': 0.0611, 'grad_norm': 5.539938449859619, 'learning_rate': 1.669230447393693e-05, 'epoch': 5.66} 57%|█████▋ | 5663/10000 [8:55:00<6:38:14, 5.51s/it][2025-06-19 22:24:45,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.90 [2025-06-19 22:24:45,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.40 | bwd_microstep: 3365.34 | bwd_inner_microstep: 3364.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.89 [2025-06-19 22:24:45,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.40 | bwd: 3365.36 | bwd_inner: 3364.56 | bwd_allreduce: 0.76 | step: 6.90 57%|█████▋ | 5664/10000 [8:55:06<6:38:41, 5.52s/it] {'loss': 0.0051, 'grad_norm': 1.4902336597442627, 'learning_rate': 1.668591633792963e-05, 'epoch': 5.66} 57%|█████▋ | 5664/10000 [8:55:06<6:38:41, 5.52s/it][2025-06-19 22:24:50,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:24:50,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.20 | bwd_microstep: 3314.59 | bwd_inner_microstep: 3313.77 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.30 [2025-06-19 22:24:50,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.20 | bwd: 3314.61 | bwd_inner: 3313.77 | bwd_allreduce: 0.80 | step: 7.30 57%|█████▋ | 5665/10000 [8:55:11<6:37:26, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.027975400909781456, 'learning_rate': 1.667952854955433e-05, 'epoch': 5.67} 57%|█████▋ | 5665/10000 [8:55:11<6:37:26, 5.50s/it][2025-06-19 22:24:56,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:24:56,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.14 | bwd_microstep: 3367.88 | bwd_inner_microstep: 3366.96 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.00 [2025-06-19 22:24:56,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.14 | bwd: 3367.89 | bwd_inner: 3366.96 | bwd_allreduce: 0.89 | step: 7.01 57%|█████▋ | 5666/10000 [8:55:17<6:38:11, 5.51s/it] {'loss': 0.0037, 'grad_norm': 0.541282594203949, 'learning_rate': 1.6673141109481075e-05, 'epoch': 5.67} 57%|█████▋ | 5666/10000 [8:55:17<6:38:11, 5.51s/it][2025-06-19 22:25:01,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 22:25:01,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.53 | bwd_microstep: 3324.48 | bwd_inner_microstep: 3323.38 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.76 [2025-06-19 22:25:01,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.53 | bwd: 3324.50 | bwd_inner: 3323.38 | bwd_allreduce: 1.06 | step: 7.78 57%|█████▋ | 5667/10000 [8:55:22<6:37:12, 5.50s/it] {'loss': 0.0029, 'grad_norm': 0.3228208124637604, 'learning_rate': 1.6666754018379878e-05, 'epoch': 5.67} 57%|█████▋ | 5667/10000 [8:55:22<6:37:12, 5.50s/it][2025-06-19 22:25:07,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:25:07,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.89 | bwd_microstep: 3398.65 | bwd_inner_microstep: 3397.83 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.24 [2025-06-19 22:25:07,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.89 | bwd: 3398.67 | bwd_inner: 3397.83 | bwd_allreduce: 0.79 | step: 7.24 57%|█████▋ | 5668/10000 [8:55:28<6:38:49, 5.52s/it] {'loss': 0.0024, 'grad_norm': 0.3322308659553528, 'learning_rate': 1.6660367276920712e-05, 'epoch': 5.67} 57%|█████▋ | 5668/10000 [8:55:28<6:38:49, 5.52s/it][2025-06-19 22:25:13,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:25:13,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.16 | bwd_microstep: 3317.17 | bwd_inner_microstep: 3316.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 22:25:13,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.16 | bwd: 3317.19 | bwd_inner: 3316.36 | bwd_allreduce: 0.78 | step: 7.00 57%|█████▋ | 5669/10000 [8:55:33<6:37:19, 5.50s/it] {'loss': 0.0079, 'grad_norm': 2.251741647720337, 'learning_rate': 1.665398088577352e-05, 'epoch': 5.67} 57%|█████▋ | 5669/10000 [8:55:33<6:37:19, 5.50s/it][2025-06-19 22:25:18,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:25:18,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.68 | bwd_microstep: 3363.06 | bwd_inner_microstep: 3362.19 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.17 [2025-06-19 22:25:18,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.68 | bwd: 3363.08 | bwd_inner: 3362.19 | bwd_allreduce: 0.84 | step: 7.17 57%|█████▋ | 5670/10000 [8:55:39<6:37:44, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.038729216903448105, 'learning_rate': 1.66475948456082e-05, 'epoch': 5.67} 57%|█████▋ | 5670/10000 [8:55:39<6:37:44, 5.51s/it][2025-06-19 22:25:24,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.74 [2025-06-19 22:25:24,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.69 | bwd_microstep: 3373.73 | bwd_inner_microstep: 3372.92 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.76 [2025-06-19 22:25:24,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.69 | bwd: 3373.75 | bwd_inner: 3372.92 | bwd_allreduce: 0.78 | step: 6.77 57%|█████▋ | 5671/10000 [8:55:44<6:38:22, 5.52s/it] {'loss': 0.0012, 'grad_norm': 0.2053915411233902, 'learning_rate': 1.6641209157094624e-05, 'epoch': 5.67} 57%|█████▋ | 5671/10000 [8:55:44<6:38:22, 5.52s/it][2025-06-19 22:25:29,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:25:29,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.03 | bwd_microstep: 3372.74 | bwd_inner_microstep: 3371.91 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.39 [2025-06-19 22:25:29,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.03 | bwd: 3372.76 | bwd_inner: 3371.91 | bwd_allreduce: 0.80 | step: 7.39 57%|█████▋ | 5672/10000 [8:55:50<6:38:42, 5.53s/it] {'loss': 0.0009, 'grad_norm': 0.2237328737974167, 'learning_rate': 1.6634823820902626e-05, 'epoch': 5.67} 57%|█████▋ | 5672/10000 [8:55:50<6:38:42, 5.53s/it][2025-06-19 22:25:35,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:25:35,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.90 | bwd_microstep: 3326.08 | bwd_inner_microstep: 3325.12 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.43 [2025-06-19 22:25:35,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.90 | bwd: 3326.10 | bwd_inner: 3325.12 | bwd_allreduce: 0.93 | step: 7.43 57%|█████▋ | 5673/10000 [8:55:55<6:37:35, 5.51s/it] {'loss': 0.0052, 'grad_norm': 1.13737154006958, 'learning_rate': 1.662843883770198e-05, 'epoch': 5.67} 57%|█████▋ | 5673/10000 [8:55:55<6:37:35, 5.51s/it][2025-06-19 22:25:40,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:25:40,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.62 | bwd_microstep: 3325.96 | bwd_inner_microstep: 3325.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-19 22:25:40,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.62 | bwd: 3325.98 | bwd_inner: 3325.16 | bwd_allreduce: 0.77 | step: 6.99 57%|█████▋ | 5674/10000 [8:56:01<6:36:47, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.014661230146884918, 'learning_rate': 1.6622054208162455e-05, 'epoch': 5.67} 57%|█████▋ | 5674/10000 [8:56:01<6:36:47, 5.50s/it][2025-06-19 22:25:46,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:25:46,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.54 | bwd_microstep: 3327.96 | bwd_inner_microstep: 3327.03 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.24 [2025-06-19 22:25:46,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.54 | bwd: 3327.98 | bwd_inner: 3327.03 | bwd_allreduce: 0.90 | step: 7.24 57%|█████▋ | 5675/10000 [8:56:06<6:36:05, 5.49s/it] {'loss': 0.0407, 'grad_norm': 4.084364891052246, 'learning_rate': 1.661566993295377e-05, 'epoch': 5.67} 57%|█████▋ | 5675/10000 [8:56:06<6:36:05, 5.49s/it][2025-06-19 22:25:51,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.89 [2025-06-19 22:25:51,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.76 | bwd_microstep: 3318.01 | bwd_inner_microstep: 3317.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 22:25:51,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.76 | bwd: 3318.03 | bwd_inner: 3317.22 | bwd_allreduce: 0.76 | step: 6.96 57%|█████▋ | 5676/10000 [8:56:12<6:35:23, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.07050391286611557, 'learning_rate': 1.6609286012745595e-05, 'epoch': 5.68} 57%|█████▋ | 5676/10000 [8:56:12<6:35:23, 5.49s/it][2025-06-19 22:25:57,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:25:57,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.39 | bwd_microstep: 3368.92 | bwd_inner_microstep: 3368.00 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.10 [2025-06-19 22:25:57,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.39 | bwd: 3368.94 | bwd_inner: 3368.00 | bwd_allreduce: 0.89 | step: 7.10 57%|█████▋ | 5677/10000 [8:56:17<6:36:22, 5.50s/it] {'loss': 0.0055, 'grad_norm': 0.8803014159202576, 'learning_rate': 1.660290244820759e-05, 'epoch': 5.68} 57%|█████▋ | 5677/10000 [8:56:17<6:36:22, 5.50s/it][2025-06-19 22:26:02,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:26:02,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.41 | bwd_microstep: 3322.24 | bwd_inner_microstep: 3321.43 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.18 [2025-06-19 22:26:02,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.41 | bwd: 3322.26 | bwd_inner: 3321.43 | bwd_allreduce: 0.78 | step: 7.18 57%|█████▋ | 5678/10000 [8:56:23<6:35:50, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.020377418026328087, 'learning_rate': 1.659651924000934e-05, 'epoch': 5.68} 57%|█████▋ | 5678/10000 [8:56:23<6:35:50, 5.50s/it][2025-06-19 22:26:08,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:26:08,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.16 | bwd_microstep: 3342.72 | bwd_inner_microstep: 3341.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 22:26:08,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.16 | bwd: 3342.73 | bwd_inner: 3341.93 | bwd_allreduce: 0.76 | step: 6.78 57%|█████▋ | 5679/10000 [8:56:28<6:35:35, 5.49s/it] {'loss': 0.0006, 'grad_norm': 0.09108923375606537, 'learning_rate': 1.6590136388820436e-05, 'epoch': 5.68} 57%|█████▋ | 5679/10000 [8:56:28<6:35:35, 5.49s/it][2025-06-19 22:26:13,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:26:13,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.75 | bwd_microstep: 3377.15 | bwd_inner_microstep: 3376.34 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 22:26:13,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.75 | bwd: 3377.17 | bwd_inner: 3376.34 | bwd_allreduce: 0.78 | step: 7.11 57%|█████▋ | 5680/10000 [8:56:34<6:36:39, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.05359066650271416, 'learning_rate': 1.6583753895310393e-05, 'epoch': 5.68} 57%|█████▋ | 5680/10000 [8:56:34<6:36:39, 5.51s/it][2025-06-19 22:26:19,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:26:19,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.49 | bwd_microstep: 3329.47 | bwd_inner_microstep: 3328.66 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 22:26:19,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.49 | bwd: 3329.49 | bwd_inner: 3328.66 | bwd_allreduce: 0.78 | step: 7.14 57%|█████▋ | 5681/10000 [8:56:39<6:35:53, 5.50s/it] {'loss': 0.012, 'grad_norm': 1.985652208328247, 'learning_rate': 1.6577371760148714e-05, 'epoch': 5.68} 57%|█████▋ | 5681/10000 [8:56:39<6:35:53, 5.50s/it][2025-06-19 22:26:24,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:26:24,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.17 | bwd_microstep: 3323.05 | bwd_inner_microstep: 3322.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 22:26:24,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.17 | bwd: 3323.06 | bwd_inner: 3322.24 | bwd_allreduce: 0.78 | step: 6.83 57%|█████▋ | 5682/10000 [8:56:45<6:35:19, 5.49s/it] {'loss': 0.0123, 'grad_norm': 3.562579393386841, 'learning_rate': 1.6570989984004857e-05, 'epoch': 5.68} 57%|█████▋ | 5682/10000 [8:56:45<6:35:19, 5.49s/it][2025-06-19 22:26:30,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:26:30,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.54 | bwd_microstep: 3325.66 | bwd_inner_microstep: 3324.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 22:26:30,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.54 | bwd: 3325.68 | bwd_inner: 3324.85 | bwd_allreduce: 0.78 | step: 6.83 57%|█████▋ | 5683/10000 [8:56:50<6:34:46, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.06920205056667328, 'learning_rate': 1.6564608567548232e-05, 'epoch': 5.68} 57%|█████▋ | 5683/10000 [8:56:50<6:34:46, 5.49s/it][2025-06-19 22:26:35,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:26:35,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.00 | bwd_microstep: 3375.35 | bwd_inner_microstep: 3374.39 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.68 [2025-06-19 22:26:35,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.00 | bwd: 3375.37 | bwd_inner: 3374.39 | bwd_allreduce: 0.93 | step: 7.69 57%|█████▋ | 5684/10000 [8:56:56<6:36:09, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.00675997044891119, 'learning_rate': 1.655822751144822e-05, 'epoch': 5.68} 57%|█████▋ | 5684/10000 [8:56:56<6:36:09, 5.51s/it][2025-06-19 22:26:41,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:26:41,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.31 | bwd_microstep: 3330.37 | bwd_inner_microstep: 3329.53 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.95 [2025-06-19 22:26:41,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.31 | bwd: 3330.38 | bwd_inner: 3329.53 | bwd_allreduce: 0.81 | step: 6.96 57%|█████▋ | 5685/10000 [8:57:01<6:35:43, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00015540033928118646, 'learning_rate': 1.6551846816374172e-05, 'epoch': 5.69} 57%|█████▋ | 5685/10000 [8:57:01<6:35:43, 5.50s/it][2025-06-19 22:26:46,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:26:46,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.95 | bwd_microstep: 3376.32 | bwd_inner_microstep: 3375.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 22:26:46,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.95 | bwd: 3376.34 | bwd_inner: 3375.52 | bwd_allreduce: 0.77 | step: 6.78 57%|█████▋ | 5686/10000 [8:57:07<6:36:46, 5.52s/it] {'loss': 0.0013, 'grad_norm': 0.18603895604610443, 'learning_rate': 1.6545466482995388e-05, 'epoch': 5.69} 57%|█████▋ | 5686/10000 [8:57:07<6:36:46, 5.52s/it][2025-06-19 22:26:52,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:26:52,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.09 | bwd_microstep: 3324.31 | bwd_inner_microstep: 3323.50 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-19 22:26:52,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.09 | bwd: 3324.32 | bwd_inner: 3323.50 | bwd_allreduce: 0.78 | step: 6.92 57%|█████▋ | 5687/10000 [8:57:12<6:35:37, 5.50s/it] {'loss': 0.0427, 'grad_norm': 7.646050453186035, 'learning_rate': 1.653908651198114e-05, 'epoch': 5.69} 57%|█████▋ | 5687/10000 [8:57:12<6:35:37, 5.50s/it][2025-06-19 22:26:57,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:26:57,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.49 | bwd_microstep: 3335.50 | bwd_inner_microstep: 3334.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-19 22:26:57,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.49 | bwd: 3335.51 | bwd_inner: 3334.70 | bwd_allreduce: 0.77 | step: 7.05 57%|█████▋ | 5688/10000 [8:57:18<6:35:02, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.03168685361742973, 'learning_rate': 1.653270690400065e-05, 'epoch': 5.69} 57%|█████▋ | 5688/10000 [8:57:18<6:35:02, 5.50s/it][2025-06-19 22:27:03,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:27:03,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.27 | bwd_microstep: 3376.89 | bwd_inner_microstep: 3376.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 22:27:03,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.27 | bwd: 3376.90 | bwd_inner: 3376.10 | bwd_allreduce: 0.76 | step: 6.77 57%|█████▋ | 5689/10000 [8:57:23<6:36:06, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.04712003469467163, 'learning_rate': 1.6526327659723116e-05, 'epoch': 5.69} 57%|█████▋ | 5689/10000 [8:57:23<6:36:06, 5.51s/it][2025-06-19 22:27:08,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:27:08,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.64 | bwd_microstep: 3326.78 | bwd_inner_microstep: 3325.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-19 22:27:08,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.64 | bwd: 3326.79 | bwd_inner: 3325.97 | bwd_allreduce: 0.77 | step: 6.94 57%|█████▋ | 5690/10000 [8:57:29<6:35:26, 5.51s/it] {'loss': 0.0026, 'grad_norm': 0.28757140040397644, 'learning_rate': 1.6519948779817683e-05, 'epoch': 5.69} 57%|█████▋ | 5690/10000 [8:57:29<6:35:26, 5.51s/it][2025-06-19 22:27:14,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:27:14,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.56 | bwd_microstep: 3373.98 | bwd_inner_microstep: 3373.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 22:27:14,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.56 | bwd: 3374.00 | bwd_inner: 3373.19 | bwd_allreduce: 0.77 | step: 6.98 57%|█████▋ | 5691/10000 [8:57:34<6:36:08, 5.52s/it] {'loss': 0.0014, 'grad_norm': 0.482722669839859, 'learning_rate': 1.6513570264953476e-05, 'epoch': 5.69} 57%|█████▋ | 5691/10000 [8:57:34<6:36:08, 5.52s/it][2025-06-19 22:27:19,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.66 | optimizer_step: 2.72 [2025-06-19 22:27:19,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.26 | bwd_microstep: 3375.78 | bwd_inner_microstep: 3374.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-19 22:27:19,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.26 | bwd: 3375.80 | bwd_inner: 3374.98 | bwd_allreduce: 0.77 | step: 7.05 57%|█████▋ | 5692/10000 [8:57:40<6:36:37, 5.52s/it] {'loss': 0.0535, 'grad_norm': 12.334731101989746, 'learning_rate': 1.6507192115799577e-05, 'epoch': 5.69} 57%|█████▋ | 5692/10000 [8:57:40<6:36:37, 5.52s/it][2025-06-19 22:27:25,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:27:25,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.26 | bwd_microstep: 3330.95 | bwd_inner_microstep: 3330.06 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.25 [2025-06-19 22:27:25,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.26 | bwd: 3330.97 | bwd_inner: 3330.06 | bwd_allreduce: 0.86 | step: 7.25 57%|█████▋ | 5693/10000 [8:57:45<6:35:35, 5.51s/it] {'loss': 0.0335, 'grad_norm': 4.222508907318115, 'learning_rate': 1.6500814333025004e-05, 'epoch': 5.69} 57%|█████▋ | 5693/10000 [8:57:45<6:35:35, 5.51s/it][2025-06-19 22:27:30,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:27:30,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.18 | bwd_microstep: 3377.98 | bwd_inner_microstep: 3377.16 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.82 [2025-06-19 22:27:30,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.18 | bwd: 3378.00 | bwd_inner: 3377.16 | bwd_allreduce: 0.80 | step: 6.82 57%|█████▋ | 5694/10000 [8:57:51<6:36:22, 5.52s/it] {'loss': 0.0012, 'grad_norm': 0.12543240189552307, 'learning_rate': 1.649443691729877e-05, 'epoch': 5.69} 57%|█████▋ | 5694/10000 [8:57:51<6:36:22, 5.52s/it][2025-06-19 22:27:36,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:27:36,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.37 | bwd_microstep: 3387.33 | bwd_inner_microstep: 3386.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-19 22:27:36,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.38 | bwd: 3387.34 | bwd_inner: 3386.53 | bwd_allreduce: 0.77 | step: 6.87 57%|█████▋ | 5695/10000 [8:57:57<6:37:04, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.008017175830900669, 'learning_rate': 1.6488059869289832e-05, 'epoch': 5.7} 57%|█████▋ | 5695/10000 [8:57:57<6:37:04, 5.53s/it][2025-06-19 22:27:41,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:27:41,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.98 | bwd_microstep: 3369.68 | bwd_inner_microstep: 3368.88 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 22:27:41,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.98 | bwd: 3369.69 | bwd_inner: 3368.88 | bwd_allreduce: 0.77 | step: 7.09 57%|█████▋ | 5696/10000 [8:58:02<6:37:12, 5.54s/it] {'loss': 0.0033, 'grad_norm': 0.41780686378479004, 'learning_rate': 1.6481683189667108e-05, 'epoch': 5.7} 57%|█████▋ | 5696/10000 [8:58:02<6:37:12, 5.54s/it][2025-06-19 22:27:47,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 22:27:47,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.30 | bwd_microstep: 3383.49 | bwd_inner_microstep: 3382.55 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.35 [2025-06-19 22:27:47,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.30 | bwd: 3383.51 | bwd_inner: 3382.55 | bwd_allreduce: 0.91 | step: 7.36 57%|█████▋ | 5697/10000 [8:58:08<6:37:43, 5.55s/it] {'loss': 0.0677, 'grad_norm': 2.7703938484191895, 'learning_rate': 1.64753068790995e-05, 'epoch': 5.7} 57%|█████▋ | 5697/10000 [8:58:08<6:37:43, 5.55s/it][2025-06-19 22:27:52,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.76 | optimizer_step: 2.73 [2025-06-19 22:27:52,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.19 | bwd_microstep: 3383.24 | bwd_inner_microstep: 3382.08 | bwd_allreduce_microstep: 1.09 | step_microstep: 7.78 [2025-06-19 22:27:52,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.19 | bwd: 3383.26 | bwd_inner: 3382.08 | bwd_allreduce: 1.12 | step: 7.79 57%|█████▋ | 5698/10000 [8:58:13<6:38:05, 5.55s/it] {'loss': 0.0003, 'grad_norm': 0.04599195346236229, 'learning_rate': 1.6468930938255837e-05, 'epoch': 5.7} 57%|█████▋ | 5698/10000 [8:58:13<6:38:05, 5.55s/it][2025-06-19 22:27:58,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:27:58,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.18 | bwd_microstep: 3331.95 | bwd_inner_microstep: 3331.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 22:27:58,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.18 | bwd: 3331.96 | bwd_inner: 3331.16 | bwd_allreduce: 0.76 | step: 6.70 57%|█████▋ | 5699/10000 [8:58:19<6:36:28, 5.53s/it] {'loss': 0.0539, 'grad_norm': 8.840449333190918, 'learning_rate': 1.6462555367804932e-05, 'epoch': 5.7} 57%|█████▋ | 5699/10000 [8:58:19<6:36:28, 5.53s/it][2025-06-19 22:28:03,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 22:28:03,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.76 | bwd_microstep: 3334.35 | bwd_inner_microstep: 3333.29 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.10 [2025-06-19 22:28:03,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.76 | bwd: 3334.38 | bwd_inner: 3333.29 | bwd_allreduce: 1.02 | step: 8.10 57%|█████▋ | 5700/10000 [8:58:24<6:35:22, 5.52s/it] {'loss': 0.0329, 'grad_norm': 7.243863582611084, 'learning_rate': 1.6456180168415546e-05, 'epoch': 5.7} 57%|█████▋ | 5700/10000 [8:58:24<6:35:22, 5.52s/it][2025-06-19 22:28:09,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:28:09,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.56 | bwd_microstep: 3375.43 | bwd_inner_microstep: 3374.54 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.32 [2025-06-19 22:28:09,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.56 | bwd: 3375.44 | bwd_inner: 3374.54 | bwd_allreduce: 0.86 | step: 7.32 57%|█████▋ | 5701/10000 [8:58:30<6:35:58, 5.53s/it] {'loss': 0.0034, 'grad_norm': 0.6408846974372864, 'learning_rate': 1.6449805340756415e-05, 'epoch': 5.7} 57%|█████▋ | 5701/10000 [8:58:30<6:35:58, 5.53s/it][2025-06-19 22:28:15,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 22:28:15,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.61 | bwd_microstep: 3382.08 | bwd_inner_microstep: 3380.96 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.66 [2025-06-19 22:28:15,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.61 | bwd: 3382.10 | bwd_inner: 3380.96 | bwd_allreduce: 1.08 | step: 7.67 57%|█████▋ | 5702/10000 [8:58:35<6:36:51, 5.54s/it] {'loss': 0.0013, 'grad_norm': 0.41526496410369873, 'learning_rate': 1.6443430885496234e-05, 'epoch': 5.7} 57%|█████▋ | 5702/10000 [8:58:35<6:36:51, 5.54s/it][2025-06-19 22:28:20,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:28:20,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.67 | bwd_microstep: 3373.65 | bwd_inner_microstep: 3372.61 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.72 [2025-06-19 22:28:20,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.67 | bwd: 3373.67 | bwd_inner: 3372.61 | bwd_allreduce: 1.00 | step: 7.73 57%|█████▋ | 5703/10000 [8:58:41<6:37:00, 5.54s/it] {'loss': 0.0003, 'grad_norm': 0.04372262582182884, 'learning_rate': 1.643705680330364e-05, 'epoch': 5.7} 57%|█████▋ | 5703/10000 [8:58:41<6:37:00, 5.54s/it][2025-06-19 22:28:26,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:28:26,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.89 | bwd_microstep: 3319.96 | bwd_inner_microstep: 3319.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 22:28:26,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.89 | bwd: 3319.98 | bwd_inner: 3319.18 | bwd_allreduce: 0.75 | step: 6.63 57%|█████▋ | 5704/10000 [8:58:46<6:35:23, 5.52s/it] {'loss': 0.0765, 'grad_norm': 3.687479257583618, 'learning_rate': 1.6430683094847248e-05, 'epoch': 5.7} 57%|█████▋ | 5704/10000 [8:58:46<6:35:23, 5.52s/it][2025-06-19 22:28:31,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:28:31,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.61 | bwd_microstep: 3366.47 | bwd_inner_microstep: 3365.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 22:28:31,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.61 | bwd: 3366.49 | bwd_inner: 3365.69 | bwd_allreduce: 0.76 | step: 6.61 57%|█████▋ | 5705/10000 [8:58:52<6:35:28, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.033823464065790176, 'learning_rate': 1.6424309760795634e-05, 'epoch': 5.71} 57%|█████▋ | 5705/10000 [8:58:52<6:35:28, 5.52s/it][2025-06-19 22:28:37,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 22:28:37,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.82 | bwd_microstep: 3377.50 | bwd_inner_microstep: 3376.68 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.58 [2025-06-19 22:28:37,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.82 | bwd: 3377.51 | bwd_inner: 3376.68 | bwd_allreduce: 0.79 | step: 7.59 57%|█████▋ | 5706/10000 [8:58:57<6:36:02, 5.53s/it] {'loss': 0.0002, 'grad_norm': 0.028539132326841354, 'learning_rate': 1.6417936801817335e-05, 'epoch': 5.71} 57%|█████▋ | 5706/10000 [8:58:57<6:36:02, 5.53s/it][2025-06-19 22:28:42,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:28:42,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.60 | bwd_microstep: 3315.43 | bwd_inner_microstep: 3314.61 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.05 [2025-06-19 22:28:42,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.60 | bwd: 3315.45 | bwd_inner: 3314.61 | bwd_allreduce: 0.79 | step: 7.05 57%|█████▋ | 5707/10000 [8:59:03<6:34:27, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0025084405206143856, 'learning_rate': 1.641156421858085e-05, 'epoch': 5.71} 57%|█████▋ | 5707/10000 [8:59:03<6:34:27, 5.51s/it][2025-06-19 22:28:48,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:28:48,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.41 | bwd_microstep: 3373.94 | bwd_inner_microstep: 3373.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 22:28:48,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.41 | bwd: 3373.96 | bwd_inner: 3373.15 | bwd_allreduce: 0.77 | step: 6.99 57%|█████▋ | 5708/10000 [8:59:08<6:35:03, 5.52s/it] {'loss': 0.0677, 'grad_norm': 10.731776237487793, 'learning_rate': 1.6405192011754613e-05, 'epoch': 5.71} 57%|█████▋ | 5708/10000 [8:59:08<6:35:03, 5.52s/it][2025-06-19 22:28:53,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:28:53,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.63 | bwd_microstep: 3321.97 | bwd_inner_microstep: 3321.17 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-19 22:28:53,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.63 | bwd: 3321.99 | bwd_inner: 3321.17 | bwd_allreduce: 0.77 | step: 6.78 57%|█████▋ | 5709/10000 [8:59:14<6:33:47, 5.51s/it] {'loss': 0.2264, 'grad_norm': 14.75185489654541, 'learning_rate': 1.6398820182007046e-05, 'epoch': 5.71} 57%|█████▋ | 5709/10000 [8:59:14<6:33:47, 5.51s/it][2025-06-19 22:28:59,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:28:59,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.26 | bwd_microstep: 3324.29 | bwd_inner_microstep: 3323.39 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.52 [2025-06-19 22:28:59,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.26 | bwd: 3324.31 | bwd_inner: 3323.39 | bwd_allreduce: 0.87 | step: 7.52 57%|█████▋ | 5710/10000 [8:59:19<6:33:02, 5.50s/it] {'loss': 0.0057, 'grad_norm': 0.7047826647758484, 'learning_rate': 1.6392448730006533e-05, 'epoch': 5.71} 57%|█████▋ | 5710/10000 [8:59:19<6:33:02, 5.50s/it][2025-06-19 22:29:04,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:29:04,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.82 | bwd_microstep: 3315.33 | bwd_inner_microstep: 3314.53 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 22:29:04,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.82 | bwd: 3315.34 | bwd_inner: 3314.54 | bwd_allreduce: 0.76 | step: 6.72 57%|█████▋ | 5711/10000 [8:59:25<6:32:12, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0014069583266973495, 'learning_rate': 1.6386077656421406e-05, 'epoch': 5.71} 57%|█████▋ | 5711/10000 [8:59:25<6:32:12, 5.49s/it][2025-06-19 22:29:10,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:29:10,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.41 | bwd_microstep: 3322.34 | bwd_inner_microstep: 3321.40 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.45 [2025-06-19 22:29:10,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.41 | bwd: 3322.36 | bwd_inner: 3321.40 | bwd_allreduce: 0.91 | step: 7.45 57%|█████▋ | 5712/10000 [8:59:30<6:31:46, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.013194827362895012, 'learning_rate': 1.6379706961919964e-05, 'epoch': 5.71} 57%|█████▋ | 5712/10000 [8:59:30<6:31:46, 5.48s/it][2025-06-19 22:29:15,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:29:15,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.38 | bwd_microstep: 3311.59 | bwd_inner_microstep: 3310.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 22:29:15,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.38 | bwd: 3311.60 | bwd_inner: 3310.81 | bwd_allreduce: 0.75 | step: 6.64 57%|█████▋ | 5713/10000 [8:59:36<6:31:15, 5.48s/it] {'loss': 0.0012, 'grad_norm': 0.2891918122768402, 'learning_rate': 1.637333664717045e-05, 'epoch': 5.71} 57%|█████▋ | 5713/10000 [8:59:36<6:31:15, 5.48s/it][2025-06-19 22:29:21,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:29:21,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.14 | bwd_microstep: 3374.44 | bwd_inner_microstep: 3373.54 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.84 [2025-06-19 22:29:21,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.14 | bwd: 3374.45 | bwd_inner: 3373.54 | bwd_allreduce: 0.88 | step: 6.84 57%|█████▋ | 5714/10000 [8:59:41<6:32:34, 5.50s/it] {'loss': 0.0577, 'grad_norm': 6.8726420402526855, 'learning_rate': 1.6366966712841093e-05, 'epoch': 5.71} 57%|█████▋ | 5714/10000 [8:59:41<6:32:34, 5.50s/it][2025-06-19 22:29:26,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:29:26,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.70 | bwd_microstep: 3316.18 | bwd_inner_microstep: 3315.34 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.38 [2025-06-19 22:29:26,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.70 | bwd: 3316.20 | bwd_inner: 3315.34 | bwd_allreduce: 0.81 | step: 7.38 57%|█████▋ | 5715/10000 [8:59:47<6:31:57, 5.49s/it] {'loss': 0.0773, 'grad_norm': 6.290938377380371, 'learning_rate': 1.6360597159600068e-05, 'epoch': 5.71} 57%|█████▋ | 5715/10000 [8:59:47<6:31:57, 5.49s/it][2025-06-19 22:29:32,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:29:32,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.86 | bwd_microstep: 3373.37 | bwd_inner_microstep: 3372.46 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.93 [2025-06-19 22:29:32,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.86 | bwd: 3373.38 | bwd_inner: 3372.46 | bwd_allreduce: 0.88 | step: 6.94 57%|█████▋ | 5716/10000 [8:59:52<6:33:07, 5.51s/it] {'loss': 0.0056, 'grad_norm': 1.1000162363052368, 'learning_rate': 1.635422798811551e-05, 'epoch': 5.72} 57%|█████▋ | 5716/10000 [8:59:52<6:33:07, 5.51s/it][2025-06-19 22:29:37,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:29:37,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.09 | bwd_microstep: 3392.43 | bwd_inner_microstep: 3391.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 22:29:37,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.09 | bwd: 3392.45 | bwd_inner: 3391.64 | bwd_allreduce: 0.76 | step: 6.77 57%|█████▋ | 5717/10000 [8:59:58<6:34:24, 5.53s/it] {'loss': 0.0021, 'grad_norm': 0.9166085720062256, 'learning_rate': 1.6347859199055522e-05, 'epoch': 5.72} 57%|█████▋ | 5717/10000 [8:59:58<6:34:24, 5.53s/it][2025-06-19 22:29:43,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:29:43,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.01 | bwd_microstep: 3323.98 | bwd_inner_microstep: 3323.13 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.01 [2025-06-19 22:29:43,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.01 | bwd: 3323.99 | bwd_inner: 3323.13 | bwd_allreduce: 0.81 | step: 7.02 57%|█████▋ | 5718/10000 [9:00:03<6:33:05, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.012501499615609646, 'learning_rate': 1.6341490793088145e-05, 'epoch': 5.72} 57%|█████▋ | 5718/10000 [9:00:03<6:33:05, 5.51s/it][2025-06-19 22:29:48,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.74 [2025-06-19 22:29:48,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.07 | bwd_microstep: 3371.03 | bwd_inner_microstep: 3370.18 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.31 [2025-06-19 22:29:48,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.07 | bwd: 3371.04 | bwd_inner: 3370.18 | bwd_allreduce: 0.81 | step: 7.31 57%|█████▋ | 5719/10000 [9:00:09<6:33:45, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.001448168302886188, 'learning_rate': 1.6335122770881408e-05, 'epoch': 5.72} 57%|█████▋ | 5719/10000 [9:00:09<6:33:45, 5.52s/it][2025-06-19 22:29:54,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:29:54,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.34 | bwd_microstep: 3321.38 | bwd_inner_microstep: 3320.45 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.03 [2025-06-19 22:29:54,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.35 | bwd: 3321.40 | bwd_inner: 3320.45 | bwd_allreduce: 0.89 | step: 7.05 57%|█████▋ | 5720/10000 [9:00:14<6:32:54, 5.51s/it] {'loss': 0.0017, 'grad_norm': 0.30988648533821106, 'learning_rate': 1.632875513310328e-05, 'epoch': 5.72} 57%|█████▋ | 5720/10000 [9:00:14<6:32:54, 5.51s/it][2025-06-19 22:29:59,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:29:59,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.40 | bwd_microstep: 3326.36 | bwd_inner_microstep: 3325.52 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.93 [2025-06-19 22:29:59,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.40 | bwd: 3326.38 | bwd_inner: 3325.52 | bwd_allreduce: 0.80 | step: 6.93 57%|█████▋ | 5721/10000 [9:00:20<6:32:09, 5.50s/it] {'loss': 0.0244, 'grad_norm': 8.750711441040039, 'learning_rate': 1.63223878804217e-05, 'epoch': 5.72} 57%|█████▋ | 5721/10000 [9:00:20<6:32:09, 5.50s/it][2025-06-19 22:30:05,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:30:05,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.53 | bwd_microstep: 3321.42 | bwd_inner_microstep: 3320.62 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 22:30:05,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.53 | bwd: 3321.44 | bwd_inner: 3320.62 | bwd_allreduce: 0.78 | step: 6.96 57%|█████▋ | 5722/10000 [9:00:25<6:31:16, 5.49s/it] {'loss': 0.0015, 'grad_norm': 0.2678981125354767, 'learning_rate': 1.6316021013504567e-05, 'epoch': 5.72} 57%|█████▋ | 5722/10000 [9:00:25<6:31:16, 5.49s/it][2025-06-19 22:30:10,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:30:10,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.88 | bwd_microstep: 3317.23 | bwd_inner_microstep: 3316.30 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.89 [2025-06-19 22:30:10,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.88 | bwd: 3317.24 | bwd_inner: 3316.30 | bwd_allreduce: 0.90 | step: 6.89 57%|█████▋ | 5723/10000 [9:00:31<6:30:44, 5.48s/it] {'loss': 0.0013, 'grad_norm': 0.2179747074842453, 'learning_rate': 1.630965453301973e-05, 'epoch': 5.72} 57%|█████▋ | 5723/10000 [9:00:31<6:30:44, 5.48s/it][2025-06-19 22:30:16,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 22:30:16,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.15 | bwd_microstep: 3320.26 | bwd_inner_microstep: 3319.27 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.40 [2025-06-19 22:30:16,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.15 | bwd: 3320.28 | bwd_inner: 3319.27 | bwd_allreduce: 0.96 | step: 7.41 57%|█████▋ | 5724/10000 [9:00:36<6:30:30, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.007273102644830942, 'learning_rate': 1.6303288439635e-05, 'epoch': 5.72} 57%|█████▋ | 5724/10000 [9:00:36<6:30:30, 5.48s/it][2025-06-19 22:30:21,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:30:21,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.26 | bwd_microstep: 3322.20 | bwd_inner_microstep: 3321.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 22:30:21,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.26 | bwd: 3322.21 | bwd_inner: 3321.40 | bwd_allreduce: 0.76 | step: 6.83 57%|█████▋ | 5725/10000 [9:00:42<6:30:30, 5.48s/it] {'loss': 0.0029, 'grad_norm': 0.39319783449172974, 'learning_rate': 1.6296922734018162e-05, 'epoch': 5.72} 57%|█████▋ | 5725/10000 [9:00:42<6:30:30, 5.48s/it][2025-06-19 22:30:26,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:30:26,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.08 | bwd_microstep: 3314.95 | bwd_inner_microstep: 3314.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.14 [2025-06-19 22:30:26,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.08 | bwd: 3314.97 | bwd_inner: 3314.15 | bwd_allreduce: 0.77 | step: 7.15 57%|█████▋ | 5726/10000 [9:00:47<6:29:50, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0016205091960728168, 'learning_rate': 1.629055741683694e-05, 'epoch': 5.73} 57%|█████▋ | 5726/10000 [9:00:47<6:29:50, 5.47s/it][2025-06-19 22:30:32,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:30:32,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.85 | bwd_microstep: 3360.27 | bwd_inner_microstep: 3359.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 22:30:32,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.85 | bwd: 3360.28 | bwd_inner: 3359.47 | bwd_allreduce: 0.77 | step: 6.73 57%|█████▋ | 5727/10000 [9:00:53<6:30:48, 5.49s/it] {'loss': 0.1211, 'grad_norm': 13.508819580078125, 'learning_rate': 1.6284192488759028e-05, 'epoch': 5.73} 57%|█████▋ | 5727/10000 [9:00:53<6:30:48, 5.49s/it][2025-06-19 22:30:37,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 22:30:37,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.48 | bwd_microstep: 3321.57 | bwd_inner_microstep: 3320.60 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.41 [2025-06-19 22:30:37,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.48 | bwd: 3321.58 | bwd_inner: 3320.60 | bwd_allreduce: 0.93 | step: 7.41 57%|█████▋ | 5728/10000 [9:00:58<6:30:22, 5.48s/it] {'loss': 0.0119, 'grad_norm': 2.138026475906372, 'learning_rate': 1.6277827950452088e-05, 'epoch': 5.73} 57%|█████▋ | 5728/10000 [9:00:58<6:30:22, 5.48s/it][2025-06-19 22:30:43,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:30:43,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.98 | bwd_microstep: 3317.88 | bwd_inner_microstep: 3316.87 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.74 [2025-06-19 22:30:43,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.98 | bwd: 3317.90 | bwd_inner: 3316.87 | bwd_allreduce: 0.98 | step: 7.75 57%|█████▋ | 5729/10000 [9:01:04<6:29:57, 5.48s/it] {'loss': 0.0006, 'grad_norm': 0.09360821545124054, 'learning_rate': 1.6271463802583708e-05, 'epoch': 5.73} 57%|█████▋ | 5729/10000 [9:01:04<6:29:57, 5.48s/it][2025-06-19 22:30:48,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:30:48,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.43 | bwd_microstep: 3324.69 | bwd_inner_microstep: 3323.87 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-19 22:30:48,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.43 | bwd: 3324.70 | bwd_inner: 3323.87 | bwd_allreduce: 0.79 | step: 7.23 57%|█████▋ | 5730/10000 [9:01:09<6:29:57, 5.48s/it] {'loss': 0.0012, 'grad_norm': 0.4533149003982544, 'learning_rate': 1.6265100045821475e-05, 'epoch': 5.73} 57%|█████▋ | 5730/10000 [9:01:09<6:29:57, 5.48s/it][2025-06-19 22:30:54,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 22:30:54,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.53 | bwd_microstep: 3319.15 | bwd_inner_microstep: 3318.37 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 22:30:54,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.53 | bwd: 3319.16 | bwd_inner: 3318.37 | bwd_allreduce: 0.75 | step: 6.64 57%|█████▋ | 5731/10000 [9:01:15<6:29:33, 5.48s/it] {'loss': 0.2409, 'grad_norm': 3.787288188934326, 'learning_rate': 1.625873668083291e-05, 'epoch': 5.73} 57%|█████▋ | 5731/10000 [9:01:15<6:29:33, 5.48s/it][2025-06-19 22:30:59,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:30:59,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.91 | bwd_microstep: 3318.30 | bwd_inner_microstep: 3317.49 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 22:30:59,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.91 | bwd: 3318.32 | bwd_inner: 3317.49 | bwd_allreduce: 0.78 | step: 7.13 57%|█████▋ | 5732/10000 [9:01:20<6:29:04, 5.47s/it] {'loss': 0.0013, 'grad_norm': 0.25173503160476685, 'learning_rate': 1.6252373708285505e-05, 'epoch': 5.73} 57%|█████▋ | 5732/10000 [9:01:20<6:29:04, 5.47s/it][2025-06-19 22:31:05,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:31:05,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.99 | bwd_microstep: 3329.97 | bwd_inner_microstep: 3329.02 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.77 [2025-06-19 22:31:05,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.00 | bwd: 3329.99 | bwd_inner: 3329.02 | bwd_allreduce: 0.91 | step: 7.78 57%|█████▋ | 5733/10000 [9:01:26<6:29:13, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0012326115975156426, 'learning_rate': 1.624601112884671e-05, 'epoch': 5.73} 57%|█████▋ | 5733/10000 [9:01:26<6:29:13, 5.47s/it][2025-06-19 22:31:10,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:31:10,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.20 | bwd_microstep: 3327.19 | bwd_inner_microstep: 3326.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.12 [2025-06-19 22:31:10,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.20 | bwd: 3327.21 | bwd_inner: 3326.40 | bwd_allreduce: 0.77 | step: 7.13 57%|█████▋ | 5734/10000 [9:01:31<6:29:18, 5.48s/it] {'loss': 0.0009, 'grad_norm': 0.21164445579051971, 'learning_rate': 1.6239648943183924e-05, 'epoch': 5.73} 57%|█████▋ | 5734/10000 [9:01:31<6:29:18, 5.48s/it][2025-06-19 22:31:16,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:31:16,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.71 | bwd_microstep: 3365.08 | bwd_inner_microstep: 3364.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 22:31:16,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.71 | bwd: 3365.09 | bwd_inner: 3364.30 | bwd_allreduce: 0.75 | step: 6.63 57%|█████▋ | 5735/10000 [9:01:37<6:30:23, 5.49s/it] {'loss': 0.0013, 'grad_norm': 0.2705698311328888, 'learning_rate': 1.6233287151964508e-05, 'epoch': 5.74} 57%|█████▋ | 5735/10000 [9:01:37<6:30:23, 5.49s/it][2025-06-19 22:31:21,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:31:21,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.85 | bwd_microstep: 3323.53 | bwd_inner_microstep: 3322.61 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.47 [2025-06-19 22:31:21,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.84 | bwd: 3323.55 | bwd_inner: 3322.61 | bwd_allreduce: 0.90 | step: 7.47 57%|█████▋ | 5736/10000 [9:01:42<6:29:38, 5.48s/it] {'loss': 0.0765, 'grad_norm': 10.254438400268555, 'learning_rate': 1.6226925755855787e-05, 'epoch': 5.74} 57%|█████▋ | 5736/10000 [9:01:42<6:29:38, 5.48s/it][2025-06-19 22:31:27,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:31:27,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.30 | bwd_microstep: 3314.71 | bwd_inner_microstep: 3313.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 22:31:27,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.30 | bwd: 3314.73 | bwd_inner: 3313.93 | bwd_allreduce: 0.76 | step: 6.67 57%|█████▋ | 5737/10000 [9:01:48<6:28:54, 5.47s/it] {'loss': 0.0004, 'grad_norm': 0.058531343936920166, 'learning_rate': 1.6220564755525044e-05, 'epoch': 5.74} 57%|█████▋ | 5737/10000 [9:01:48<6:28:54, 5.47s/it][2025-06-19 22:31:32,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:31:32,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.06 | bwd_microstep: 3368.06 | bwd_inner_microstep: 3367.10 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.40 [2025-06-19 22:31:32,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.06 | bwd: 3368.08 | bwd_inner: 3367.10 | bwd_allreduce: 0.93 | step: 7.40 57%|█████▋ | 5738/10000 [9:01:53<6:30:21, 5.50s/it] {'loss': 0.0034, 'grad_norm': 0.7082996964454651, 'learning_rate': 1.621420415163952e-05, 'epoch': 5.74} 57%|█████▋ | 5738/10000 [9:01:53<6:30:21, 5.50s/it][2025-06-19 22:31:38,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:31:38,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.16 | bwd_microstep: 3312.45 | bwd_inner_microstep: 3311.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.83 [2025-06-19 22:31:38,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.16 | bwd: 3312.46 | bwd_inner: 3311.66 | bwd_allreduce: 0.76 | step: 6.83 57%|█████▋ | 5739/10000 [9:01:59<6:29:37, 5.49s/it] {'loss': 0.0091, 'grad_norm': 1.283003568649292, 'learning_rate': 1.620784394486641e-05, 'epoch': 5.74} 57%|█████▋ | 5739/10000 [9:01:59<6:29:37, 5.49s/it][2025-06-19 22:31:43,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:31:43,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.05 | bwd_microstep: 3316.87 | bwd_inner_microstep: 3316.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 22:31:43,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.05 | bwd: 3316.89 | bwd_inner: 3316.09 | bwd_allreduce: 0.76 | step: 6.65 57%|█████▋ | 5740/10000 [9:02:04<6:28:49, 5.48s/it] {'loss': 0.0083, 'grad_norm': 0.744825541973114, 'learning_rate': 1.6201484135872864e-05, 'epoch': 5.74} 57%|█████▋ | 5740/10000 [9:02:04<6:28:49, 5.48s/it][2025-06-19 22:31:49,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:31:49,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.63 | bwd_microstep: 3313.99 | bwd_inner_microstep: 3313.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 22:31:49,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.63 | bwd: 3314.00 | bwd_inner: 3313.20 | bwd_allreduce: 0.76 | step: 6.66 57%|█████▋ | 5741/10000 [9:02:09<6:28:12, 5.47s/it] {'loss': 0.0007, 'grad_norm': 0.13269932568073273, 'learning_rate': 1.6195124725326004e-05, 'epoch': 5.74} 57%|█████▋ | 5741/10000 [9:02:09<6:28:12, 5.47s/it][2025-06-19 22:31:54,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:31:54,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.12 | bwd_microstep: 3367.68 | bwd_inner_microstep: 3366.88 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 22:31:54,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.12 | bwd: 3367.70 | bwd_inner: 3366.88 | bwd_allreduce: 0.77 | step: 7.06 57%|█████▋ | 5742/10000 [9:02:15<6:29:24, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.05360851436853409, 'learning_rate': 1.6188765713892904e-05, 'epoch': 5.74} 57%|█████▋ | 5742/10000 [9:02:15<6:29:24, 5.49s/it][2025-06-19 22:32:00,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:32:00,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.49 | bwd_microstep: 3326.81 | bwd_inner_microstep: 3326.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 22:32:00,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.49 | bwd: 3326.83 | bwd_inner: 3326.03 | bwd_allreduce: 0.76 | step: 6.68 57%|█████▋ | 5743/10000 [9:02:20<6:28:50, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.08164659142494202, 'learning_rate': 1.61824071022406e-05, 'epoch': 5.74} 57%|█████▋ | 5743/10000 [9:02:20<6:28:50, 5.48s/it][2025-06-19 22:32:05,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:32:05,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.01 | bwd_microstep: 3372.23 | bwd_inner_microstep: 3371.16 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.44 [2025-06-19 22:32:05,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.01 | bwd: 3372.25 | bwd_inner: 3371.16 | bwd_allreduce: 1.03 | step: 7.45 57%|█████▋ | 5744/10000 [9:02:26<6:30:10, 5.50s/it] {'loss': 0.0665, 'grad_norm': 8.803754806518555, 'learning_rate': 1.6176048891036065e-05, 'epoch': 5.74} 57%|█████▋ | 5744/10000 [9:02:26<6:30:10, 5.50s/it][2025-06-19 22:32:11,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:32:11,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.03 | bwd_microstep: 3315.93 | bwd_inner_microstep: 3315.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-19 22:32:11,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.03 | bwd: 3315.95 | bwd_inner: 3315.14 | bwd_allreduce: 0.76 | step: 6.98 57%|█████▋ | 5745/10000 [9:02:31<6:29:14, 5.49s/it] {'loss': 0.0006, 'grad_norm': 0.10306669026613235, 'learning_rate': 1.6169691080946254e-05, 'epoch': 5.75} 57%|█████▋ | 5745/10000 [9:02:31<6:29:14, 5.49s/it][2025-06-19 22:32:16,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:32:16,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.77 | bwd_microstep: 3308.91 | bwd_inner_microstep: 3307.92 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.28 [2025-06-19 22:32:16,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.77 | bwd: 3308.93 | bwd_inner: 3307.92 | bwd_allreduce: 0.96 | step: 7.28 57%|█████▋ | 5746/10000 [9:02:37<6:28:25, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.18168282508850098, 'learning_rate': 1.6163333672638067e-05, 'epoch': 5.75} 57%|█████▋ | 5746/10000 [9:02:37<6:28:25, 5.48s/it][2025-06-19 22:32:22,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 22:32:22,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.05 | bwd_microstep: 3314.68 | bwd_inner_microstep: 3313.69 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.18 [2025-06-19 22:32:22,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.05 | bwd: 3314.70 | bwd_inner: 3313.69 | bwd_allreduce: 0.96 | step: 7.18 57%|█████▋ | 5747/10000 [9:02:42<6:27:48, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0049986145459115505, 'learning_rate': 1.6156976666778375e-05, 'epoch': 5.75} 57%|█████▋ | 5747/10000 [9:02:42<6:27:48, 5.47s/it][2025-06-19 22:32:27,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:32:27,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.74 | bwd_microstep: 3367.35 | bwd_inner_microstep: 3366.39 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.24 [2025-06-19 22:32:27,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.74 | bwd: 3367.37 | bwd_inner: 3366.39 | bwd_allreduce: 0.93 | step: 7.24 57%|█████▋ | 5748/10000 [9:02:48<6:29:18, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002345855813473463, 'learning_rate': 1.6150620064034e-05, 'epoch': 5.75} 57%|█████▋ | 5748/10000 [9:02:48<6:29:18, 5.49s/it][2025-06-19 22:32:33,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:32:33,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.62 | bwd_microstep: 3316.00 | bwd_inner_microstep: 3315.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-19 22:32:33,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.62 | bwd: 3316.02 | bwd_inner: 3315.21 | bwd_allreduce: 0.76 | step: 6.98 57%|█████▋ | 5749/10000 [9:02:53<6:28:26, 5.48s/it] {'loss': 0.0022, 'grad_norm': 0.3947087526321411, 'learning_rate': 1.614426386507171e-05, 'epoch': 5.75} 57%|█████▋ | 5749/10000 [9:02:53<6:28:26, 5.48s/it][2025-06-19 22:32:38,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:32:38,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.73 | bwd_microstep: 3365.27 | bwd_inner_microstep: 3364.49 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 22:32:38,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.73 | bwd: 3365.29 | bwd_inner: 3364.49 | bwd_allreduce: 0.75 | step: 6.61 57%|█████▊ | 5750/10000 [9:02:59<6:29:16, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.07573908567428589, 'learning_rate': 1.6137908070558243e-05, 'epoch': 5.75} 57%|█████▊ | 5750/10000 [9:02:59<6:29:16, 5.50s/it][2025-06-19 22:32:44,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 22:32:44,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.78 | bwd_microstep: 3394.72 | bwd_inner_microstep: 3393.73 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.84 [2025-06-19 22:32:44,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.78 | bwd: 3394.74 | bwd_inner: 3393.73 | bwd_allreduce: 0.96 | step: 7.85 58%|█████▊ | 5751/10000 [9:03:04<6:30:44, 5.52s/it] {'loss': 0.0574, 'grad_norm': 4.635526180267334, 'learning_rate': 1.6131552681160296e-05, 'epoch': 5.75} 58%|█████▊ | 5751/10000 [9:03:04<6:30:44, 5.52s/it][2025-06-19 22:32:49,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 22:32:49,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.19 | bwd_microstep: 3370.38 | bwd_inner_microstep: 3369.39 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.54 [2025-06-19 22:32:49,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.19 | bwd: 3370.39 | bwd_inner: 3369.39 | bwd_allreduce: 0.96 | step: 7.55 58%|█████▊ | 5752/10000 [9:03:10<6:31:06, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.015339800156652927, 'learning_rate': 1.6125197697544522e-05, 'epoch': 5.75} 58%|█████▊ | 5752/10000 [9:03:10<6:31:06, 5.52s/it][2025-06-19 22:32:55,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.73 [2025-06-19 22:32:55,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.36 | bwd_microstep: 3329.90 | bwd_inner_microstep: 3328.97 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.04 [2025-06-19 22:32:55,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.36 | bwd: 3329.91 | bwd_inner: 3328.97 | bwd_allreduce: 0.90 | step: 7.04 58%|█████▊ | 5753/10000 [9:03:15<6:30:05, 5.51s/it] {'loss': 0.0053, 'grad_norm': 0.6170664429664612, 'learning_rate': 1.611884312037753e-05, 'epoch': 5.75} 58%|█████▊ | 5753/10000 [9:03:15<6:30:05, 5.51s/it][2025-06-19 22:33:00,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:33:00,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.79 | bwd_microstep: 3355.46 | bwd_inner_microstep: 3354.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 22:33:00,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.79 | bwd: 3355.47 | bwd_inner: 3354.65 | bwd_allreduce: 0.78 | step: 7.14 58%|█████▊ | 5754/10000 [9:03:21<6:30:04, 5.51s/it] {'loss': 0.0883, 'grad_norm': 10.345049858093262, 'learning_rate': 1.6112488950325877e-05, 'epoch': 5.75} 58%|█████▊ | 5754/10000 [9:03:21<6:30:04, 5.51s/it][2025-06-19 22:33:06,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:33:06,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.14 | bwd_microstep: 3389.56 | bwd_inner_microstep: 3388.76 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-19 22:33:06,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.14 | bwd: 3389.57 | bwd_inner: 3388.76 | bwd_allreduce: 0.77 | step: 6.75 58%|█████▊ | 5755/10000 [9:03:27<6:31:10, 5.53s/it] {'loss': 0.0013, 'grad_norm': 0.42453324794769287, 'learning_rate': 1.6106135188056087e-05, 'epoch': 5.75} 58%|█████▊ | 5755/10000 [9:03:27<6:31:10, 5.53s/it][2025-06-19 22:33:11,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:33:11,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.35 | bwd_microstep: 3365.98 | bwd_inner_microstep: 3365.17 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 22:33:11,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.35 | bwd: 3366.00 | bwd_inner: 3365.17 | bwd_allreduce: 0.78 | step: 7.05 58%|█████▊ | 5756/10000 [9:03:32<6:31:08, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.023394698277115822, 'learning_rate': 1.6099781834234645e-05, 'epoch': 5.76} 58%|█████▊ | 5756/10000 [9:03:32<6:31:08, 5.53s/it][2025-06-19 22:33:17,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:33:17,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.85 | bwd_microstep: 3309.21 | bwd_inner_microstep: 3308.09 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.71 [2025-06-19 22:33:17,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.85 | bwd: 3309.23 | bwd_inner: 3308.09 | bwd_allreduce: 1.08 | step: 7.71 58%|█████▊ | 5757/10000 [9:03:38<6:29:48, 5.51s/it] {'loss': 0.0076, 'grad_norm': 0.83819979429245, 'learning_rate': 1.6093428889527988e-05, 'epoch': 5.76} 58%|█████▊ | 5757/10000 [9:03:38<6:29:48, 5.51s/it][2025-06-19 22:33:22,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:33:22,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.85 | bwd_microstep: 3367.71 | bwd_inner_microstep: 3366.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 22:33:22,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.85 | bwd: 3367.72 | bwd_inner: 3366.91 | bwd_allreduce: 0.77 | step: 6.63 58%|█████▊ | 5758/10000 [9:03:43<6:30:19, 5.52s/it] {'loss': 0.0024, 'grad_norm': 0.44607505202293396, 'learning_rate': 1.608707635460251e-05, 'epoch': 5.76} 58%|█████▊ | 5758/10000 [9:03:43<6:30:19, 5.52s/it][2025-06-19 22:33:28,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:33:28,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.87 | bwd_microstep: 3313.22 | bwd_inner_microstep: 3312.42 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 22:33:28,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.87 | bwd: 3313.23 | bwd_inner: 3312.42 | bwd_allreduce: 0.77 | step: 6.82 58%|█████▊ | 5759/10000 [9:03:49<6:28:47, 5.50s/it] {'loss': 0.0034, 'grad_norm': 0.6887637376785278, 'learning_rate': 1.6080724230124563e-05, 'epoch': 5.76} 58%|█████▊ | 5759/10000 [9:03:49<6:28:47, 5.50s/it][2025-06-19 22:33:33,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 22:33:33,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.18 | bwd_microstep: 3311.07 | bwd_inner_microstep: 3310.04 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.74 [2025-06-19 22:33:33,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.18 | bwd: 3311.09 | bwd_inner: 3310.04 | bwd_allreduce: 0.99 | step: 7.74 58%|█████▊ | 5760/10000 [9:03:54<6:27:40, 5.49s/it] {'loss': 0.0011, 'grad_norm': 0.19667139649391174, 'learning_rate': 1.6074372516760452e-05, 'epoch': 5.76} 58%|█████▊ | 5760/10000 [9:03:54<6:27:40, 5.49s/it][2025-06-19 22:33:39,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:33:39,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.88 | bwd_microstep: 3367.70 | bwd_inner_microstep: 3366.74 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.70 [2025-06-19 22:33:39,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.88 | bwd: 3367.72 | bwd_inner: 3366.74 | bwd_allreduce: 0.94 | step: 7.70 58%|█████▊ | 5761/10000 [9:04:00<6:28:38, 5.50s/it] {'loss': 0.0244, 'grad_norm': 4.975554943084717, 'learning_rate': 1.6068021215176445e-05, 'epoch': 5.76} 58%|█████▊ | 5761/10000 [9:04:00<6:28:38, 5.50s/it][2025-06-19 22:33:44,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:33:44,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.72 | bwd_microstep: 3372.09 | bwd_inner_microstep: 3371.29 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-19 22:33:44,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.72 | bwd: 3372.11 | bwd_inner: 3371.29 | bwd_allreduce: 0.77 | step: 7.03 58%|█████▊ | 5762/10000 [9:04:05<6:29:34, 5.52s/it] {'loss': 0.0439, 'grad_norm': 9.561721801757812, 'learning_rate': 1.6061670326038764e-05, 'epoch': 5.76} 58%|█████▊ | 5762/10000 [9:04:05<6:29:34, 5.52s/it][2025-06-19 22:33:50,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:33:50,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.73 | bwd_microstep: 3315.08 | bwd_inner_microstep: 3314.07 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.82 [2025-06-19 22:33:50,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.73 | bwd: 3315.10 | bwd_inner: 3314.07 | bwd_allreduce: 0.98 | step: 7.83 58%|█████▊ | 5763/10000 [9:04:11<6:28:16, 5.50s/it] {'loss': 0.0032, 'grad_norm': 0.7591136693954468, 'learning_rate': 1.6055319850013596e-05, 'epoch': 5.76} 58%|█████▊ | 5763/10000 [9:04:11<6:28:16, 5.50s/it][2025-06-19 22:33:55,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:33:55,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.60 | bwd_microstep: 3318.30 | bwd_inner_microstep: 3317.49 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.81 [2025-06-19 22:33:55,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.60 | bwd: 3318.31 | bwd_inner: 3317.49 | bwd_allreduce: 0.79 | step: 6.81 58%|█████▊ | 5764/10000 [9:04:16<6:27:28, 5.49s/it] {'loss': 0.0025, 'grad_norm': 0.8531925082206726, 'learning_rate': 1.604896978776706e-05, 'epoch': 5.76} 58%|█████▊ | 5764/10000 [9:04:16<6:27:28, 5.49s/it][2025-06-19 22:34:01,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:34:01,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.02 | bwd_microstep: 3324.51 | bwd_inner_microstep: 3323.59 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.10 [2025-06-19 22:34:01,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.02 | bwd: 3324.53 | bwd_inner: 3323.59 | bwd_allreduce: 0.88 | step: 7.10 58%|█████▊ | 5765/10000 [9:04:21<6:26:56, 5.48s/it] {'loss': 0.0004, 'grad_norm': 0.04837336763739586, 'learning_rate': 1.6042620139965248e-05, 'epoch': 5.76} 58%|█████▊ | 5765/10000 [9:04:21<6:26:56, 5.48s/it][2025-06-19 22:34:06,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:34:06,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.62 | bwd_microstep: 3317.36 | bwd_inner_microstep: 3316.54 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 22:34:06,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.62 | bwd: 3317.37 | bwd_inner: 3316.54 | bwd_allreduce: 0.79 | step: 7.29 58%|█████▊ | 5766/10000 [9:04:27<6:26:22, 5.48s/it] {'loss': 0.0007, 'grad_norm': 0.08246520161628723, 'learning_rate': 1.6036270907274224e-05, 'epoch': 5.77} 58%|█████▊ | 5766/10000 [9:04:27<6:26:22, 5.48s/it][2025-06-19 22:34:12,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:34:12,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.58 | bwd_microstep: 3370.28 | bwd_inner_microstep: 3369.47 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.26 [2025-06-19 22:34:12,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.58 | bwd: 3370.29 | bwd_inner: 3369.47 | bwd_allreduce: 0.78 | step: 7.27 58%|█████▊ | 5767/10000 [9:04:32<6:27:42, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.008822634816169739, 'learning_rate': 1.6029922090359985e-05, 'epoch': 5.77} 58%|█████▊ | 5767/10000 [9:04:32<6:27:42, 5.50s/it][2025-06-19 22:34:17,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:34:17,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.57 | bwd_microstep: 3317.23 | bwd_inner_microstep: 3316.22 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.75 [2025-06-19 22:34:17,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.57 | bwd: 3317.25 | bwd_inner: 3316.22 | bwd_allreduce: 0.98 | step: 7.74 58%|█████▊ | 5768/10000 [9:04:38<6:27:02, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.026563875377178192, 'learning_rate': 1.60235736898885e-05, 'epoch': 5.77} 58%|█████▊ | 5768/10000 [9:04:38<6:27:02, 5.49s/it][2025-06-19 22:34:23,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:34:23,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.91 | bwd_microstep: 3323.33 | bwd_inner_microstep: 3322.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 22:34:23,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.91 | bwd: 3323.35 | bwd_inner: 3322.55 | bwd_allreduce: 0.76 | step: 6.63 58%|█████▊ | 5769/10000 [9:04:43<6:26:28, 5.48s/it] {'loss': 0.0098, 'grad_norm': 1.43793523311615, 'learning_rate': 1.6017225706525673e-05, 'epoch': 5.77} 58%|█████▊ | 5769/10000 [9:04:43<6:26:28, 5.48s/it][2025-06-19 22:34:28,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:34:28,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.77 | bwd_microstep: 3371.64 | bwd_inner_microstep: 3370.71 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.52 [2025-06-19 22:34:28,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.77 | bwd: 3371.66 | bwd_inner: 3370.71 | bwd_allreduce: 0.90 | step: 7.52 58%|█████▊ | 5770/10000 [9:04:49<6:27:50, 5.50s/it] {'loss': 0.1141, 'grad_norm': 9.281034469604492, 'learning_rate': 1.6010878140937384e-05, 'epoch': 5.77} 58%|█████▊ | 5770/10000 [9:04:49<6:27:50, 5.50s/it][2025-06-19 22:34:34,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 22:34:34,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.29 | bwd_microstep: 3387.22 | bwd_inner_microstep: 3386.06 | bwd_allreduce_microstep: 1.09 | step_microstep: 7.91 [2025-06-19 22:34:34,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.29 | bwd: 3387.24 | bwd_inner: 3386.06 | bwd_allreduce: 1.12 | step: 7.92 58%|█████▊ | 5771/10000 [9:04:55<6:29:15, 5.52s/it] {'loss': 0.0174, 'grad_norm': 3.113729476928711, 'learning_rate': 1.6004530993789467e-05, 'epoch': 5.77} 58%|█████▊ | 5771/10000 [9:04:55<6:29:15, 5.52s/it][2025-06-19 22:34:39,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:34:39,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.59 | bwd_microstep: 3370.38 | bwd_inner_microstep: 3369.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-19 22:34:39,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.59 | bwd: 3370.39 | bwd_inner: 3369.58 | bwd_allreduce: 0.77 | step: 7.09 58%|█████▊ | 5772/10000 [9:05:00<6:29:49, 5.53s/it] {'loss': 0.0008, 'grad_norm': 0.10982795059680939, 'learning_rate': 1.5998184265747702e-05, 'epoch': 5.77} 58%|█████▊ | 5772/10000 [9:05:00<6:29:49, 5.53s/it][2025-06-19 22:34:45,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:34:45,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.97 | bwd_microstep: 3315.08 | bwd_inner_microstep: 3314.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 22:34:45,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.97 | bwd: 3315.09 | bwd_inner: 3314.30 | bwd_allreduce: 0.75 | step: 6.63 58%|█████▊ | 5773/10000 [9:05:06<6:28:11, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.06284396350383759, 'learning_rate': 1.5991837957477837e-05, 'epoch': 5.77} 58%|█████▊ | 5773/10000 [9:05:06<6:28:11, 5.51s/it][2025-06-19 22:34:50,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:34:50,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.44 | bwd_microstep: 3331.03 | bwd_inner_microstep: 3329.94 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.86 [2025-06-19 22:34:50,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.44 | bwd: 3331.05 | bwd_inner: 3329.94 | bwd_allreduce: 1.05 | step: 7.87 58%|█████▊ | 5774/10000 [9:05:11<6:27:42, 5.50s/it] {'loss': 0.1005, 'grad_norm': 6.990933895111084, 'learning_rate': 1.5985492069645566e-05, 'epoch': 5.77} 58%|█████▊ | 5774/10000 [9:05:11<6:27:42, 5.50s/it][2025-06-19 22:34:56,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:34:56,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.45 | bwd_microstep: 3371.56 | bwd_inner_microstep: 3370.61 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.29 [2025-06-19 22:34:56,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.44 | bwd: 3371.58 | bwd_inner: 3370.61 | bwd_allreduce: 0.92 | step: 7.28 58%|█████▊ | 5775/10000 [9:05:17<6:28:23, 5.52s/it] {'loss': 0.017, 'grad_norm': 2.901705026626587, 'learning_rate': 1.5979146602916548e-05, 'epoch': 5.78} 58%|█████▊ | 5775/10000 [9:05:17<6:28:23, 5.52s/it][2025-06-19 22:35:01,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:35:01,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.40 | bwd_microstep: 3369.86 | bwd_inner_microstep: 3369.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 22:35:01,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.40 | bwd: 3369.88 | bwd_inner: 3369.08 | bwd_allreduce: 0.76 | step: 6.59 58%|█████▊ | 5776/10000 [9:05:22<6:28:50, 5.52s/it] {'loss': 0.0007, 'grad_norm': 0.0716671422123909, 'learning_rate': 1.5972801557956384e-05, 'epoch': 5.78} 58%|█████▊ | 5776/10000 [9:05:22<6:28:50, 5.52s/it][2025-06-19 22:35:07,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 22:35:07,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.69 | bwd_microstep: 3319.93 | bwd_inner_microstep: 3318.92 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.99 [2025-06-19 22:35:07,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.69 | bwd: 3319.95 | bwd_inner: 3318.92 | bwd_allreduce: 0.97 | step: 8.00 58%|█████▊ | 5777/10000 [9:05:28<6:27:37, 5.51s/it] {'loss': 0.0022, 'grad_norm': 0.5746263265609741, 'learning_rate': 1.596645693543065e-05, 'epoch': 5.78} 58%|█████▊ | 5777/10000 [9:05:28<6:27:37, 5.51s/it][2025-06-19 22:35:12,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:35:12,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.71 | bwd_microstep: 3331.64 | bwd_inner_microstep: 3330.74 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.85 [2025-06-19 22:35:12,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.71 | bwd: 3331.65 | bwd_inner: 3330.74 | bwd_allreduce: 0.87 | step: 6.85 58%|█████▊ | 5778/10000 [9:05:33<6:27:06, 5.50s/it] {'loss': 0.001, 'grad_norm': 0.18276740610599518, 'learning_rate': 1.596011273600487e-05, 'epoch': 5.78} 58%|█████▊ | 5778/10000 [9:05:33<6:27:06, 5.50s/it][2025-06-19 22:35:18,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:35:18,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.91 | bwd_microstep: 3322.70 | bwd_inner_microstep: 3321.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 22:35:18,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.91 | bwd: 3322.71 | bwd_inner: 3321.90 | bwd_allreduce: 0.76 | step: 6.69 58%|█████▊ | 5779/10000 [9:05:39<6:26:13, 5.49s/it] {'loss': 0.0737, 'grad_norm': 4.883432865142822, 'learning_rate': 1.5953768960344507e-05, 'epoch': 5.78} 58%|█████▊ | 5779/10000 [9:05:39<6:26:13, 5.49s/it][2025-06-19 22:35:23,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:35:23,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.93 | bwd_microstep: 3382.33 | bwd_inner_microstep: 3381.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-19 22:35:23,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.93 | bwd: 3382.35 | bwd_inner: 3381.54 | bwd_allreduce: 0.76 | step: 6.81 58%|█████▊ | 5780/10000 [9:05:44<6:27:24, 5.51s/it] {'loss': 0.0004, 'grad_norm': 0.07582323253154755, 'learning_rate': 1.5947425609115e-05, 'epoch': 5.78} 58%|█████▊ | 5780/10000 [9:05:44<6:27:24, 5.51s/it][2025-06-19 22:35:29,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:35:29,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.66 | bwd_microstep: 3373.35 | bwd_inner_microstep: 3372.40 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.44 [2025-06-19 22:35:29,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.66 | bwd: 3373.37 | bwd_inner: 3372.40 | bwd_allreduce: 0.92 | step: 7.44 58%|█████▊ | 5781/10000 [9:05:50<6:28:06, 5.52s/it] {'loss': 0.054, 'grad_norm': 6.645312309265137, 'learning_rate': 1.594108268298174e-05, 'epoch': 5.78} 58%|█████▊ | 5781/10000 [9:05:50<6:28:06, 5.52s/it][2025-06-19 22:35:34,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:35:34,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.28 | bwd_microstep: 3319.74 | bwd_inner_microstep: 3318.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 22:35:34,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.28 | bwd: 3319.75 | bwd_inner: 3318.95 | bwd_allreduce: 0.76 | step: 6.80 58%|█████▊ | 5782/10000 [9:05:55<6:26:57, 5.50s/it] {'loss': 0.0013, 'grad_norm': 0.3125345706939697, 'learning_rate': 1.5934740182610066e-05, 'epoch': 5.78} 58%|█████▊ | 5782/10000 [9:05:55<6:26:57, 5.50s/it][2025-06-19 22:35:40,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:35:40,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.82 | bwd_microstep: 3318.49 | bwd_inner_microstep: 3317.53 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.19 [2025-06-19 22:35:40,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.82 | bwd: 3318.51 | bwd_inner: 3317.53 | bwd_allreduce: 0.93 | step: 7.19 58%|█████▊ | 5783/10000 [9:06:01<6:26:01, 5.49s/it] {'loss': 0.0883, 'grad_norm': 6.016563892364502, 'learning_rate': 1.5928398108665284e-05, 'epoch': 5.78} 58%|█████▊ | 5783/10000 [9:06:01<6:26:01, 5.49s/it][2025-06-19 22:35:45,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:35:45,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.63 | bwd_microstep: 3399.20 | bwd_inner_microstep: 3398.11 | bwd_allreduce_microstep: 1.04 | step_microstep: 8.09 [2025-06-19 22:35:45,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.63 | bwd: 3399.22 | bwd_inner: 3398.11 | bwd_allreduce: 1.06 | step: 8.10 58%|█████▊ | 5784/10000 [9:06:06<6:27:49, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0016532058361917734, 'learning_rate': 1.592205646181264e-05, 'epoch': 5.78} 58%|█████▊ | 5784/10000 [9:06:06<6:27:49, 5.52s/it][2025-06-19 22:35:51,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:35:51,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.35 | bwd_microstep: 3373.37 | bwd_inner_microstep: 3372.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 22:35:51,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.35 | bwd: 3373.38 | bwd_inner: 3372.58 | bwd_allreduce: 0.76 | step: 6.65 58%|█████▊ | 5785/10000 [9:06:12<6:28:31, 5.53s/it] {'loss': 0.0015, 'grad_norm': 0.23114395141601562, 'learning_rate': 1.5915715242717347e-05, 'epoch': 5.79} 58%|█████▊ | 5785/10000 [9:06:12<6:28:31, 5.53s/it][2025-06-19 22:35:56,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:35:56,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.82 | bwd_microstep: 3369.70 | bwd_inner_microstep: 3368.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-19 22:35:56,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.82 | bwd: 3369.71 | bwd_inner: 3368.90 | bwd_allreduce: 0.77 | step: 6.89 58%|█████▊ | 5786/10000 [9:06:17<6:28:35, 5.53s/it] {'loss': 0.0027, 'grad_norm': 0.7664440870285034, 'learning_rate': 1.590937445204457e-05, 'epoch': 5.79} 58%|█████▊ | 5786/10000 [9:06:17<6:28:35, 5.53s/it][2025-06-19 22:36:02,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:36:02,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.77 | bwd_microstep: 3330.96 | bwd_inner_microstep: 3330.17 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 22:36:02,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.77 | bwd: 3330.98 | bwd_inner: 3330.17 | bwd_allreduce: 0.77 | step: 6.90 58%|█████▊ | 5787/10000 [9:06:23<6:27:26, 5.52s/it] {'loss': 0.0019, 'grad_norm': 0.41920459270477295, 'learning_rate': 1.5903034090459425e-05, 'epoch': 5.79} 58%|█████▊ | 5787/10000 [9:06:23<6:27:26, 5.52s/it][2025-06-19 22:36:07,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:36:07,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.35 | bwd_microstep: 3331.13 | bwd_inner_microstep: 3330.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-19 22:36:07,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.35 | bwd: 3331.15 | bwd_inner: 3330.34 | bwd_allreduce: 0.76 | step: 6.79 58%|█████▊ | 5788/10000 [9:06:28<6:26:23, 5.50s/it] {'loss': 0.0181, 'grad_norm': 4.546582221984863, 'learning_rate': 1.5896694158626987e-05, 'epoch': 5.79} 58%|█████▊ | 5788/10000 [9:06:28<6:26:23, 5.50s/it][2025-06-19 22:36:13,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:36:13,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.44 | bwd_microstep: 3386.92 | bwd_inner_microstep: 3386.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 22:36:13,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.44 | bwd: 3386.94 | bwd_inner: 3386.14 | bwd_allreduce: 0.76 | step: 6.68 58%|█████▊ | 5789/10000 [9:06:34<6:27:20, 5.52s/it] {'loss': 0.0056, 'grad_norm': 0.8593683838844299, 'learning_rate': 1.5890354657212305e-05, 'epoch': 5.79} 58%|█████▊ | 5789/10000 [9:06:34<6:27:20, 5.52s/it][2025-06-19 22:36:18,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.74 [2025-06-19 22:36:18,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.88 | bwd_microstep: 3335.04 | bwd_inner_microstep: 3334.18 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.65 [2025-06-19 22:36:18,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.88 | bwd: 3335.05 | bwd_inner: 3334.18 | bwd_allreduce: 0.82 | step: 7.65 58%|█████▊ | 5790/10000 [9:06:39<6:26:40, 5.51s/it] {'loss': 0.0032, 'grad_norm': 0.6327633261680603, 'learning_rate': 1.588401558688033e-05, 'epoch': 5.79} 58%|█████▊ | 5790/10000 [9:06:39<6:26:40, 5.51s/it][2025-06-19 22:36:24,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:36:24,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.86 | bwd_microstep: 3406.30 | bwd_inner_microstep: 3405.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 22:36:24,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.86 | bwd: 3406.31 | bwd_inner: 3405.51 | bwd_allreduce: 0.76 | step: 6.68 58%|█████▊ | 5791/10000 [9:06:45<6:28:04, 5.53s/it] {'loss': 0.0009, 'grad_norm': 0.11023538559675217, 'learning_rate': 1.587767694829602e-05, 'epoch': 5.79} 58%|█████▊ | 5791/10000 [9:06:45<6:28:04, 5.53s/it][2025-06-19 22:36:30,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:36:30,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.48 | bwd_microstep: 3410.35 | bwd_inner_microstep: 3409.54 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 22:36:30,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.48 | bwd: 3410.36 | bwd_inner: 3409.54 | bwd_allreduce: 0.78 | step: 7.20 58%|█████▊ | 5792/10000 [9:06:50<6:29:10, 5.55s/it] {'loss': 0.0105, 'grad_norm': 1.2210441827774048, 'learning_rate': 1.5871338742124263e-05, 'epoch': 5.79} 58%|█████▊ | 5792/10000 [9:06:50<6:29:10, 5.55s/it][2025-06-19 22:36:35,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:36:35,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.12 | bwd_microstep: 3323.09 | bwd_inner_microstep: 3322.31 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 22:36:35,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.12 | bwd: 3323.11 | bwd_inner: 3322.31 | bwd_allreduce: 0.76 | step: 6.73 58%|█████▊ | 5793/10000 [9:06:56<6:27:22, 5.52s/it] {'loss': 0.0007, 'grad_norm': 0.04775119945406914, 'learning_rate': 1.5865000969029915e-05, 'epoch': 5.79} 58%|█████▊ | 5793/10000 [9:06:56<6:27:22, 5.52s/it][2025-06-19 22:36:41,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:36:41,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.23 | bwd_microstep: 3383.73 | bwd_inner_microstep: 3382.91 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.17 [2025-06-19 22:36:41,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.23 | bwd: 3383.74 | bwd_inner: 3382.91 | bwd_allreduce: 0.78 | step: 7.18 58%|█████▊ | 5794/10000 [9:07:01<6:28:12, 5.54s/it] {'loss': 0.0013, 'grad_norm': 0.18540960550308228, 'learning_rate': 1.5858663629677778e-05, 'epoch': 5.79} 58%|█████▊ | 5794/10000 [9:07:01<6:28:12, 5.54s/it][2025-06-19 22:36:46,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:36:46,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.56 | bwd_microstep: 3327.69 | bwd_inner_microstep: 3326.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.59 [2025-06-19 22:36:46,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.56 | bwd: 3327.71 | bwd_inner: 3326.90 | bwd_allreduce: 0.76 | step: 6.59 58%|█████▊ | 5795/10000 [9:07:07<6:26:44, 5.52s/it] {'loss': 0.0252, 'grad_norm': 3.2989556789398193, 'learning_rate': 1.58523267247326e-05, 'epoch': 5.79} 58%|█████▊ | 5795/10000 [9:07:07<6:26:44, 5.52s/it][2025-06-19 22:36:52,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:36:52,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.53 | bwd_microstep: 3377.25 | bwd_inner_microstep: 3376.41 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.79 [2025-06-19 22:36:52,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.53 | bwd: 3377.27 | bwd_inner: 3376.41 | bwd_allreduce: 0.80 | step: 6.80 58%|█████▊ | 5796/10000 [9:07:12<6:27:14, 5.53s/it] {'loss': 0.0002, 'grad_norm': 0.019502965733408928, 'learning_rate': 1.5845990254859095e-05, 'epoch': 5.8} 58%|█████▊ | 5796/10000 [9:07:12<6:27:14, 5.53s/it][2025-06-19 22:36:57,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:36:57,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.51 | bwd_microstep: 3333.54 | bwd_inner_microstep: 3332.58 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.71 [2025-06-19 22:36:57,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.51 | bwd: 3333.56 | bwd_inner: 3332.58 | bwd_allreduce: 0.93 | step: 7.72 58%|█████▊ | 5797/10000 [9:07:18<6:26:17, 5.51s/it] {'loss': 0.0011, 'grad_norm': 0.20138084888458252, 'learning_rate': 1.5839654220721937e-05, 'epoch': 5.8} 58%|█████▊ | 5797/10000 [9:07:18<6:26:17, 5.51s/it][2025-06-19 22:37:03,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:37:03,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.35 | bwd_microstep: 3376.70 | bwd_inner_microstep: 3375.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 22:37:03,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.35 | bwd: 3376.71 | bwd_inner: 3375.91 | bwd_allreduce: 0.76 | step: 6.68 58%|█████▊ | 5798/10000 [9:07:24<6:26:59, 5.53s/it] {'loss': 0.0005, 'grad_norm': 0.15093912184238434, 'learning_rate': 1.5833318622985742e-05, 'epoch': 5.8} 58%|█████▊ | 5798/10000 [9:07:24<6:26:59, 5.53s/it][2025-06-19 22:37:08,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:37:08,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.83 | bwd_microstep: 3379.20 | bwd_inner_microstep: 3378.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 22:37:08,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.83 | bwd: 3379.22 | bwd_inner: 3378.42 | bwd_allreduce: 0.76 | step: 6.65 58%|█████▊ | 5799/10000 [9:07:29<6:27:20, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.013743460178375244, 'learning_rate': 1.5826983462315092e-05, 'epoch': 5.8} 58%|█████▊ | 5799/10000 [9:07:29<6:27:20, 5.53s/it][2025-06-19 22:37:14,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:37:14,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.19 | bwd_microstep: 3377.05 | bwd_inner_microstep: 3376.09 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.05 [2025-06-19 22:37:14,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.19 | bwd: 3377.07 | bwd_inner: 3376.09 | bwd_allreduce: 0.93 | step: 7.05 58%|█████▊ | 5800/10000 [9:07:35<6:27:31, 5.54s/it] {'loss': 0.116, 'grad_norm': 3.205732583999634, 'learning_rate': 1.58206487393745e-05, 'epoch': 5.8} 58%|█████▊ | 5800/10000 [9:07:35<6:27:31, 5.54s/it][2025-06-19 22:37:19,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:37:19,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.98 | bwd_microstep: 3337.55 | bwd_inner_microstep: 3336.54 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.68 [2025-06-19 22:37:19,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.98 | bwd: 3337.57 | bwd_inner: 3336.54 | bwd_allreduce: 0.97 | step: 7.69 58%|█████▊ | 5801/10000 [9:07:40<6:26:29, 5.52s/it] {'loss': 0.007, 'grad_norm': 1.2525875568389893, 'learning_rate': 1.581431445482846e-05, 'epoch': 5.8} 58%|█████▊ | 5801/10000 [9:07:40<6:26:29, 5.52s/it][2025-06-19 22:37:25,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:37:25,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.53 | bwd_microstep: 3385.19 | bwd_inner_microstep: 3384.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 22:37:25,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.53 | bwd: 3385.20 | bwd_inner: 3384.39 | bwd_allreduce: 0.76 | step: 6.66 58%|█████▊ | 5802/10000 [9:07:46<6:27:09, 5.53s/it] {'loss': 0.0079, 'grad_norm': 1.687820315361023, 'learning_rate': 1.5807980609341408e-05, 'epoch': 5.8} 58%|█████▊ | 5802/10000 [9:07:46<6:27:09, 5.53s/it][2025-06-19 22:37:30,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:37:30,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.15 | bwd_microstep: 3327.98 | bwd_inner_microstep: 3327.02 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.10 [2025-06-19 22:37:30,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.15 | bwd: 3327.99 | bwd_inner: 3327.02 | bwd_allreduce: 0.92 | step: 7.10 58%|█████▊ | 5803/10000 [9:07:51<6:25:57, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.011032843962311745, 'learning_rate': 1.5801647203577735e-05, 'epoch': 5.8} 58%|█████▊ | 5803/10000 [9:07:51<6:25:57, 5.52s/it][2025-06-19 22:37:36,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:37:36,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.80 | bwd_microstep: 3323.23 | bwd_inner_microstep: 3322.35 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.38 [2025-06-19 22:37:36,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.80 | bwd: 3323.25 | bwd_inner: 3322.35 | bwd_allreduce: 0.83 | step: 7.38 58%|█████▊ | 5804/10000 [9:07:57<6:25:02, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.064730703830719, 'learning_rate': 1.579531423820179e-05, 'epoch': 5.8} 58%|█████▊ | 5804/10000 [9:07:57<6:25:02, 5.51s/it][2025-06-19 22:37:41,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:37:41,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.68 | bwd_microstep: 3368.30 | bwd_inner_microstep: 3367.49 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.94 [2025-06-19 22:37:41,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.68 | bwd: 3368.32 | bwd_inner: 3367.49 | bwd_allreduce: 0.79 | step: 6.95 58%|█████▊ | 5805/10000 [9:08:02<6:25:37, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.09995787590742111, 'learning_rate': 1.5788981713877862e-05, 'epoch': 5.8} 58%|█████▊ | 5805/10000 [9:08:02<6:25:37, 5.52s/it][2025-06-19 22:37:47,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:37:47,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.47 | bwd_microstep: 3330.11 | bwd_inner_microstep: 3329.20 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.54 [2025-06-19 22:37:47,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.47 | bwd: 3330.13 | bwd_inner: 3329.20 | bwd_allreduce: 0.89 | step: 7.55 58%|█████▊ | 5806/10000 [9:08:08<6:24:45, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.005886436440050602, 'learning_rate': 1.5782649631270207e-05, 'epoch': 5.81} 58%|█████▊ | 5806/10000 [9:08:08<6:24:45, 5.50s/it][2025-06-19 22:37:52,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:37:52,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.21 | bwd_microstep: 3328.87 | bwd_inner_microstep: 3328.05 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.37 [2025-06-19 22:37:52,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.21 | bwd: 3328.89 | bwd_inner: 3328.05 | bwd_allreduce: 0.79 | step: 7.38 58%|█████▊ | 5807/10000 [9:08:13<6:24:15, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.007595466915518045, 'learning_rate': 1.5776317991043036e-05, 'epoch': 5.81} 58%|█████▊ | 5807/10000 [9:08:13<6:24:15, 5.50s/it][2025-06-19 22:37:58,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:37:58,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.11 | bwd_microstep: 3320.42 | bwd_inner_microstep: 3319.60 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.90 [2025-06-19 22:37:58,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.11 | bwd: 3320.44 | bwd_inner: 3319.60 | bwd_allreduce: 0.79 | step: 6.91 58%|█████▊ | 5808/10000 [9:08:19<6:23:32, 5.49s/it] {'loss': 0.0012, 'grad_norm': 0.31188133358955383, 'learning_rate': 1.57699867938605e-05, 'epoch': 5.81} 58%|█████▊ | 5808/10000 [9:08:19<6:23:32, 5.49s/it][2025-06-19 22:38:03,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:38:03,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.26 | bwd_microstep: 3323.33 | bwd_inner_microstep: 3322.52 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.22 [2025-06-19 22:38:03,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.26 | bwd: 3323.35 | bwd_inner: 3322.52 | bwd_allreduce: 0.79 | step: 7.22 58%|█████▊ | 5809/10000 [9:08:24<6:23:03, 5.48s/it] {'loss': 0.0855, 'grad_norm': 4.802022457122803, 'learning_rate': 1.5763656040386723e-05, 'epoch': 5.81} 58%|█████▊ | 5809/10000 [9:08:24<6:23:03, 5.48s/it][2025-06-19 22:38:09,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:38:09,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.67 | bwd_microstep: 3371.64 | bwd_inner_microstep: 3370.77 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.54 [2025-06-19 22:38:09,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.67 | bwd: 3371.66 | bwd_inner: 3370.77 | bwd_allreduce: 0.83 | step: 7.55 58%|█████▊ | 5810/10000 [9:08:30<6:24:32, 5.51s/it] {'loss': 0.0024, 'grad_norm': 0.5551368594169617, 'learning_rate': 1.575732573128576e-05, 'epoch': 5.81} 58%|█████▊ | 5810/10000 [9:08:30<6:24:32, 5.51s/it][2025-06-19 22:38:14,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:38:14,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.50 | bwd_microstep: 3324.93 | bwd_inner_microstep: 3324.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 22:38:14,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.50 | bwd: 3324.94 | bwd_inner: 3324.14 | bwd_allreduce: 0.76 | step: 6.67 58%|█████▊ | 5811/10000 [9:08:35<6:23:41, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.054163575172424316, 'learning_rate': 1.575099586722164e-05, 'epoch': 5.81} 58%|█████▊ | 5811/10000 [9:08:35<6:23:41, 5.50s/it][2025-06-19 22:38:20,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 22:38:20,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.64 | bwd_microstep: 3381.40 | bwd_inner_microstep: 3380.36 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.66 [2025-06-19 22:38:20,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.64 | bwd: 3381.42 | bwd_inner: 3380.36 | bwd_allreduce: 1.01 | step: 7.66 58%|█████▊ | 5812/10000 [9:08:41<6:24:50, 5.51s/it] {'loss': 0.0079, 'grad_norm': 1.3692787885665894, 'learning_rate': 1.5744666448858333e-05, 'epoch': 5.81} 58%|█████▊ | 5812/10000 [9:08:41<6:24:50, 5.51s/it][2025-06-19 22:38:25,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:38:25,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.96 | bwd_microstep: 3337.21 | bwd_inner_microstep: 3336.28 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.03 [2025-06-19 22:38:25,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.96 | bwd: 3337.23 | bwd_inner: 3336.28 | bwd_allreduce: 0.90 | step: 7.03 58%|█████▊ | 5813/10000 [9:08:46<6:24:14, 5.51s/it] {'loss': 0.0008, 'grad_norm': 0.21941964328289032, 'learning_rate': 1.5738337476859765e-05, 'epoch': 5.81} 58%|█████▊ | 5813/10000 [9:08:46<6:24:14, 5.51s/it][2025-06-19 22:38:31,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:38:31,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.83 | bwd_microstep: 3329.32 | bwd_inner_microstep: 3328.35 | bwd_allreduce_microstep: 0.93 | step_microstep: 6.63 [2025-06-19 22:38:31,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.83 | bwd: 3329.34 | bwd_inner: 3328.35 | bwd_allreduce: 0.94 | step: 6.62 58%|█████▊ | 5814/10000 [9:08:52<6:23:38, 5.50s/it] {'loss': 0.0051, 'grad_norm': 1.317183494567871, 'learning_rate': 1.5732008951889825e-05, 'epoch': 5.81} 58%|█████▊ | 5814/10000 [9:08:52<6:23:38, 5.50s/it][2025-06-19 22:38:36,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:38:36,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.27 | bwd_microstep: 3324.93 | bwd_inner_microstep: 3324.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-19 22:38:36,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.27 | bwd: 3324.95 | bwd_inner: 3324.14 | bwd_allreduce: 0.76 | step: 6.78 58%|█████▊ | 5815/10000 [9:08:57<6:23:07, 5.49s/it] {'loss': 0.0329, 'grad_norm': 6.150333881378174, 'learning_rate': 1.572568087461233e-05, 'epoch': 5.81} 58%|█████▊ | 5815/10000 [9:08:57<6:23:07, 5.49s/it][2025-06-19 22:38:42,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:38:42,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.36 | bwd_microstep: 3323.50 | bwd_inner_microstep: 3322.65 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.47 [2025-06-19 22:38:42,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.36 | bwd: 3323.51 | bwd_inner: 3322.65 | bwd_allreduce: 0.82 | step: 7.48 58%|█████▊ | 5816/10000 [9:09:03<6:22:30, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.019244983792304993, 'learning_rate': 1.571935324569107e-05, 'epoch': 5.82} 58%|█████▊ | 5816/10000 [9:09:03<6:22:30, 5.49s/it][2025-06-19 22:38:47,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:38:47,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.63 | bwd_microstep: 3321.76 | bwd_inner_microstep: 3320.92 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.96 [2025-06-19 22:38:47,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.63 | bwd: 3321.78 | bwd_inner: 3320.92 | bwd_allreduce: 0.80 | step: 6.96 58%|█████▊ | 5817/10000 [9:09:08<6:22:05, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.16287463903427124, 'learning_rate': 1.5713026065789792e-05, 'epoch': 5.82} 58%|█████▊ | 5817/10000 [9:09:08<6:22:05, 5.48s/it][2025-06-19 22:38:53,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 22:38:53,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2151.45 | bwd_microstep: 3372.85 | bwd_inner_microstep: 3372.04 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-19 22:38:53,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2151.46 | bwd: 3372.87 | bwd_inner: 3372.04 | bwd_allreduce: 0.78 | step: 6.96 58%|█████▊ | 5818/10000 [9:09:14<6:23:46, 5.51s/it] {'loss': 0.0023, 'grad_norm': 0.4876319468021393, 'learning_rate': 1.570669933557218e-05, 'epoch': 5.82} 58%|█████▊ | 5818/10000 [9:09:14<6:23:46, 5.51s/it][2025-06-19 22:38:58,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:38:58,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.75 | bwd_microstep: 3322.72 | bwd_inner_microstep: 3321.87 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.00 [2025-06-19 22:38:58,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.75 | bwd: 3322.74 | bwd_inner: 3321.87 | bwd_allreduce: 0.81 | step: 7.00 58%|█████▊ | 5819/10000 [9:09:19<6:23:21, 5.50s/it] {'loss': 0.0018, 'grad_norm': 0.2114289253950119, 'learning_rate': 1.570037305570189e-05, 'epoch': 5.82} 58%|█████▊ | 5819/10000 [9:09:19<6:23:21, 5.50s/it][2025-06-19 22:39:04,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:39:04,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.83 | bwd_microstep: 3323.61 | bwd_inner_microstep: 3322.80 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-19 22:39:04,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.84 | bwd: 3323.63 | bwd_inner: 3322.80 | bwd_allreduce: 0.78 | step: 7.14 58%|█████▊ | 5820/10000 [9:09:25<6:23:06, 5.50s/it] {'loss': 0.0245, 'grad_norm': 3.9422762393951416, 'learning_rate': 1.5694047226842502e-05, 'epoch': 5.82} 58%|█████▊ | 5820/10000 [9:09:25<6:23:06, 5.50s/it][2025-06-19 22:39:09,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 22:39:09,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.21 | bwd_microstep: 3329.05 | bwd_inner_microstep: 3328.03 | bwd_allreduce_microstep: 0.91 | step_microstep: 8.39 [2025-06-19 22:39:09,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.21 | bwd: 3329.10 | bwd_inner: 3328.03 | bwd_allreduce: 0.96 | step: 8.40 58%|█████▊ | 5821/10000 [9:09:30<6:23:03, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.028503309935331345, 'learning_rate': 1.568772184965758e-05, 'epoch': 5.82} 58%|█████▊ | 5821/10000 [9:09:30<6:23:03, 5.50s/it][2025-06-19 22:39:15,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:39:15,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.12 | bwd_microstep: 3315.02 | bwd_inner_microstep: 3314.16 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.08 [2025-06-19 22:39:15,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.12 | bwd: 3315.04 | bwd_inner: 3314.16 | bwd_allreduce: 0.82 | step: 7.09 58%|█████▊ | 5822/10000 [9:09:36<6:22:15, 5.49s/it] {'loss': 0.0035, 'grad_norm': 1.0305594205856323, 'learning_rate': 1.568139692481062e-05, 'epoch': 5.82} 58%|█████▊ | 5822/10000 [9:09:36<6:22:15, 5.49s/it][2025-06-19 22:39:20,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 22:39:20,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2152.49 | bwd_microstep: 3366.20 | bwd_inner_microstep: 3365.34 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.05 [2025-06-19 22:39:20,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2152.49 | bwd: 3366.22 | bwd_inner: 3365.34 | bwd_allreduce: 0.83 | step: 7.06 58%|█████▊ | 5823/10000 [9:09:41<6:23:42, 5.51s/it] {'loss': 0.0011, 'grad_norm': 0.12906739115715027, 'learning_rate': 1.567507245296508e-05, 'epoch': 5.82} 58%|█████▊ | 5823/10000 [9:09:41<6:23:42, 5.51s/it][2025-06-19 22:39:26,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:39:26,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.27 | bwd_microstep: 3368.62 | bwd_inner_microstep: 3367.84 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 22:39:26,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.27 | bwd: 3368.63 | bwd_inner: 3367.84 | bwd_allreduce: 0.75 | step: 6.58 58%|█████▊ | 5824/10000 [9:09:47<6:24:25, 5.52s/it] {'loss': 0.0033, 'grad_norm': 1.3694268465042114, 'learning_rate': 1.5668748434784376e-05, 'epoch': 5.82} 58%|█████▊ | 5824/10000 [9:09:47<6:24:25, 5.52s/it][2025-06-19 22:39:31,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:39:31,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2154.24 | bwd_microstep: 3387.44 | bwd_inner_microstep: 3386.64 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.37 [2025-06-19 22:39:31,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2154.24 | bwd: 3387.45 | bwd_inner: 3386.64 | bwd_allreduce: 0.77 | step: 7.37 58%|█████▊ | 5825/10000 [9:09:52<6:25:31, 5.54s/it] {'loss': 0.0079, 'grad_norm': 1.1509273052215576, 'learning_rate': 1.566242487093185e-05, 'epoch': 5.83} 58%|█████▊ | 5825/10000 [9:09:52<6:25:31, 5.54s/it][2025-06-19 22:39:37,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:39:37,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.64 | bwd_microstep: 3334.21 | bwd_inner_microstep: 3333.36 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.33 [2025-06-19 22:39:37,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.64 | bwd: 3334.24 | bwd_inner: 3333.36 | bwd_allreduce: 0.81 | step: 7.34 58%|█████▊ | 5826/10000 [9:09:58<6:24:09, 5.52s/it] {'loss': 0.0647, 'grad_norm': 6.625924110412598, 'learning_rate': 1.5656101762070823e-05, 'epoch': 5.83} 58%|█████▊ | 5826/10000 [9:09:58<6:24:09, 5.52s/it][2025-06-19 22:39:42,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:39:42,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.99 | bwd_microstep: 3321.73 | bwd_inner_microstep: 3320.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 22:39:42,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.99 | bwd: 3321.74 | bwd_inner: 3320.93 | bwd_allreduce: 0.77 | step: 6.69 58%|█████▊ | 5827/10000 [9:10:03<6:23:23, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.036003150045871735, 'learning_rate': 1.564977910886456e-05, 'epoch': 5.83} 58%|█████▊ | 5827/10000 [9:10:03<6:23:23, 5.51s/it][2025-06-19 22:39:48,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.91 [2025-06-19 22:39:48,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.01 | bwd_microstep: 3315.26 | bwd_inner_microstep: 3314.36 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.65 [2025-06-19 22:39:48,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.01 | bwd: 3315.28 | bwd_inner: 3314.36 | bwd_allreduce: 0.85 | step: 7.66 58%|█████▊ | 5828/10000 [9:10:09<6:22:20, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.09971751272678375, 'learning_rate': 1.564345691197628e-05, 'epoch': 5.83} 58%|█████▊ | 5828/10000 [9:10:09<6:22:20, 5.50s/it][2025-06-19 22:39:53,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 22:39:53,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2160.54 | bwd_microstep: 3387.69 | bwd_inner_microstep: 3386.73 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.55 [2025-06-19 22:39:53,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2160.54 | bwd: 3387.72 | bwd_inner: 3386.73 | bwd_allreduce: 0.92 | step: 7.56 58%|█████▊ | 5829/10000 [9:10:14<6:24:12, 5.53s/it] {'loss': 0.0113, 'grad_norm': 1.9623603820800781, 'learning_rate': 1.5637135172069155e-05, 'epoch': 5.83} 58%|█████▊ | 5829/10000 [9:10:14<6:24:12, 5.53s/it][2025-06-19 22:39:59,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:39:59,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.31 | bwd_microstep: 3328.02 | bwd_inner_microstep: 3327.12 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.50 [2025-06-19 22:39:59,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.31 | bwd: 3328.05 | bwd_inner: 3327.12 | bwd_allreduce: 0.85 | step: 7.50 58%|█████▊ | 5830/10000 [9:10:20<6:23:59, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.024775028228759766, 'learning_rate': 1.5630813889806297e-05, 'epoch': 5.83} 58%|█████▊ | 5830/10000 [9:10:20<6:23:59, 5.53s/it][2025-06-19 22:40:05,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:40:05,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.57 | bwd_microstep: 3323.27 | bwd_inner_microstep: 3322.46 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.15 [2025-06-19 22:40:05,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.57 | bwd: 3323.29 | bwd_inner: 3322.46 | bwd_allreduce: 0.79 | step: 7.15 58%|█████▊ | 5831/10000 [9:10:25<6:23:30, 5.52s/it] {'loss': 0.0034, 'grad_norm': 0.495505690574646, 'learning_rate': 1.5624493065850784e-05, 'epoch': 5.83} 58%|█████▊ | 5831/10000 [9:10:25<6:23:30, 5.52s/it][2025-06-19 22:40:10,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:40:10,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.88 | bwd_microstep: 3320.79 | bwd_inner_microstep: 3319.47 | bwd_allreduce_microstep: 1.23 | step_microstep: 8.40 [2025-06-19 22:40:10,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.88 | bwd: 3320.82 | bwd_inner: 3319.47 | bwd_allreduce: 1.28 | step: 8.39 58%|█████▊ | 5832/10000 [9:10:31<6:22:45, 5.51s/it] {'loss': 0.0246, 'grad_norm': 3.735145092010498, 'learning_rate': 1.561817270086564e-05, 'epoch': 5.83} 58%|█████▊ | 5832/10000 [9:10:31<6:22:45, 5.51s/it][2025-06-19 22:40:15,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:40:15,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.10 | bwd_microstep: 3329.53 | bwd_inner_microstep: 3328.74 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 22:40:15,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.10 | bwd: 3329.54 | bwd_inner: 3328.74 | bwd_allreduce: 0.76 | step: 6.68 58%|█████▊ | 5833/10000 [9:10:36<6:22:24, 5.51s/it] {'loss': 0.0074, 'grad_norm': 0.9346563220024109, 'learning_rate': 1.5611852795513843e-05, 'epoch': 5.83} 58%|█████▊ | 5833/10000 [9:10:36<6:22:24, 5.51s/it][2025-06-19 22:40:21,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.72 [2025-06-19 22:40:21,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.19 | bwd_microstep: 3330.05 | bwd_inner_microstep: 3328.89 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.92 [2025-06-19 22:40:21,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.19 | bwd: 3330.08 | bwd_inner: 3328.89 | bwd_allreduce: 1.11 | step: 7.91 58%|█████▊ | 5834/10000 [9:10:42<6:21:43, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.08303319662809372, 'learning_rate': 1.5605533350458332e-05, 'epoch': 5.83} 58%|█████▊ | 5834/10000 [9:10:42<6:21:43, 5.50s/it][2025-06-19 22:40:27,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:40:27,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.44 | bwd_microstep: 3375.61 | bwd_inner_microstep: 3374.62 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.94 [2025-06-19 22:40:27,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.44 | bwd: 3375.63 | bwd_inner: 3374.62 | bwd_allreduce: 0.96 | step: 7.96 58%|█████▊ | 5835/10000 [9:10:47<6:22:54, 5.52s/it] {'loss': 0.0035, 'grad_norm': 0.5281729102134705, 'learning_rate': 1.5599214366361963e-05, 'epoch': 5.83} 58%|█████▊ | 5835/10000 [9:10:47<6:22:54, 5.52s/it][2025-06-19 22:40:32,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:40:32,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.29 | bwd_microstep: 3318.31 | bwd_inner_microstep: 3317.40 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.19 [2025-06-19 22:40:32,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.29 | bwd: 3318.32 | bwd_inner: 3317.40 | bwd_allreduce: 0.88 | step: 7.20 58%|█████▊ | 5836/10000 [9:10:53<6:22:09, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.1190762147307396, 'learning_rate': 1.5592895843887586e-05, 'epoch': 5.84} 58%|█████▊ | 5836/10000 [9:10:53<6:22:09, 5.51s/it][2025-06-19 22:40:38,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 22:40:38,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.43 | bwd_microstep: 3340.63 | bwd_inner_microstep: 3339.51 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.56 [2025-06-19 22:40:38,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.43 | bwd: 3340.67 | bwd_inner: 3339.51 | bwd_allreduce: 1.07 | step: 7.54 58%|█████▊ | 5837/10000 [9:10:58<6:22:05, 5.51s/it] {'loss': 0.0144, 'grad_norm': 1.7022738456726074, 'learning_rate': 1.558657778369798e-05, 'epoch': 5.84} 58%|█████▊ | 5837/10000 [9:10:58<6:22:05, 5.51s/it][2025-06-19 22:40:43,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:40:43,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.84 | bwd_microstep: 3319.20 | bwd_inner_microstep: 3318.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-19 22:40:43,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.84 | bwd: 3319.21 | bwd_inner: 3318.40 | bwd_allreduce: 0.76 | step: 6.92 58%|█████▊ | 5838/10000 [9:11:04<6:21:29, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.10958836227655411, 'learning_rate': 1.5580260186455887e-05, 'epoch': 5.84} 58%|█████▊ | 5838/10000 [9:11:04<6:21:29, 5.50s/it][2025-06-19 22:40:48,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:40:48,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.79 | bwd_microstep: 3319.39 | bwd_inner_microstep: 3318.58 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.74 [2025-06-19 22:40:48,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.79 | bwd: 3319.41 | bwd_inner: 3318.58 | bwd_allreduce: 0.78 | step: 6.74 58%|█████▊ | 5839/10000 [9:11:09<6:21:03, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.052467674016952515, 'learning_rate': 1.557394305282399e-05, 'epoch': 5.84} 58%|█████▊ | 5839/10000 [9:11:09<6:21:03, 5.49s/it][2025-06-19 22:40:54,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:40:54,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.26 | bwd_microstep: 3315.86 | bwd_inner_microstep: 3315.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-19 22:40:54,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.26 | bwd: 3315.87 | bwd_inner: 3315.07 | bwd_allreduce: 0.76 | step: 6.79 58%|█████▊ | 5840/10000 [9:11:15<6:20:20, 5.49s/it] {'loss': 0.0026, 'grad_norm': 0.35020601749420166, 'learning_rate': 1.556762638346492e-05, 'epoch': 5.84} 58%|█████▊ | 5840/10000 [9:11:15<6:20:20, 5.49s/it][2025-06-19 22:40:59,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:40:59,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.04 | bwd_microstep: 3326.59 | bwd_inner_microstep: 3325.72 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.34 [2025-06-19 22:40:59,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.04 | bwd: 3326.61 | bwd_inner: 3325.72 | bwd_allreduce: 0.83 | step: 7.34 58%|█████▊ | 5841/10000 [9:11:20<6:20:10, 5.48s/it] {'loss': 0.0007, 'grad_norm': 0.1329565793275833, 'learning_rate': 1.5561310179041268e-05, 'epoch': 5.84} 58%|█████▊ | 5841/10000 [9:11:20<6:20:10, 5.48s/it][2025-06-19 22:41:05,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 22:41:05,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.53 | bwd_microstep: 3320.54 | bwd_inner_microstep: 3319.46 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.31 [2025-06-19 22:41:05,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.53 | bwd: 3320.56 | bwd_inner: 3319.46 | bwd_allreduce: 1.04 | step: 7.31 58%|█████▊ | 5842/10000 [9:11:26<6:19:50, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.014401711523532867, 'learning_rate': 1.5554994440215582e-05, 'epoch': 5.84} 58%|█████▊ | 5842/10000 [9:11:26<6:19:50, 5.48s/it][2025-06-19 22:41:10,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-19 22:41:10,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.17 | bwd_microstep: 3321.77 | bwd_inner_microstep: 3320.43 | bwd_allreduce_microstep: 1.25 | step_microstep: 9.05 [2025-06-19 22:41:10,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.17 | bwd: 3321.80 | bwd_inner: 3320.43 | bwd_allreduce: 1.29 | step: 9.04 58%|█████▊ | 5843/10000 [9:11:31<6:19:28, 5.48s/it] {'loss': 0.0011, 'grad_norm': 0.16084276139736176, 'learning_rate': 1.5548679167650346e-05, 'epoch': 5.84} 58%|█████▊ | 5843/10000 [9:11:31<6:19:28, 5.48s/it][2025-06-19 22:41:16,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:41:16,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.63 | bwd_microstep: 3372.72 | bwd_inner_microstep: 3371.87 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.29 [2025-06-19 22:41:16,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.63 | bwd: 3372.74 | bwd_inner: 3371.87 | bwd_allreduce: 0.81 | step: 7.30 58%|█████▊ | 5844/10000 [9:11:37<6:20:59, 5.50s/it] {'loss': 0.047, 'grad_norm': 7.411352634429932, 'learning_rate': 1.554236436200801e-05, 'epoch': 5.84} 58%|█████▊ | 5844/10000 [9:11:37<6:20:59, 5.50s/it][2025-06-19 22:41:21,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:41:21,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.76 | bwd_microstep: 3323.23 | bwd_inner_microstep: 3322.36 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.14 [2025-06-19 22:41:21,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.76 | bwd: 3323.25 | bwd_inner: 3322.36 | bwd_allreduce: 0.83 | step: 7.14 58%|█████▊ | 5845/10000 [9:11:42<6:20:29, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.027837766334414482, 'learning_rate': 1.5536050023950963e-05, 'epoch': 5.84} 58%|█████▊ | 5845/10000 [9:11:42<6:20:29, 5.49s/it][2025-06-19 22:41:27,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.72 [2025-06-19 22:41:27,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.41 | bwd_microstep: 3376.58 | bwd_inner_microstep: 3375.32 | bwd_allreduce_microstep: 1.17 | step_microstep: 8.68 [2025-06-19 22:41:27,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.41 | bwd: 3376.60 | bwd_inner: 3375.32 | bwd_allreduce: 1.21 | step: 8.71 58%|█████▊ | 5846/10000 [9:11:48<6:21:46, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.005644946359097958, 'learning_rate': 1.5529736154141545e-05, 'epoch': 5.85} 58%|█████▊ | 5846/10000 [9:11:48<6:21:46, 5.51s/it][2025-06-19 22:41:32,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 22:41:32,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.20 | bwd_microstep: 3325.36 | bwd_inner_microstep: 3324.42 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.51 [2025-06-19 22:41:32,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.20 | bwd: 3325.38 | bwd_inner: 3324.42 | bwd_allreduce: 0.91 | step: 7.51 58%|█████▊ | 5847/10000 [9:11:53<6:21:00, 5.50s/it] {'loss': 0.119, 'grad_norm': 5.427291393280029, 'learning_rate': 1.5523422753242064e-05, 'epoch': 5.85} 58%|█████▊ | 5847/10000 [9:11:53<6:21:00, 5.50s/it][2025-06-19 22:41:38,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:41:38,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.15 | bwd_microstep: 3376.66 | bwd_inner_microstep: 3375.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 22:41:38,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.15 | bwd: 3376.67 | bwd_inner: 3375.88 | bwd_allreduce: 0.76 | step: 6.57 58%|█████▊ | 5848/10000 [9:11:59<6:22:04, 5.52s/it] {'loss': 0.0005, 'grad_norm': 0.17457540333271027, 'learning_rate': 1.5517109821914757e-05, 'epoch': 5.85} 58%|█████▊ | 5848/10000 [9:11:59<6:22:04, 5.52s/it][2025-06-19 22:41:43,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:41:43,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.51 | bwd_microstep: 3323.65 | bwd_inner_microstep: 3322.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.68 [2025-06-19 22:41:43,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.51 | bwd: 3323.67 | bwd_inner: 3322.85 | bwd_allreduce: 0.78 | step: 6.68 58%|█████▊ | 5849/10000 [9:12:04<6:20:54, 5.51s/it] {'loss': 0.0006, 'grad_norm': 0.06331804394721985, 'learning_rate': 1.551079736082182e-05, 'epoch': 5.85} 58%|█████▊ | 5849/10000 [9:12:04<6:20:54, 5.51s/it][2025-06-19 22:41:49,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:41:49,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.35 | bwd_microstep: 3319.11 | bwd_inner_microstep: 3318.26 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.86 [2025-06-19 22:41:49,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.35 | bwd: 3319.12 | bwd_inner: 3318.26 | bwd_allreduce: 0.82 | step: 6.86 58%|█████▊ | 5850/10000 [9:12:10<6:19:52, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.021838726475834846, 'learning_rate': 1.5504485370625418e-05, 'epoch': 5.85} 58%|█████▊ | 5850/10000 [9:12:10<6:19:52, 5.49s/it][2025-06-19 22:41:54,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:41:54,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.15 | bwd_microstep: 3321.15 | bwd_inner_microstep: 3320.37 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 22:41:54,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.15 | bwd: 3321.17 | bwd_inner: 3320.37 | bwd_allreduce: 0.76 | step: 6.56 59%|█████▊ | 5851/10000 [9:12:15<6:19:22, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.05138440430164337, 'learning_rate': 1.549817385198763e-05, 'epoch': 5.85} 59%|█████▊ | 5851/10000 [9:12:15<6:19:22, 5.49s/it][2025-06-19 22:42:00,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:42:00,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.43 | bwd_microstep: 3320.01 | bwd_inner_microstep: 3319.11 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.00 [2025-06-19 22:42:00,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.44 | bwd: 3320.03 | bwd_inner: 3319.11 | bwd_allreduce: 0.87 | step: 7.00 59%|█████▊ | 5852/10000 [9:12:21<6:18:45, 5.48s/it] {'loss': 0.1014, 'grad_norm': 3.5191397666931152, 'learning_rate': 1.5491862805570504e-05, 'epoch': 5.85} 59%|█████▊ | 5852/10000 [9:12:21<6:18:45, 5.48s/it][2025-06-19 22:42:05,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:42:05,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.29 | bwd_microstep: 3370.78 | bwd_inner_microstep: 3369.97 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.13 [2025-06-19 22:42:05,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.29 | bwd: 3370.80 | bwd_inner: 3369.97 | bwd_allreduce: 0.79 | step: 7.13 59%|█████▊ | 5853/10000 [9:12:26<6:20:00, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.29987144470214844, 'learning_rate': 1.5485552232036048e-05, 'epoch': 5.85} 59%|█████▊ | 5853/10000 [9:12:26<6:20:00, 5.50s/it][2025-06-19 22:42:11,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 22:42:11,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.18 | bwd_microstep: 3320.18 | bwd_inner_microstep: 3319.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 22:42:11,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.18 | bwd: 3320.20 | bwd_inner: 3319.41 | bwd_allreduce: 0.75 | step: 6.60 59%|█████▊ | 5854/10000 [9:12:32<6:19:22, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.01703653857111931, 'learning_rate': 1.5479242132046217e-05, 'epoch': 5.85} 59%|█████▊ | 5854/10000 [9:12:32<6:19:22, 5.49s/it][2025-06-19 22:42:16,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:42:16,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.72 | bwd_microstep: 3317.62 | bwd_inner_microstep: 3316.80 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.78 [2025-06-19 22:42:16,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.72 | bwd: 3317.64 | bwd_inner: 3316.80 | bwd_allreduce: 0.80 | step: 6.79 59%|█████▊ | 5855/10000 [9:12:37<6:18:35, 5.48s/it] {'loss': 0.0533, 'grad_norm': 7.137799263000488, 'learning_rate': 1.5472932506262902e-05, 'epoch': 5.86} 59%|█████▊ | 5855/10000 [9:12:37<6:18:35, 5.48s/it][2025-06-19 22:42:22,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:42:22,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.27 | bwd_microstep: 3314.44 | bwd_inner_microstep: 3313.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 22:42:22,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.27 | bwd: 3314.45 | bwd_inner: 3313.65 | bwd_allreduce: 0.76 | step: 6.75 59%|█████▊ | 5856/10000 [9:12:43<6:18:03, 5.47s/it] {'loss': 0.0054, 'grad_norm': 0.8717175126075745, 'learning_rate': 1.5466623355347958e-05, 'epoch': 5.86} 59%|█████▊ | 5856/10000 [9:12:43<6:18:03, 5.47s/it][2025-06-19 22:42:27,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:42:27,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.82 | bwd_microstep: 3314.66 | bwd_inner_microstep: 3313.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 22:42:27,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.82 | bwd: 3314.67 | bwd_inner: 3313.87 | bwd_allreduce: 0.76 | step: 6.65 59%|█████▊ | 5857/10000 [9:12:48<6:17:45, 5.47s/it] {'loss': 0.1616, 'grad_norm': 6.655195713043213, 'learning_rate': 1.5460314679963183e-05, 'epoch': 5.86} 59%|█████▊ | 5857/10000 [9:12:48<6:17:45, 5.47s/it][2025-06-19 22:42:33,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:42:33,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.85 | bwd_microstep: 3363.71 | bwd_inner_microstep: 3362.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 22:42:33,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.85 | bwd: 3363.72 | bwd_inner: 3362.90 | bwd_allreduce: 0.78 | step: 7.19 59%|█████▊ | 5858/10000 [9:12:54<6:18:50, 5.49s/it] {'loss': 0.0013, 'grad_norm': 0.15448230504989624, 'learning_rate': 1.5454006480770325e-05, 'epoch': 5.86} 59%|█████▊ | 5858/10000 [9:12:54<6:18:50, 5.49s/it][2025-06-19 22:42:38,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:42:38,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.99 | bwd_microstep: 3356.89 | bwd_inner_microstep: 3355.98 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.98 [2025-06-19 22:42:38,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.98 | bwd: 3356.91 | bwd_inner: 3355.98 | bwd_allreduce: 0.88 | step: 6.99 59%|█████▊ | 5859/10000 [9:12:59<6:19:23, 5.50s/it] {'loss': 0.054, 'grad_norm': 3.7528438568115234, 'learning_rate': 1.544769875843109e-05, 'epoch': 5.86} 59%|█████▊ | 5859/10000 [9:12:59<6:19:23, 5.50s/it][2025-06-19 22:42:44,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:42:44,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.90 | bwd_microstep: 3318.78 | bwd_inner_microstep: 3317.96 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.88 [2025-06-19 22:42:44,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.90 | bwd: 3318.80 | bwd_inner: 3317.97 | bwd_allreduce: 0.79 | step: 6.88 59%|█████▊ | 5860/10000 [9:13:05<6:18:42, 5.49s/it] {'loss': 0.0021, 'grad_norm': 0.526850163936615, 'learning_rate': 1.5441391513607138e-05, 'epoch': 5.86} 59%|█████▊ | 5860/10000 [9:13:05<6:18:42, 5.49s/it][2025-06-19 22:42:49,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:42:49,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.26 | bwd_microstep: 3313.27 | bwd_inner_microstep: 3312.38 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.66 [2025-06-19 22:42:49,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.26 | bwd: 3313.29 | bwd_inner: 3312.38 | bwd_allreduce: 0.84 | step: 7.66 59%|█████▊ | 5861/10000 [9:13:10<6:17:52, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004853988066315651, 'learning_rate': 1.543508474696005e-05, 'epoch': 5.86} 59%|█████▊ | 5861/10000 [9:13:10<6:17:52, 5.48s/it][2025-06-19 22:42:55,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:42:55,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.00 | bwd_microstep: 3318.53 | bwd_inner_microstep: 3317.74 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 22:42:55,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.00 | bwd: 3318.54 | bwd_inner: 3317.74 | bwd_allreduce: 0.76 | step: 6.69 59%|█████▊ | 5862/10000 [9:13:16<6:17:29, 5.47s/it] {'loss': 0.0016, 'grad_norm': 0.3335912823677063, 'learning_rate': 1.5428778459151384e-05, 'epoch': 5.86} 59%|█████▊ | 5862/10000 [9:13:16<6:17:29, 5.47s/it][2025-06-19 22:43:00,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:43:00,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.14 | bwd_microstep: 3313.01 | bwd_inner_microstep: 3312.05 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.56 [2025-06-19 22:43:00,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.14 | bwd: 3313.03 | bwd_inner: 3312.05 | bwd_allreduce: 0.92 | step: 7.56 59%|█████▊ | 5863/10000 [9:13:21<6:17:06, 5.47s/it] {'loss': 0.0646, 'grad_norm': 3.9225878715515137, 'learning_rate': 1.542247265084264e-05, 'epoch': 5.86} 59%|█████▊ | 5863/10000 [9:13:21<6:17:06, 5.47s/it][2025-06-19 22:43:06,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:43:06,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.57 | bwd_microstep: 3318.48 | bwd_inner_microstep: 3317.58 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.97 [2025-06-19 22:43:06,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.57 | bwd: 3318.49 | bwd_inner: 3317.58 | bwd_allreduce: 0.87 | step: 6.98 59%|█████▊ | 5864/10000 [9:13:26<6:17:01, 5.47s/it] {'loss': 0.0082, 'grad_norm': 1.0231168270111084, 'learning_rate': 1.5416167322695273e-05, 'epoch': 5.86} 59%|█████▊ | 5864/10000 [9:13:26<6:17:01, 5.47s/it][2025-06-19 22:43:11,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:43:11,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.36 | bwd_microstep: 3363.53 | bwd_inner_microstep: 3362.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 22:43:11,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.36 | bwd: 3363.54 | bwd_inner: 3362.74 | bwd_allreduce: 0.76 | step: 6.64 59%|█████▊ | 5865/10000 [9:13:32<6:18:16, 5.49s/it] {'loss': 0.0645, 'grad_norm': 4.458878993988037, 'learning_rate': 1.5409862475370684e-05, 'epoch': 5.87} 59%|█████▊ | 5865/10000 [9:13:32<6:18:16, 5.49s/it][2025-06-19 22:43:17,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:43:17,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.24 | bwd_microstep: 3368.96 | bwd_inner_microstep: 3367.96 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.61 [2025-06-19 22:43:17,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.24 | bwd: 3368.98 | bwd_inner: 3367.96 | bwd_allreduce: 0.96 | step: 7.61 59%|█████▊ | 5866/10000 [9:13:38<6:19:13, 5.50s/it] {'loss': 0.0101, 'grad_norm': 1.8007487058639526, 'learning_rate': 1.5403558109530214e-05, 'epoch': 5.87} 59%|█████▊ | 5866/10000 [9:13:38<6:19:13, 5.50s/it][2025-06-19 22:43:22,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:43:22,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.87 | bwd_microstep: 3320.09 | bwd_inner_microstep: 3319.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 22:43:22,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.87 | bwd: 3320.10 | bwd_inner: 3319.30 | bwd_allreduce: 0.76 | step: 6.60 59%|█████▊ | 5867/10000 [9:13:43<6:18:19, 5.49s/it] {'loss': 0.101, 'grad_norm': 8.278355598449707, 'learning_rate': 1.5397254225835165e-05, 'epoch': 5.87} 59%|█████▊ | 5867/10000 [9:13:43<6:18:19, 5.49s/it][2025-06-19 22:43:28,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:43:28,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.62 | bwd_microstep: 3333.29 | bwd_inner_microstep: 3332.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 22:43:28,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.62 | bwd: 3333.30 | bwd_inner: 3332.50 | bwd_allreduce: 0.76 | step: 6.68 59%|█████▊ | 5868/10000 [9:13:48<6:17:49, 5.49s/it] {'loss': 0.0013, 'grad_norm': 0.30710339546203613, 'learning_rate': 1.5390950824946786e-05, 'epoch': 5.87} 59%|█████▊ | 5868/10000 [9:13:48<6:17:49, 5.49s/it][2025-06-19 22:43:33,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:43:33,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.90 | bwd_microstep: 3323.44 | bwd_inner_microstep: 3322.63 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 22:43:33,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.90 | bwd: 3323.45 | bwd_inner: 3322.63 | bwd_allreduce: 0.78 | step: 7.11 59%|█████▊ | 5869/10000 [9:13:54<6:17:27, 5.48s/it] {'loss': 0.0009, 'grad_norm': 0.17226462066173553, 'learning_rate': 1.5384647907526272e-05, 'epoch': 5.87} 59%|█████▊ | 5869/10000 [9:13:54<6:17:27, 5.48s/it][2025-06-19 22:43:39,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:43:39,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.70 | bwd_microstep: 3325.30 | bwd_inner_microstep: 3324.22 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.83 [2025-06-19 22:43:39,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.70 | bwd: 3325.32 | bwd_inner: 3324.22 | bwd_allreduce: 1.04 | step: 7.83 59%|█████▊ | 5870/10000 [9:13:59<6:17:07, 5.48s/it] {'loss': 0.0126, 'grad_norm': 4.0093817710876465, 'learning_rate': 1.5378345474234778e-05, 'epoch': 5.87} 59%|█████▊ | 5870/10000 [9:13:59<6:17:07, 5.48s/it][2025-06-19 22:43:44,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:43:44,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.54 | bwd_microstep: 3363.47 | bwd_inner_microstep: 3362.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-19 22:43:44,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.54 | bwd: 3363.48 | bwd_inner: 3362.67 | bwd_allreduce: 0.77 | step: 6.99 59%|█████▊ | 5871/10000 [9:14:05<6:17:58, 5.49s/it] {'loss': 0.0013, 'grad_norm': 0.24309459328651428, 'learning_rate': 1.537204352573339e-05, 'epoch': 5.87} 59%|█████▊ | 5871/10000 [9:14:05<6:17:58, 5.49s/it][2025-06-19 22:43:50,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:43:50,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.78 | bwd_microstep: 3368.69 | bwd_inner_microstep: 3367.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 22:43:50,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.78 | bwd: 3368.70 | bwd_inner: 3367.90 | bwd_allreduce: 0.75 | step: 6.63 59%|█████▊ | 5872/10000 [9:14:10<6:18:39, 5.50s/it] {'loss': 0.0024, 'grad_norm': 0.46713513135910034, 'learning_rate': 1.5365742062683158e-05, 'epoch': 5.87} 59%|█████▊ | 5872/10000 [9:14:10<6:18:39, 5.50s/it][2025-06-19 22:43:55,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:43:55,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.07 | bwd_microstep: 3321.46 | bwd_inner_microstep: 3320.43 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.21 [2025-06-19 22:43:55,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.07 | bwd: 3321.48 | bwd_inner: 3320.43 | bwd_allreduce: 0.99 | step: 7.21 59%|█████▊ | 5873/10000 [9:14:16<6:17:47, 5.49s/it] {'loss': 0.0052, 'grad_norm': 1.2271106243133545, 'learning_rate': 1.5359441085745074e-05, 'epoch': 5.87} 59%|█████▊ | 5873/10000 [9:14:16<6:17:47, 5.49s/it][2025-06-19 22:44:01,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:44:01,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.23 | bwd_microstep: 3316.29 | bwd_inner_microstep: 3315.49 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.84 [2025-06-19 22:44:01,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.23 | bwd: 3316.30 | bwd_inner: 3315.50 | bwd_allreduce: 0.76 | step: 6.85 59%|█████▊ | 5874/10000 [9:14:21<6:17:09, 5.48s/it] {'loss': 0.0487, 'grad_norm': 4.526946067810059, 'learning_rate': 1.5353140595580083e-05, 'epoch': 5.87} 59%|█████▊ | 5874/10000 [9:14:21<6:17:09, 5.48s/it][2025-06-19 22:44:06,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:44:06,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.46 | bwd_microstep: 3318.91 | bwd_inner_microstep: 3317.96 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.37 [2025-06-19 22:44:06,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.46 | bwd: 3318.92 | bwd_inner: 3317.96 | bwd_allreduce: 0.92 | step: 7.38 59%|█████▉ | 5875/10000 [9:14:27<6:16:38, 5.48s/it] {'loss': 0.0009, 'grad_norm': 0.09732680767774582, 'learning_rate': 1.5346840592849083e-05, 'epoch': 5.88} 59%|█████▉ | 5875/10000 [9:14:27<6:16:38, 5.48s/it][2025-06-19 22:44:12,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:44:12,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.19 | bwd_microstep: 3316.07 | bwd_inner_microstep: 3315.12 | bwd_allreduce_microstep: 0.91 | step_microstep: 6.62 [2025-06-19 22:44:12,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.19 | bwd: 3316.09 | bwd_inner: 3315.12 | bwd_allreduce: 0.93 | step: 6.62 59%|█████▉ | 5876/10000 [9:14:32<6:16:19, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.07061199843883514, 'learning_rate': 1.5340541078212903e-05, 'epoch': 5.88} 59%|█████▉ | 5876/10000 [9:14:32<6:16:19, 5.48s/it][2025-06-19 22:44:17,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:44:17,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.17 | bwd_microstep: 3368.96 | bwd_inner_microstep: 3368.14 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.25 [2025-06-19 22:44:17,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.17 | bwd: 3368.98 | bwd_inner: 3368.14 | bwd_allreduce: 0.80 | step: 7.26 59%|█████▉ | 5877/10000 [9:14:38<6:17:23, 5.49s/it] {'loss': 0.0021, 'grad_norm': 0.37125062942504883, 'learning_rate': 1.5334242052332336e-05, 'epoch': 5.88} 59%|█████▉ | 5877/10000 [9:14:38<6:17:23, 5.49s/it][2025-06-19 22:44:23,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:44:23,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.33 | bwd_microstep: 3363.49 | bwd_inner_microstep: 3362.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-19 22:44:23,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.33 | bwd: 3363.50 | bwd_inner: 3362.69 | bwd_allreduce: 0.77 | step: 6.79 59%|█████▉ | 5878/10000 [9:14:43<6:18:01, 5.50s/it] {'loss': 0.0063, 'grad_norm': 0.7996113300323486, 'learning_rate': 1.5327943515868127e-05, 'epoch': 5.88} 59%|█████▉ | 5878/10000 [9:14:43<6:18:01, 5.50s/it][2025-06-19 22:44:28,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:44:28,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.94 | bwd_microstep: 3369.83 | bwd_inner_microstep: 3369.03 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 22:44:28,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.94 | bwd: 3369.84 | bwd_inner: 3369.03 | bwd_allreduce: 0.77 | step: 7.14 59%|█████▉ | 5879/10000 [9:14:49<6:18:43, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.030595799908041954, 'learning_rate': 1.5321645469480956e-05, 'epoch': 5.88} 59%|█████▉ | 5879/10000 [9:14:49<6:18:43, 5.51s/it][2025-06-19 22:44:34,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:44:34,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.02 | bwd_microstep: 3319.26 | bwd_inner_microstep: 3318.14 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.16 [2025-06-19 22:44:34,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.02 | bwd: 3319.28 | bwd_inner: 3318.14 | bwd_allreduce: 1.09 | step: 7.17 59%|█████▉ | 5880/10000 [9:14:54<6:17:37, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.07172452658414841, 'learning_rate': 1.531534791383147e-05, 'epoch': 5.88} 59%|█████▉ | 5880/10000 [9:14:54<6:17:37, 5.50s/it][2025-06-19 22:44:39,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 22:44:39,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.92 | bwd_microstep: 3318.40 | bwd_inner_microstep: 3317.17 | bwd_allreduce_microstep: 1.17 | step_microstep: 8.13 [2025-06-19 22:44:39,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.92 | bwd: 3318.42 | bwd_inner: 3317.17 | bwd_allreduce: 1.19 | step: 8.13 59%|█████▉ | 5881/10000 [9:15:00<6:16:56, 5.49s/it] {'loss': 0.002, 'grad_norm': 0.2388119399547577, 'learning_rate': 1.5309050849580242e-05, 'epoch': 5.88} 59%|█████▉ | 5881/10000 [9:15:00<6:16:56, 5.49s/it][2025-06-19 22:44:45,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:44:45,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.76 | bwd_microstep: 3319.58 | bwd_inner_microstep: 3318.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 22:44:45,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.76 | bwd: 3319.60 | bwd_inner: 3318.79 | bwd_allreduce: 0.76 | step: 6.70 59%|█████▉ | 5882/10000 [9:15:05<6:16:19, 5.48s/it] {'loss': 0.0024, 'grad_norm': 0.40363040566444397, 'learning_rate': 1.5302754277387806e-05, 'epoch': 5.88} 59%|█████▉ | 5882/10000 [9:15:05<6:16:19, 5.48s/it][2025-06-19 22:44:50,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.79 [2025-06-19 22:44:50,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.09 | bwd_microstep: 3326.48 | bwd_inner_microstep: 3325.56 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.48 [2025-06-19 22:44:50,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.09 | bwd: 3326.49 | bwd_inner: 3325.57 | bwd_allreduce: 0.88 | step: 7.48 59%|█████▉ | 5883/10000 [9:15:11<6:16:02, 5.48s/it] {'loss': 0.0182, 'grad_norm': 7.703607559204102, 'learning_rate': 1.5296458197914646e-05, 'epoch': 5.88} 59%|█████▉ | 5883/10000 [9:15:11<6:16:02, 5.48s/it][2025-06-19 22:44:55,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 22:44:55,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.75 | bwd_microstep: 3323.14 | bwd_inner_microstep: 3322.29 | bwd_allreduce_microstep: 0.80 | step_microstep: 8.12 [2025-06-19 22:44:55,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.75 | bwd: 3323.16 | bwd_inner: 3322.29 | bwd_allreduce: 0.82 | step: 8.13 59%|█████▉ | 5884/10000 [9:15:16<6:15:47, 5.48s/it] {'loss': 0.0108, 'grad_norm': 2.7765657901763916, 'learning_rate': 1.5290162611821192e-05, 'epoch': 5.88} 59%|█████▉ | 5884/10000 [9:15:16<6:15:47, 5.48s/it][2025-06-19 22:45:01,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 22:45:01,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.33 | bwd_microstep: 3323.52 | bwd_inner_microstep: 3322.36 | bwd_allreduce_microstep: 1.09 | step_microstep: 8.30 [2025-06-19 22:45:01,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.33 | bwd: 3323.54 | bwd_inner: 3322.36 | bwd_allreduce: 1.12 | step: 8.30 59%|█████▉ | 5885/10000 [9:15:22<6:16:03, 5.48s/it] {'loss': 0.0251, 'grad_norm': 4.213940143585205, 'learning_rate': 1.5283867519767827e-05, 'epoch': 5.88} 59%|█████▉ | 5885/10000 [9:15:22<6:16:03, 5.48s/it][2025-06-19 22:45:06,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:45:06,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.30 | bwd_microstep: 3319.18 | bwd_inner_microstep: 3318.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 22:45:06,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.30 | bwd: 3319.19 | bwd_inner: 3318.38 | bwd_allreduce: 0.77 | step: 6.83 59%|█████▉ | 5886/10000 [9:15:27<6:15:44, 5.48s/it] {'loss': 0.0174, 'grad_norm': 1.566551685333252, 'learning_rate': 1.5277572922414866e-05, 'epoch': 5.89} 59%|█████▉ | 5886/10000 [9:15:27<6:15:44, 5.48s/it][2025-06-19 22:45:12,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:45:12,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.46 | bwd_microstep: 3328.94 | bwd_inner_microstep: 3328.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-19 22:45:12,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.46 | bwd: 3328.95 | bwd_inner: 3328.13 | bwd_allreduce: 0.78 | step: 7.23 59%|█████▉ | 5887/10000 [9:15:33<6:15:33, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.005054227542132139, 'learning_rate': 1.5271278820422584e-05, 'epoch': 5.89} 59%|█████▉ | 5887/10000 [9:15:33<6:15:33, 5.48s/it][2025-06-19 22:45:17,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:45:17,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.06 | bwd_microstep: 3317.20 | bwd_inner_microstep: 3316.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-19 22:45:17,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.06 | bwd: 3317.22 | bwd_inner: 3316.38 | bwd_allreduce: 0.78 | step: 6.84 59%|█████▉ | 5888/10000 [9:15:38<6:15:08, 5.47s/it] {'loss': 0.0039, 'grad_norm': 1.0787670612335205, 'learning_rate': 1.5264985214451208e-05, 'epoch': 5.89} 59%|█████▉ | 5888/10000 [9:15:38<6:15:08, 5.47s/it][2025-06-19 22:45:23,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:45:23,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.12 | bwd_microstep: 3369.22 | bwd_inner_microstep: 3368.39 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.34 [2025-06-19 22:45:23,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.12 | bwd: 3369.23 | bwd_inner: 3368.39 | bwd_allreduce: 0.79 | step: 7.34 59%|█████▉ | 5889/10000 [9:15:44<6:16:20, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.021253474056720734, 'learning_rate': 1.525869210516091e-05, 'epoch': 5.89} 59%|█████▉ | 5889/10000 [9:15:44<6:16:20, 5.49s/it][2025-06-19 22:45:28,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:45:28,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.33 | bwd_microstep: 3316.89 | bwd_inner_microstep: 3316.00 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.98 [2025-06-19 22:45:28,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.33 | bwd: 3316.90 | bwd_inner: 3316.00 | bwd_allreduce: 0.86 | step: 6.98 59%|█████▉ | 5890/10000 [9:15:49<6:15:41, 5.48s/it] {'loss': 0.0532, 'grad_norm': 11.052403450012207, 'learning_rate': 1.5252399493211807e-05, 'epoch': 5.89} 59%|█████▉ | 5890/10000 [9:15:49<6:15:41, 5.48s/it][2025-06-19 22:45:34,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:45:34,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.07 | bwd_microstep: 3340.70 | bwd_inner_microstep: 3339.88 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.16 [2025-06-19 22:45:34,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.07 | bwd: 3340.72 | bwd_inner: 3339.88 | bwd_allreduce: 0.79 | step: 7.16 59%|█████▉ | 5891/10000 [9:15:55<6:15:39, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.08790736645460129, 'learning_rate': 1.524610737926396e-05, 'epoch': 5.89} 59%|█████▉ | 5891/10000 [9:15:55<6:15:39, 5.49s/it][2025-06-19 22:45:39,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:45:39,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.27 | bwd_microstep: 3324.19 | bwd_inner_microstep: 3323.30 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.92 [2025-06-19 22:45:39,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.27 | bwd: 3324.21 | bwd_inner: 3323.30 | bwd_allreduce: 0.86 | step: 6.93 59%|█████▉ | 5892/10000 [9:16:00<6:15:11, 5.48s/it] {'loss': 0.001, 'grad_norm': 0.21550631523132324, 'learning_rate': 1.5239815763977384e-05, 'epoch': 5.89} 59%|█████▉ | 5892/10000 [9:16:00<6:15:11, 5.48s/it][2025-06-19 22:45:45,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:45:45,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.89 | bwd_microstep: 3376.49 | bwd_inner_microstep: 3375.62 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.93 [2025-06-19 22:45:45,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.89 | bwd: 3376.50 | bwd_inner: 3375.62 | bwd_allreduce: 0.83 | step: 6.93 59%|█████▉ | 5893/10000 [9:16:06<6:16:22, 5.50s/it] {'loss': 0.0426, 'grad_norm': 4.8162713050842285, 'learning_rate': 1.5233524648012041e-05, 'epoch': 5.89} 59%|█████▉ | 5893/10000 [9:16:06<6:16:22, 5.50s/it][2025-06-19 22:45:50,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:45:50,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.45 | bwd_microstep: 3333.24 | bwd_inner_microstep: 3332.21 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.82 [2025-06-19 22:45:50,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.45 | bwd: 3333.26 | bwd_inner: 3332.21 | bwd_allreduce: 1.00 | step: 7.83 59%|█████▉ | 5894/10000 [9:16:11<6:15:59, 5.49s/it] {'loss': 0.0254, 'grad_norm': 2.295189380645752, 'learning_rate': 1.5227234032027842e-05, 'epoch': 5.89} 59%|█████▉ | 5894/10000 [9:16:11<6:15:59, 5.49s/it][2025-06-19 22:45:56,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:45:56,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.42 | bwd_microstep: 3373.04 | bwd_inner_microstep: 3372.22 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.36 [2025-06-19 22:45:56,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.42 | bwd: 3373.06 | bwd_inner: 3372.22 | bwd_allreduce: 0.79 | step: 7.36 59%|█████▉ | 5895/10000 [9:16:17<6:17:05, 5.51s/it] {'loss': 0.0046, 'grad_norm': 1.1172845363616943, 'learning_rate': 1.5220943916684652e-05, 'epoch': 5.89} 59%|█████▉ | 5895/10000 [9:16:17<6:17:05, 5.51s/it][2025-06-19 22:46:01,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:46:01,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.29 | bwd_microstep: 3334.49 | bwd_inner_microstep: 3333.51 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.37 [2025-06-19 22:46:01,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.29 | bwd: 3334.51 | bwd_inner: 3333.51 | bwd_allreduce: 0.95 | step: 7.37 59%|█████▉ | 5896/10000 [9:16:22<6:16:25, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.018657084554433823, 'learning_rate': 1.5214654302642254e-05, 'epoch': 5.9} 59%|█████▉ | 5896/10000 [9:16:22<6:16:25, 5.50s/it][2025-06-19 22:46:07,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:46:07,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.77 | bwd_microstep: 3330.01 | bwd_inner_microstep: 3329.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 22:46:07,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.77 | bwd: 3330.03 | bwd_inner: 3329.23 | bwd_allreduce: 0.76 | step: 6.65 59%|█████▉ | 5897/10000 [9:16:28<6:15:51, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.09452037513256073, 'learning_rate': 1.520836519056041e-05, 'epoch': 5.9} 59%|█████▉ | 5897/10000 [9:16:28<6:15:51, 5.50s/it][2025-06-19 22:46:12,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:46:12,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.43 | bwd_microstep: 3374.28 | bwd_inner_microstep: 3373.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 22:46:12,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.43 | bwd: 3374.30 | bwd_inner: 3373.49 | bwd_allreduce: 0.76 | step: 6.73 59%|█████▉ | 5898/10000 [9:16:33<6:16:52, 5.51s/it] {'loss': 0.0006, 'grad_norm': 0.08321049064397812, 'learning_rate': 1.520207658109882e-05, 'epoch': 5.9} 59%|█████▉ | 5898/10000 [9:16:33<6:16:52, 5.51s/it][2025-06-19 22:46:18,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 22:46:18,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.93 | bwd_microstep: 3407.42 | bwd_inner_microstep: 3406.47 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.26 [2025-06-19 22:46:18,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.93 | bwd: 3407.44 | bwd_inner: 3406.47 | bwd_allreduce: 0.92 | step: 7.27 59%|█████▉ | 5899/10000 [9:16:39<6:18:16, 5.53s/it] {'loss': 0.0099, 'grad_norm': 1.33432936668396, 'learning_rate': 1.5195788474917128e-05, 'epoch': 5.9} 59%|█████▉ | 5899/10000 [9:16:39<6:18:16, 5.53s/it][2025-06-19 22:46:23,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:46:23,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.14 | bwd_microstep: 3337.29 | bwd_inner_microstep: 3336.47 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.05 [2025-06-19 22:46:23,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.14 | bwd: 3337.30 | bwd_inner: 3336.47 | bwd_allreduce: 0.79 | step: 7.06 59%|█████▉ | 5900/10000 [9:16:44<6:17:28, 5.52s/it] {'loss': 0.0012, 'grad_norm': 0.15642905235290527, 'learning_rate': 1.5189500872674934e-05, 'epoch': 5.9} 59%|█████▉ | 5900/10000 [9:16:44<6:17:28, 5.52s/it][2025-06-19 22:46:29,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 22:46:29,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.85 | bwd_microstep: 3388.27 | bwd_inner_microstep: 3387.36 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.94 [2025-06-19 22:46:29,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.85 | bwd: 3388.28 | bwd_inner: 3387.36 | bwd_allreduce: 0.88 | step: 6.94 59%|█████▉ | 5901/10000 [9:16:50<6:18:04, 5.53s/it] {'loss': 0.0213, 'grad_norm': 2.9038939476013184, 'learning_rate': 1.5183213775031767e-05, 'epoch': 5.9} 59%|█████▉ | 5901/10000 [9:16:50<6:18:04, 5.53s/it][2025-06-19 22:46:35,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:46:35,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.02 | bwd_microstep: 3410.62 | bwd_inner_microstep: 3409.74 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.53 [2025-06-19 22:46:35,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.02 | bwd: 3410.65 | bwd_inner: 3409.74 | bwd_allreduce: 0.84 | step: 7.53 59%|█████▉ | 5902/10000 [9:16:55<6:19:24, 5.55s/it] {'loss': 0.025, 'grad_norm': 6.234543800354004, 'learning_rate': 1.5176927182647118e-05, 'epoch': 5.9} 59%|█████▉ | 5902/10000 [9:16:55<6:19:24, 5.55s/it][2025-06-19 22:46:40,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:46:40,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2187.61 | bwd_microstep: 3404.13 | bwd_inner_microstep: 3403.31 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 22:46:40,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2187.61 | bwd: 3404.15 | bwd_inner: 3403.31 | bwd_allreduce: 0.79 | step: 7.30 59%|█████▉ | 5903/10000 [9:17:01<6:20:55, 5.58s/it] {'loss': 0.0003, 'grad_norm': 0.032412417232990265, 'learning_rate': 1.5170641096180426e-05, 'epoch': 5.9} 59%|█████▉ | 5903/10000 [9:17:01<6:20:55, 5.58s/it][2025-06-19 22:46:46,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-19 22:46:46,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.00 | bwd_microstep: 3382.05 | bwd_inner_microstep: 3380.97 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.73 [2025-06-19 22:46:46,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.00 | bwd: 3382.07 | bwd_inner: 3380.97 | bwd_allreduce: 1.04 | step: 7.73 59%|█████▉ | 5904/10000 [9:17:07<6:20:25, 5.57s/it] {'loss': 0.0295, 'grad_norm': 2.587994337081909, 'learning_rate': 1.5164355516291064e-05, 'epoch': 5.9} 59%|█████▉ | 5904/10000 [9:17:07<6:20:25, 5.57s/it][2025-06-19 22:46:51,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:46:51,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.10 | bwd_microstep: 3323.20 | bwd_inner_microstep: 3322.39 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-19 22:46:51,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.10 | bwd: 3323.22 | bwd_inner: 3322.39 | bwd_allreduce: 0.78 | step: 7.04 59%|█████▉ | 5905/10000 [9:17:12<6:18:21, 5.54s/it] {'loss': 0.0002, 'grad_norm': 0.03456811234354973, 'learning_rate': 1.5158070443638376e-05, 'epoch': 5.91} 59%|█████▉ | 5905/10000 [9:17:12<6:18:21, 5.54s/it][2025-06-19 22:46:57,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 22:46:57,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.26 | bwd_microstep: 3325.05 | bwd_inner_microstep: 3324.19 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.74 [2025-06-19 22:46:57,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.26 | bwd: 3325.08 | bwd_inner: 3324.19 | bwd_allreduce: 0.82 | step: 7.74 59%|█████▉ | 5906/10000 [9:17:18<6:17:02, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.004046890418976545, 'learning_rate': 1.515178587888162e-05, 'epoch': 5.91} 59%|█████▉ | 5906/10000 [9:17:18<6:17:02, 5.53s/it][2025-06-19 22:47:02,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.82 [2025-06-19 22:47:02,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.72 | bwd_microstep: 3323.21 | bwd_inner_microstep: 3322.33 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.31 [2025-06-19 22:47:02,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.72 | bwd: 3323.23 | bwd_inner: 3322.33 | bwd_allreduce: 0.84 | step: 7.32 59%|█████▉ | 5907/10000 [9:17:23<6:16:48, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.014431123621761799, 'learning_rate': 1.5145501822680022e-05, 'epoch': 5.91} 59%|█████▉ | 5907/10000 [9:17:23<6:16:48, 5.52s/it][2025-06-19 22:47:08,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-19 22:47:08,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.04 | bwd_microstep: 3328.82 | bwd_inner_microstep: 3327.65 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.53 [2025-06-19 22:47:08,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.04 | bwd: 3328.86 | bwd_inner: 3327.65 | bwd_allreduce: 1.11 | step: 8.53 59%|█████▉ | 5908/10000 [9:17:29<6:16:38, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.06841219216585159, 'learning_rate': 1.5139218275692753e-05, 'epoch': 5.91} 59%|█████▉ | 5908/10000 [9:17:29<6:16:38, 5.52s/it][2025-06-19 22:47:13,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 22:47:13,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2153.30 | bwd_microstep: 3318.58 | bwd_inner_microstep: 3317.67 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.09 [2025-06-19 22:47:13,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2153.30 | bwd: 3318.60 | bwd_inner: 3317.67 | bwd_allreduce: 0.88 | step: 7.09 59%|█████▉ | 5909/10000 [9:17:34<6:16:24, 5.52s/it] {'loss': 0.0005, 'grad_norm': 0.10051063448190689, 'learning_rate': 1.5132935238578928e-05, 'epoch': 5.91} 59%|█████▉ | 5909/10000 [9:17:34<6:16:24, 5.52s/it][2025-06-19 22:47:19,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-19 22:47:19,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2158.71 | bwd_microstep: 3320.83 | bwd_inner_microstep: 3319.70 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.59 [2025-06-19 22:47:19,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2158.71 | bwd: 3320.86 | bwd_inner: 3319.70 | bwd_allreduce: 1.09 | step: 8.60 59%|█████▉ | 5910/10000 [9:17:40<6:16:24, 5.52s/it] {'loss': 0.0006, 'grad_norm': 0.06283529847860336, 'learning_rate': 1.5126652711997607e-05, 'epoch': 5.91} 59%|█████▉ | 5910/10000 [9:17:40<6:16:24, 5.52s/it][2025-06-19 22:47:24,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:47:24,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.69 | bwd_microstep: 3369.99 | bwd_inner_microstep: 3369.11 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.52 [2025-06-19 22:47:24,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.69 | bwd: 3370.02 | bwd_inner: 3369.11 | bwd_allreduce: 0.84 | step: 7.52 59%|█████▉ | 5911/10000 [9:17:45<6:17:17, 5.54s/it] {'loss': 0.1005, 'grad_norm': 10.95688247680664, 'learning_rate': 1.5120370696607809e-05, 'epoch': 5.91} 59%|█████▉ | 5911/10000 [9:17:45<6:17:17, 5.54s/it][2025-06-19 22:47:30,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:47:30,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.54 | bwd_microstep: 3335.08 | bwd_inner_microstep: 3334.19 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.85 [2025-06-19 22:47:30,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.54 | bwd: 3335.10 | bwd_inner: 3334.19 | bwd_allreduce: 0.85 | step: 7.85 59%|█████▉ | 5912/10000 [9:17:51<6:16:55, 5.53s/it] {'loss': 0.0005, 'grad_norm': 0.07999254763126373, 'learning_rate': 1.5114089193068468e-05, 'epoch': 5.91} 59%|█████▉ | 5912/10000 [9:17:51<6:16:55, 5.53s/it][2025-06-19 22:47:35,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 22:47:35,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.53 | bwd_microstep: 3322.09 | bwd_inner_microstep: 3321.16 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.73 [2025-06-19 22:47:35,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.53 | bwd: 3322.11 | bwd_inner: 3321.16 | bwd_allreduce: 0.90 | step: 7.75 59%|█████▉ | 5913/10000 [9:17:56<6:15:37, 5.51s/it] {'loss': 0.0037, 'grad_norm': 0.5026253461837769, 'learning_rate': 1.5107808202038495e-05, 'epoch': 5.91} 59%|█████▉ | 5913/10000 [9:17:56<6:15:37, 5.51s/it][2025-06-19 22:47:41,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 22:47:41,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.47 | bwd_microstep: 3331.10 | bwd_inner_microstep: 3330.23 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.92 [2025-06-19 22:47:41,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.47 | bwd: 3331.12 | bwd_inner: 3330.23 | bwd_allreduce: 0.85 | step: 6.92 59%|█████▉ | 5914/10000 [9:18:02<6:14:48, 5.50s/it] {'loss': 0.0006, 'grad_norm': 0.14906594157218933, 'learning_rate': 1.5101527724176737e-05, 'epoch': 5.91} 59%|█████▉ | 5914/10000 [9:18:02<6:14:48, 5.50s/it][2025-06-19 22:47:46,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:47:46,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.43 | bwd_microstep: 3369.14 | bwd_inner_microstep: 3368.24 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.86 [2025-06-19 22:47:46,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.43 | bwd: 3369.17 | bwd_inner: 3368.24 | bwd_allreduce: 0.86 | step: 7.87 59%|█████▉ | 5915/10000 [9:18:07<6:15:36, 5.52s/it] {'loss': 0.0071, 'grad_norm': 0.7330738306045532, 'learning_rate': 1.5095247760141984e-05, 'epoch': 5.92} 59%|█████▉ | 5915/10000 [9:18:07<6:15:36, 5.52s/it][2025-06-19 22:47:52,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:47:52,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.95 | bwd_microstep: 3322.14 | bwd_inner_microstep: 3321.24 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.52 [2025-06-19 22:47:52,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.95 | bwd: 3322.16 | bwd_inner: 3321.24 | bwd_allreduce: 0.85 | step: 7.53 59%|█████▉ | 5916/10000 [9:18:13<6:15:31, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.027862440794706345, 'learning_rate': 1.5088968310592984e-05, 'epoch': 5.92} 59%|█████▉ | 5916/10000 [9:18:13<6:15:31, 5.52s/it][2025-06-19 22:47:58,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:47:58,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2153.43 | bwd_microstep: 3321.03 | bwd_inner_microstep: 3320.13 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.25 [2025-06-19 22:47:58,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2153.43 | bwd: 3321.05 | bwd_inner: 3320.13 | bwd_allreduce: 0.85 | step: 7.25 59%|█████▉ | 5917/10000 [9:18:18<6:15:31, 5.52s/it] {'loss': 0.0131, 'grad_norm': 1.06965172290802, 'learning_rate': 1.5082689376188408e-05, 'epoch': 5.92} 59%|█████▉ | 5917/10000 [9:18:18<6:15:31, 5.52s/it][2025-06-19 22:48:03,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:48:03,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2175.24 | bwd_microstep: 3385.17 | bwd_inner_microstep: 3384.29 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.31 [2025-06-19 22:48:03,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2175.24 | bwd: 3385.20 | bwd_inner: 3384.29 | bwd_allreduce: 0.83 | step: 7.32 59%|█████▉ | 5918/10000 [9:18:24<6:17:10, 5.54s/it] {'loss': 0.002, 'grad_norm': 0.3194243311882019, 'learning_rate': 1.5076410957586896e-05, 'epoch': 5.92} 59%|█████▉ | 5918/10000 [9:18:24<6:17:10, 5.54s/it][2025-06-19 22:48:09,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:48:09,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.47 | bwd_microstep: 3329.16 | bwd_inner_microstep: 3328.34 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.18 [2025-06-19 22:48:09,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.47 | bwd: 3329.18 | bwd_inner: 3328.34 | bwd_allreduce: 0.78 | step: 7.18 59%|█████▉ | 5919/10000 [9:18:29<6:16:37, 5.54s/it] {'loss': 0.0007, 'grad_norm': 0.07547981292009354, 'learning_rate': 1.5070133055447024e-05, 'epoch': 5.92} 59%|█████▉ | 5919/10000 [9:18:29<6:16:37, 5.54s/it][2025-06-19 22:48:14,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:48:14,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.74 | bwd_microstep: 3334.83 | bwd_inner_microstep: 3333.95 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.83 [2025-06-19 22:48:14,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.74 | bwd: 3334.84 | bwd_inner: 3333.95 | bwd_allreduce: 0.84 | step: 6.84 59%|█████▉ | 5920/10000 [9:18:35<6:15:39, 5.52s/it] {'loss': 0.0119, 'grad_norm': 1.2120914459228516, 'learning_rate': 1.5063855670427315e-05, 'epoch': 5.92} 59%|█████▉ | 5920/10000 [9:18:35<6:15:39, 5.52s/it][2025-06-19 22:48:20,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 22:48:20,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.50 | bwd_microstep: 3328.68 | bwd_inner_microstep: 3327.62 | bwd_allreduce_microstep: 0.98 | step_microstep: 6.98 [2025-06-19 22:48:20,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.50 | bwd: 3328.71 | bwd_inner: 3327.62 | bwd_allreduce: 1.01 | step: 6.97 59%|█████▉ | 5921/10000 [9:18:40<6:14:44, 5.51s/it] {'loss': 0.0009, 'grad_norm': 0.11442264914512634, 'learning_rate': 1.5057578803186244e-05, 'epoch': 5.92} 59%|█████▉ | 5921/10000 [9:18:40<6:14:44, 5.51s/it][2025-06-19 22:48:25,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:48:25,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.68 | bwd_microstep: 3322.78 | bwd_inner_microstep: 3321.78 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.24 [2025-06-19 22:48:25,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.68 | bwd: 3322.79 | bwd_inner: 3321.78 | bwd_allreduce: 0.96 | step: 7.25 59%|█████▉ | 5922/10000 [9:18:46<6:14:13, 5.51s/it] {'loss': 0.0336, 'grad_norm': 3.609333038330078, 'learning_rate': 1.505130245438221e-05, 'epoch': 5.92} 59%|█████▉ | 5922/10000 [9:18:46<6:14:13, 5.51s/it][2025-06-19 22:48:31,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:48:31,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.08 | bwd_microstep: 3333.02 | bwd_inner_microstep: 3332.20 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-19 22:48:31,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.08 | bwd: 3333.03 | bwd_inner: 3332.20 | bwd_allreduce: 0.79 | step: 7.29 59%|█████▉ | 5923/10000 [9:18:51<6:13:48, 5.50s/it] {'loss': 0.0014, 'grad_norm': 0.4889966547489166, 'learning_rate': 1.504502662467358e-05, 'epoch': 5.92} 59%|█████▉ | 5923/10000 [9:18:51<6:13:48, 5.50s/it][2025-06-19 22:48:36,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 22:48:36,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.12 | bwd_microstep: 3310.62 | bwd_inner_microstep: 3309.75 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.47 [2025-06-19 22:48:36,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.12 | bwd: 3310.65 | bwd_inner: 3309.75 | bwd_allreduce: 0.83 | step: 7.47 59%|█████▉ | 5924/10000 [9:18:57<6:13:09, 5.49s/it] {'loss': 0.0078, 'grad_norm': 2.099708080291748, 'learning_rate': 1.5038751314718663e-05, 'epoch': 5.92} 59%|█████▉ | 5924/10000 [9:18:57<6:13:09, 5.49s/it][2025-06-19 22:48:42,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:48:42,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.57 | bwd_microstep: 3364.06 | bwd_inner_microstep: 3363.17 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.56 [2025-06-19 22:48:42,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.57 | bwd: 3364.08 | bwd_inner: 3363.17 | bwd_allreduce: 0.84 | step: 7.57 59%|█████▉ | 5925/10000 [9:19:02<6:14:09, 5.51s/it] {'loss': 0.0004, 'grad_norm': 0.07312425225973129, 'learning_rate': 1.5032476525175703e-05, 'epoch': 5.92} 59%|█████▉ | 5925/10000 [9:19:02<6:14:09, 5.51s/it][2025-06-19 22:48:47,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:48:47,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.97 | bwd_microstep: 3320.24 | bwd_inner_microstep: 3319.44 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 22:48:47,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.97 | bwd: 3320.26 | bwd_inner: 3319.44 | bwd_allreduce: 0.77 | step: 6.76 59%|█████▉ | 5926/10000 [9:19:08<6:13:25, 5.50s/it] {'loss': 0.0046, 'grad_norm': 0.8997000455856323, 'learning_rate': 1.5026202256702909e-05, 'epoch': 5.93} 59%|█████▉ | 5926/10000 [9:19:08<6:13:25, 5.50s/it][2025-06-19 22:48:53,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:48:53,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2151.90 | bwd_microstep: 3377.85 | bwd_inner_microstep: 3377.02 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.83 [2025-06-19 22:48:53,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2151.90 | bwd: 3377.86 | bwd_inner: 3377.02 | bwd_allreduce: 0.80 | step: 6.84 59%|█████▉ | 5927/10000 [9:19:13<6:14:42, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.012399302795529366, 'learning_rate': 1.5019928509958408e-05, 'epoch': 5.93} 59%|█████▉ | 5927/10000 [9:19:13<6:14:42, 5.52s/it][2025-06-19 22:48:58,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:48:58,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.19 | bwd_microstep: 3398.65 | bwd_inner_microstep: 3397.83 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.02 [2025-06-19 22:48:58,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.19 | bwd: 3398.67 | bwd_inner: 3397.83 | bwd_allreduce: 0.79 | step: 7.02 59%|█████▉ | 5928/10000 [9:19:19<6:15:56, 5.54s/it] {'loss': 0.0071, 'grad_norm': 0.9332082867622375, 'learning_rate': 1.5013655285600292e-05, 'epoch': 5.93} 59%|█████▉ | 5928/10000 [9:19:19<6:15:56, 5.54s/it][2025-06-19 22:49:04,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:49:04,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.91 | bwd_microstep: 3316.87 | bwd_inner_microstep: 3316.07 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-19 22:49:04,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.91 | bwd: 3316.88 | bwd_inner: 3316.07 | bwd_allreduce: 0.77 | step: 6.97 59%|█████▉ | 5929/10000 [9:19:25<6:14:34, 5.52s/it] {'loss': 0.0813, 'grad_norm': 6.94362735748291, 'learning_rate': 1.5007382584286595e-05, 'epoch': 5.93} 59%|█████▉ | 5929/10000 [9:19:25<6:14:34, 5.52s/it][2025-06-19 22:49:09,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:49:09,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.19 | bwd_microstep: 3315.01 | bwd_inner_microstep: 3314.13 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.35 [2025-06-19 22:49:09,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.19 | bwd: 3315.04 | bwd_inner: 3314.13 | bwd_allreduce: 0.84 | step: 7.35 59%|█████▉ | 5930/10000 [9:19:30<6:13:20, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.033591076731681824, 'learning_rate': 1.5001110406675294e-05, 'epoch': 5.93} 59%|█████▉ | 5930/10000 [9:19:30<6:13:20, 5.50s/it][2025-06-19 22:49:15,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:49:15,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.30 | bwd_microstep: 3315.56 | bwd_inner_microstep: 3314.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 22:49:15,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.30 | bwd: 3315.58 | bwd_inner: 3314.78 | bwd_allreduce: 0.75 | step: 6.58 59%|█████▉ | 5931/10000 [9:19:35<6:12:15, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006666163448244333, 'learning_rate': 1.499483875342432e-05, 'epoch': 5.93} 59%|█████▉ | 5931/10000 [9:19:35<6:12:15, 5.49s/it][2025-06-19 22:49:20,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:49:20,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.25 | bwd_microstep: 3326.60 | bwd_inner_microstep: 3325.43 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.95 [2025-06-19 22:49:20,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.25 | bwd: 3326.61 | bwd_inner: 3325.43 | bwd_allreduce: 1.14 | step: 7.96 59%|█████▉ | 5932/10000 [9:19:41<6:11:51, 5.48s/it] {'loss': 0.0007, 'grad_norm': 0.08491696417331696, 'learning_rate': 1.498856762519152e-05, 'epoch': 5.93} 59%|█████▉ | 5932/10000 [9:19:41<6:11:51, 5.48s/it][2025-06-19 22:49:26,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:49:26,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.10 | bwd_microstep: 3336.97 | bwd_inner_microstep: 3336.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 22:49:26,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.09 | bwd: 3336.99 | bwd_inner: 3336.18 | bwd_allreduce: 0.77 | step: 6.72 59%|█████▉ | 5933/10000 [9:19:46<6:11:54, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.06956922262907028, 'learning_rate': 1.4982297022634722e-05, 'epoch': 5.93} 59%|█████▉ | 5933/10000 [9:19:46<6:11:54, 5.49s/it][2025-06-19 22:49:31,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:49:31,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.82 | bwd_microstep: 3375.13 | bwd_inner_microstep: 3374.26 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.93 [2025-06-19 22:49:31,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.82 | bwd: 3375.15 | bwd_inner: 3374.26 | bwd_allreduce: 0.83 | step: 6.93 59%|█████▉ | 5934/10000 [9:19:52<6:13:19, 5.51s/it] {'loss': 0.0008, 'grad_norm': 0.14739446341991425, 'learning_rate': 1.4976026946411675e-05, 'epoch': 5.93} 59%|█████▉ | 5934/10000 [9:19:52<6:13:19, 5.51s/it][2025-06-19 22:49:37,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:49:37,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.87 | bwd_microstep: 3318.55 | bwd_inner_microstep: 3317.69 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.89 [2025-06-19 22:49:37,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.87 | bwd: 3318.58 | bwd_inner: 3317.69 | bwd_allreduce: 0.82 | step: 7.91 59%|█████▉ | 5935/10000 [9:19:57<6:12:21, 5.50s/it] {'loss': 0.0086, 'grad_norm': 1.1907405853271484, 'learning_rate': 1.496975739718008e-05, 'epoch': 5.94} 59%|█████▉ | 5935/10000 [9:19:57<6:12:21, 5.50s/it][2025-06-19 22:49:42,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:49:42,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.78 | bwd_microstep: 3335.32 | bwd_inner_microstep: 3334.39 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.08 [2025-06-19 22:49:42,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.79 | bwd: 3335.33 | bwd_inner: 3334.39 | bwd_allreduce: 0.90 | step: 7.08 59%|█████▉ | 5936/10000 [9:20:03<6:11:58, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0017603762680664659, 'learning_rate': 1.4963488375597601e-05, 'epoch': 5.94} 59%|█████▉ | 5936/10000 [9:20:03<6:11:58, 5.49s/it][2025-06-19 22:49:48,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:49:48,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.07 | bwd_microstep: 3327.14 | bwd_inner_microstep: 3326.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 22:49:48,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.07 | bwd: 3327.15 | bwd_inner: 3326.35 | bwd_allreduce: 0.76 | step: 6.64 59%|█████▉ | 5937/10000 [9:20:08<6:11:33, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.0646585077047348, 'learning_rate': 1.4957219882321807e-05, 'epoch': 5.94} 59%|█████▉ | 5937/10000 [9:20:08<6:11:33, 5.49s/it][2025-06-19 22:49:53,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:49:53,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.62 | bwd_microstep: 3311.54 | bwd_inner_microstep: 3310.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 22:49:53,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.62 | bwd: 3311.55 | bwd_inner: 3310.73 | bwd_allreduce: 0.77 | step: 7.10 59%|█████▉ | 5938/10000 [9:20:14<6:10:42, 5.48s/it] {'loss': 0.0431, 'grad_norm': 10.689864158630371, 'learning_rate': 1.4950951918010247e-05, 'epoch': 5.94} 59%|█████▉ | 5938/10000 [9:20:14<6:10:42, 5.48s/it][2025-06-19 22:49:59,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:49:59,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.94 | bwd_microstep: 3361.57 | bwd_inner_microstep: 3360.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 22:49:59,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.94 | bwd: 3361.59 | bwd_inner: 3360.79 | bwd_allreduce: 0.76 | step: 6.71 59%|█████▉ | 5939/10000 [9:20:19<6:11:37, 5.49s/it] {'loss': 0.0174, 'grad_norm': 1.8618133068084717, 'learning_rate': 1.4944684483320394e-05, 'epoch': 5.94} 59%|█████▉ | 5939/10000 [9:20:19<6:11:37, 5.49s/it][2025-06-19 22:50:04,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:50:04,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.70 | bwd_microstep: 3312.36 | bwd_inner_microstep: 3311.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 22:50:04,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.70 | bwd: 3312.37 | bwd_inner: 3311.56 | bwd_allreduce: 0.77 | step: 6.72 59%|█████▉ | 5940/10000 [9:20:25<6:10:47, 5.48s/it] {'loss': 0.0863, 'grad_norm': 5.749686241149902, 'learning_rate': 1.4938417578909676e-05, 'epoch': 5.94} 59%|█████▉ | 5940/10000 [9:20:25<6:10:47, 5.48s/it][2025-06-19 22:50:09,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:50:09,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.04 | bwd_microstep: 3312.22 | bwd_inner_microstep: 3311.44 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 22:50:09,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.04 | bwd: 3312.23 | bwd_inner: 3311.44 | bwd_allreduce: 0.75 | step: 6.65 59%|█████▉ | 5941/10000 [9:20:30<6:10:09, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0018789580790326, 'learning_rate': 1.4932151205435466e-05, 'epoch': 5.94} 59%|█████▉ | 5941/10000 [9:20:30<6:10:09, 5.47s/it][2025-06-19 22:50:15,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:50:15,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.67 | bwd_microstep: 3367.75 | bwd_inner_microstep: 3366.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-19 22:50:15,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.67 | bwd: 3367.76 | bwd_inner: 3366.94 | bwd_allreduce: 0.77 | step: 7.06 59%|█████▉ | 5942/10000 [9:20:36<6:11:21, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0007697553955949843, 'learning_rate': 1.4925885363555073e-05, 'epoch': 5.94} 59%|█████▉ | 5942/10000 [9:20:36<6:11:21, 5.49s/it][2025-06-19 22:50:20,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:50:20,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.70 | bwd_microstep: 3326.11 | bwd_inner_microstep: 3325.30 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.86 [2025-06-19 22:50:20,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.70 | bwd: 3326.13 | bwd_inner: 3325.30 | bwd_allreduce: 0.79 | step: 6.86 59%|█████▉ | 5943/10000 [9:20:41<6:10:53, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0022346959449350834, 'learning_rate': 1.4919620053925755e-05, 'epoch': 5.94} 59%|█████▉ | 5943/10000 [9:20:41<6:10:53, 5.49s/it][2025-06-19 22:50:26,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:50:26,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.83 | bwd_microstep: 3364.42 | bwd_inner_microstep: 3363.59 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.34 [2025-06-19 22:50:26,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.83 | bwd: 3364.43 | bwd_inner: 3363.59 | bwd_allreduce: 0.80 | step: 7.35 59%|█████▉ | 5944/10000 [9:20:47<6:11:43, 5.50s/it] {'loss': 0.0051, 'grad_norm': 1.1288869380950928, 'learning_rate': 1.491335527720471e-05, 'epoch': 5.94} 59%|█████▉ | 5944/10000 [9:20:47<6:11:43, 5.50s/it][2025-06-19 22:50:32,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:50:32,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.40 | bwd_microstep: 3388.67 | bwd_inner_microstep: 3387.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 22:50:32,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.40 | bwd: 3388.68 | bwd_inner: 3387.87 | bwd_allreduce: 0.76 | step: 6.75 59%|█████▉ | 5945/10000 [9:20:52<6:12:58, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.03904523700475693, 'learning_rate': 1.4907091034049095e-05, 'epoch': 5.95} 59%|█████▉ | 5945/10000 [9:20:52<6:12:58, 5.52s/it][2025-06-19 22:50:37,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:50:37,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.83 | bwd_microstep: 3312.61 | bwd_inner_microstep: 3311.56 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.19 [2025-06-19 22:50:37,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.83 | bwd: 3312.63 | bwd_inner: 3311.56 | bwd_allreduce: 1.01 | step: 7.19 59%|█████▉ | 5946/10000 [9:20:58<6:11:33, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.009942847304046154, 'learning_rate': 1.4900827325115996e-05, 'epoch': 5.95} 59%|█████▉ | 5946/10000 [9:20:58<6:11:33, 5.50s/it][2025-06-19 22:50:43,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:50:43,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.99 | bwd_microstep: 3322.56 | bwd_inner_microstep: 3321.73 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.86 [2025-06-19 22:50:43,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.99 | bwd: 3322.58 | bwd_inner: 3321.73 | bwd_allreduce: 0.80 | step: 6.86 59%|█████▉ | 5947/10000 [9:21:03<6:10:56, 5.49s/it] {'loss': 0.1133, 'grad_norm': 5.159591197967529, 'learning_rate': 1.489456415106244e-05, 'epoch': 5.95} 59%|█████▉ | 5947/10000 [9:21:03<6:10:56, 5.49s/it][2025-06-19 22:50:48,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:50:48,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.55 | bwd_microstep: 3319.24 | bwd_inner_microstep: 3318.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 22:50:48,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.55 | bwd: 3319.25 | bwd_inner: 3318.44 | bwd_allreduce: 0.77 | step: 6.66 59%|█████▉ | 5948/10000 [9:21:09<6:10:15, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.033065035939216614, 'learning_rate': 1.4888301512545409e-05, 'epoch': 5.95} 59%|█████▉ | 5948/10000 [9:21:09<6:10:15, 5.48s/it][2025-06-19 22:50:53,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:50:53,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.58 | bwd_microstep: 3363.08 | bwd_inner_microstep: 3362.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-19 22:50:53,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.58 | bwd: 3363.09 | bwd_inner: 3362.28 | bwd_allreduce: 0.77 | step: 7.08 59%|█████▉ | 5949/10000 [9:21:14<6:11:01, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.01646195724606514, 'learning_rate': 1.4882039410221825e-05, 'epoch': 5.95} 59%|█████▉ | 5949/10000 [9:21:14<6:11:01, 5.50s/it][2025-06-19 22:50:59,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.74 [2025-06-19 22:50:59,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.16 | bwd_microstep: 3310.79 | bwd_inner_microstep: 3309.93 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.90 [2025-06-19 22:50:59,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.15 | bwd: 3310.80 | bwd_inner: 3309.93 | bwd_allreduce: 0.83 | step: 6.90 60%|█████▉ | 5950/10000 [9:21:20<6:10:05, 5.48s/it] {'loss': 0.0101, 'grad_norm': 1.3404958248138428, 'learning_rate': 1.4875777844748553e-05, 'epoch': 5.95} 60%|█████▉ | 5950/10000 [9:21:20<6:10:05, 5.48s/it][2025-06-19 22:51:04,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:51:04,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.84 | bwd_microstep: 3358.18 | bwd_inner_microstep: 3357.26 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.28 [2025-06-19 22:51:04,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.84 | bwd: 3358.19 | bwd_inner: 3357.26 | bwd_allreduce: 0.88 | step: 7.28 60%|█████▉ | 5951/10000 [9:21:25<6:10:55, 5.50s/it] {'loss': 0.0328, 'grad_norm': 6.483086109161377, 'learning_rate': 1.486951681678241e-05, 'epoch': 5.95} 60%|█████▉ | 5951/10000 [9:21:25<6:10:55, 5.50s/it][2025-06-19 22:51:10,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:51:10,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.95 | bwd_microstep: 3397.21 | bwd_inner_microstep: 3396.39 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.17 [2025-06-19 22:51:10,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.95 | bwd: 3397.22 | bwd_inner: 3396.39 | bwd_allreduce: 0.79 | step: 7.17 60%|█████▉ | 5952/10000 [9:21:31<6:12:24, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.010802479460835457, 'learning_rate': 1.4863256326980134e-05, 'epoch': 5.95} 60%|█████▉ | 5952/10000 [9:21:31<6:12:24, 5.52s/it][2025-06-19 22:51:16,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:51:16,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.75 | bwd_microstep: 3315.85 | bwd_inner_microstep: 3315.04 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-19 22:51:16,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.75 | bwd: 3315.86 | bwd_inner: 3315.04 | bwd_allreduce: 0.78 | step: 6.86 60%|█████▉ | 5953/10000 [9:21:36<6:11:08, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.08257003873586655, 'learning_rate': 1.485699637599843e-05, 'epoch': 5.95} 60%|█████▉ | 5953/10000 [9:21:36<6:11:08, 5.50s/it][2025-06-19 22:51:21,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:51:21,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.95 | bwd_microstep: 3366.38 | bwd_inner_microstep: 3365.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.25 [2025-06-19 22:51:21,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.95 | bwd: 3366.39 | bwd_inner: 3365.57 | bwd_allreduce: 0.78 | step: 7.25 60%|█████▉ | 5954/10000 [9:21:42<6:11:31, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.02071458287537098, 'learning_rate': 1.4850736964493937e-05, 'epoch': 5.95} 60%|█████▉ | 5954/10000 [9:21:42<6:11:31, 5.51s/it][2025-06-19 22:51:26,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:51:26,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.87 | bwd_microstep: 3314.34 | bwd_inner_microstep: 3313.53 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.74 [2025-06-19 22:51:26,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.87 | bwd: 3314.36 | bwd_inner: 3313.53 | bwd_allreduce: 0.78 | step: 6.74 60%|█████▉ | 5955/10000 [9:21:47<6:10:20, 5.49s/it] {'loss': 0.0194, 'grad_norm': 3.0877110958099365, 'learning_rate': 1.4844478093123237e-05, 'epoch': 5.96} 60%|█████▉ | 5955/10000 [9:21:47<6:10:20, 5.49s/it][2025-06-19 22:51:32,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:51:32,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.11 | bwd_microstep: 3372.41 | bwd_inner_microstep: 3371.45 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.22 [2025-06-19 22:51:32,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.11 | bwd: 3372.43 | bwd_inner: 3371.45 | bwd_allreduce: 0.92 | step: 7.22 60%|█████▉ | 5956/10000 [9:21:53<6:11:14, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.01760197803378105, 'learning_rate': 1.4838219762542868e-05, 'epoch': 5.96} 60%|█████▉ | 5956/10000 [9:21:53<6:11:14, 5.51s/it][2025-06-19 22:51:38,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:51:38,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.89 | bwd_microstep: 3323.56 | bwd_inner_microstep: 3322.74 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-19 22:51:38,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.89 | bwd: 3323.58 | bwd_inner: 3322.74 | bwd_allreduce: 0.79 | step: 7.23 60%|█████▉ | 5957/10000 [9:21:58<6:10:21, 5.50s/it] {'loss': 0.0054, 'grad_norm': 1.1097197532653809, 'learning_rate': 1.4831961973409277e-05, 'epoch': 5.96} 60%|█████▉ | 5957/10000 [9:21:58<6:10:21, 5.50s/it][2025-06-19 22:51:43,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.68 | optimizer_step: 2.85 [2025-06-19 22:51:43,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.67 | bwd_microstep: 3319.91 | bwd_inner_microstep: 3318.93 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.56 [2025-06-19 22:51:43,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.67 | bwd: 3319.93 | bwd_inner: 3318.93 | bwd_allreduce: 0.95 | step: 7.57 60%|█████▉ | 5958/10000 [9:22:04<6:09:36, 5.49s/it] {'loss': 0.0013, 'grad_norm': 0.1820741891860962, 'learning_rate': 1.4825704726378893e-05, 'epoch': 5.96} 60%|█████▉ | 5958/10000 [9:22:04<6:09:36, 5.49s/it][2025-06-19 22:51:48,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:51:48,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.46 | bwd_microstep: 3311.24 | bwd_inner_microstep: 3310.39 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.04 [2025-06-19 22:51:48,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.46 | bwd: 3311.26 | bwd_inner: 3310.39 | bwd_allreduce: 0.82 | step: 7.04 60%|█████▉ | 5959/10000 [9:22:09<6:08:49, 5.48s/it] {'loss': 0.0034, 'grad_norm': 0.9909940958023071, 'learning_rate': 1.4819448022108067e-05, 'epoch': 5.96} 60%|█████▉ | 5959/10000 [9:22:09<6:08:49, 5.48s/it][2025-06-19 22:51:54,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:51:54,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.12 | bwd_microstep: 3316.48 | bwd_inner_microstep: 3315.66 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.13 [2025-06-19 22:51:54,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.12 | bwd: 3316.50 | bwd_inner: 3315.66 | bwd_allreduce: 0.79 | step: 7.13 60%|█████▉ | 5960/10000 [9:22:15<6:08:25, 5.47s/it] {'loss': 0.0049, 'grad_norm': 1.4664742946624756, 'learning_rate': 1.48131918612531e-05, 'epoch': 5.96} 60%|█████▉ | 5960/10000 [9:22:15<6:08:25, 5.47s/it][2025-06-19 22:51:59,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:51:59,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.19 | bwd_microstep: 3363.56 | bwd_inner_microstep: 3362.75 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 22:51:59,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.19 | bwd: 3363.58 | bwd_inner: 3362.75 | bwd_allreduce: 0.78 | step: 7.19 60%|█████▉ | 5961/10000 [9:22:20<6:09:34, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.01620614156126976, 'learning_rate': 1.480693624447024e-05, 'epoch': 5.96} 60%|█████▉ | 5961/10000 [9:22:20<6:09:34, 5.49s/it][2025-06-19 22:52:05,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:52:05,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.26 | bwd_microstep: 3319.39 | bwd_inner_microstep: 3318.37 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.40 [2025-06-19 22:52:05,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.26 | bwd: 3319.41 | bwd_inner: 3318.37 | bwd_allreduce: 0.99 | step: 7.40 60%|█████▉ | 5962/10000 [9:22:26<6:08:59, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.04603861644864082, 'learning_rate': 1.480068117241566e-05, 'epoch': 5.96} 60%|█████▉ | 5962/10000 [9:22:26<6:08:59, 5.48s/it][2025-06-19 22:52:10,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:52:10,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.49 | bwd_microstep: 3366.18 | bwd_inner_microstep: 3365.36 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.79 [2025-06-19 22:52:10,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.49 | bwd: 3366.20 | bwd_inner: 3365.36 | bwd_allreduce: 0.79 | step: 6.80 60%|█████▉ | 5963/10000 [9:22:31<6:09:55, 5.50s/it] {'loss': 0.0052, 'grad_norm': 1.0306439399719238, 'learning_rate': 1.4794426645745494e-05, 'epoch': 5.96} 60%|█████▉ | 5963/10000 [9:22:31<6:09:55, 5.50s/it][2025-06-19 22:52:16,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:52:16,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.99 | bwd_microstep: 3357.95 | bwd_inner_microstep: 3357.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-19 22:52:16,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.99 | bwd: 3357.97 | bwd_inner: 3357.15 | bwd_allreduce: 0.77 | step: 6.91 60%|█████▉ | 5964/10000 [9:22:37<6:10:37, 5.51s/it] {'loss': 0.0426, 'grad_norm': 10.484381675720215, 'learning_rate': 1.4788172665115814e-05, 'epoch': 5.96} 60%|█████▉ | 5964/10000 [9:22:37<6:10:37, 5.51s/it][2025-06-19 22:52:21,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:52:21,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.70 | bwd_microstep: 3310.10 | bwd_inner_microstep: 3309.30 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 22:52:21,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.70 | bwd: 3310.12 | bwd_inner: 3309.30 | bwd_allreduce: 0.78 | step: 7.01 60%|█████▉ | 5965/10000 [9:22:42<6:09:19, 5.49s/it] {'loss': 0.0014, 'grad_norm': 0.3094710111618042, 'learning_rate': 1.478191923118263e-05, 'epoch': 5.96} 60%|█████▉ | 5965/10000 [9:22:42<6:09:19, 5.49s/it][2025-06-19 22:52:27,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.84 [2025-06-19 22:52:27,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.87 | bwd_microstep: 3314.79 | bwd_inner_microstep: 3314.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 22:52:27,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.87 | bwd: 3314.81 | bwd_inner: 3314.00 | bwd_allreduce: 0.76 | step: 6.82 60%|█████▉ | 5966/10000 [9:22:48<6:08:43, 5.48s/it] {'loss': 0.0037, 'grad_norm': 0.6809855103492737, 'learning_rate': 1.4775666344601916e-05, 'epoch': 5.97} 60%|█████▉ | 5966/10000 [9:22:48<6:08:43, 5.48s/it][2025-06-19 22:52:32,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 22:52:32,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.31 | bwd_microstep: 3311.93 | bwd_inner_microstep: 3311.04 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.02 [2025-06-19 22:52:32,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.31 | bwd: 3311.95 | bwd_inner: 3311.04 | bwd_allreduce: 0.85 | step: 7.02 60%|█████▉ | 5967/10000 [9:22:53<6:08:04, 5.48s/it] {'loss': 0.1639, 'grad_norm': 4.020680904388428, 'learning_rate': 1.4769414006029541e-05, 'epoch': 5.97} 60%|█████▉ | 5967/10000 [9:22:53<6:08:04, 5.48s/it][2025-06-19 22:52:38,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:52:38,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.52 | bwd_microstep: 3359.12 | bwd_inner_microstep: 3358.32 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 22:52:38,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.52 | bwd: 3359.14 | bwd_inner: 3358.32 | bwd_allreduce: 0.78 | step: 6.98 60%|█████▉ | 5968/10000 [9:22:59<6:08:53, 5.49s/it] {'loss': 0.0025, 'grad_norm': 0.4773583710193634, 'learning_rate': 1.476316221612136e-05, 'epoch': 5.97} 60%|█████▉ | 5968/10000 [9:22:59<6:08:53, 5.49s/it][2025-06-19 22:52:43,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:52:43,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.61 | bwd_microstep: 3308.47 | bwd_inner_microstep: 3307.68 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 22:52:43,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.61 | bwd: 3308.48 | bwd_inner: 3307.68 | bwd_allreduce: 0.76 | step: 6.69 60%|█████▉ | 5969/10000 [9:23:04<6:08:04, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.06775651127099991, 'learning_rate': 1.4756910975533161e-05, 'epoch': 5.97} 60%|█████▉ | 5969/10000 [9:23:04<6:08:04, 5.48s/it][2025-06-19 22:52:49,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:52:49,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.14 | bwd_microstep: 3369.13 | bwd_inner_microstep: 3368.24 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.46 [2025-06-19 22:52:49,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.14 | bwd: 3369.15 | bwd_inner: 3368.24 | bwd_allreduce: 0.86 | step: 7.46 60%|█████▉ | 5970/10000 [9:23:10<6:09:34, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.01769857481122017, 'learning_rate': 1.4750660284920662e-05, 'epoch': 5.97} 60%|█████▉ | 5970/10000 [9:23:10<6:09:34, 5.50s/it][2025-06-19 22:52:54,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 22:52:54,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2093.85 | bwd_microstep: 3303.83 | bwd_inner_microstep: 3303.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 22:52:54,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2093.85 | bwd: 3303.84 | bwd_inner: 3303.03 | bwd_allreduce: 0.77 | step: 6.74 60%|█████▉ | 5971/10000 [9:23:15<6:08:11, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.026308929547667503, 'learning_rate': 1.4744410144939541e-05, 'epoch': 5.97} 60%|█████▉ | 5971/10000 [9:23:15<6:08:11, 5.48s/it][2025-06-19 22:53:00,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 22:53:00,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.77 | bwd_microstep: 3365.53 | bwd_inner_microstep: 3364.68 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.89 [2025-06-19 22:53:00,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.77 | bwd: 3365.54 | bwd_inner: 3364.68 | bwd_allreduce: 0.82 | step: 6.89 60%|█████▉ | 5972/10000 [9:23:21<6:09:10, 5.50s/it] {'loss': 0.0429, 'grad_norm': 9.72445011138916, 'learning_rate': 1.4738160556245411e-05, 'epoch': 5.97} 60%|█████▉ | 5972/10000 [9:23:21<6:09:10, 5.50s/it][2025-06-19 22:53:05,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:53:05,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.00 | bwd_microstep: 3363.38 | bwd_inner_microstep: 3362.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.06 [2025-06-19 22:53:05,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.00 | bwd: 3363.39 | bwd_inner: 3362.59 | bwd_allreduce: 0.76 | step: 7.07 60%|█████▉ | 5973/10000 [9:23:26<6:09:30, 5.51s/it] {'loss': 0.0006, 'grad_norm': 0.1716519445180893, 'learning_rate': 1.4731911519493811e-05, 'epoch': 5.97} 60%|█████▉ | 5973/10000 [9:23:26<6:09:30, 5.51s/it][2025-06-19 22:53:11,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 22:53:11,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.23 | bwd_microstep: 3362.00 | bwd_inner_microstep: 3361.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 22:53:11,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.23 | bwd: 3362.01 | bwd_inner: 3361.21 | bwd_allreduce: 0.76 | step: 6.78 60%|█████▉ | 5974/10000 [9:23:32<6:09:46, 5.51s/it] {'loss': 0.0033, 'grad_norm': 0.9848546981811523, 'learning_rate': 1.4725663035340241e-05, 'epoch': 5.97} 60%|█████▉ | 5974/10000 [9:23:32<6:09:46, 5.51s/it][2025-06-19 22:53:16,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:53:16,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.54 | bwd_microstep: 3365.25 | bwd_inner_microstep: 3364.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.61 [2025-06-19 22:53:16,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.54 | bwd: 3365.27 | bwd_inner: 3364.45 | bwd_allreduce: 0.77 | step: 6.61 60%|█████▉ | 5975/10000 [9:23:37<6:10:09, 5.52s/it] {'loss': 0.0025, 'grad_norm': 0.3679084777832031, 'learning_rate': 1.471941510444014e-05, 'epoch': 5.97} 60%|█████▉ | 5975/10000 [9:23:37<6:10:09, 5.52s/it][2025-06-19 22:53:22,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 22:53:22,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.10 | bwd_microstep: 3331.21 | bwd_inner_microstep: 3330.27 | bwd_allreduce_microstep: 0.86 | step_microstep: 8.26 [2025-06-19 22:53:22,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.10 | bwd: 3331.24 | bwd_inner: 3330.27 | bwd_allreduce: 0.89 | step: 8.26 60%|█████▉ | 5976/10000 [9:23:43<6:09:13, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.036483265459537506, 'learning_rate': 1.4713167727448885e-05, 'epoch': 5.98} 60%|█████▉ | 5976/10000 [9:23:43<6:09:13, 5.51s/it][2025-06-19 22:53:27,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 22:53:27,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.33 | bwd_microstep: 3313.72 | bwd_inner_microstep: 3312.80 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.42 [2025-06-19 22:53:27,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.33 | bwd: 3313.74 | bwd_inner: 3312.80 | bwd_allreduce: 0.88 | step: 7.43 60%|█████▉ | 5977/10000 [9:23:48<6:09:01, 5.50s/it] {'loss': 0.0119, 'grad_norm': 1.5810531377792358, 'learning_rate': 1.4706920905021807e-05, 'epoch': 5.98} 60%|█████▉ | 5977/10000 [9:23:48<6:09:01, 5.50s/it][2025-06-19 22:53:33,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:53:33,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2162.15 | bwd_microstep: 3364.02 | bwd_inner_microstep: 3363.21 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.06 [2025-06-19 22:53:33,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2162.15 | bwd: 3364.03 | bwd_inner: 3363.21 | bwd_allreduce: 0.79 | step: 7.07 60%|█████▉ | 5978/10000 [9:23:54<6:10:13, 5.52s/it] {'loss': 0.001, 'grad_norm': 0.1712339222431183, 'learning_rate': 1.470067463781415e-05, 'epoch': 5.98} 60%|█████▉ | 5978/10000 [9:23:54<6:10:13, 5.52s/it][2025-06-19 22:53:38,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 22:53:38,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.08 | bwd_microstep: 3305.87 | bwd_inner_microstep: 3304.75 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.58 [2025-06-19 22:53:38,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.08 | bwd: 3305.92 | bwd_inner: 3304.75 | bwd_allreduce: 1.06 | step: 7.58 60%|█████▉ | 5979/10000 [9:23:59<6:08:39, 5.50s/it] {'loss': 0.0037, 'grad_norm': 0.6510158181190491, 'learning_rate': 1.4694428926481132e-05, 'epoch': 5.98} 60%|█████▉ | 5979/10000 [9:23:59<6:08:39, 5.50s/it][2025-06-19 22:53:44,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:53:44,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.54 | bwd_microstep: 3312.29 | bwd_inner_microstep: 3311.49 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-19 22:53:44,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.54 | bwd: 3312.31 | bwd_inner: 3311.49 | bwd_allreduce: 0.77 | step: 6.95 60%|█████▉ | 5980/10000 [9:24:05<6:07:55, 5.49s/it] {'loss': 0.0032, 'grad_norm': 0.41495755314826965, 'learning_rate': 1.4688183771677897e-05, 'epoch': 5.98} 60%|█████▉ | 5980/10000 [9:24:05<6:07:55, 5.49s/it][2025-06-19 22:53:49,901] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:53:49,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.98 | bwd_microstep: 3366.70 | bwd_inner_microstep: 3365.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 22:53:49,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.98 | bwd: 3366.71 | bwd_inner: 3365.89 | bwd_allreduce: 0.78 | step: 7.11 60%|█████▉ | 5981/10000 [9:24:10<6:08:32, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.013232593424618244, 'learning_rate': 1.468193917405953e-05, 'epoch': 5.98} 60%|█████▉ | 5981/10000 [9:24:10<6:08:32, 5.50s/it][2025-06-19 22:53:55,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:53:55,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.13 | bwd_microstep: 3382.58 | bwd_inner_microstep: 3381.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.79 [2025-06-19 22:53:55,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.13 | bwd: 3382.60 | bwd_inner: 3381.77 | bwd_allreduce: 0.78 | step: 6.79 60%|█████▉ | 5982/10000 [9:24:16<6:09:16, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.019631659612059593, 'learning_rate': 1.4675695134281069e-05, 'epoch': 5.98} 60%|█████▉ | 5982/10000 [9:24:16<6:09:16, 5.51s/it][2025-06-19 22:54:00,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.93 [2025-06-19 22:54:00,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.25 | bwd_microstep: 3388.11 | bwd_inner_microstep: 3387.31 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-19 22:54:00,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.25 | bwd: 3388.13 | bwd_inner: 3387.31 | bwd_allreduce: 0.77 | step: 7.03 60%|█████▉ | 5983/10000 [9:24:21<6:09:54, 5.53s/it] {'loss': 0.0033, 'grad_norm': 0.8683913350105286, 'learning_rate': 1.466945165299747e-05, 'epoch': 5.98} 60%|█████▉ | 5983/10000 [9:24:21<6:09:54, 5.53s/it][2025-06-19 22:54:06,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:54:06,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.72 | bwd_microstep: 3330.72 | bwd_inner_microstep: 3329.91 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.03 [2025-06-19 22:54:06,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.72 | bwd: 3330.74 | bwd_inner: 3329.91 | bwd_allreduce: 0.78 | step: 7.03 60%|█████▉ | 5984/10000 [9:24:27<6:08:50, 5.51s/it] {'loss': 0.0036, 'grad_norm': 0.4565627872943878, 'learning_rate': 1.4663208730863651e-05, 'epoch': 5.98} 60%|█████▉ | 5984/10000 [9:24:27<6:08:50, 5.51s/it][2025-06-19 22:54:12,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:54:12,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.89 | bwd_microstep: 3380.25 | bwd_inner_microstep: 3379.35 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.98 [2025-06-19 22:54:12,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.89 | bwd: 3380.27 | bwd_inner: 3379.35 | bwd_allreduce: 0.85 | step: 7.99 60%|█████▉ | 5985/10000 [9:24:32<6:09:33, 5.52s/it] {'loss': 0.0008, 'grad_norm': 0.256398469209671, 'learning_rate': 1.4656966368534471e-05, 'epoch': 5.99} 60%|█████▉ | 5985/10000 [9:24:32<6:09:33, 5.52s/it][2025-06-19 22:54:17,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 22:54:17,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.18 | bwd_microstep: 3329.57 | bwd_inner_microstep: 3328.66 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.45 [2025-06-19 22:54:17,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.18 | bwd: 3329.60 | bwd_inner: 3328.66 | bwd_allreduce: 0.87 | step: 7.46 60%|█████▉ | 5986/10000 [9:24:38<6:09:31, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.010546360164880753, 'learning_rate': 1.465072456666472e-05, 'epoch': 5.99} 60%|█████▉ | 5986/10000 [9:24:38<6:09:31, 5.52s/it][2025-06-19 22:54:23,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:54:23,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2164.07 | bwd_microstep: 3325.54 | bwd_inner_microstep: 3324.65 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.37 [2025-06-19 22:54:23,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2164.08 | bwd: 3325.57 | bwd_inner: 3324.65 | bwd_allreduce: 0.85 | step: 7.38 60%|█████▉ | 5987/10000 [9:24:43<6:09:40, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0034514162689447403, 'learning_rate': 1.464448332590914e-05, 'epoch': 5.99} 60%|█████▉ | 5987/10000 [9:24:43<6:09:40, 5.53s/it][2025-06-19 22:54:28,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:54:28,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.50 | bwd_microstep: 3315.45 | bwd_inner_microstep: 3314.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 22:54:28,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.50 | bwd: 3315.47 | bwd_inner: 3314.66 | bwd_allreduce: 0.76 | step: 6.73 60%|█████▉ | 5988/10000 [9:24:49<6:09:02, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.005894254427403212, 'learning_rate': 1.4638242646922396e-05, 'epoch': 5.99} 60%|█████▉ | 5988/10000 [9:24:49<6:09:02, 5.52s/it][2025-06-19 22:54:34,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:54:34,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.94 | bwd_microstep: 3377.36 | bwd_inner_microstep: 3376.54 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.13 [2025-06-19 22:54:34,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.94 | bwd: 3377.38 | bwd_inner: 3376.54 | bwd_allreduce: 0.79 | step: 7.13 60%|█████▉ | 5989/10000 [9:24:54<6:09:51, 5.53s/it] {'loss': 0.0343, 'grad_norm': 4.050424575805664, 'learning_rate': 1.4632002530359114e-05, 'epoch': 5.99} 60%|█████▉ | 5989/10000 [9:24:54<6:09:51, 5.53s/it][2025-06-19 22:54:39,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:54:39,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.85 | bwd_microstep: 3373.01 | bwd_inner_microstep: 3372.07 | bwd_allreduce_microstep: 0.87 | step_microstep: 8.02 [2025-06-19 22:54:39,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.85 | bwd: 3373.03 | bwd_inner: 3372.07 | bwd_allreduce: 0.89 | step: 8.03 60%|█████▉ | 5990/10000 [9:25:00<6:10:15, 5.54s/it] {'loss': 0.0532, 'grad_norm': 5.088200569152832, 'learning_rate': 1.4625762976873856e-05, 'epoch': 5.99} 60%|█████▉ | 5990/10000 [9:25:00<6:10:15, 5.54s/it][2025-06-19 22:54:45,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:54:45,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2167.45 | bwd_microstep: 3403.69 | bwd_inner_microstep: 3402.59 | bwd_allreduce_microstep: 0.98 | step_microstep: 8.48 [2025-06-19 22:54:45,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2167.45 | bwd: 3403.74 | bwd_inner: 3402.59 | bwd_allreduce: 1.03 | step: 8.47 60%|█████▉ | 5991/10000 [9:25:06<6:11:41, 5.56s/it] {'loss': 0.0032, 'grad_norm': 0.766189455986023, 'learning_rate': 1.4619523987121118e-05, 'epoch': 5.99} 60%|█████▉ | 5991/10000 [9:25:06<6:11:41, 5.56s/it][2025-06-19 22:54:50,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.78 [2025-06-19 22:54:50,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.24 | bwd_microstep: 3334.07 | bwd_inner_microstep: 3332.90 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.30 [2025-06-19 22:54:50,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.24 | bwd: 3334.08 | bwd_inner: 3332.90 | bwd_allreduce: 1.14 | step: 7.31 60%|█████▉ | 5992/10000 [9:25:11<6:10:33, 5.55s/it] {'loss': 0.0002, 'grad_norm': 0.03048357367515564, 'learning_rate': 1.4613285561755351e-05, 'epoch': 5.99} 60%|█████▉ | 5992/10000 [9:25:11<6:10:33, 5.55s/it][2025-06-19 22:54:56,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 22:54:56,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.43 | bwd_microstep: 3320.60 | bwd_inner_microstep: 3319.62 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.19 [2025-06-19 22:54:56,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.43 | bwd: 3320.61 | bwd_inner: 3319.62 | bwd_allreduce: 0.95 | step: 7.19 60%|█████▉ | 5993/10000 [9:25:17<6:09:14, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0031407279893755913, 'learning_rate': 1.4607047701430919e-05, 'epoch': 5.99} 60%|█████▉ | 5993/10000 [9:25:17<6:09:14, 5.53s/it][2025-06-19 22:55:01,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 22:55:01,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.96 | bwd_microstep: 3325.68 | bwd_inner_microstep: 3324.76 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.27 [2025-06-19 22:55:01,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.96 | bwd: 3325.70 | bwd_inner: 3324.76 | bwd_allreduce: 0.87 | step: 8.28 60%|█████▉ | 5994/10000 [9:25:22<6:08:29, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.007284337189048529, 'learning_rate': 1.4600810406802152e-05, 'epoch': 5.99} 60%|█████▉ | 5994/10000 [9:25:22<6:08:29, 5.52s/it][2025-06-19 22:55:07,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:55:07,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2151.59 | bwd_microstep: 3373.21 | bwd_inner_microstep: 3372.33 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.34 [2025-06-19 22:55:07,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2151.59 | bwd: 3373.23 | bwd_inner: 3372.33 | bwd_allreduce: 0.84 | step: 7.35 60%|█████▉ | 5995/10000 [9:25:28<6:09:21, 5.53s/it] {'loss': 0.0042, 'grad_norm': 0.4564915895462036, 'learning_rate': 1.4594573678523319e-05, 'epoch': 6.0} 60%|█████▉ | 5995/10000 [9:25:28<6:09:21, 5.53s/it][2025-06-19 22:55:12,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:55:12,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.22 | bwd_microstep: 3370.75 | bwd_inner_microstep: 3369.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 22:55:12,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.22 | bwd: 3370.76 | bwd_inner: 3369.95 | bwd_allreduce: 0.76 | step: 6.66 60%|█████▉ | 5996/10000 [9:25:33<6:09:34, 5.54s/it] {'loss': 0.0079, 'grad_norm': 1.4043564796447754, 'learning_rate': 1.4588337517248615e-05, 'epoch': 6.0} 60%|█████▉ | 5996/10000 [9:25:33<6:09:34, 5.54s/it][2025-06-19 22:55:18,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 22:55:18,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.86 | bwd_microstep: 3326.01 | bwd_inner_microstep: 3325.22 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 22:55:18,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.86 | bwd: 3326.03 | bwd_inner: 3325.22 | bwd_allreduce: 0.76 | step: 6.67 60%|█████▉ | 5997/10000 [9:25:39<6:08:09, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.003730031196027994, 'learning_rate': 1.4582101923632195e-05, 'epoch': 6.0} 60%|█████▉ | 5997/10000 [9:25:39<6:08:09, 5.52s/it][2025-06-19 22:55:23,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:55:23,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.78 | bwd_microstep: 3321.09 | bwd_inner_microstep: 3320.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.71 [2025-06-19 22:55:23,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.78 | bwd: 3321.10 | bwd_inner: 3320.28 | bwd_allreduce: 0.78 | step: 6.72 60%|█████▉ | 5998/10000 [9:25:44<6:07:00, 5.50s/it] {'loss': 0.0121, 'grad_norm': 2.2609033584594727, 'learning_rate': 1.4575866898328137e-05, 'epoch': 6.0} 60%|█████▉ | 5998/10000 [9:25:44<6:07:00, 5.50s/it][2025-06-19 22:55:29,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 22:55:29,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.33 | bwd_microstep: 3332.47 | bwd_inner_microstep: 3331.48 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.32 [2025-06-19 22:55:29,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.33 | bwd: 3332.48 | bwd_inner: 3331.48 | bwd_allreduce: 0.96 | step: 7.32 60%|█████▉ | 5999/10000 [9:25:50<6:06:30, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.06407798081636429, 'learning_rate': 1.4569632441990465e-05, 'epoch': 6.0} 60%|█████▉ | 5999/10000 [9:25:50<6:06:30, 5.50s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-06-19 22:55:37,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:55:37,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.72 | bwd_microstep: 3370.12 | bwd_inner_microstep: 3369.32 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 22:55:37,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.72 | bwd: 3370.13 | bwd_inner: 3369.32 | bwd_allreduce: 0.77 | step: 6.88 60%|██████ | 6000/10000 [9:25:57<6:49:49, 6.15s/it] {'loss': 0.0082, 'grad_norm': 0.9639946818351746, 'learning_rate': 1.4563398555273143e-05, 'epoch': 6.0} 60%|██████ | 6000/10000 [9:25:57<6:49:49, 6.15s/it]evaluate! [INFO|trainer.py:3910] 2025-06-19 22:55:48,624 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 22:55:48,628 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 22:55:48,628 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 22:56:45,946 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 22:56:45,950 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 22:56:45,950 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 22:56:45,951 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json evaluate! [INFO|trainer.py:3910] 2025-06-19 22:57:05,615 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-19 22:57:05,624 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-19 22:57:05,625 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-19 22:58:09,377 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-19 22:58:09,381 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-19 22:58:09,382 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-19 22:58:09,382 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json [2025-06-19 22:58:14,347] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 22:58:20,222] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 22:58:26,125] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 22:58:32,075] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-19 22:58:49,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 22:58:49,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2089.66 | bwd_microstep: 3261.80 | bwd_inner_microstep: 3260.71 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.82 [2025-06-19 22:58:49,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2089.63 | bwd: 3261.81 | bwd_inner: 3260.71 | bwd_allreduce: 1.06 | step: 7.82 60%|██████ | 6001/10000 [9:29:09<68:47:40, 61.93s/it] {'loss': 0.0011, 'grad_norm': 0.16448062658309937, 'learning_rate': 1.4557165238830085e-05, 'epoch': 6.0} 60%|██████ | 6001/10000 [9:29:09<68:47:40, 61.93s/it][2025-06-19 22:58:54,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:58:54,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2081.40 | bwd_microstep: 3283.26 | bwd_inner_microstep: 3282.37 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.29 [2025-06-19 22:58:54,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2081.40 | bwd: 3283.29 | bwd_inner: 3282.37 | bwd_allreduce: 0.85 | step: 7.28 60%|██████ | 6002/10000 [9:29:15<49:57:00, 44.98s/it] {'loss': 0.0126, 'grad_norm': 1.1723755598068237, 'learning_rate': 1.455093249331514e-05, 'epoch': 6.0} 60%|██████ | 6002/10000 [9:29:15<49:57:00, 44.98s/it][2025-06-19 22:59:00,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:59:00,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.30 | bwd_microstep: 3343.32 | bwd_inner_microstep: 3342.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.56 [2025-06-19 22:59:00,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.30 | bwd: 3343.33 | bwd_inner: 3342.52 | bwd_allreduce: 0.77 | step: 7.57 60%|██████ | 6003/10000 [9:29:20<36:47:28, 33.14s/it] {'loss': 0.0207, 'grad_norm': 2.0015804767608643, 'learning_rate': 1.4544700319382079e-05, 'epoch': 6.0} 60%|██████ | 6003/10000 [9:29:20<36:47:28, 33.14s/it][2025-06-19 22:59:05,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.72 [2025-06-19 22:59:05,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2094.39 | bwd_microstep: 3291.57 | bwd_inner_microstep: 3290.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.96 [2025-06-19 22:59:05,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2094.39 | bwd: 3291.59 | bwd_inner: 3290.78 | bwd_allreduce: 0.76 | step: 6.97 60%|██████ | 6004/10000 [9:29:26<27:33:14, 24.82s/it] {'loss': 0.0007, 'grad_norm': 0.31700241565704346, 'learning_rate': 1.4538468717684628e-05, 'epoch': 6.0} 60%|██████ | 6004/10000 [9:29:26<27:33:14, 24.82s/it][2025-06-19 22:59:10,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.73 [2025-06-19 22:59:10,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.53 | bwd_microstep: 3345.51 | bwd_inner_microstep: 3344.31 | bwd_allreduce_microstep: 1.08 | step_microstep: 9.86 [2025-06-19 22:59:10,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.53 | bwd: 3345.55 | bwd_inner: 3344.31 | bwd_allreduce: 1.13 | step: 9.87 60%|██████ | 6005/10000 [9:29:31<21:06:53, 19.03s/it] {'loss': 0.0021, 'grad_norm': 0.5545454621315002, 'learning_rate': 1.4532237688876469e-05, 'epoch': 6.0} 60%|██████ | 6005/10000 [9:29:31<21:06:53, 19.03s/it][2025-06-19 22:59:16,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:59:16,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.92 | bwd_microstep: 3346.18 | bwd_inner_microstep: 3345.40 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.63 [2025-06-19 22:59:16,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.92 | bwd: 3346.19 | bwd_inner: 3345.40 | bwd_allreduce: 0.75 | step: 6.63 60%|██████ | 6006/10000 [9:29:37<16:37:01, 14.98s/it] {'loss': 0.0015, 'grad_norm': 0.312593549489975, 'learning_rate': 1.4526007233611198e-05, 'epoch': 6.01} 60%|██████ | 6006/10000 [9:29:37<16:37:01, 14.98s/it][2025-06-19 22:59:21,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:59:21,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.10 | bwd_microstep: 3310.27 | bwd_inner_microstep: 3309.46 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 22:59:21,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.10 | bwd: 3310.28 | bwd_inner: 3309.46 | bwd_allreduce: 0.77 | step: 7.12 60%|██████ | 6007/10000 [9:29:42<13:26:34, 12.12s/it] {'loss': 0.0006, 'grad_norm': 0.11306097358465195, 'learning_rate': 1.451977735254237e-05, 'epoch': 6.01} 60%|██████ | 6007/10000 [9:29:42<13:26:34, 12.12s/it][2025-06-19 22:59:27,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:59:27,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.96 | bwd_microstep: 3296.51 | bwd_inner_microstep: 3295.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 22:59:27,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.96 | bwd: 3296.53 | bwd_inner: 3295.72 | bwd_allreduce: 0.76 | step: 6.85 60%|██████ | 6008/10000 [9:29:48<11:12:56, 10.11s/it] {'loss': 0.0001, 'grad_norm': 0.021117860451340675, 'learning_rate': 1.4513548046323457e-05, 'epoch': 6.01} 60%|██████ | 6008/10000 [9:29:48<11:12:56, 10.11s/it][2025-06-19 22:59:32,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 22:59:32,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2084.95 | bwd_microstep: 3281.39 | bwd_inner_microstep: 3280.58 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.65 [2025-06-19 22:59:32,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2084.95 | bwd: 3281.40 | bwd_inner: 3280.58 | bwd_allreduce: 0.78 | step: 6.66 60%|██████ | 6009/10000 [9:29:53<9:39:04, 8.71s/it] {'loss': 0.0002, 'grad_norm': 0.07080797851085663, 'learning_rate': 1.450731931560789e-05, 'epoch': 6.01} 60%|██████ | 6009/10000 [9:29:53<9:39:04, 8.71s/it][2025-06-19 22:59:38,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 22:59:38,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.43 | bwd_microstep: 3336.55 | bwd_inner_microstep: 3335.66 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.99 [2025-06-19 22:59:38,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.43 | bwd: 3336.57 | bwd_inner: 3335.66 | bwd_allreduce: 0.84 | step: 6.98 60%|██████ | 6010/10000 [9:29:59<8:34:40, 7.74s/it] {'loss': 0.0, 'grad_norm': 0.005023526959121227, 'learning_rate': 1.450109116104903e-05, 'epoch': 6.01} 60%|██████ | 6010/10000 [9:29:59<8:34:40, 7.74s/it][2025-06-19 22:59:43,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:59:43,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.46 | bwd_microstep: 3287.47 | bwd_inner_microstep: 3286.66 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 22:59:43,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.46 | bwd: 3287.49 | bwd_inner: 3286.66 | bwd_allreduce: 0.78 | step: 7.11 60%|██████ | 6011/10000 [9:30:04<7:48:27, 7.05s/it] {'loss': 0.0005, 'grad_norm': 0.09140510112047195, 'learning_rate': 1.4494863583300192e-05, 'epoch': 6.01} 60%|██████ | 6011/10000 [9:30:04<7:48:27, 7.05s/it][2025-06-19 22:59:49,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.72 [2025-06-19 22:59:49,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2094.93 | bwd_microstep: 3287.74 | bwd_inner_microstep: 3286.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-19 22:59:49,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2094.93 | bwd: 3287.76 | bwd_inner: 3286.95 | bwd_allreduce: 0.77 | step: 7.05 60%|██████ | 6012/10000 [9:30:09<7:15:55, 6.56s/it] {'loss': 0.0005, 'grad_norm': 0.10647223144769669, 'learning_rate': 1.4488636583014616e-05, 'epoch': 6.01} 60%|██████ | 6012/10000 [9:30:09<7:15:55, 6.56s/it][2025-06-19 22:59:54,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 22:59:54,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2091.24 | bwd_microstep: 3294.75 | bwd_inner_microstep: 3293.96 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 22:59:54,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2091.24 | bwd: 3294.76 | bwd_inner: 3293.96 | bwd_allreduce: 0.76 | step: 6.67 60%|██████ | 6013/10000 [9:30:15<6:53:09, 6.22s/it] {'loss': 0.0001, 'grad_norm': 0.019502462819218636, 'learning_rate': 1.448241016084548e-05, 'epoch': 6.01} 60%|██████ | 6013/10000 [9:30:15<6:53:09, 6.22s/it][2025-06-19 22:59:59,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 22:59:59,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2091.99 | bwd_microstep: 3297.43 | bwd_inner_microstep: 3296.61 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.41 [2025-06-19 22:59:59,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2091.99 | bwd: 3297.44 | bwd_inner: 3296.61 | bwd_allreduce: 0.79 | step: 7.41 60%|██████ | 6014/10000 [9:30:20<6:37:21, 5.98s/it] {'loss': 0.0004, 'grad_norm': 0.06732027977705002, 'learning_rate': 1.4476184317445907e-05, 'epoch': 6.01} 60%|██████ | 6014/10000 [9:30:20<6:37:21, 5.98s/it][2025-06-19 23:00:05,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:00:05,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2090.34 | bwd_microstep: 3292.40 | bwd_inner_microstep: 3291.56 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.05 [2025-06-19 23:00:05,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2090.34 | bwd: 3292.42 | bwd_inner: 3291.56 | bwd_allreduce: 0.81 | step: 7.06 60%|██████ | 6015/10000 [9:30:26<6:26:11, 5.81s/it] {'loss': 0.0002, 'grad_norm': 0.025679713115096092, 'learning_rate': 1.4469959053468965e-05, 'epoch': 6.01} 60%|██████ | 6015/10000 [9:30:26<6:26:11, 5.81s/it][2025-06-19 23:00:10,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-19 23:00:10,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2091.22 | bwd_microstep: 3294.72 | bwd_inner_microstep: 3293.67 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.64 [2025-06-19 23:00:10,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2091.22 | bwd: 3294.75 | bwd_inner: 3293.67 | bwd_allreduce: 1.01 | step: 7.64 60%|██████ | 6016/10000 [9:30:31<6:18:23, 5.70s/it] {'loss': 0.0174, 'grad_norm': 2.999133825302124, 'learning_rate': 1.4463734369567652e-05, 'epoch': 6.02} 60%|██████ | 6016/10000 [9:30:31<6:18:23, 5.70s/it][2025-06-19 23:00:16,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:00:16,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.71 | bwd_microstep: 3307.51 | bwd_inner_microstep: 3306.61 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.65 [2025-06-19 23:00:16,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.71 | bwd: 3307.54 | bwd_inner: 3306.61 | bwd_allreduce: 0.85 | step: 7.66 60%|██████ | 6017/10000 [9:30:37<6:14:14, 5.64s/it] {'loss': 0.0119, 'grad_norm': 1.2970553636550903, 'learning_rate': 1.4457510266394919e-05, 'epoch': 6.02} 60%|██████ | 6017/10000 [9:30:37<6:14:14, 5.64s/it][2025-06-19 23:00:21,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 23:00:21,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.11 | bwd_microstep: 3347.81 | bwd_inner_microstep: 3346.76 | bwd_allreduce_microstep: 0.97 | step_microstep: 8.39 [2025-06-19 23:00:21,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.11 | bwd: 3347.84 | bwd_inner: 3346.76 | bwd_allreduce: 1.00 | step: 8.40 60%|██████ | 6018/10000 [9:30:42<6:12:17, 5.61s/it] {'loss': 0.0, 'grad_norm': 0.0011783936060965061, 'learning_rate': 1.4451286744603625e-05, 'epoch': 6.02} 60%|██████ | 6018/10000 [9:30:42<6:12:17, 5.61s/it][2025-06-19 23:00:27,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.72 [2025-06-19 23:00:27,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2166.65 | bwd_microstep: 3384.95 | bwd_inner_microstep: 3383.89 | bwd_allreduce_microstep: 0.93 | step_microstep: 8.63 [2025-06-19 23:00:27,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2166.65 | bwd: 3385.00 | bwd_inner: 3383.89 | bwd_allreduce: 0.98 | step: 8.64 60%|██████ | 6019/10000 [9:30:48<6:12:10, 5.61s/it] {'loss': 0.0002, 'grad_norm': 0.05304984003305435, 'learning_rate': 1.4445063804846599e-05, 'epoch': 6.02} 60%|██████ | 6019/10000 [9:30:48<6:12:10, 5.61s/it][2025-06-19 23:00:33,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-19 23:00:33,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2165.71 | bwd_microstep: 3355.51 | bwd_inner_microstep: 3354.53 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.95 [2025-06-19 23:00:33,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2165.71 | bwd: 3355.54 | bwd_inner: 3354.53 | bwd_allreduce: 0.96 | step: 7.96 60%|██████ | 6020/10000 [9:30:53<6:11:15, 5.60s/it] {'loss': 0.0012, 'grad_norm': 0.2094409018754959, 'learning_rate': 1.4438841447776602e-05, 'epoch': 6.02} 60%|██████ | 6020/10000 [9:30:53<6:11:15, 5.60s/it][2025-06-19 23:00:38,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:00:38,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.63 | bwd_microstep: 3309.53 | bwd_inner_microstep: 3308.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.46 [2025-06-19 23:00:38,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.63 | bwd: 3309.54 | bwd_inner: 3308.73 | bwd_allreduce: 0.77 | step: 7.48 60%|██████ | 6021/10000 [9:30:59<6:08:53, 5.56s/it] {'loss': 0.0, 'grad_norm': 0.001944928546436131, 'learning_rate': 1.4432619674046322e-05, 'epoch': 6.02} 60%|██████ | 6021/10000 [9:30:59<6:08:53, 5.56s/it][2025-06-19 23:00:43,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:00:43,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.01 | bwd_microstep: 3303.21 | bwd_inner_microstep: 3302.37 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.02 [2025-06-19 23:00:43,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.01 | bwd: 3303.22 | bwd_inner: 3302.37 | bwd_allreduce: 0.81 | step: 7.02 60%|██████ | 6022/10000 [9:31:04<6:06:30, 5.53s/it] {'loss': 0.0051, 'grad_norm': 1.2189338207244873, 'learning_rate': 1.4426398484308405e-05, 'epoch': 6.02} 60%|██████ | 6022/10000 [9:31:04<6:06:30, 5.53s/it][2025-06-19 23:00:49,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.79 | optimizer_step: 2.73 [2025-06-19 23:00:49,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2158.47 | bwd_microstep: 3379.49 | bwd_inner_microstep: 3378.37 | bwd_allreduce_microstep: 1.03 | step_microstep: 8.41 [2025-06-19 23:00:49,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2158.47 | bwd: 3379.52 | bwd_inner: 3378.37 | bwd_allreduce: 1.08 | step: 8.42 60%|██████ | 6023/10000 [9:31:10<6:07:36, 5.55s/it] {'loss': 0.0124, 'grad_norm': 1.5829111337661743, 'learning_rate': 1.4420177879215419e-05, 'epoch': 6.02} 60%|██████ | 6023/10000 [9:31:10<6:07:36, 5.55s/it][2025-06-19 23:00:55,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:00:55,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.38 | bwd_microstep: 3354.54 | bwd_inner_microstep: 3353.62 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.02 [2025-06-19 23:00:55,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.38 | bwd: 3354.55 | bwd_inner: 3353.62 | bwd_allreduce: 0.89 | step: 7.03 60%|██████ | 6024/10000 [9:31:15<6:07:07, 5.54s/it] {'loss': 0.0033, 'grad_norm': 1.142056941986084, 'learning_rate': 1.4413957859419871e-05, 'epoch': 6.02} 60%|██████ | 6024/10000 [9:31:15<6:07:07, 5.54s/it][2025-06-19 23:01:00,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:01:00,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.64 | bwd_microstep: 3301.82 | bwd_inner_microstep: 3301.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 23:01:00,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.64 | bwd: 3301.84 | bwd_inner: 3301.03 | bwd_allreduce: 0.76 | step: 6.72 60%|██████ | 6025/10000 [9:31:21<6:05:14, 5.51s/it] {'loss': 0.0053, 'grad_norm': 0.895354151725769, 'learning_rate': 1.4407738425574223e-05, 'epoch': 6.03} 60%|██████ | 6025/10000 [9:31:21<6:05:14, 5.51s/it][2025-06-19 23:01:05,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 23:01:05,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.16 | bwd_microstep: 3297.26 | bwd_inner_microstep: 3296.16 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.01 [2025-06-19 23:01:05,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.17 | bwd: 3297.28 | bwd_inner: 3296.16 | bwd_allreduce: 1.07 | step: 7.02 60%|██████ | 6026/10000 [9:31:26<6:03:35, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002730201929807663, 'learning_rate': 1.4401519578330855e-05, 'epoch': 6.03} 60%|██████ | 6026/10000 [9:31:26<6:03:35, 5.49s/it][2025-06-19 23:01:11,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:01:11,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.36 | bwd_microstep: 3357.89 | bwd_inner_microstep: 3356.99 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.52 [2025-06-19 23:01:11,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.36 | bwd: 3357.92 | bwd_inner: 3356.99 | bwd_allreduce: 0.86 | step: 7.51 60%|██████ | 6027/10000 [9:31:32<6:04:18, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.03242168575525284, 'learning_rate': 1.4395301318342112e-05, 'epoch': 6.03} 60%|██████ | 6027/10000 [9:31:32<6:04:18, 5.50s/it][2025-06-19 23:01:16,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:01:16,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.52 | bwd_microstep: 3310.48 | bwd_inner_microstep: 3309.58 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.23 [2025-06-19 23:01:16,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.52 | bwd: 3310.49 | bwd_inner: 3309.58 | bwd_allreduce: 0.87 | step: 7.24 60%|██████ | 6028/10000 [9:31:37<6:03:24, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.10968898236751556, 'learning_rate': 1.4389083646260238e-05, 'epoch': 6.03} 60%|██████ | 6028/10000 [9:31:37<6:03:24, 5.49s/it][2025-06-19 23:01:22,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-19 23:01:22,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.31 | bwd_microstep: 3347.92 | bwd_inner_microstep: 3346.37 | bwd_allreduce_microstep: 1.42 | step_microstep: 9.42 [2025-06-19 23:01:22,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.31 | bwd: 3347.96 | bwd_inner: 3346.37 | bwd_allreduce: 1.48 | step: 9.42 60%|██████ | 6029/10000 [9:31:43<6:03:40, 5.49s/it] {'loss': 0.0646, 'grad_norm': 6.763617038726807, 'learning_rate': 1.438286656273745e-05, 'epoch': 6.03} 60%|██████ | 6029/10000 [9:31:43<6:03:40, 5.49s/it][2025-06-19 23:01:27,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:01:27,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.65 | bwd_microstep: 3307.28 | bwd_inner_microstep: 3306.49 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 23:01:27,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.65 | bwd: 3307.29 | bwd_inner: 3306.49 | bwd_allreduce: 0.76 | step: 6.64 60%|██████ | 6030/10000 [9:31:48<6:02:41, 5.48s/it] {'loss': 0.0123, 'grad_norm': 3.8557629585266113, 'learning_rate': 1.4376650068425889e-05, 'epoch': 6.03} 60%|██████ | 6030/10000 [9:31:48<6:02:41, 5.48s/it][2025-06-19 23:01:33,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:01:33,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2090.43 | bwd_microstep: 3298.87 | bwd_inner_microstep: 3298.01 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.44 [2025-06-19 23:01:33,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2090.43 | bwd: 3298.89 | bwd_inner: 3298.01 | bwd_allreduce: 0.83 | step: 7.45 60%|██████ | 6031/10000 [9:31:54<6:01:32, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.0559801310300827, 'learning_rate': 1.4370434163977635e-05, 'epoch': 6.03} 60%|██████ | 6031/10000 [9:31:54<6:01:32, 5.47s/it][2025-06-19 23:01:38,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:01:38,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2095.36 | bwd_microstep: 3303.75 | bwd_inner_microstep: 3302.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 23:01:38,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2095.36 | bwd: 3303.77 | bwd_inner: 3302.95 | bwd_allreduce: 0.77 | step: 6.91 60%|██████ | 6032/10000 [9:31:59<6:00:56, 5.46s/it] {'loss': 0.0003, 'grad_norm': 0.056242864578962326, 'learning_rate': 1.4364218850044713e-05, 'epoch': 6.03} 60%|██████ | 6032/10000 [9:31:59<6:00:56, 5.46s/it][2025-06-19 23:01:44,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:01:44,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.18 | bwd_microstep: 3349.72 | bwd_inner_microstep: 3348.66 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.20 [2025-06-19 23:01:44,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.18 | bwd: 3349.75 | bwd_inner: 3348.66 | bwd_allreduce: 1.02 | step: 7.20 60%|██████ | 6033/10000 [9:32:05<6:01:48, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.08900108933448792, 'learning_rate': 1.4358004127279079e-05, 'epoch': 6.03} 60%|██████ | 6033/10000 [9:32:05<6:01:48, 5.47s/it][2025-06-19 23:01:49,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:01:49,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.83 | bwd_microstep: 3311.57 | bwd_inner_microstep: 3310.51 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.26 [2025-06-19 23:01:49,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.83 | bwd: 3311.59 | bwd_inner: 3310.51 | bwd_allreduce: 1.02 | step: 7.26 60%|██████ | 6034/10000 [9:32:10<6:01:20, 5.47s/it] {'loss': 0.0007, 'grad_norm': 0.06981809437274933, 'learning_rate': 1.4351789996332621e-05, 'epoch': 6.03} 60%|██████ | 6034/10000 [9:32:10<6:01:20, 5.47s/it][2025-06-19 23:01:55,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:01:55,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.04 | bwd_microstep: 3311.16 | bwd_inner_microstep: 3309.87 | bwd_allreduce_microstep: 1.14 | step_microstep: 7.92 [2025-06-19 23:01:55,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.04 | bwd: 3311.21 | bwd_inner: 3309.87 | bwd_allreduce: 1.21 | step: 7.90 60%|██████ | 6035/10000 [9:32:16<6:00:52, 5.46s/it] {'loss': 0.0036, 'grad_norm': 1.1590814590454102, 'learning_rate': 1.4345576457857183e-05, 'epoch': 6.04} 60%|██████ | 6035/10000 [9:32:16<6:00:52, 5.46s/it][2025-06-19 23:02:00,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:02:00,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.09 | bwd_microstep: 3308.13 | bwd_inner_microstep: 3307.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 23:02:00,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.09 | bwd: 3308.15 | bwd_inner: 3307.34 | bwd_allreduce: 0.76 | step: 6.70 60%|██████ | 6036/10000 [9:32:21<6:00:46, 5.46s/it] {'loss': 0.0435, 'grad_norm': 2.908485174179077, 'learning_rate': 1.433936351250453e-05, 'epoch': 6.04} 60%|██████ | 6036/10000 [9:32:21<6:00:46, 5.46s/it][2025-06-19 23:02:06,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:02:06,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.87 | bwd_microstep: 3304.95 | bwd_inner_microstep: 3304.14 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.72 [2025-06-19 23:02:06,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.87 | bwd: 3304.97 | bwd_inner: 3304.14 | bwd_allreduce: 0.78 | step: 6.72 60%|██████ | 6037/10000 [9:32:26<6:00:23, 5.46s/it] {'loss': 0.002, 'grad_norm': 0.5647035241127014, 'learning_rate': 1.4333151160926368e-05, 'epoch': 6.04} 60%|██████ | 6037/10000 [9:32:26<6:00:23, 5.46s/it][2025-06-19 23:02:11,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:02:11,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.65 | bwd_microstep: 3307.82 | bwd_inner_microstep: 3306.89 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.08 [2025-06-19 23:02:11,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.65 | bwd: 3307.84 | bwd_inner: 3306.89 | bwd_allreduce: 0.89 | step: 7.08 60%|██████ | 6038/10000 [9:32:32<6:00:15, 5.46s/it] {'loss': 0.0002, 'grad_norm': 0.029852572828531265, 'learning_rate': 1.4326939403774363e-05, 'epoch': 6.04} 60%|██████ | 6038/10000 [9:32:32<6:00:15, 5.46s/it][2025-06-19 23:02:17,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:02:17,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.30 | bwd_microstep: 3301.58 | bwd_inner_microstep: 3300.56 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.29 [2025-06-19 23:02:17,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.30 | bwd: 3301.61 | bwd_inner: 3300.56 | bwd_allreduce: 0.99 | step: 7.29 60%|██████ | 6039/10000 [9:32:37<5:59:55, 5.45s/it] {'loss': 0.0032, 'grad_norm': 0.7988036274909973, 'learning_rate': 1.4320728241700069e-05, 'epoch': 6.04} 60%|██████ | 6039/10000 [9:32:37<5:59:55, 5.45s/it][2025-06-19 23:02:22,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:02:22,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.59 | bwd_microstep: 3306.00 | bwd_inner_microstep: 3304.99 | bwd_allreduce_microstep: 0.96 | step_microstep: 6.96 [2025-06-19 23:02:22,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.59 | bwd: 3306.02 | bwd_inner: 3304.99 | bwd_allreduce: 0.97 | step: 6.97 60%|██████ | 6040/10000 [9:32:43<5:59:43, 5.45s/it] {'loss': 0.0012, 'grad_norm': 0.13367187976837158, 'learning_rate': 1.4314517675355029e-05, 'epoch': 6.04} 60%|██████ | 6040/10000 [9:32:43<5:59:43, 5.45s/it][2025-06-19 23:02:27,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:02:27,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.73 | bwd_microstep: 3354.81 | bwd_inner_microstep: 3354.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-19 23:02:27,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.73 | bwd: 3354.82 | bwd_inner: 3354.00 | bwd_allreduce: 0.78 | step: 6.90 60%|██████ | 6041/10000 [9:32:48<6:00:53, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.02900429628789425, 'learning_rate': 1.4308307705390698e-05, 'epoch': 6.04} 60%|██████ | 6041/10000 [9:32:48<6:00:53, 5.47s/it][2025-06-19 23:02:33,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:02:33,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.99 | bwd_microstep: 3309.10 | bwd_inner_microstep: 3308.30 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 23:02:33,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.99 | bwd: 3309.11 | bwd_inner: 3308.30 | bwd_allreduce: 0.77 | step: 6.74 60%|██████ | 6042/10000 [9:32:54<6:00:29, 5.46s/it] {'loss': 0.0174, 'grad_norm': 1.7057641744613647, 'learning_rate': 1.4302098332458475e-05, 'epoch': 6.04} 60%|██████ | 6042/10000 [9:32:54<6:00:29, 5.46s/it][2025-06-19 23:02:38,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:02:38,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.39 | bwd_microstep: 3356.86 | bwd_inner_microstep: 3356.07 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 23:02:38,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.39 | bwd: 3356.88 | bwd_inner: 3356.07 | bwd_allreduce: 0.76 | step: 6.79 60%|██████ | 6043/10000 [9:32:59<6:01:26, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.03315321356058121, 'learning_rate': 1.4295889557209699e-05, 'epoch': 6.04} 60%|██████ | 6043/10000 [9:32:59<6:01:26, 5.48s/it][2025-06-19 23:02:44,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:02:44,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.40 | bwd_microstep: 3362.03 | bwd_inner_microstep: 3361.01 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.28 [2025-06-19 23:02:44,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.40 | bwd: 3362.05 | bwd_inner: 3361.01 | bwd_allreduce: 0.99 | step: 7.29 60%|██████ | 6044/10000 [9:33:05<6:02:27, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.07790524512529373, 'learning_rate': 1.4289681380295624e-05, 'epoch': 6.04} 60%|██████ | 6044/10000 [9:33:05<6:02:27, 5.50s/it][2025-06-19 23:02:49,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:02:49,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.45 | bwd_microstep: 3314.24 | bwd_inner_microstep: 3313.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 23:02:49,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.45 | bwd: 3314.25 | bwd_inner: 3313.45 | bwd_allreduce: 0.76 | step: 6.62 60%|██████ | 6045/10000 [9:33:10<6:01:34, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00031656018109060824, 'learning_rate': 1.4283473802367471e-05, 'epoch': 6.04} 60%|██████ | 6045/10000 [9:33:10<6:01:34, 5.49s/it][2025-06-19 23:02:55,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:02:55,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.42 | bwd_microstep: 3317.46 | bwd_inner_microstep: 3316.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.58 [2025-06-19 23:02:55,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.42 | bwd: 3317.47 | bwd_inner: 3316.67 | bwd_allreduce: 0.76 | step: 6.58 60%|██████ | 6046/10000 [9:33:16<6:00:53, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00027461309218779206, 'learning_rate': 1.4277266824076388e-05, 'epoch': 6.05} 60%|██████ | 6046/10000 [9:33:16<6:00:53, 5.48s/it][2025-06-19 23:03:00,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 23:03:00,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.45 | bwd_microstep: 3368.20 | bwd_inner_microstep: 3367.34 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.94 [2025-06-19 23:03:00,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.45 | bwd: 3368.21 | bwd_inner: 3367.33 | bwd_allreduce: 0.84 | step: 6.94 60%|██████ | 6047/10000 [9:33:21<6:01:50, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.08769695460796356, 'learning_rate': 1.4271060446073452e-05, 'epoch': 6.05} 60%|██████ | 6047/10000 [9:33:21<6:01:50, 5.49s/it][2025-06-19 23:03:06,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.74 [2025-06-19 23:03:06,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.29 | bwd_microstep: 3367.30 | bwd_inner_microstep: 3366.39 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.23 [2025-06-19 23:03:06,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.29 | bwd: 3367.33 | bwd_inner: 3366.39 | bwd_allreduce: 0.89 | step: 7.23 60%|██████ | 6048/10000 [9:33:27<6:02:34, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001652666600421071, 'learning_rate': 1.4264854669009692e-05, 'epoch': 6.05} 60%|██████ | 6048/10000 [9:33:27<6:02:34, 5.50s/it][2025-06-19 23:03:11,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:03:11,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.61 | bwd_microstep: 3320.52 | bwd_inner_microstep: 3319.67 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.18 [2025-06-19 23:03:11,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.61 | bwd: 3320.53 | bwd_inner: 3319.67 | bwd_allreduce: 0.82 | step: 7.18 60%|██████ | 6049/10000 [9:33:32<6:01:35, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.003936633467674255, 'learning_rate': 1.4258649493536054e-05, 'epoch': 6.05} 60%|██████ | 6049/10000 [9:33:32<6:01:35, 5.49s/it][2025-06-19 23:03:17,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:03:17,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.04 | bwd_microstep: 3316.85 | bwd_inner_microstep: 3316.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 23:03:17,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.04 | bwd: 3316.86 | bwd_inner: 3316.06 | bwd_allreduce: 0.76 | step: 6.74 60%|██████ | 6050/10000 [9:33:38<6:00:56, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.03212415426969528, 'learning_rate': 1.4252444920303438e-05, 'epoch': 6.05} 60%|██████ | 6050/10000 [9:33:38<6:00:56, 5.48s/it][2025-06-19 23:03:22,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:03:22,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.51 | bwd_microstep: 3366.67 | bwd_inner_microstep: 3365.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 23:03:22,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.51 | bwd: 3366.68 | bwd_inner: 3365.87 | bwd_allreduce: 0.76 | step: 6.73 61%|██████ | 6051/10000 [9:33:43<6:01:54, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.009593692608177662, 'learning_rate': 1.4246240949962675e-05, 'epoch': 6.05} 61%|██████ | 6051/10000 [9:33:43<6:01:54, 5.50s/it][2025-06-19 23:03:28,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:03:28,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.40 | bwd_microstep: 3359.50 | bwd_inner_microstep: 3358.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.31 [2025-06-19 23:03:28,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.40 | bwd: 3359.52 | bwd_inner: 3358.69 | bwd_allreduce: 0.78 | step: 7.31 61%|██████ | 6052/10000 [9:33:49<6:02:24, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.05938392132520676, 'learning_rate': 1.4240037583164532e-05, 'epoch': 6.05} 61%|██████ | 6052/10000 [9:33:49<6:02:24, 5.51s/it][2025-06-19 23:03:33,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:03:33,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.64 | bwd_microstep: 3364.45 | bwd_inner_microstep: 3363.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 23:03:33,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.64 | bwd: 3364.46 | bwd_inner: 3363.66 | bwd_allreduce: 0.76 | step: 6.71 61%|██████ | 6053/10000 [9:33:54<6:02:50, 5.52s/it] {'loss': 0.0763, 'grad_norm': 6.3335280418396, 'learning_rate': 1.4233834820559724e-05, 'epoch': 6.05} 61%|██████ | 6053/10000 [9:33:54<6:02:50, 5.52s/it][2025-06-19 23:03:39,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:03:39,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.14 | bwd_microstep: 3368.72 | bwd_inner_microstep: 3367.76 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.71 [2025-06-19 23:03:39,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.14 | bwd: 3368.74 | bwd_inner: 3367.76 | bwd_allreduce: 0.93 | step: 7.72 61%|██████ | 6054/10000 [9:34:00<6:03:05, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.009875445626676083, 'learning_rate': 1.422763266279887e-05, 'epoch': 6.05} 61%|██████ | 6054/10000 [9:34:00<6:03:05, 5.52s/it][2025-06-19 23:03:44,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:03:44,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.83 | bwd_microstep: 3313.41 | bwd_inner_microstep: 3312.59 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.18 [2025-06-19 23:03:44,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.83 | bwd: 3313.43 | bwd_inner: 3312.59 | bwd_allreduce: 0.80 | step: 7.18 61%|██████ | 6055/10000 [9:34:05<6:01:52, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.11404848098754883, 'learning_rate': 1.4221431110532562e-05, 'epoch': 6.05} 61%|██████ | 6055/10000 [9:34:05<6:01:52, 5.50s/it][2025-06-19 23:03:50,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:03:50,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.01 | bwd_microstep: 3315.58 | bwd_inner_microstep: 3314.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 23:03:50,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.01 | bwd: 3315.60 | bwd_inner: 3314.78 | bwd_allreduce: 0.77 | step: 7.08 61%|██████ | 6056/10000 [9:34:11<6:00:59, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.0990685448050499, 'learning_rate': 1.4215230164411309e-05, 'epoch': 6.06} 61%|██████ | 6056/10000 [9:34:11<6:00:59, 5.49s/it][2025-06-19 23:03:55,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:03:55,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.91 | bwd_microstep: 3321.57 | bwd_inner_microstep: 3320.68 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.95 [2025-06-19 23:03:55,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.91 | bwd: 3321.59 | bwd_inner: 3320.68 | bwd_allreduce: 0.86 | step: 6.95 61%|██████ | 6057/10000 [9:34:16<6:00:32, 5.49s/it] {'loss': 0.0033, 'grad_norm': 0.47406989336013794, 'learning_rate': 1.4209029825085568e-05, 'epoch': 6.06} 61%|██████ | 6057/10000 [9:34:16<6:00:32, 5.49s/it][2025-06-19 23:04:01,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:04:01,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.96 | bwd_microstep: 3325.27 | bwd_inner_microstep: 3324.47 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-19 23:04:01,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.96 | bwd: 3325.29 | bwd_inner: 3324.47 | bwd_allreduce: 0.77 | step: 7.20 61%|██████ | 6058/10000 [9:34:22<6:00:10, 5.48s/it] {'loss': 0.1135, 'grad_norm': 5.599523067474365, 'learning_rate': 1.4202830093205721e-05, 'epoch': 6.06} 61%|██████ | 6058/10000 [9:34:22<6:00:10, 5.48s/it][2025-06-19 23:04:06,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:04:06,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.54 | bwd_microstep: 3320.08 | bwd_inner_microstep: 3319.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 23:04:06,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.54 | bwd: 3320.09 | bwd_inner: 3319.29 | bwd_allreduce: 0.76 | step: 6.59 61%|██████ | 6059/10000 [9:34:27<5:59:45, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.006179208867251873, 'learning_rate': 1.4196630969422089e-05, 'epoch': 6.06} 61%|██████ | 6059/10000 [9:34:27<5:59:45, 5.48s/it][2025-06-19 23:04:12,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:04:12,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.00 | bwd_microstep: 3378.87 | bwd_inner_microstep: 3377.91 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.07 [2025-06-19 23:04:12,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.00 | bwd: 3378.88 | bwd_inner: 3377.91 | bwd_allreduce: 0.93 | step: 7.08 61%|██████ | 6060/10000 [9:34:33<6:01:03, 5.50s/it] {'loss': 0.0329, 'grad_norm': 7.260401248931885, 'learning_rate': 1.4190432454384934e-05, 'epoch': 6.06} 61%|██████ | 6060/10000 [9:34:33<6:01:03, 5.50s/it][2025-06-19 23:04:17,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:04:17,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.57 | bwd_microstep: 3323.70 | bwd_inner_microstep: 3322.88 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.99 [2025-06-19 23:04:17,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.57 | bwd: 3323.72 | bwd_inner: 3322.88 | bwd_allreduce: 0.79 | step: 6.99 61%|██████ | 6061/10000 [9:34:38<6:00:29, 5.49s/it] {'loss': 0.0013, 'grad_norm': 0.4059150218963623, 'learning_rate': 1.4184234548744454e-05, 'epoch': 6.06} 61%|██████ | 6061/10000 [9:34:38<6:00:29, 5.49s/it][2025-06-19 23:04:23,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 23:04:23,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.33 | bwd_microstep: 3323.03 | bwd_inner_microstep: 3321.93 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.80 [2025-06-19 23:04:23,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.33 | bwd: 3323.05 | bwd_inner: 3321.93 | bwd_allreduce: 1.07 | step: 7.80 61%|██████ | 6062/10000 [9:34:44<6:00:08, 5.49s/it] {'loss': 0.0006, 'grad_norm': 0.09358952194452286, 'learning_rate': 1.4178037253150775e-05, 'epoch': 6.06} 61%|██████ | 6062/10000 [9:34:44<6:00:08, 5.49s/it][2025-06-19 23:04:28,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:04:28,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.77 | bwd_microstep: 3325.61 | bwd_inner_microstep: 3324.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 23:04:28,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.77 | bwd: 3325.62 | bwd_inner: 3324.82 | bwd_allreduce: 0.76 | step: 6.81 61%|██████ | 6063/10000 [9:34:49<5:59:51, 5.48s/it] {'loss': 0.0051, 'grad_norm': 2.0297110080718994, 'learning_rate': 1.4171840568253979e-05, 'epoch': 6.06} 61%|██████ | 6063/10000 [9:34:49<5:59:51, 5.48s/it][2025-06-19 23:04:34,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:04:34,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.74 | bwd_microstep: 3373.12 | bwd_inner_microstep: 3372.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 23:04:34,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.74 | bwd: 3373.13 | bwd_inner: 3372.31 | bwd_allreduce: 0.78 | step: 7.08 61%|██████ | 6064/10000 [9:34:55<6:00:54, 5.50s/it] {'loss': 0.001, 'grad_norm': 0.13570749759674072, 'learning_rate': 1.4165644494704047e-05, 'epoch': 6.06} 61%|██████ | 6064/10000 [9:34:55<6:00:54, 5.50s/it][2025-06-19 23:04:39,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:04:39,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.31 | bwd_microstep: 3368.90 | bwd_inner_microstep: 3368.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 23:04:39,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.31 | bwd: 3368.91 | bwd_inner: 3368.11 | bwd_allreduce: 0.76 | step: 6.71 61%|██████ | 6065/10000 [9:35:00<6:01:34, 5.51s/it] {'loss': 0.0298, 'grad_norm': 3.3210830688476562, 'learning_rate': 1.4159449033150931e-05, 'epoch': 6.07} 61%|██████ | 6065/10000 [9:35:00<6:01:34, 5.51s/it][2025-06-19 23:04:45,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:04:45,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.07 | bwd_microstep: 3377.38 | bwd_inner_microstep: 3376.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 23:04:45,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.07 | bwd: 3377.39 | bwd_inner: 3376.58 | bwd_allreduce: 0.76 | step: 6.74 61%|██████ | 6066/10000 [9:35:06<6:02:10, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0019283808069303632, 'learning_rate': 1.4153254184244502e-05, 'epoch': 6.07} 61%|██████ | 6066/10000 [9:35:06<6:02:10, 5.52s/it][2025-06-19 23:04:51,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:04:51,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.87 | bwd_microstep: 3370.48 | bwd_inner_microstep: 3369.45 | bwd_allreduce_microstep: 0.98 | step_microstep: 6.95 [2025-06-19 23:04:51,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.87 | bwd: 3370.50 | bwd_inner: 3369.45 | bwd_allreduce: 1.00 | step: 6.95 61%|██████ | 6067/10000 [9:35:11<6:02:34, 5.53s/it] {'loss': 0.0006, 'grad_norm': 0.08418568223714828, 'learning_rate': 1.4147059948634576e-05, 'epoch': 6.07} 61%|██████ | 6067/10000 [9:35:11<6:02:34, 5.53s/it][2025-06-19 23:04:56,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:04:56,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.92 | bwd_microstep: 3333.14 | bwd_inner_microstep: 3332.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-19 23:04:56,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.92 | bwd: 3333.15 | bwd_inner: 3332.34 | bwd_allreduce: 0.76 | step: 7.07 61%|██████ | 6068/10000 [9:35:17<6:01:34, 5.52s/it] {'loss': 0.0011, 'grad_norm': 0.20000626146793365, 'learning_rate': 1.4140866326970902e-05, 'epoch': 6.07} 61%|██████ | 6068/10000 [9:35:17<6:01:34, 5.52s/it][2025-06-19 23:05:01,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:05:01,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.21 | bwd_microstep: 3334.52 | bwd_inner_microstep: 3333.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 23:05:01,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.21 | bwd: 3334.53 | bwd_inner: 3333.72 | bwd_allreduce: 0.77 | step: 6.72 61%|██████ | 6069/10000 [9:35:22<6:00:44, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0011784904636442661, 'learning_rate': 1.4134673319903151e-05, 'epoch': 6.07} 61%|██████ | 6069/10000 [9:35:22<6:00:44, 5.51s/it][2025-06-19 23:05:07,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:05:07,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.78 | bwd_microstep: 3329.90 | bwd_inner_microstep: 3328.99 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.46 [2025-06-19 23:05:07,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.78 | bwd: 3329.92 | bwd_inner: 3328.99 | bwd_allreduce: 0.87 | step: 7.46 61%|██████ | 6070/10000 [9:35:28<6:00:16, 5.50s/it] {'loss': 0.001, 'grad_norm': 0.14634132385253906, 'learning_rate': 1.4128480928080945e-05, 'epoch': 6.07} 61%|██████ | 6070/10000 [9:35:28<6:00:16, 5.50s/it][2025-06-19 23:05:12,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:05:12,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.74 | bwd_microstep: 3326.17 | bwd_inner_microstep: 3325.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 23:05:12,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.74 | bwd: 3326.18 | bwd_inner: 3325.36 | bwd_allreduce: 0.78 | step: 7.09 61%|██████ | 6071/10000 [9:35:33<5:59:55, 5.50s/it] {'loss': 0.0033, 'grad_norm': 0.8595883250236511, 'learning_rate': 1.4122289152153843e-05, 'epoch': 6.07} 61%|██████ | 6071/10000 [9:35:33<5:59:55, 5.50s/it][2025-06-19 23:05:18,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:05:18,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.60 | bwd_microstep: 3326.92 | bwd_inner_microstep: 3325.94 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.44 [2025-06-19 23:05:18,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.60 | bwd: 3326.94 | bwd_inner: 3325.94 | bwd_allreduce: 0.95 | step: 7.45 61%|██████ | 6072/10000 [9:35:39<5:59:24, 5.49s/it] {'loss': 0.0029, 'grad_norm': 0.3233245611190796, 'learning_rate': 1.4116097992771329e-05, 'epoch': 6.07} 61%|██████ | 6072/10000 [9:35:39<5:59:24, 5.49s/it][2025-06-19 23:05:23,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-19 23:05:23,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.69 | bwd_microstep: 3381.59 | bwd_inner_microstep: 3380.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 23:05:23,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.69 | bwd: 3381.60 | bwd_inner: 3380.78 | bwd_allreduce: 0.78 | step: 7.08 61%|██████ | 6073/10000 [9:35:44<6:00:38, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.014931712299585342, 'learning_rate': 1.4109907450582837e-05, 'epoch': 6.07} 61%|██████ | 6073/10000 [9:35:44<6:00:38, 5.51s/it][2025-06-19 23:05:29,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:05:29,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.73 | bwd_microstep: 3380.10 | bwd_inner_microstep: 3379.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-19 23:05:29,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.73 | bwd: 3380.12 | bwd_inner: 3379.32 | bwd_allreduce: 0.76 | step: 6.75 61%|██████ | 6074/10000 [9:35:50<6:01:27, 5.52s/it] {'loss': 0.005, 'grad_norm': 0.822711706161499, 'learning_rate': 1.41037175262377e-05, 'epoch': 6.07} 61%|██████ | 6074/10000 [9:35:50<6:01:27, 5.52s/it][2025-06-19 23:05:35,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:05:35,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.96 | bwd_microstep: 3379.35 | bwd_inner_microstep: 3378.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 23:05:35,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.96 | bwd: 3379.37 | bwd_inner: 3378.56 | bwd_allreduce: 0.77 | step: 6.71 61%|██████ | 6075/10000 [9:35:55<6:01:57, 5.53s/it] {'loss': 0.0008, 'grad_norm': 0.12226215749979019, 'learning_rate': 1.4097528220385235e-05, 'epoch': 6.08} 61%|██████ | 6075/10000 [9:35:55<6:01:57, 5.53s/it][2025-06-19 23:05:40,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:05:40,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.84 | bwd_microstep: 3370.72 | bwd_inner_microstep: 3369.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 23:05:40,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.84 | bwd: 3370.74 | bwd_inner: 3369.93 | bwd_allreduce: 0.76 | step: 6.73 61%|██████ | 6076/10000 [9:36:01<6:01:56, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.013706923462450504, 'learning_rate': 1.4091339533674665e-05, 'epoch': 6.08} 61%|██████ | 6076/10000 [9:36:01<6:01:56, 5.53s/it][2025-06-19 23:05:46,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:05:46,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.34 | bwd_microstep: 3322.76 | bwd_inner_microstep: 3321.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 23:05:46,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.34 | bwd: 3322.77 | bwd_inner: 3321.96 | bwd_allreduce: 0.76 | step: 6.72 61%|██████ | 6077/10000 [9:36:06<6:00:27, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.02027312107384205, 'learning_rate': 1.4085151466755159e-05, 'epoch': 6.08} 61%|██████ | 6077/10000 [9:36:06<6:00:27, 5.51s/it][2025-06-19 23:05:51,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:05:51,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.14 | bwd_microstep: 3382.73 | bwd_inner_microstep: 3381.91 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.28 [2025-06-19 23:05:51,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.14 | bwd: 3382.74 | bwd_inner: 3381.91 | bwd_allreduce: 0.78 | step: 7.28 61%|██████ | 6078/10000 [9:36:12<6:01:09, 5.53s/it] {'loss': 0.0002, 'grad_norm': 0.01800442300736904, 'learning_rate': 1.4078964020275816e-05, 'epoch': 6.08} 61%|██████ | 6078/10000 [9:36:12<6:01:09, 5.53s/it][2025-06-19 23:05:57,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:05:57,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.94 | bwd_microstep: 3321.88 | bwd_inner_microstep: 3321.07 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.75 [2025-06-19 23:05:57,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.94 | bwd: 3321.90 | bwd_inner: 3321.07 | bwd_allreduce: 0.79 | step: 6.75 61%|██████ | 6079/10000 [9:36:17<5:59:55, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.013122263364493847, 'learning_rate': 1.4072777194885658e-05, 'epoch': 6.08} 61%|██████ | 6079/10000 [9:36:17<5:59:55, 5.51s/it][2025-06-19 23:06:02,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:06:02,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.46 | bwd_microstep: 3330.95 | bwd_inner_microstep: 3330.08 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.29 [2025-06-19 23:06:02,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.46 | bwd: 3330.96 | bwd_inner: 3330.08 | bwd_allreduce: 0.83 | step: 7.30 61%|██████ | 6080/10000 [9:36:23<5:59:31, 5.50s/it] {'loss': 0.0021, 'grad_norm': 0.45249733328819275, 'learning_rate': 1.4066590991233669e-05, 'epoch': 6.08} 61%|██████ | 6080/10000 [9:36:23<5:59:31, 5.50s/it][2025-06-19 23:06:08,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:06:08,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.25 | bwd_microstep: 3372.02 | bwd_inner_microstep: 3371.21 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-19 23:06:08,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.26 | bwd: 3372.03 | bwd_inner: 3371.21 | bwd_allreduce: 0.78 | step: 6.97 61%|██████ | 6081/10000 [9:36:28<6:00:17, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.08338294178247452, 'learning_rate': 1.4060405409968741e-05, 'epoch': 6.08} 61%|██████ | 6081/10000 [9:36:28<6:00:17, 5.52s/it][2025-06-19 23:06:13,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:06:13,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.30 | bwd_microstep: 3324.14 | bwd_inner_microstep: 3323.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 23:06:13,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.30 | bwd: 3324.16 | bwd_inner: 3323.35 | bwd_allreduce: 0.76 | step: 6.88 61%|██████ | 6082/10000 [9:36:34<5:59:23, 5.50s/it] {'loss': 0.0058, 'grad_norm': 0.8079464435577393, 'learning_rate': 1.4054220451739725e-05, 'epoch': 6.08} 61%|██████ | 6082/10000 [9:36:34<5:59:23, 5.50s/it][2025-06-19 23:06:19,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:06:19,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.29 | bwd_microstep: 3371.57 | bwd_inner_microstep: 3370.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 23:06:19,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.29 | bwd: 3371.59 | bwd_inner: 3370.79 | bwd_allreduce: 0.76 | step: 6.63 61%|██████ | 6083/10000 [9:36:39<6:00:02, 5.51s/it] {'loss': 0.2375, 'grad_norm': 6.177032947540283, 'learning_rate': 1.4048036117195387e-05, 'epoch': 6.08} 61%|██████ | 6083/10000 [9:36:39<6:00:02, 5.51s/it][2025-06-19 23:06:24,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:06:24,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.11 | bwd_microstep: 3328.74 | bwd_inner_microstep: 3327.93 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.10 [2025-06-19 23:06:24,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.11 | bwd: 3328.76 | bwd_inner: 3327.93 | bwd_allreduce: 0.79 | step: 7.10 61%|██████ | 6084/10000 [9:36:45<5:59:15, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.023132655769586563, 'learning_rate': 1.4041852406984436e-05, 'epoch': 6.08} 61%|██████ | 6084/10000 [9:36:45<5:59:15, 5.50s/it][2025-06-19 23:06:30,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:06:30,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.60 | bwd_microstep: 3332.16 | bwd_inner_microstep: 3331.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 23:06:30,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.60 | bwd: 3332.17 | bwd_inner: 3331.35 | bwd_allreduce: 0.77 | step: 7.05 61%|██████ | 6085/10000 [9:36:50<5:58:42, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.7078049182891846, 'learning_rate': 1.4035669321755511e-05, 'epoch': 6.08} 61%|██████ | 6085/10000 [9:36:50<5:58:42, 5.50s/it][2025-06-19 23:06:35,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:06:35,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.32 | bwd_microstep: 3328.29 | bwd_inner_microstep: 3327.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 23:06:35,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.32 | bwd: 3328.31 | bwd_inner: 3327.50 | bwd_allreduce: 0.76 | step: 6.70 61%|██████ | 6086/10000 [9:36:56<5:58:08, 5.49s/it] {'loss': 0.0063, 'grad_norm': 0.8793833255767822, 'learning_rate': 1.4029486862157195e-05, 'epoch': 6.09} 61%|██████ | 6086/10000 [9:36:56<5:58:08, 5.49s/it][2025-06-19 23:06:41,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:06:41,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.40 | bwd_microstep: 3323.68 | bwd_inner_microstep: 3322.74 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.00 [2025-06-19 23:06:41,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.40 | bwd: 3323.69 | bwd_inner: 3322.74 | bwd_allreduce: 0.91 | step: 7.00 61%|██████ | 6087/10000 [9:37:01<5:57:45, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.029913349077105522, 'learning_rate': 1.4023305028837996e-05, 'epoch': 6.09} 61%|██████ | 6087/10000 [9:37:01<5:57:45, 5.49s/it][2025-06-19 23:06:46,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:06:46,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.37 | bwd_microstep: 3374.42 | bwd_inner_microstep: 3373.52 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.17 [2025-06-19 23:06:46,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.37 | bwd: 3374.45 | bwd_inner: 3373.52 | bwd_allreduce: 0.86 | step: 7.17 61%|██████ | 6088/10000 [9:37:07<5:58:55, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.021096356213092804, 'learning_rate': 1.4017123822446372e-05, 'epoch': 6.09} 61%|██████ | 6088/10000 [9:37:07<5:58:55, 5.51s/it][2025-06-19 23:06:52,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:06:52,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.07 | bwd_microstep: 3368.20 | bwd_inner_microstep: 3367.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 23:06:52,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.07 | bwd: 3368.21 | bwd_inner: 3367.41 | bwd_allreduce: 0.76 | step: 6.69 61%|██████ | 6089/10000 [9:37:12<5:59:29, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.013743000105023384, 'learning_rate': 1.4010943243630681e-05, 'epoch': 6.09} 61%|██████ | 6089/10000 [9:37:12<5:59:29, 5.52s/it][2025-06-19 23:06:57,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:06:57,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.17 | bwd_microstep: 3316.30 | bwd_inner_microstep: 3315.52 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 23:06:57,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.17 | bwd: 3316.31 | bwd_inner: 3315.52 | bwd_allreduce: 0.75 | step: 6.55 61%|██████ | 6090/10000 [9:37:18<5:58:15, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.012860064394772053, 'learning_rate': 1.400476329303925e-05, 'epoch': 6.09} 61%|██████ | 6090/10000 [9:37:18<5:58:15, 5.50s/it][2025-06-19 23:07:03,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:07:03,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.28 | bwd_microstep: 3322.03 | bwd_inner_microstep: 3321.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-19 23:07:03,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.28 | bwd: 3322.05 | bwd_inner: 3321.22 | bwd_allreduce: 0.78 | step: 6.90 61%|██████ | 6091/10000 [9:37:23<5:57:37, 5.49s/it] {'loss': 0.0126, 'grad_norm': 3.9618730545043945, 'learning_rate': 1.3998583971320323e-05, 'epoch': 6.09} 61%|██████ | 6091/10000 [9:37:23<5:57:37, 5.49s/it][2025-06-19 23:07:08,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:07:08,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.72 | bwd_microstep: 3332.88 | bwd_inner_microstep: 3331.91 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.65 [2025-06-19 23:07:08,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.72 | bwd: 3332.89 | bwd_inner: 3331.91 | bwd_allreduce: 0.94 | step: 7.66 61%|██████ | 6092/10000 [9:37:29<5:57:25, 5.49s/it] {'loss': 0.005, 'grad_norm': 1.2314324378967285, 'learning_rate': 1.3992405279122083e-05, 'epoch': 6.09} 61%|██████ | 6092/10000 [9:37:29<5:57:25, 5.49s/it][2025-06-19 23:07:14,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:07:14,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.39 | bwd_microstep: 3317.13 | bwd_inner_microstep: 3316.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 23:07:14,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.39 | bwd: 3317.14 | bwd_inner: 3316.35 | bwd_allreduce: 0.76 | step: 6.64 61%|██████ | 6093/10000 [9:37:34<5:56:58, 5.48s/it] {'loss': 0.0174, 'grad_norm': 3.7988040447235107, 'learning_rate': 1.398622721709265e-05, 'epoch': 6.09} 61%|██████ | 6093/10000 [9:37:34<5:56:58, 5.48s/it][2025-06-19 23:07:19,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:07:19,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.39 | bwd_microstep: 3336.59 | bwd_inner_microstep: 3335.66 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.39 [2025-06-19 23:07:19,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.39 | bwd: 3336.61 | bwd_inner: 3335.66 | bwd_allreduce: 0.90 | step: 7.39 61%|██████ | 6094/10000 [9:37:40<5:56:54, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.054079677909612656, 'learning_rate': 1.3980049785880073e-05, 'epoch': 6.09} 61%|██████ | 6094/10000 [9:37:40<5:56:54, 5.48s/it][2025-06-19 23:07:25,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:07:25,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.49 | bwd_microstep: 3336.91 | bwd_inner_microstep: 3336.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 23:07:25,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.49 | bwd: 3336.93 | bwd_inner: 3336.13 | bwd_allreduce: 0.76 | step: 6.64 61%|██████ | 6095/10000 [9:37:45<5:56:47, 5.48s/it] {'loss': 0.0174, 'grad_norm': 3.0336403846740723, 'learning_rate': 1.397387298613233e-05, 'epoch': 6.09} 61%|██████ | 6095/10000 [9:37:45<5:56:47, 5.48s/it][2025-06-19 23:07:30,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:07:30,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.35 | bwd_microstep: 3326.43 | bwd_inner_microstep: 3325.65 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 23:07:30,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.35 | bwd: 3326.45 | bwd_inner: 3325.65 | bwd_allreduce: 0.76 | step: 6.65 61%|██████ | 6096/10000 [9:37:51<5:56:28, 5.48s/it] {'loss': 0.0051, 'grad_norm': 1.1198374032974243, 'learning_rate': 1.3967696818497343e-05, 'epoch': 6.1} 61%|██████ | 6096/10000 [9:37:51<5:56:28, 5.48s/it][2025-06-19 23:07:35,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:07:35,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.36 | bwd_microstep: 3315.79 | bwd_inner_microstep: 3314.99 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 23:07:35,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.36 | bwd: 3315.80 | bwd_inner: 3314.99 | bwd_allreduce: 0.76 | step: 7.00 61%|██████ | 6097/10000 [9:37:56<5:55:57, 5.47s/it] {'loss': 0.001, 'grad_norm': 0.09907326847314835, 'learning_rate': 1.3961521283622963e-05, 'epoch': 6.1} 61%|██████ | 6097/10000 [9:37:56<5:55:57, 5.47s/it][2025-06-19 23:07:41,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:07:41,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.39 | bwd_microstep: 3385.11 | bwd_inner_microstep: 3384.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 23:07:41,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.39 | bwd: 3385.12 | bwd_inner: 3384.33 | bwd_allreduce: 0.75 | step: 6.66 61%|██████ | 6098/10000 [9:38:02<5:57:30, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.0114402174949646, 'learning_rate': 1.3955346382156974e-05, 'epoch': 6.1} 61%|██████ | 6098/10000 [9:38:02<5:57:30, 5.50s/it][2025-06-19 23:07:46,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:07:46,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.36 | bwd_microstep: 3330.39 | bwd_inner_microstep: 3329.55 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.88 [2025-06-19 23:07:46,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.36 | bwd: 3330.40 | bwd_inner: 3329.55 | bwd_allreduce: 0.80 | step: 6.88 61%|██████ | 6099/10000 [9:38:07<5:57:03, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004749496467411518, 'learning_rate': 1.3949172114747105e-05, 'epoch': 6.1} 61%|██████ | 6099/10000 [9:38:07<5:57:03, 5.49s/it][2025-06-19 23:07:52,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:07:52,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.48 | bwd_microstep: 3368.59 | bwd_inner_microstep: 3367.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 23:07:52,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.48 | bwd: 3368.60 | bwd_inner: 3367.80 | bwd_allreduce: 0.76 | step: 6.68 61%|██████ | 6100/10000 [9:38:13<5:57:45, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.2553156018257141, 'learning_rate': 1.394299848204099e-05, 'epoch': 6.1} 61%|██████ | 6100/10000 [9:38:13<5:57:45, 5.50s/it][2025-06-19 23:07:58,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:07:58,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.81 | bwd_microstep: 3381.93 | bwd_inner_microstep: 3381.09 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.29 [2025-06-19 23:07:58,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.81 | bwd: 3381.95 | bwd_inner: 3381.09 | bwd_allreduce: 0.80 | step: 7.30 61%|██████ | 6101/10000 [9:38:18<5:58:54, 5.52s/it] {'loss': 0.0426, 'grad_norm': 4.605076313018799, 'learning_rate': 1.3936825484686224e-05, 'epoch': 6.1} 61%|██████ | 6101/10000 [9:38:18<5:58:54, 5.52s/it][2025-06-19 23:08:03,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:08:03,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.28 | bwd_microstep: 3323.51 | bwd_inner_microstep: 3322.67 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.06 [2025-06-19 23:08:03,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.28 | bwd: 3323.52 | bwd_inner: 3322.67 | bwd_allreduce: 0.80 | step: 7.06 61%|██████ | 6102/10000 [9:38:24<5:57:46, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.047660935670137405, 'learning_rate': 1.393065312333032e-05, 'epoch': 6.1} 61%|██████ | 6102/10000 [9:38:24<5:57:46, 5.51s/it][2025-06-19 23:08:09,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:08:09,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.34 | bwd_microstep: 3373.53 | bwd_inner_microstep: 3372.57 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.55 [2025-06-19 23:08:09,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.34 | bwd: 3373.54 | bwd_inner: 3372.57 | bwd_allreduce: 0.93 | step: 7.55 61%|██████ | 6103/10000 [9:38:29<5:58:23, 5.52s/it] {'loss': 0.0244, 'grad_norm': 7.052847385406494, 'learning_rate': 1.3924481398620739e-05, 'epoch': 6.1} 61%|██████ | 6103/10000 [9:38:29<5:58:23, 5.52s/it][2025-06-19 23:08:14,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:08:14,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.38 | bwd_microstep: 3324.55 | bwd_inner_microstep: 3323.57 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.58 [2025-06-19 23:08:14,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.38 | bwd: 3324.57 | bwd_inner: 3323.57 | bwd_allreduce: 0.95 | step: 7.58 61%|██████ | 6104/10000 [9:38:35<5:57:33, 5.51s/it] {'loss': 0.0013, 'grad_norm': 0.24671311676502228, 'learning_rate': 1.3918310311204866e-05, 'epoch': 6.1} 61%|██████ | 6104/10000 [9:38:35<5:57:33, 5.51s/it][2025-06-19 23:08:20,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:08:20,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.16 | bwd_microstep: 3367.78 | bwd_inner_microstep: 3366.84 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.43 [2025-06-19 23:08:20,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.16 | bwd: 3367.80 | bwd_inner: 3366.84 | bwd_allreduce: 0.91 | step: 7.44 61%|██████ | 6105/10000 [9:38:40<5:58:12, 5.52s/it] {'loss': 0.002, 'grad_norm': 0.7052962183952332, 'learning_rate': 1.391213986173001e-05, 'epoch': 6.11} 61%|██████ | 6105/10000 [9:38:40<5:58:12, 5.52s/it][2025-06-19 23:08:25,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:08:25,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.93 | bwd_microstep: 3324.40 | bwd_inner_microstep: 3323.30 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.61 [2025-06-19 23:08:25,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.93 | bwd: 3324.42 | bwd_inner: 3323.30 | bwd_allreduce: 1.06 | step: 7.61 61%|██████ | 6106/10000 [9:38:46<5:57:24, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0045251911506056786, 'learning_rate': 1.3905970050843427e-05, 'epoch': 6.11} 61%|██████ | 6106/10000 [9:38:46<5:57:24, 5.51s/it][2025-06-19 23:08:31,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:08:31,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.13 | bwd_microstep: 3394.18 | bwd_inner_microstep: 3393.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.87 [2025-06-19 23:08:31,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.13 | bwd: 3394.20 | bwd_inner: 3393.38 | bwd_allreduce: 0.77 | step: 6.87 61%|██████ | 6107/10000 [9:38:51<5:58:39, 5.53s/it] {'loss': 0.0024, 'grad_norm': 0.5378208160400391, 'learning_rate': 1.3899800879192302e-05, 'epoch': 6.11} 61%|██████ | 6107/10000 [9:38:51<5:58:39, 5.53s/it][2025-06-19 23:08:36,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:08:36,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.35 | bwd_microstep: 3376.15 | bwd_inner_microstep: 3375.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 23:08:36,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.35 | bwd: 3376.17 | bwd_inner: 3375.36 | bwd_allreduce: 0.76 | step: 6.96 61%|██████ | 6108/10000 [9:38:57<5:58:52, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.005639049224555492, 'learning_rate': 1.3893632347423756e-05, 'epoch': 6.11} 61%|██████ | 6108/10000 [9:38:57<5:58:52, 5.53s/it][2025-06-19 23:08:42,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:08:42,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.88 | bwd_microstep: 3326.58 | bwd_inner_microstep: 3325.71 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.61 [2025-06-19 23:08:42,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.88 | bwd: 3326.61 | bwd_inner: 3325.71 | bwd_allreduce: 0.84 | step: 7.62 61%|██████ | 6109/10000 [9:39:02<5:57:38, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0010032269638031721, 'learning_rate': 1.3887464456184839e-05, 'epoch': 6.11} 61%|██████ | 6109/10000 [9:39:02<5:57:38, 5.51s/it][2025-06-19 23:08:47,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.88 [2025-06-19 23:08:47,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.29 | bwd_microstep: 3378.64 | bwd_inner_microstep: 3377.82 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.08 [2025-06-19 23:08:47,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.29 | bwd: 3378.66 | bwd_inner: 3377.82 | bwd_allreduce: 0.79 | step: 7.08 61%|██████ | 6110/10000 [9:39:08<5:58:09, 5.52s/it] {'loss': 0.0016, 'grad_norm': 0.3012080192565918, 'learning_rate': 1.3881297206122526e-05, 'epoch': 6.11} 61%|██████ | 6110/10000 [9:39:08<5:58:09, 5.52s/it][2025-06-19 23:08:53,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:08:53,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.06 | bwd_microstep: 3317.16 | bwd_inner_microstep: 3316.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.30 [2025-06-19 23:08:53,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.06 | bwd: 3317.17 | bwd_inner: 3316.36 | bwd_allreduce: 0.77 | step: 7.30 61%|██████ | 6111/10000 [9:39:13<5:57:01, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.0071691772900521755, 'learning_rate': 1.387513059788374e-05, 'epoch': 6.11} 61%|██████ | 6111/10000 [9:39:14<5:57:01, 5.51s/it][2025-06-19 23:08:58,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:08:58,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.33 | bwd_microstep: 3368.68 | bwd_inner_microstep: 3367.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 23:08:58,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.33 | bwd: 3368.69 | bwd_inner: 3367.89 | bwd_allreduce: 0.76 | step: 6.63 61%|██████ | 6112/10000 [9:39:19<5:57:23, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.04787006601691246, 'learning_rate': 1.3868964632115325e-05, 'epoch': 6.11} 61%|██████ | 6112/10000 [9:39:19<5:57:23, 5.52s/it][2025-06-19 23:09:04,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:09:04,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.60 | bwd_microstep: 3314.22 | bwd_inner_microstep: 3313.42 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.03 [2025-06-19 23:09:04,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.60 | bwd: 3314.23 | bwd_inner: 3313.42 | bwd_allreduce: 0.77 | step: 7.03 61%|██████ | 6113/10000 [9:39:24<5:56:23, 5.50s/it] {'loss': 0.2765, 'grad_norm': 10.963606834411621, 'learning_rate': 1.3862799309464068e-05, 'epoch': 6.11} 61%|██████ | 6113/10000 [9:39:25<5:56:23, 5.50s/it][2025-06-19 23:09:09,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:09:09,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.24 | bwd_microstep: 3323.01 | bwd_inner_microstep: 3322.22 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 23:09:09,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.24 | bwd: 3323.02 | bwd_inner: 3322.22 | bwd_allreduce: 0.75 | step: 6.73 61%|██████ | 6114/10000 [9:39:30<5:55:40, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.07055936753749847, 'learning_rate': 1.3856634630576688e-05, 'epoch': 6.11} 61%|██████ | 6114/10000 [9:39:30<5:55:40, 5.49s/it][2025-06-19 23:09:15,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:09:15,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.41 | bwd_microstep: 3317.91 | bwd_inner_microstep: 3317.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 23:09:15,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.41 | bwd: 3317.92 | bwd_inner: 3317.13 | bwd_allreduce: 0.75 | step: 6.60 61%|██████ | 6115/10000 [9:39:35<5:54:55, 5.48s/it] {'loss': 0.0079, 'grad_norm': 2.121962785720825, 'learning_rate': 1.3850470596099811e-05, 'epoch': 6.12} 61%|██████ | 6115/10000 [9:39:35<5:54:55, 5.48s/it][2025-06-19 23:09:20,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:09:20,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.78 | bwd_microstep: 3316.10 | bwd_inner_microstep: 3315.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 23:09:20,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.78 | bwd: 3316.11 | bwd_inner: 3315.32 | bwd_allreduce: 0.75 | step: 6.63 61%|██████ | 6116/10000 [9:39:41<5:54:18, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.004587284754961729, 'learning_rate': 1.3844307206680032e-05, 'epoch': 6.12} 61%|██████ | 6116/10000 [9:39:41<5:54:18, 5.47s/it][2025-06-19 23:09:26,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.68 | optimizer_step: 2.73 [2025-06-19 23:09:26,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.96 | bwd_microstep: 3318.81 | bwd_inner_microstep: 3318.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.23 [2025-06-19 23:09:26,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.96 | bwd: 3318.82 | bwd_inner: 3318.01 | bwd_allreduce: 0.77 | step: 7.23 61%|██████ | 6117/10000 [9:39:46<5:54:02, 5.47s/it] {'loss': 0.0006, 'grad_norm': 0.11301089823246002, 'learning_rate': 1.3838144462963858e-05, 'epoch': 6.12} 61%|██████ | 6117/10000 [9:39:46<5:54:02, 5.47s/it][2025-06-19 23:09:31,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 23:09:31,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.62 | bwd_microstep: 3364.93 | bwd_inner_microstep: 3364.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 23:09:31,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.62 | bwd: 3364.94 | bwd_inner: 3364.15 | bwd_allreduce: 0.76 | step: 6.66 61%|██████ | 6118/10000 [9:39:52<5:55:03, 5.49s/it] {'loss': 0.0078, 'grad_norm': 1.2825123071670532, 'learning_rate': 1.383198236559773e-05, 'epoch': 6.12} 61%|██████ | 6118/10000 [9:39:52<5:55:03, 5.49s/it][2025-06-19 23:09:37,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 23:09:37,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.24 | bwd_microstep: 3312.16 | bwd_inner_microstep: 3311.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 23:09:37,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.24 | bwd: 3312.17 | bwd_inner: 3311.38 | bwd_allreduce: 0.75 | step: 6.58 61%|██████ | 6119/10000 [9:39:57<5:54:25, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.15215247869491577, 'learning_rate': 1.382582091522803e-05, 'epoch': 6.12} 61%|██████ | 6119/10000 [9:39:57<5:54:25, 5.48s/it][2025-06-19 23:09:42,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:09:42,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.74 | bwd_microstep: 3368.90 | bwd_inner_microstep: 3368.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 23:09:42,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.74 | bwd: 3368.92 | bwd_inner: 3368.12 | bwd_allreduce: 0.75 | step: 6.71 61%|██████ | 6120/10000 [9:40:03<5:55:18, 5.49s/it] {'loss': 0.001, 'grad_norm': 0.20513007044792175, 'learning_rate': 1.3819660112501054e-05, 'epoch': 6.12} 61%|██████ | 6120/10000 [9:40:03<5:55:18, 5.49s/it][2025-06-19 23:09:48,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:09:48,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.65 | bwd_microstep: 3322.62 | bwd_inner_microstep: 3321.44 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.91 [2025-06-19 23:09:48,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.65 | bwd: 3322.65 | bwd_inner: 3321.44 | bwd_allreduce: 1.15 | step: 7.91 61%|██████ | 6121/10000 [9:40:08<5:54:49, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.0077810026705265045, 'learning_rate': 1.381349995806305e-05, 'epoch': 6.12} 61%|██████ | 6121/10000 [9:40:08<5:54:49, 5.49s/it][2025-06-19 23:09:53,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:09:53,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.05 | bwd_microstep: 3374.34 | bwd_inner_microstep: 3373.52 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.08 [2025-06-19 23:09:53,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.05 | bwd: 3374.35 | bwd_inner: 3373.52 | bwd_allreduce: 0.79 | step: 7.08 61%|██████ | 6122/10000 [9:40:14<5:55:55, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.05359760671854019, 'learning_rate': 1.380734045256019e-05, 'epoch': 6.12} 61%|██████ | 6122/10000 [9:40:14<5:55:55, 5.51s/it][2025-06-19 23:09:59,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:09:59,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.94 | bwd_microstep: 3316.05 | bwd_inner_microstep: 3314.95 | bwd_allreduce_microstep: 1.03 | step_microstep: 8.12 [2025-06-19 23:09:59,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.94 | bwd: 3316.07 | bwd_inner: 3314.95 | bwd_allreduce: 1.06 | step: 8.12 61%|██████ | 6123/10000 [9:40:19<5:55:00, 5.49s/it] {'loss': 0.004, 'grad_norm': 0.8392695188522339, 'learning_rate': 1.3801181596638574e-05, 'epoch': 6.12} 61%|██████ | 6123/10000 [9:40:19<5:55:00, 5.49s/it][2025-06-19 23:10:04,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:10:04,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.00 | bwd_microstep: 3314.55 | bwd_inner_microstep: 3313.77 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.66 [2025-06-19 23:10:04,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.01 | bwd: 3314.56 | bwd_inner: 3313.77 | bwd_allreduce: 0.75 | step: 6.66 61%|██████ | 6124/10000 [9:40:25<5:54:15, 5.48s/it] {'loss': 0.0055, 'grad_norm': 1.6282026767730713, 'learning_rate': 1.3795023390944247e-05, 'epoch': 6.12} 61%|██████ | 6124/10000 [9:40:25<5:54:15, 5.48s/it][2025-06-19 23:10:09,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:10:09,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.71 | bwd_microstep: 3314.88 | bwd_inner_microstep: 3314.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-19 23:10:09,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.71 | bwd: 3314.90 | bwd_inner: 3314.09 | bwd_allreduce: 0.77 | step: 7.07 61%|██████▏ | 6125/10000 [9:40:30<5:53:43, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.06965663284063339, 'learning_rate': 1.3788865836123158e-05, 'epoch': 6.12} 61%|██████▏ | 6125/10000 [9:40:30<5:53:43, 5.48s/it][2025-06-19 23:10:15,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:10:15,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.27 | bwd_microstep: 3314.82 | bwd_inner_microstep: 3313.80 | bwd_allreduce_microstep: 0.97 | step_microstep: 6.90 [2025-06-19 23:10:15,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.27 | bwd: 3314.84 | bwd_inner: 3313.80 | bwd_allreduce: 0.99 | step: 6.90 61%|██████▏ | 6126/10000 [9:40:36<5:53:12, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.03772442787885666, 'learning_rate': 1.3782708932821218e-05, 'epoch': 6.13} 61%|██████▏ | 6126/10000 [9:40:36<5:53:12, 5.47s/it][2025-06-19 23:10:20,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:10:20,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.02 | bwd_microstep: 3314.49 | bwd_inner_microstep: 3313.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 23:10:20,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.02 | bwd: 3314.51 | bwd_inner: 3313.69 | bwd_allreduce: 0.77 | step: 7.14 61%|██████▏ | 6127/10000 [9:40:41<5:52:56, 5.47s/it] {'loss': 0.0247, 'grad_norm': 2.641786575317383, 'learning_rate': 1.3776552681684254e-05, 'epoch': 6.13} 61%|██████▏ | 6127/10000 [9:40:41<5:52:56, 5.47s/it][2025-06-19 23:10:26,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.70 | optimizer_step: 2.73 [2025-06-19 23:10:26,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.58 | bwd_microstep: 3317.79 | bwd_inner_microstep: 3317.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 23:10:26,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.58 | bwd: 3317.80 | bwd_inner: 3317.00 | bwd_allreduce: 0.77 | step: 6.95 61%|██████▏ | 6128/10000 [9:40:47<5:52:34, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.01956409588456154, 'learning_rate': 1.3770397083358032e-05, 'epoch': 6.13} 61%|██████▏ | 6128/10000 [9:40:47<5:52:34, 5.46s/it][2025-06-19 23:10:31,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:10:31,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.63 | bwd_microstep: 3317.41 | bwd_inner_microstep: 3316.54 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.43 [2025-06-19 23:10:31,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.63 | bwd: 3317.43 | bwd_inner: 3316.54 | bwd_allreduce: 0.83 | step: 7.43 61%|██████▏ | 6129/10000 [9:40:52<5:52:32, 5.46s/it] {'loss': 0.0355, 'grad_norm': 6.733515739440918, 'learning_rate': 1.3764242138488246e-05, 'epoch': 6.13} 61%|██████▏ | 6129/10000 [9:40:52<5:52:32, 5.46s/it][2025-06-19 23:10:37,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:10:37,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.31 | bwd_microstep: 3310.18 | bwd_inner_microstep: 3309.35 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.89 [2025-06-19 23:10:37,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.31 | bwd: 3310.20 | bwd_inner: 3309.35 | bwd_allreduce: 0.79 | step: 6.89 61%|██████▏ | 6130/10000 [9:40:58<5:52:27, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.007127837743610144, 'learning_rate': 1.3758087847720515e-05, 'epoch': 6.13} 61%|██████▏ | 6130/10000 [9:40:58<5:52:27, 5.46s/it][2025-06-19 23:10:42,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:10:42,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.73 | bwd_microstep: 3315.96 | bwd_inner_microstep: 3315.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-19 23:10:42,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.73 | bwd: 3315.97 | bwd_inner: 3315.15 | bwd_allreduce: 0.78 | step: 7.19 61%|██████▏ | 6131/10000 [9:41:03<5:52:21, 5.46s/it] {'loss': 0.001, 'grad_norm': 0.10057229548692703, 'learning_rate': 1.3751934211700399e-05, 'epoch': 6.13} 61%|██████▏ | 6131/10000 [9:41:03<5:52:21, 5.46s/it][2025-06-19 23:10:48,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:10:48,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.87 | bwd_microstep: 3322.11 | bwd_inner_microstep: 3321.05 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.11 [2025-06-19 23:10:48,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.87 | bwd: 3322.13 | bwd_inner: 3321.05 | bwd_allreduce: 1.03 | step: 8.12 61%|██████▏ | 6132/10000 [9:41:09<5:52:28, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0010524644749239087, 'learning_rate': 1.3745781231073388e-05, 'epoch': 6.13} 61%|██████▏ | 6132/10000 [9:41:09<5:52:28, 5.47s/it][2025-06-19 23:10:53,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:10:53,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.22 | bwd_microstep: 3319.26 | bwd_inner_microstep: 3318.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-19 23:10:53,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.22 | bwd: 3319.28 | bwd_inner: 3318.47 | bwd_allreduce: 0.77 | step: 6.97 61%|██████▏ | 6133/10000 [9:41:14<5:52:26, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.05468262732028961, 'learning_rate': 1.3739628906484897e-05, 'epoch': 6.13} 61%|██████▏ | 6133/10000 [9:41:14<5:52:26, 5.47s/it][2025-06-19 23:10:59,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:10:59,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.77 | bwd_microstep: 3361.31 | bwd_inner_microstep: 3360.52 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 23:10:59,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.77 | bwd: 3361.32 | bwd_inner: 3360.52 | bwd_allreduce: 0.76 | step: 6.67 61%|██████▏ | 6134/10000 [9:41:20<5:53:20, 5.48s/it] {'loss': 0.0012, 'grad_norm': 0.4848146140575409, 'learning_rate': 1.3733477238580286e-05, 'epoch': 6.13} 61%|██████▏ | 6134/10000 [9:41:20<5:53:20, 5.48s/it][2025-06-19 23:11:04,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:11:04,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.91 | bwd_microstep: 3318.93 | bwd_inner_microstep: 3318.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 23:11:04,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.91 | bwd: 3318.95 | bwd_inner: 3318.14 | bwd_allreduce: 0.76 | step: 6.69 61%|██████▏ | 6135/10000 [9:41:25<5:52:41, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.040879663079977036, 'learning_rate': 1.3727326228004823e-05, 'epoch': 6.13} 61%|██████▏ | 6135/10000 [9:41:25<5:52:41, 5.48s/it][2025-06-19 23:11:10,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:11:10,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.04 | bwd_microstep: 3365.68 | bwd_inner_microstep: 3364.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-19 23:11:10,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.04 | bwd: 3365.70 | bwd_inner: 3364.87 | bwd_allreduce: 0.78 | step: 7.22 61%|██████▏ | 6136/10000 [9:41:30<5:53:38, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.014407322742044926, 'learning_rate': 1.3721175875403725e-05, 'epoch': 6.14} 61%|██████▏ | 6136/10000 [9:41:30<5:53:38, 5.49s/it][2025-06-19 23:11:15,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:11:15,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.83 | bwd_microstep: 3366.84 | bwd_inner_microstep: 3366.00 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.80 [2025-06-19 23:11:15,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.83 | bwd: 3366.85 | bwd_inner: 3366.00 | bwd_allreduce: 0.80 | step: 6.81 61%|██████▏ | 6137/10000 [9:41:36<5:54:30, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.008709062822163105, 'learning_rate': 1.3715026181422136e-05, 'epoch': 6.14} 61%|██████▏ | 6137/10000 [9:41:36<5:54:30, 5.51s/it][2025-06-19 23:11:21,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:11:21,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.88 | bwd_microstep: 3364.62 | bwd_inner_microstep: 3363.83 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.82 [2025-06-19 23:11:21,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.88 | bwd: 3364.63 | bwd_inner: 3363.83 | bwd_allreduce: 0.76 | step: 6.83 61%|██████▏ | 6138/10000 [9:41:42<5:54:51, 5.51s/it] {'loss': 0.0079, 'grad_norm': 2.347837209701538, 'learning_rate': 1.3708877146705133e-05, 'epoch': 6.14} 61%|██████▏ | 6138/10000 [9:41:42<5:54:51, 5.51s/it][2025-06-19 23:11:26,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:11:26,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.24 | bwd_microstep: 3317.19 | bwd_inner_microstep: 3316.15 | bwd_allreduce_microstep: 0.97 | step_microstep: 8.20 [2025-06-19 23:11:26,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.24 | bwd: 3317.21 | bwd_inner: 3316.15 | bwd_allreduce: 1.00 | step: 8.20 61%|██████▏ | 6139/10000 [9:41:47<5:53:45, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0005729308468289673, 'learning_rate': 1.3702728771897721e-05, 'epoch': 6.14} 61%|██████▏ | 6139/10000 [9:41:47<5:53:45, 5.50s/it][2025-06-19 23:11:32,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.73 [2025-06-19 23:11:32,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.47 | bwd_microstep: 3364.82 | bwd_inner_microstep: 3364.01 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 23:11:32,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.47 | bwd: 3364.83 | bwd_inner: 3364.01 | bwd_allreduce: 0.78 | step: 7.12 61%|██████▏ | 6140/10000 [9:41:53<5:54:17, 5.51s/it] {'loss': 0.0016, 'grad_norm': 0.22864188253879547, 'learning_rate': 1.3696581057644834e-05, 'epoch': 6.14} 61%|██████▏ | 6140/10000 [9:41:53<5:54:17, 5.51s/it][2025-06-19 23:11:37,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:11:37,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.68 | bwd_microstep: 3314.51 | bwd_inner_microstep: 3313.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.09 [2025-06-19 23:11:37,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.68 | bwd: 3314.52 | bwd_inner: 3313.72 | bwd_allreduce: 0.76 | step: 7.10 61%|██████▏ | 6141/10000 [9:41:58<5:53:13, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.07943969964981079, 'learning_rate': 1.3690434004591335e-05, 'epoch': 6.14} 61%|██████▏ | 6141/10000 [9:41:58<5:53:13, 5.49s/it][2025-06-19 23:11:43,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:11:43,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.84 | bwd_microstep: 3318.53 | bwd_inner_microstep: 3317.69 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.96 [2025-06-19 23:11:43,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.84 | bwd: 3318.55 | bwd_inner: 3317.69 | bwd_allreduce: 0.81 | step: 6.97 61%|██████▏ | 6142/10000 [9:42:03<5:52:30, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.016907798126339912, 'learning_rate': 1.3684287613382026e-05, 'epoch': 6.14} 61%|██████▏ | 6142/10000 [9:42:03<5:52:30, 5.48s/it][2025-06-19 23:11:48,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:11:48,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.66 | bwd_microstep: 3307.15 | bwd_inner_microstep: 3306.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 23:11:48,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.66 | bwd: 3307.16 | bwd_inner: 3306.36 | bwd_allreduce: 0.76 | step: 6.99 61%|██████▏ | 6143/10000 [9:42:09<5:51:48, 5.47s/it] {'loss': 0.0012, 'grad_norm': 0.2601931691169739, 'learning_rate': 1.3678141884661635e-05, 'epoch': 6.14} 61%|██████▏ | 6143/10000 [9:42:09<5:51:48, 5.47s/it][2025-06-19 23:11:54,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.90 [2025-06-19 23:11:54,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.68 | bwd_microstep: 3316.55 | bwd_inner_microstep: 3315.58 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.08 [2025-06-19 23:11:54,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.68 | bwd: 3316.57 | bwd_inner: 3315.58 | bwd_allreduce: 0.93 | step: 7.08 61%|██████▏ | 6144/10000 [9:42:14<5:51:33, 5.47s/it] {'loss': 0.002, 'grad_norm': 0.6302437782287598, 'learning_rate': 1.3671996819074824e-05, 'epoch': 6.14} 61%|██████▏ | 6144/10000 [9:42:14<5:51:33, 5.47s/it][2025-06-19 23:11:59,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:11:59,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.03 | bwd_microstep: 3323.98 | bwd_inner_microstep: 3323.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 23:11:59,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.03 | bwd: 3323.99 | bwd_inner: 3323.18 | bwd_allreduce: 0.76 | step: 6.71 61%|██████▏ | 6145/10000 [9:42:20<5:51:24, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.000841859495267272, 'learning_rate': 1.3665852417266173e-05, 'epoch': 6.14} 61%|██████▏ | 6145/10000 [9:42:20<5:51:24, 5.47s/it][2025-06-19 23:12:05,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:12:05,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.52 | bwd_microstep: 3314.21 | bwd_inner_microstep: 3313.34 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.44 [2025-06-19 23:12:05,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.52 | bwd: 3314.23 | bwd_inner: 3313.34 | bwd_allreduce: 0.83 | step: 7.44 61%|██████▏ | 6146/10000 [9:42:25<5:51:06, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.044941626489162445, 'learning_rate': 1.3659708679880207e-05, 'epoch': 6.15} 61%|██████▏ | 6146/10000 [9:42:25<5:51:06, 5.47s/it][2025-06-19 23:12:10,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:12:10,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.93 | bwd_microstep: 3307.81 | bwd_inner_microstep: 3307.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 23:12:10,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.93 | bwd: 3307.83 | bwd_inner: 3307.03 | bwd_allreduce: 0.76 | step: 6.77 61%|██████▏ | 6147/10000 [9:42:31<5:50:35, 5.46s/it] {'loss': 0.0005, 'grad_norm': 0.1715192198753357, 'learning_rate': 1.3653565607561375e-05, 'epoch': 6.15} 61%|██████▏ | 6147/10000 [9:42:31<5:50:35, 5.46s/it][2025-06-19 23:12:15,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:12:15,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.16 | bwd_microstep: 3365.42 | bwd_inner_microstep: 3364.59 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.72 [2025-06-19 23:12:15,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.16 | bwd: 3365.44 | bwd_inner: 3364.59 | bwd_allreduce: 0.80 | step: 6.72 61%|██████▏ | 6148/10000 [9:42:36<5:52:01, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.07566431909799576, 'learning_rate': 1.364742320095405e-05, 'epoch': 6.15} 61%|██████▏ | 6148/10000 [9:42:36<5:52:01, 5.48s/it][2025-06-19 23:12:21,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:12:21,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.19 | bwd_microstep: 3313.63 | bwd_inner_microstep: 3312.82 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 23:12:21,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.19 | bwd: 3313.65 | bwd_inner: 3312.82 | bwd_allreduce: 0.78 | step: 7.09 61%|██████▏ | 6149/10000 [9:42:42<5:51:19, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.020433321595191956, 'learning_rate': 1.3641281460702562e-05, 'epoch': 6.15} 61%|██████▏ | 6149/10000 [9:42:42<5:51:19, 5.47s/it][2025-06-19 23:12:26,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:12:26,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.73 | bwd_microstep: 3312.72 | bwd_inner_microstep: 3311.80 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.98 [2025-06-19 23:12:26,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.73 | bwd: 3312.74 | bwd_inner: 3311.80 | bwd_allreduce: 0.90 | step: 6.98 62%|██████▏ | 6150/10000 [9:42:47<5:50:57, 5.47s/it] {'loss': 0.002, 'grad_norm': 0.9505004286766052, 'learning_rate': 1.3635140387451129e-05, 'epoch': 6.15} 62%|██████▏ | 6150/10000 [9:42:47<5:50:57, 5.47s/it][2025-06-19 23:12:32,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:12:32,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.14 | bwd_microstep: 3312.18 | bwd_inner_microstep: 3311.26 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.04 [2025-06-19 23:12:32,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.14 | bwd: 3312.20 | bwd_inner: 3311.26 | bwd_allreduce: 0.90 | step: 7.05 62%|██████▏ | 6151/10000 [9:42:53<5:50:41, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.02078181318938732, 'learning_rate': 1.3628999981843926e-05, 'epoch': 6.15} 62%|██████▏ | 6151/10000 [9:42:53<5:50:41, 5.47s/it][2025-06-19 23:12:37,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:12:37,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.07 | bwd_microstep: 3311.98 | bwd_inner_microstep: 3311.11 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.50 [2025-06-19 23:12:37,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.07 | bwd: 3312.00 | bwd_inner: 3311.11 | bwd_allreduce: 0.83 | step: 7.51 62%|██████▏ | 6152/10000 [9:42:58<5:50:33, 5.47s/it] {'loss': 0.0, 'grad_norm': 8.374256867682561e-05, 'learning_rate': 1.3622860244525058e-05, 'epoch': 6.15} 62%|██████▏ | 6152/10000 [9:42:58<5:50:33, 5.47s/it][2025-06-19 23:12:43,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:12:43,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.93 | bwd_microstep: 3309.75 | bwd_inner_microstep: 3308.77 | bwd_allreduce_microstep: 0.93 | step_microstep: 6.96 [2025-06-19 23:12:43,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.93 | bwd: 3309.76 | bwd_inner: 3308.77 | bwd_allreduce: 0.94 | step: 6.96 62%|██████▏ | 6153/10000 [9:43:04<5:50:16, 5.46s/it] {'loss': 0.0005, 'grad_norm': 0.056520767509937286, 'learning_rate': 1.3616721176138552e-05, 'epoch': 6.15} 62%|██████▏ | 6153/10000 [9:43:04<5:50:16, 5.46s/it][2025-06-19 23:12:48,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:12:48,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.47 | bwd_microstep: 3384.83 | bwd_inner_microstep: 3383.72 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.65 [2025-06-19 23:12:48,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.47 | bwd: 3384.86 | bwd_inner: 3383.72 | bwd_allreduce: 1.07 | step: 7.65 62%|██████▏ | 6154/10000 [9:43:09<5:51:57, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.007050614804029465, 'learning_rate': 1.3610582777328372e-05, 'epoch': 6.15} 62%|██████▏ | 6154/10000 [9:43:09<5:51:57, 5.49s/it][2025-06-19 23:12:54,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.75 [2025-06-19 23:12:54,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.28 | bwd_microstep: 3354.05 | bwd_inner_microstep: 3353.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 23:12:54,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.28 | bwd: 3354.07 | bwd_inner: 3353.26 | bwd_allreduce: 0.76 | step: 6.96 62%|██████▏ | 6155/10000 [9:43:15<5:52:30, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.6394698023796082, 'learning_rate': 1.3604445048738404e-05, 'epoch': 6.16} 62%|██████▏ | 6155/10000 [9:43:15<5:52:30, 5.50s/it][2025-06-19 23:12:59,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:12:59,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.49 | bwd_microstep: 3311.66 | bwd_inner_microstep: 3310.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 23:12:59,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.49 | bwd: 3311.68 | bwd_inner: 3310.87 | bwd_allreduce: 0.76 | step: 6.70 62%|██████▏ | 6156/10000 [9:43:20<5:51:25, 5.49s/it] {'loss': 0.0427, 'grad_norm': 5.677585601806641, 'learning_rate': 1.3598307991012467e-05, 'epoch': 6.16} 62%|██████▏ | 6156/10000 [9:43:20<5:51:25, 5.49s/it][2025-06-19 23:13:05,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:13:05,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.09 | bwd_microstep: 3356.62 | bwd_inner_microstep: 3355.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 23:13:05,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.09 | bwd: 3356.64 | bwd_inner: 3355.83 | bwd_allreduce: 0.76 | step: 6.85 62%|██████▏ | 6157/10000 [9:43:26<5:52:01, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.36987191438674927, 'learning_rate': 1.3592171604794309e-05, 'epoch': 6.16} 62%|██████▏ | 6157/10000 [9:43:26<5:52:01, 5.50s/it][2025-06-19 23:13:10,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:13:10,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.39 | bwd_microstep: 3358.33 | bwd_inner_microstep: 3357.41 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.97 [2025-06-19 23:13:10,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.39 | bwd: 3358.35 | bwd_inner: 3357.41 | bwd_allreduce: 0.89 | step: 6.97 62%|██████▏ | 6158/10000 [9:43:31<5:52:21, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.013974180445075035, 'learning_rate': 1.3586035890727608e-05, 'epoch': 6.16} 62%|██████▏ | 6158/10000 [9:43:31<5:52:21, 5.50s/it][2025-06-19 23:13:16,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 23:13:16,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.23 | bwd_microstep: 3307.62 | bwd_inner_microstep: 3306.54 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.62 [2025-06-19 23:13:16,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.23 | bwd: 3307.64 | bwd_inner: 3306.54 | bwd_allreduce: 1.04 | step: 7.62 62%|██████▏ | 6159/10000 [9:43:37<5:51:20, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.022099914029240608, 'learning_rate': 1.3579900849455978e-05, 'epoch': 6.16} 62%|██████▏ | 6159/10000 [9:43:37<5:51:20, 5.49s/it][2025-06-19 23:13:21,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:13:21,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.37 | bwd_microstep: 3310.32 | bwd_inner_microstep: 3309.42 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.98 [2025-06-19 23:13:21,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.37 | bwd: 3310.33 | bwd_inner: 3309.42 | bwd_allreduce: 0.87 | step: 6.99 62%|██████▏ | 6160/10000 [9:43:42<5:50:39, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.02528713457286358, 'learning_rate': 1.3573766481622958e-05, 'epoch': 6.16} 62%|██████▏ | 6160/10000 [9:43:42<5:50:39, 5.48s/it][2025-06-19 23:13:27,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:13:27,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.09 | bwd_microstep: 3359.99 | bwd_inner_microstep: 3359.21 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 23:13:27,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.09 | bwd: 3360.01 | bwd_inner: 3359.21 | bwd_allreduce: 0.75 | step: 6.53 62%|██████▏ | 6161/10000 [9:43:48<5:51:25, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.07221610099077225, 'learning_rate': 1.3567632787872005e-05, 'epoch': 6.16} 62%|██████▏ | 6161/10000 [9:43:48<5:51:25, 5.49s/it][2025-06-19 23:13:32,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:13:32,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.78 | bwd_microstep: 3315.37 | bwd_inner_microstep: 3314.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 23:13:32,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.78 | bwd: 3315.38 | bwd_inner: 3314.57 | bwd_allreduce: 0.77 | step: 6.74 62%|██████▏ | 6162/10000 [9:43:53<5:50:45, 5.48s/it] {'loss': 0.0021, 'grad_norm': 0.8071189522743225, 'learning_rate': 1.3561499768846513e-05, 'epoch': 6.16} 62%|██████▏ | 6162/10000 [9:43:53<5:50:45, 5.48s/it][2025-06-19 23:13:38,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:13:38,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.76 | bwd_microstep: 3307.63 | bwd_inner_microstep: 3306.83 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.75 [2025-06-19 23:13:38,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.96 | bwd: 3307.65 | bwd_inner: 3306.83 | bwd_allreduce: 0.77 | step: 6.76 62%|██████▏ | 6163/10000 [9:43:58<5:49:55, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0015164593933150172, 'learning_rate': 1.3555367425189818e-05, 'epoch': 6.16} 62%|██████▏ | 6163/10000 [9:43:58<5:49:55, 5.47s/it][2025-06-19 23:13:43,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:13:43,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.95 | bwd_microstep: 3312.50 | bwd_inner_microstep: 3311.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-19 23:13:43,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.95 | bwd: 3312.51 | bwd_inner: 3311.69 | bwd_allreduce: 0.78 | step: 6.75 62%|██████▏ | 6164/10000 [9:44:04<5:49:30, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00334992166608572, 'learning_rate': 1.3549235757545165e-05, 'epoch': 6.16} 62%|██████▏ | 6164/10000 [9:44:04<5:49:30, 5.47s/it][2025-06-19 23:13:49,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:13:49,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.89 | bwd_microstep: 3362.42 | bwd_inner_microstep: 3361.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 23:13:49,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.90 | bwd: 3362.43 | bwd_inner: 3361.62 | bwd_allreduce: 0.76 | step: 6.68 62%|██████▏ | 6165/10000 [9:44:09<5:50:38, 5.49s/it] {'loss': 0.0645, 'grad_norm': 7.052885055541992, 'learning_rate': 1.354310476655575e-05, 'epoch': 6.17} 62%|██████▏ | 6165/10000 [9:44:09<5:50:38, 5.49s/it][2025-06-19 23:13:54,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:13:54,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.15 | bwd_microstep: 3373.72 | bwd_inner_microstep: 3372.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 23:13:54,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.15 | bwd: 3373.73 | bwd_inner: 3372.94 | bwd_allreduce: 0.75 | step: 6.58 62%|██████▏ | 6166/10000 [9:44:15<5:51:32, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0005088737234473228, 'learning_rate': 1.3536974452864673e-05, 'epoch': 6.17} 62%|██████▏ | 6166/10000 [9:44:15<5:51:32, 5.50s/it][2025-06-19 23:14:00,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:14:00,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.73 | bwd_microstep: 3310.91 | bwd_inner_microstep: 3310.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.13 [2025-06-19 23:14:00,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.73 | bwd: 3310.92 | bwd_inner: 3310.10 | bwd_allreduce: 0.77 | step: 7.13 62%|██████▏ | 6167/10000 [9:44:20<5:50:26, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.007183111272752285, 'learning_rate': 1.353084481711498e-05, 'epoch': 6.17} 62%|██████▏ | 6167/10000 [9:44:20<5:50:26, 5.49s/it][2025-06-19 23:14:05,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:14:05,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.46 | bwd_microstep: 3303.42 | bwd_inner_microstep: 3302.58 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.71 [2025-06-19 23:14:05,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.46 | bwd: 3303.44 | bwd_inner: 3302.58 | bwd_allreduce: 0.80 | step: 6.73 62%|██████▏ | 6168/10000 [9:44:26<5:49:34, 5.47s/it] {'loss': 0.0006, 'grad_norm': 0.09369296580553055, 'learning_rate': 1.352471585994964e-05, 'epoch': 6.17} 62%|██████▏ | 6168/10000 [9:44:26<5:49:34, 5.47s/it][2025-06-19 23:14:11,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:14:11,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.51 | bwd_microstep: 3308.97 | bwd_inner_microstep: 3308.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 23:14:11,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.51 | bwd: 3308.98 | bwd_inner: 3308.19 | bwd_allreduce: 0.75 | step: 6.62 62%|██████▏ | 6169/10000 [9:44:31<5:49:04, 5.47s/it] {'loss': 0.1881, 'grad_norm': 4.524322509765625, 'learning_rate': 1.3518587582011553e-05, 'epoch': 6.17} 62%|██████▏ | 6169/10000 [9:44:31<5:49:04, 5.47s/it][2025-06-19 23:14:16,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:14:16,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.46 | bwd_microstep: 3314.16 | bwd_inner_microstep: 3313.14 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.13 [2025-06-19 23:14:16,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.46 | bwd: 3314.18 | bwd_inner: 3313.14 | bwd_allreduce: 1.00 | step: 7.13 62%|██████▏ | 6170/10000 [9:44:37<5:48:49, 5.46s/it] {'loss': 0.0025, 'grad_norm': 0.3437742292881012, 'learning_rate': 1.3512459983943557e-05, 'epoch': 6.17} 62%|██████▏ | 6170/10000 [9:44:37<5:48:49, 5.46s/it][2025-06-19 23:14:22,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:14:22,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.99 | bwd_microstep: 3350.80 | bwd_inner_microstep: 3349.77 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.95 [2025-06-19 23:14:22,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.99 | bwd: 3350.82 | bwd_inner: 3349.77 | bwd_allreduce: 0.99 | step: 7.96 62%|██████▏ | 6171/10000 [9:44:42<5:49:42, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.07360417395830154, 'learning_rate': 1.3506333066388391e-05, 'epoch': 6.17} 62%|██████▏ | 6171/10000 [9:44:42<5:49:42, 5.48s/it][2025-06-19 23:14:27,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:14:27,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.63 | bwd_microstep: 3316.34 | bwd_inner_microstep: 3315.47 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.14 [2025-06-19 23:14:27,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.63 | bwd: 3316.37 | bwd_inner: 3315.47 | bwd_allreduce: 0.84 | step: 7.14 62%|██████▏ | 6172/10000 [9:44:48<5:49:20, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.09988315403461456, 'learning_rate': 1.3500206829988748e-05, 'epoch': 6.17} 62%|██████▏ | 6172/10000 [9:44:48<5:49:20, 5.48s/it][2025-06-19 23:14:33,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:14:33,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.55 | bwd_microstep: 3356.57 | bwd_inner_microstep: 3355.49 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.54 [2025-06-19 23:14:33,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.55 | bwd: 3356.61 | bwd_inner: 3355.49 | bwd_allreduce: 1.04 | step: 7.54 62%|██████▏ | 6173/10000 [9:44:53<5:50:16, 5.49s/it] {'loss': 0.0119, 'grad_norm': 4.028948783874512, 'learning_rate': 1.3494081275387243e-05, 'epoch': 6.17} 62%|██████▏ | 6173/10000 [9:44:53<5:50:16, 5.49s/it][2025-06-19 23:14:38,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:14:38,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.16 | bwd_microstep: 3316.39 | bwd_inner_microstep: 3315.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 23:14:38,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.16 | bwd: 3316.40 | bwd_inner: 3315.60 | bwd_allreduce: 0.77 | step: 6.75 62%|██████▏ | 6174/10000 [9:44:59<5:49:32, 5.48s/it] {'loss': 0.0034, 'grad_norm': 0.5116754174232483, 'learning_rate': 1.3487956403226413e-05, 'epoch': 6.17} 62%|██████▏ | 6174/10000 [9:44:59<5:49:32, 5.48s/it][2025-06-19 23:14:43,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.72 [2025-06-19 23:14:43,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.67 | bwd_microstep: 3321.48 | bwd_inner_microstep: 3320.27 | bwd_allreduce_microstep: 1.14 | step_microstep: 8.92 [2025-06-19 23:14:43,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.67 | bwd: 3321.49 | bwd_inner: 3320.27 | bwd_allreduce: 1.17 | step: 8.92 62%|██████▏ | 6175/10000 [9:45:04<5:49:17, 5.48s/it] {'loss': 0.0119, 'grad_norm': 2.652724027633667, 'learning_rate': 1.3481832214148744e-05, 'epoch': 6.17} 62%|██████▏ | 6175/10000 [9:45:04<5:49:17, 5.48s/it][2025-06-19 23:14:49,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:14:49,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.76 | bwd_microstep: 3316.29 | bwd_inner_microstep: 3315.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 23:14:49,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.76 | bwd: 3316.30 | bwd_inner: 3315.51 | bwd_allreduce: 0.76 | step: 6.63 62%|██████▏ | 6176/10000 [9:45:10<5:49:03, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0004598663072101772, 'learning_rate': 1.3475708708796615e-05, 'epoch': 6.18} 62%|██████▏ | 6176/10000 [9:45:10<5:49:03, 5.48s/it][2025-06-19 23:14:54,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:14:54,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.71 | bwd_microstep: 3364.23 | bwd_inner_microstep: 3363.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 23:14:54,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.71 | bwd: 3364.24 | bwd_inner: 3363.44 | bwd_allreduce: 0.76 | step: 6.84 62%|██████▏ | 6177/10000 [9:45:15<5:50:00, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.011100009083747864, 'learning_rate': 1.3469585887812367e-05, 'epoch': 6.18} 62%|██████▏ | 6177/10000 [9:45:15<5:50:00, 5.49s/it][2025-06-19 23:15:00,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:15:00,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.25 | bwd_microstep: 3313.07 | bwd_inner_microstep: 3312.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 23:15:00,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.25 | bwd: 3313.08 | bwd_inner: 3312.27 | bwd_allreduce: 0.77 | step: 6.83 62%|██████▏ | 6178/10000 [9:45:21<5:49:14, 5.48s/it] {'loss': 0.0004, 'grad_norm': 0.1783929318189621, 'learning_rate': 1.3463463751838246e-05, 'epoch': 6.18} 62%|██████▏ | 6178/10000 [9:45:21<5:49:14, 5.48s/it][2025-06-19 23:15:05,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:15:05,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.17 | bwd_microstep: 3319.14 | bwd_inner_microstep: 3318.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 23:15:05,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.17 | bwd: 3319.15 | bwd_inner: 3318.34 | bwd_allreduce: 0.76 | step: 6.79 62%|██████▏ | 6179/10000 [9:45:26<5:48:42, 5.48s/it] {'loss': 0.0016, 'grad_norm': 0.1615324318408966, 'learning_rate': 1.3457342301516444e-05, 'epoch': 6.18} 62%|██████▏ | 6179/10000 [9:45:26<5:48:42, 5.48s/it][2025-06-19 23:15:11,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:15:11,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.70 | bwd_microstep: 3315.84 | bwd_inner_microstep: 3315.05 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 23:15:11,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.70 | bwd: 3315.85 | bwd_inner: 3315.05 | bwd_allreduce: 0.76 | step: 6.60 62%|██████▏ | 6180/10000 [9:45:32<5:48:23, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.01068862248212099, 'learning_rate': 1.345122153748907e-05, 'epoch': 6.18} 62%|██████▏ | 6180/10000 [9:45:32<5:48:23, 5.47s/it][2025-06-19 23:15:16,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:15:16,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.65 | bwd_microstep: 3369.99 | bwd_inner_microstep: 3369.03 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.10 [2025-06-19 23:15:16,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.65 | bwd: 3370.00 | bwd_inner: 3369.03 | bwd_allreduce: 0.92 | step: 7.10 62%|██████▏ | 6181/10000 [9:45:37<5:49:38, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.009189530275762081, 'learning_rate': 1.344510146039816e-05, 'epoch': 6.18} 62%|██████▏ | 6181/10000 [9:45:37<5:49:38, 5.49s/it][2025-06-19 23:15:22,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:15:22,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.08 | bwd_microstep: 3317.20 | bwd_inner_microstep: 3316.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 23:15:22,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.08 | bwd: 3317.22 | bwd_inner: 3316.40 | bwd_allreduce: 0.77 | step: 6.88 62%|██████▏ | 6182/10000 [9:45:43<5:49:01, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.1464623361825943, 'learning_rate': 1.3438982070885684e-05, 'epoch': 6.18} 62%|██████▏ | 6182/10000 [9:45:43<5:49:01, 5.48s/it][2025-06-19 23:15:27,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:15:27,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.74 | bwd_microstep: 3319.16 | bwd_inner_microstep: 3318.17 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 23:15:27,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.74 | bwd: 3319.18 | bwd_inner: 3318.17 | bwd_allreduce: 0.97 | step: 7.01 62%|██████▏ | 6183/10000 [9:45:48<5:48:35, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.005703811999410391, 'learning_rate': 1.3432863369593538e-05, 'epoch': 6.18} 62%|██████▏ | 6183/10000 [9:45:48<5:48:35, 5.48s/it][2025-06-19 23:15:33,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:15:33,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.89 | bwd_microstep: 3363.09 | bwd_inner_microstep: 3362.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 23:15:33,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.89 | bwd: 3363.10 | bwd_inner: 3362.30 | bwd_allreduce: 0.76 | step: 6.66 62%|██████▏ | 6184/10000 [9:45:54<5:49:27, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.06836222112178802, 'learning_rate': 1.3426745357163546e-05, 'epoch': 6.18} 62%|██████▏ | 6184/10000 [9:45:54<5:49:27, 5.49s/it][2025-06-19 23:15:38,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:15:38,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.77 | bwd_microstep: 3316.76 | bwd_inner_microstep: 3315.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 23:15:38,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.77 | bwd: 3316.78 | bwd_inner: 3315.98 | bwd_allreduce: 0.76 | step: 6.72 62%|██████▏ | 6185/10000 [9:45:59<5:48:39, 5.48s/it] {'loss': 0.0426, 'grad_norm': 3.683759927749634, 'learning_rate': 1.3420628034237468e-05, 'epoch': 6.18} 62%|██████▏ | 6185/10000 [9:45:59<5:48:39, 5.48s/it][2025-06-19 23:15:44,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:15:44,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.59 | bwd_microstep: 3314.34 | bwd_inner_microstep: 3313.37 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.24 [2025-06-19 23:15:44,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.59 | bwd: 3314.36 | bwd_inner: 3313.37 | bwd_allreduce: 0.94 | step: 7.24 62%|██████▏ | 6186/10000 [9:46:05<5:48:05, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.003982941620051861, 'learning_rate': 1.3414511401456964e-05, 'epoch': 6.19} 62%|██████▏ | 6186/10000 [9:46:05<5:48:05, 5.48s/it][2025-06-19 23:15:49,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:15:49,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.08 | bwd_microstep: 3372.47 | bwd_inner_microstep: 3371.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 23:15:49,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.08 | bwd: 3372.49 | bwd_inner: 3371.67 | bwd_allreduce: 0.77 | step: 7.07 62%|██████▏ | 6187/10000 [9:46:10<5:49:19, 5.50s/it] {'loss': 0.0051, 'grad_norm': 1.799606442451477, 'learning_rate': 1.340839545946365e-05, 'epoch': 6.19} 62%|██████▏ | 6187/10000 [9:46:10<5:49:19, 5.50s/it][2025-06-19 23:15:55,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:15:55,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.41 | bwd_microstep: 3402.89 | bwd_inner_microstep: 3402.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-19 23:15:55,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.41 | bwd: 3402.90 | bwd_inner: 3402.10 | bwd_allreduce: 0.76 | step: 6.74 62%|██████▏ | 6188/10000 [9:46:16<5:50:45, 5.52s/it] {'loss': 0.002, 'grad_norm': 0.3935287892818451, 'learning_rate': 1.3402280208899061e-05, 'epoch': 6.19} 62%|██████▏ | 6188/10000 [9:46:16<5:50:45, 5.52s/it][2025-06-19 23:16:00,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:16:00,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.72 | bwd_microstep: 3318.70 | bwd_inner_microstep: 3317.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 23:16:00,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.72 | bwd: 3318.71 | bwd_inner: 3317.92 | bwd_allreduce: 0.76 | step: 6.71 62%|██████▏ | 6189/10000 [9:46:21<5:49:35, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.04494724050164223, 'learning_rate': 1.3396165650404655e-05, 'epoch': 6.19} 62%|██████▏ | 6189/10000 [9:46:21<5:49:35, 5.50s/it][2025-06-19 23:16:06,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:16:06,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.46 | bwd_microstep: 3318.54 | bwd_inner_microstep: 3317.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 23:16:06,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.46 | bwd: 3318.56 | bwd_inner: 3317.76 | bwd_allreduce: 0.75 | step: 6.55 62%|██████▏ | 6190/10000 [9:46:27<5:48:43, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.07101314514875412, 'learning_rate': 1.3390051784621827e-05, 'epoch': 6.19} 62%|██████▏ | 6190/10000 [9:46:27<5:48:43, 5.49s/it][2025-06-19 23:16:11,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:16:11,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.43 | bwd_microstep: 3319.17 | bwd_inner_microstep: 3318.33 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.11 [2025-06-19 23:16:11,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.43 | bwd: 3319.19 | bwd_inner: 3318.33 | bwd_allreduce: 0.80 | step: 7.11 62%|██████▏ | 6191/10000 [9:46:32<5:48:07, 5.48s/it] {'loss': 0.0033, 'grad_norm': 0.4692719876766205, 'learning_rate': 1.3383938612191887e-05, 'epoch': 6.19} 62%|██████▏ | 6191/10000 [9:46:32<5:48:07, 5.48s/it][2025-06-19 23:16:17,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:16:17,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.87 | bwd_microstep: 3320.49 | bwd_inner_microstep: 3319.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 23:16:17,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.87 | bwd: 3320.50 | bwd_inner: 3319.70 | bwd_allreduce: 0.76 | step: 6.89 62%|██████▏ | 6192/10000 [9:46:38<5:47:42, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0041443887166678905, 'learning_rate': 1.337782613375608e-05, 'epoch': 6.19} 62%|██████▏ | 6192/10000 [9:46:38<5:47:42, 5.48s/it][2025-06-19 23:16:22,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:16:22,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.03 | bwd_microstep: 3373.13 | bwd_inner_microstep: 3372.09 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.49 [2025-06-19 23:16:22,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.03 | bwd: 3373.15 | bwd_inner: 3372.09 | bwd_allreduce: 1.01 | step: 7.49 62%|██████▏ | 6193/10000 [9:46:43<5:48:45, 5.50s/it] {'loss': 0.0009, 'grad_norm': 0.14549174904823303, 'learning_rate': 1.3371714349955576e-05, 'epoch': 6.19} 62%|██████▏ | 6193/10000 [9:46:43<5:48:45, 5.50s/it][2025-06-19 23:16:28,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:16:28,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.34 | bwd_microstep: 3367.92 | bwd_inner_microstep: 3367.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 23:16:28,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.34 | bwd: 3367.94 | bwd_inner: 3367.12 | bwd_allreduce: 0.77 | step: 6.83 62%|██████▏ | 6194/10000 [9:46:49<5:49:30, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.007370163686573505, 'learning_rate': 1.3365603261431474e-05, 'epoch': 6.19} 62%|██████▏ | 6194/10000 [9:46:49<5:49:30, 5.51s/it][2025-06-19 23:16:33,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-19 23:16:33,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.85 | bwd_microstep: 3367.95 | bwd_inner_microstep: 3367.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 23:16:33,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.85 | bwd: 3367.96 | bwd_inner: 3367.15 | bwd_allreduce: 0.77 | step: 7.01 62%|██████▏ | 6195/10000 [9:46:54<5:49:57, 5.52s/it] {'loss': 0.002, 'grad_norm': 0.6226342916488647, 'learning_rate': 1.3359492868824809e-05, 'epoch': 6.2} 62%|██████▏ | 6195/10000 [9:46:54<5:49:57, 5.52s/it][2025-06-19 23:16:39,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:16:39,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.14 | bwd_microstep: 3320.45 | bwd_inner_microstep: 3319.67 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.55 [2025-06-19 23:16:39,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.14 | bwd: 3320.47 | bwd_inner: 3319.67 | bwd_allreduce: 0.75 | step: 6.55 62%|██████▏ | 6196/10000 [9:47:00<5:48:49, 5.50s/it] {'loss': 0.0082, 'grad_norm': 1.6798914670944214, 'learning_rate': 1.335338317277651e-05, 'epoch': 6.2} 62%|██████▏ | 6196/10000 [9:47:00<5:48:49, 5.50s/it][2025-06-19 23:16:44,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:16:44,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.55 | bwd_microstep: 3321.65 | bwd_inner_microstep: 3320.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 23:16:44,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.54 | bwd: 3321.66 | bwd_inner: 3320.87 | bwd_allreduce: 0.75 | step: 6.60 62%|██████▏ | 6197/10000 [9:47:05<5:48:01, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.00736148189753294, 'learning_rate': 1.3347274173927471e-05, 'epoch': 6.2} 62%|██████▏ | 6197/10000 [9:47:05<5:48:01, 5.49s/it][2025-06-19 23:16:50,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:16:50,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.49 | bwd_microstep: 3323.97 | bwd_inner_microstep: 3323.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 23:16:50,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.49 | bwd: 3323.99 | bwd_inner: 3323.18 | bwd_allreduce: 0.76 | step: 6.81 62%|██████▏ | 6198/10000 [9:47:11<5:47:30, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.009725164622068405, 'learning_rate': 1.3341165872918497e-05, 'epoch': 6.2} 62%|██████▏ | 6198/10000 [9:47:11<5:47:30, 5.48s/it][2025-06-19 23:16:55,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:16:55,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.99 | bwd_microstep: 3377.55 | bwd_inner_microstep: 3376.71 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.23 [2025-06-19 23:16:55,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.99 | bwd: 3377.57 | bwd_inner: 3376.71 | bwd_allreduce: 0.81 | step: 7.23 62%|██████▏ | 6199/10000 [9:47:16<5:48:53, 5.51s/it] {'loss': 0.0021, 'grad_norm': 0.44266170263290405, 'learning_rate': 1.3335058270390315e-05, 'epoch': 6.2} 62%|██████▏ | 6199/10000 [9:47:16<5:48:53, 5.51s/it][2025-06-19 23:17:01,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:17:01,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.53 | bwd_microstep: 3316.61 | bwd_inner_microstep: 3315.83 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 23:17:01,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.53 | bwd: 3316.62 | bwd_inner: 3315.83 | bwd_allreduce: 0.75 | step: 6.66 62%|██████▏ | 6200/10000 [9:47:22<5:48:01, 5.50s/it] {'loss': 0.0329, 'grad_norm': 4.350081443786621, 'learning_rate': 1.3328951366983594e-05, 'epoch': 6.2} 62%|██████▏ | 6200/10000 [9:47:22<5:48:01, 5.50s/it][2025-06-19 23:17:06,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:17:06,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.94 | bwd_microstep: 3372.83 | bwd_inner_microstep: 3372.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 23:17:06,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.94 | bwd: 3372.84 | bwd_inner: 3372.04 | bwd_allreduce: 0.76 | step: 6.60 62%|██████▏ | 6201/10000 [9:47:27<5:48:51, 5.51s/it] {'loss': 0.0011, 'grad_norm': 0.1778804212808609, 'learning_rate': 1.3322845163338907e-05, 'epoch': 6.2} 62%|██████▏ | 6201/10000 [9:47:27<5:48:51, 5.51s/it][2025-06-19 23:17:12,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:17:12,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.79 | bwd_microstep: 3380.88 | bwd_inner_microstep: 3380.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 23:17:12,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.79 | bwd: 3380.89 | bwd_inner: 3380.10 | bwd_allreduce: 0.75 | step: 6.61 62%|██████▏ | 6202/10000 [9:47:33<5:49:34, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.001341356779448688, 'learning_rate': 1.3316739660096773e-05, 'epoch': 6.2} 62%|██████▏ | 6202/10000 [9:47:33<5:49:34, 5.52s/it][2025-06-19 23:17:17,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.90 [2025-06-19 23:17:17,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.91 | bwd_microstep: 3410.75 | bwd_inner_microstep: 3409.93 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.00 [2025-06-19 23:17:17,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.91 | bwd: 3410.77 | bwd_inner: 3409.93 | bwd_allreduce: 0.80 | step: 7.00 62%|██████▏ | 6203/10000 [9:47:38<5:50:45, 5.54s/it] {'loss': 0.0001, 'grad_norm': 0.009410626254975796, 'learning_rate': 1.3310634857897631e-05, 'epoch': 6.2} 62%|██████▏ | 6203/10000 [9:47:38<5:50:45, 5.54s/it][2025-06-19 23:17:23,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:17:23,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.16 | bwd_microstep: 3386.26 | bwd_inner_microstep: 3385.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 23:17:23,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.16 | bwd: 3386.27 | bwd_inner: 3385.47 | bwd_allreduce: 0.76 | step: 6.68 62%|██████▏ | 6204/10000 [9:47:44<5:51:00, 5.55s/it] {'loss': 0.0532, 'grad_norm': 11.381250381469727, 'learning_rate': 1.3304530757381847e-05, 'epoch': 6.2} 62%|██████▏ | 6204/10000 [9:47:44<5:51:00, 5.55s/it][2025-06-19 23:17:29,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:17:29,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.56 | bwd_microstep: 3406.39 | bwd_inner_microstep: 3405.41 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.27 [2025-06-19 23:17:29,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.56 | bwd: 3406.40 | bwd_inner: 3405.41 | bwd_allreduce: 0.95 | step: 7.27 62%|██████▏ | 6205/10000 [9:47:49<5:51:42, 5.56s/it] {'loss': 0.0, 'grad_norm': 0.004581758286803961, 'learning_rate': 1.3298427359189719e-05, 'epoch': 6.21} 62%|██████▏ | 6205/10000 [9:47:49<5:51:42, 5.56s/it][2025-06-19 23:17:34,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:17:34,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.81 | bwd_microstep: 3326.27 | bwd_inner_microstep: 3325.48 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.82 [2025-06-19 23:17:34,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.81 | bwd: 3326.28 | bwd_inner: 3325.48 | bwd_allreduce: 0.76 | step: 6.82 62%|██████▏ | 6206/10000 [9:47:55<5:49:57, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.005458840634673834, 'learning_rate': 1.3292324663961452e-05, 'epoch': 6.21} 62%|██████▏ | 6206/10000 [9:47:55<5:49:57, 5.53s/it][2025-06-19 23:17:40,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:17:40,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.27 | bwd_microstep: 3323.05 | bwd_inner_microstep: 3322.07 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.16 [2025-06-19 23:17:40,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.27 | bwd: 3323.07 | bwd_inner: 3322.07 | bwd_allreduce: 0.94 | step: 7.16 62%|██████▏ | 6207/10000 [9:48:00<5:48:44, 5.52s/it] {'loss': 0.0037, 'grad_norm': 0.4235774278640747, 'learning_rate': 1.3286222672337195e-05, 'epoch': 6.21} 62%|██████▏ | 6207/10000 [9:48:00<5:48:44, 5.52s/it][2025-06-19 23:17:45,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:17:45,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.41 | bwd_microstep: 3378.25 | bwd_inner_microstep: 3377.35 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.86 [2025-06-19 23:17:45,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.41 | bwd: 3378.26 | bwd_inner: 3377.35 | bwd_allreduce: 0.87 | step: 6.86 62%|██████▏ | 6208/10000 [9:48:06<5:49:23, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.010549698024988174, 'learning_rate': 1.3280121384957019e-05, 'epoch': 6.21} 62%|██████▏ | 6208/10000 [9:48:06<5:49:23, 5.53s/it][2025-06-19 23:17:51,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:17:51,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.16 | bwd_microstep: 3383.21 | bwd_inner_microstep: 3382.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 23:17:51,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.16 | bwd: 3383.23 | bwd_inner: 3382.43 | bwd_allreduce: 0.76 | step: 6.60 62%|██████▏ | 6209/10000 [9:48:11<5:49:52, 5.54s/it] {'loss': 0.0002, 'grad_norm': 0.024171942844986916, 'learning_rate': 1.3274020802460918e-05, 'epoch': 6.21} 62%|██████▏ | 6209/10000 [9:48:11<5:49:52, 5.54s/it][2025-06-19 23:17:56,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:17:56,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.24 | bwd_microstep: 3317.48 | bwd_inner_microstep: 3316.52 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.10 [2025-06-19 23:17:56,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.24 | bwd: 3317.49 | bwd_inner: 3316.52 | bwd_allreduce: 0.92 | step: 7.11 62%|██████▏ | 6210/10000 [9:48:17<5:48:18, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.02389034815132618, 'learning_rate': 1.326792092548883e-05, 'epoch': 6.21} 62%|██████▏ | 6210/10000 [9:48:17<5:48:18, 5.51s/it][2025-06-19 23:18:02,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:18:02,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.56 | bwd_microstep: 3372.57 | bwd_inner_microstep: 3371.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 23:18:02,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.56 | bwd: 3372.58 | bwd_inner: 3371.78 | bwd_allreduce: 0.76 | step: 6.64 62%|██████▏ | 6211/10000 [9:48:22<5:48:53, 5.52s/it] {'loss': 0.0104, 'grad_norm': 1.0833945274353027, 'learning_rate': 1.3261821754680589e-05, 'epoch': 6.21} 62%|██████▏ | 6211/10000 [9:48:22<5:48:53, 5.52s/it][2025-06-19 23:18:07,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:18:07,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.85 | bwd_microstep: 3378.07 | bwd_inner_microstep: 3377.26 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-19 23:18:07,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.85 | bwd: 3378.08 | bwd_inner: 3377.26 | bwd_allreduce: 0.78 | step: 7.25 62%|██████▏ | 6212/10000 [9:48:28<5:49:16, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0004423221980687231, 'learning_rate': 1.3255723290675968e-05, 'epoch': 6.21} 62%|██████▏ | 6212/10000 [9:48:28<5:49:16, 5.53s/it][2025-06-19 23:18:13,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:18:13,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.98 | bwd_microstep: 3321.24 | bwd_inner_microstep: 3320.35 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.13 [2025-06-19 23:18:13,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.98 | bwd: 3321.26 | bwd_inner: 3320.35 | bwd_allreduce: 0.85 | step: 7.14 62%|██████▏ | 6213/10000 [9:48:34<5:48:04, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.05709286779165268, 'learning_rate': 1.3249625534114676e-05, 'epoch': 6.21} 62%|██████▏ | 6213/10000 [9:48:34<5:48:04, 5.51s/it][2025-06-19 23:18:18,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 23:18:18,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.86 | bwd_microstep: 3323.78 | bwd_inner_microstep: 3322.63 | bwd_allreduce_microstep: 1.08 | step_microstep: 8.15 [2025-06-19 23:18:18,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.86 | bwd: 3323.80 | bwd_inner: 3322.63 | bwd_allreduce: 1.11 | step: 8.16 62%|██████▏ | 6214/10000 [9:48:39<5:47:18, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.026037193834781647, 'learning_rate': 1.3243528485636335e-05, 'epoch': 6.21} 62%|██████▏ | 6214/10000 [9:48:39<5:47:18, 5.50s/it][2025-06-19 23:18:24,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:18:24,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.86 | bwd_microstep: 3319.99 | bwd_inner_microstep: 3319.02 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.37 [2025-06-19 23:18:24,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.86 | bwd: 3320.01 | bwd_inner: 3319.02 | bwd_allreduce: 0.93 | step: 7.38 62%|██████▏ | 6215/10000 [9:48:44<5:46:32, 5.49s/it] {'loss': 0.0244, 'grad_norm': 5.038841247558594, 'learning_rate': 1.3237432145880496e-05, 'epoch': 6.21} 62%|██████▏ | 6215/10000 [9:48:44<5:46:32, 5.49s/it][2025-06-19 23:18:29,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:18:29,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.48 | bwd_microstep: 3364.53 | bwd_inner_microstep: 3363.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 23:18:29,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.48 | bwd: 3364.54 | bwd_inner: 3363.75 | bwd_allreduce: 0.76 | step: 6.67 62%|██████▏ | 6216/10000 [9:48:50<5:47:12, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.08722402900457382, 'learning_rate': 1.3231336515486644e-05, 'epoch': 6.22} 62%|██████▏ | 6216/10000 [9:48:50<5:47:12, 5.51s/it][2025-06-19 23:18:35,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:18:35,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.77 | bwd_microstep: 3406.81 | bwd_inner_microstep: 3406.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 23:18:35,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.77 | bwd: 3406.82 | bwd_inner: 3406.02 | bwd_allreduce: 0.76 | step: 6.76 62%|██████▏ | 6217/10000 [9:48:56<5:48:39, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.020705049857497215, 'learning_rate': 1.3225241595094173e-05, 'epoch': 6.22} 62%|██████▏ | 6217/10000 [9:48:56<5:48:39, 5.53s/it][2025-06-19 23:18:40,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-19 23:18:40,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.20 | bwd_microstep: 3334.31 | bwd_inner_microstep: 3333.37 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.18 [2025-06-19 23:18:40,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.20 | bwd: 3334.33 | bwd_inner: 3333.37 | bwd_allreduce: 0.91 | step: 7.18 62%|██████▏ | 6218/10000 [9:49:01<5:47:44, 5.52s/it] {'loss': 0.002, 'grad_norm': 0.40336328744888306, 'learning_rate': 1.3219147385342419e-05, 'epoch': 6.22} 62%|██████▏ | 6218/10000 [9:49:01<5:47:44, 5.52s/it][2025-06-19 23:18:46,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:18:46,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.52 | bwd_microstep: 3372.40 | bwd_inner_microstep: 3371.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 23:18:46,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.52 | bwd: 3372.41 | bwd_inner: 3371.61 | bwd_allreduce: 0.77 | step: 6.82 62%|██████▏ | 6219/10000 [9:49:07<5:48:04, 5.52s/it] {'loss': 0.0015, 'grad_norm': 0.23967769742012024, 'learning_rate': 1.3213053886870628e-05, 'epoch': 6.22} 62%|██████▏ | 6219/10000 [9:49:07<5:48:04, 5.52s/it][2025-06-19 23:18:51,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:18:51,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.89 | bwd_microstep: 3318.11 | bwd_inner_microstep: 3317.14 | bwd_allreduce_microstep: 0.92 | step_microstep: 6.64 [2025-06-19 23:18:51,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.89 | bwd: 3318.12 | bwd_inner: 3317.14 | bwd_allreduce: 0.94 | step: 6.64 62%|██████▏ | 6220/10000 [9:49:12<5:46:52, 5.51s/it] {'loss': 0.0025, 'grad_norm': 0.5167974829673767, 'learning_rate': 1.320696110031799e-05, 'epoch': 6.22} 62%|██████▏ | 6220/10000 [9:49:12<5:46:52, 5.51s/it][2025-06-19 23:18:57,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:18:57,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.27 | bwd_microstep: 3327.49 | bwd_inner_microstep: 3326.47 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.02 [2025-06-19 23:18:57,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.27 | bwd: 3327.52 | bwd_inner: 3326.47 | bwd_allreduce: 0.98 | step: 7.01 62%|██████▏ | 6221/10000 [9:49:18<5:46:11, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.12282097339630127, 'learning_rate': 1.3200869026323613e-05, 'epoch': 6.22} 62%|██████▏ | 6221/10000 [9:49:18<5:46:11, 5.50s/it][2025-06-19 23:19:02,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:19:02,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.80 | bwd_microstep: 3375.07 | bwd_inner_microstep: 3373.99 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.27 [2025-06-19 23:19:02,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.80 | bwd: 3375.10 | bwd_inner: 3373.99 | bwd_allreduce: 1.04 | step: 7.27 62%|██████▏ | 6222/10000 [9:49:23<5:47:06, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.011041156947612762, 'learning_rate': 1.3194777665526507e-05, 'epoch': 6.22} 62%|██████▏ | 6222/10000 [9:49:23<5:47:06, 5.51s/it][2025-06-19 23:19:08,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:19:08,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.04 | bwd_microstep: 3318.74 | bwd_inner_microstep: 3317.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-19 23:19:08,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.04 | bwd: 3318.76 | bwd_inner: 3317.92 | bwd_allreduce: 0.79 | step: 7.15 62%|██████▏ | 6223/10000 [9:49:29<5:46:03, 5.50s/it] {'loss': 0.0012, 'grad_norm': 0.13763414323329926, 'learning_rate': 1.3188687018565642e-05, 'epoch': 6.22} 62%|██████▏ | 6223/10000 [9:49:29<5:46:03, 5.50s/it][2025-06-19 23:19:13,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:19:13,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.62 | bwd_microstep: 3325.97 | bwd_inner_microstep: 3325.13 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.77 [2025-06-19 23:19:13,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.62 | bwd: 3325.99 | bwd_inner: 3325.13 | bwd_allreduce: 0.80 | step: 6.77 62%|██████▏ | 6224/10000 [9:49:34<5:45:31, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.02590864896774292, 'learning_rate': 1.3182597086079898e-05, 'epoch': 6.22} 62%|██████▏ | 6224/10000 [9:49:34<5:45:31, 5.49s/it][2025-06-19 23:19:19,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:19:19,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.91 | bwd_microstep: 3312.50 | bwd_inner_microstep: 3311.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 23:19:19,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.91 | bwd: 3312.52 | bwd_inner: 3311.72 | bwd_allreduce: 0.76 | step: 6.68 62%|██████▏ | 6225/10000 [9:49:40<5:44:53, 5.48s/it] {'loss': 0.0009, 'grad_norm': 0.1176910400390625, 'learning_rate': 1.3176507868708077e-05, 'epoch': 6.22} 62%|██████▏ | 6225/10000 [9:49:40<5:44:53, 5.48s/it][2025-06-19 23:19:24,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:19:24,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.47 | bwd_microstep: 3374.21 | bwd_inner_microstep: 3373.26 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.13 [2025-06-19 23:19:24,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.47 | bwd: 3374.23 | bwd_inner: 3373.26 | bwd_allreduce: 0.93 | step: 7.13 62%|██████▏ | 6226/10000 [9:49:45<5:46:01, 5.50s/it] {'loss': 0.0009, 'grad_norm': 0.11131773144006729, 'learning_rate': 1.3170419367088918e-05, 'epoch': 6.23} 62%|██████▏ | 6226/10000 [9:49:45<5:46:01, 5.50s/it][2025-06-19 23:19:30,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:19:30,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.04 | bwd_microstep: 3311.76 | bwd_inner_microstep: 3310.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 23:19:30,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.04 | bwd: 3311.78 | bwd_inner: 3310.97 | bwd_allreduce: 0.76 | step: 6.82 62%|██████▏ | 6227/10000 [9:49:50<5:44:59, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.09039822965860367, 'learning_rate': 1.3164331581861065e-05, 'epoch': 6.23} 62%|██████▏ | 6227/10000 [9:49:51<5:44:59, 5.49s/it][2025-06-19 23:19:35,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:19:35,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.25 | bwd_microstep: 3360.81 | bwd_inner_microstep: 3360.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-19 23:19:35,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.25 | bwd: 3360.82 | bwd_inner: 3360.00 | bwd_allreduce: 0.78 | step: 6.99 62%|██████▏ | 6228/10000 [9:49:56<5:45:36, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.01194828748703003, 'learning_rate': 1.3158244513663103e-05, 'epoch': 6.23} 62%|██████▏ | 6228/10000 [9:49:56<5:45:36, 5.50s/it][2025-06-19 23:19:41,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:19:41,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.84 | bwd_microstep: 3325.86 | bwd_inner_microstep: 3324.91 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.42 [2025-06-19 23:19:41,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.84 | bwd: 3325.88 | bwd_inner: 3324.91 | bwd_allreduce: 0.93 | step: 7.43 62%|██████▏ | 6229/10000 [9:50:01<5:45:06, 5.49s/it] {'loss': 0.0008, 'grad_norm': 0.07590026408433914, 'learning_rate': 1.3152158163133542e-05, 'epoch': 6.23} 62%|██████▏ | 6229/10000 [9:50:02<5:45:06, 5.49s/it][2025-06-19 23:19:46,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:19:46,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.40 | bwd_microstep: 3366.26 | bwd_inner_microstep: 3365.48 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 23:19:46,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.40 | bwd: 3366.28 | bwd_inner: 3365.48 | bwd_allreduce: 0.76 | step: 6.67 62%|██████▏ | 6230/10000 [9:50:07<5:45:42, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.013616537675261497, 'learning_rate': 1.3146072530910803e-05, 'epoch': 6.23} 62%|██████▏ | 6230/10000 [9:50:07<5:45:42, 5.50s/it][2025-06-19 23:19:52,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:19:52,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.52 | bwd_microstep: 3374.07 | bwd_inner_microstep: 3373.26 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 23:19:52,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.52 | bwd: 3374.08 | bwd_inner: 3373.26 | bwd_allreduce: 0.77 | step: 7.16 62%|██████▏ | 6231/10000 [9:50:13<5:46:24, 5.51s/it] {'loss': 0.0055, 'grad_norm': 0.9586828947067261, 'learning_rate': 1.3139987617633256e-05, 'epoch': 6.23} 62%|██████▏ | 6231/10000 [9:50:13<5:46:24, 5.51s/it][2025-06-19 23:19:57,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.76 | optimizer_step: 2.73 [2025-06-19 23:19:57,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.56 | bwd_microstep: 3367.40 | bwd_inner_microstep: 3366.50 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.49 [2025-06-19 23:19:57,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.55 | bwd: 3367.42 | bwd_inner: 3366.50 | bwd_allreduce: 0.87 | step: 7.49 62%|██████▏ | 6232/10000 [9:50:18<5:46:43, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.003134855069220066, 'learning_rate': 1.3133903423939161e-05, 'epoch': 6.23} 62%|██████▏ | 6232/10000 [9:50:18<5:46:43, 5.52s/it][2025-06-19 23:20:03,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:20:03,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.54 | bwd_microstep: 3317.22 | bwd_inner_microstep: 3316.35 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.09 [2025-06-19 23:20:03,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.54 | bwd: 3317.23 | bwd_inner: 3316.35 | bwd_allreduce: 0.83 | step: 7.09 62%|██████▏ | 6233/10000 [9:50:24<5:45:30, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.022282129153609276, 'learning_rate': 1.312781995046673e-05, 'epoch': 6.23} 62%|██████▏ | 6233/10000 [9:50:24<5:45:30, 5.50s/it][2025-06-19 23:20:08,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:20:08,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.31 | bwd_microstep: 3315.94 | bwd_inner_microstep: 3315.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 23:20:08,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.31 | bwd: 3315.96 | bwd_inner: 3315.14 | bwd_allreduce: 0.77 | step: 6.95 62%|██████▏ | 6234/10000 [9:50:29<5:44:38, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.034364525228738785, 'learning_rate': 1.3121737197854092e-05, 'epoch': 6.23} 62%|██████▏ | 6234/10000 [9:50:29<5:44:38, 5.49s/it][2025-06-19 23:20:14,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:20:14,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.88 | bwd_microstep: 3323.38 | bwd_inner_microstep: 3322.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 23:20:14,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.88 | bwd: 3323.39 | bwd_inner: 3322.59 | bwd_allreduce: 0.76 | step: 6.77 62%|██████▏ | 6235/10000 [9:50:35<5:44:16, 5.49s/it] {'loss': 0.002, 'grad_norm': 0.8184264302253723, 'learning_rate': 1.3115655166739297e-05, 'epoch': 6.24} 62%|██████▏ | 6235/10000 [9:50:35<5:44:16, 5.49s/it][2025-06-19 23:20:19,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 23:20:19,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.67 | bwd_microstep: 3310.29 | bwd_inner_microstep: 3309.42 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.79 [2025-06-19 23:20:19,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.67 | bwd: 3310.31 | bwd_inner: 3309.42 | bwd_allreduce: 0.83 | step: 7.80 62%|██████▏ | 6236/10000 [9:50:40<5:43:41, 5.48s/it] {'loss': 0.0762, 'grad_norm': 10.972761154174805, 'learning_rate': 1.310957385776033e-05, 'epoch': 6.24} 62%|██████▏ | 6236/10000 [9:50:40<5:43:41, 5.48s/it][2025-06-19 23:20:25,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:20:25,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.92 | bwd_microstep: 3322.03 | bwd_inner_microstep: 3321.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 23:20:25,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.92 | bwd: 3322.05 | bwd_inner: 3321.24 | bwd_allreduce: 0.76 | step: 6.81 62%|██████▏ | 6237/10000 [9:50:45<5:43:23, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004400560166686773, 'learning_rate': 1.3103493271555082e-05, 'epoch': 6.24} 62%|██████▏ | 6237/10000 [9:50:45<5:43:23, 5.48s/it][2025-06-19 23:20:30,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:20:30,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.66 | bwd_microstep: 3313.08 | bwd_inner_microstep: 3312.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 23:20:30,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.66 | bwd: 3313.09 | bwd_inner: 3312.30 | bwd_allreduce: 0.75 | step: 6.68 62%|██████▏ | 6238/10000 [9:50:51<5:42:52, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.010679579339921474, 'learning_rate': 1.309741340876138e-05, 'epoch': 6.24} 62%|██████▏ | 6238/10000 [9:50:51<5:42:52, 5.47s/it][2025-06-19 23:20:36,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:20:36,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.54 | bwd_microstep: 3369.82 | bwd_inner_microstep: 3368.87 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.73 [2025-06-19 23:20:36,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.54 | bwd: 3369.84 | bwd_inner: 3368.87 | bwd_allreduce: 0.92 | step: 7.74 62%|██████▏ | 6239/10000 [9:50:56<5:44:20, 5.49s/it] {'loss': 0.0014, 'grad_norm': 0.38240572810173035, 'learning_rate': 1.3091334270016977e-05, 'epoch': 6.24} 62%|██████▏ | 6239/10000 [9:50:56<5:44:20, 5.49s/it][2025-06-19 23:20:41,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:20:41,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.69 | bwd_microstep: 3319.52 | bwd_inner_microstep: 3318.53 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.53 [2025-06-19 23:20:41,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.69 | bwd: 3319.54 | bwd_inner: 3318.53 | bwd_allreduce: 0.96 | step: 7.54 62%|██████▏ | 6240/10000 [9:51:02<5:43:52, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.003911939915269613, 'learning_rate': 1.3085255855959545e-05, 'epoch': 6.24} 62%|██████▏ | 6240/10000 [9:51:02<5:43:52, 5.49s/it][2025-06-19 23:20:47,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:20:47,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.10 | bwd_microstep: 3319.45 | bwd_inner_microstep: 3318.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 23:20:47,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.10 | bwd: 3319.47 | bwd_inner: 3318.66 | bwd_allreduce: 0.76 | step: 6.90 62%|██████▏ | 6241/10000 [9:51:07<5:43:18, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0069722989574074745, 'learning_rate': 1.3079178167226689e-05, 'epoch': 6.24} 62%|██████▏ | 6241/10000 [9:51:07<5:43:18, 5.48s/it][2025-06-19 23:20:52,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:20:52,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.16 | bwd_microstep: 3362.80 | bwd_inner_microstep: 3362.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 23:20:52,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.16 | bwd: 3362.82 | bwd_inner: 3362.01 | bwd_allreduce: 0.77 | step: 6.81 62%|██████▏ | 6242/10000 [9:51:13<5:44:03, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002803053241223097, 'learning_rate': 1.3073101204455913e-05, 'epoch': 6.24} 62%|██████▏ | 6242/10000 [9:51:13<5:44:03, 5.49s/it][2025-06-19 23:20:58,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:20:58,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.23 | bwd_microstep: 3313.53 | bwd_inner_microstep: 3312.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.14 [2025-06-19 23:20:58,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.23 | bwd: 3313.55 | bwd_inner: 3312.74 | bwd_allreduce: 0.77 | step: 7.15 62%|██████▏ | 6243/10000 [9:51:18<5:43:10, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.002581462962552905, 'learning_rate': 1.306702496828467e-05, 'epoch': 6.24} 62%|██████▏ | 6243/10000 [9:51:18<5:43:10, 5.48s/it][2025-06-19 23:21:03,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:21:03,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.54 | bwd_microstep: 3318.72 | bwd_inner_microstep: 3317.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 23:21:03,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.54 | bwd: 3318.73 | bwd_inner: 3317.93 | bwd_allreduce: 0.77 | step: 6.75 62%|██████▏ | 6244/10000 [9:51:24<5:42:38, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.0645771250128746, 'learning_rate': 1.3060949459350334e-05, 'epoch': 6.24} 62%|██████▏ | 6244/10000 [9:51:24<5:42:38, 5.47s/it][2025-06-19 23:21:08,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 23:21:08,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.51 | bwd_microstep: 3316.86 | bwd_inner_microstep: 3315.56 | bwd_allreduce_microstep: 1.05 | step_microstep: 8.25 [2025-06-19 23:21:08,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.51 | bwd: 3316.88 | bwd_inner: 3315.56 | bwd_allreduce: 1.08 | step: 8.26 62%|██████▏ | 6245/10000 [9:51:29<5:42:35, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.004015833139419556, 'learning_rate': 1.3054874678290194e-05, 'epoch': 6.25} 62%|██████▏ | 6245/10000 [9:51:29<5:42:35, 5.47s/it][2025-06-19 23:21:14,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:21:14,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.22 | bwd_microstep: 3313.27 | bwd_inner_microstep: 3312.32 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.03 [2025-06-19 23:21:14,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.22 | bwd: 3313.29 | bwd_inner: 3312.32 | bwd_allreduce: 0.93 | step: 7.03 62%|██████▏ | 6246/10000 [9:51:35<5:42:19, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.02408658340573311, 'learning_rate': 1.304880062574148e-05, 'epoch': 6.25} 62%|██████▏ | 6246/10000 [9:51:35<5:42:19, 5.47s/it][2025-06-19 23:21:20,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:21:20,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.71 | bwd_microstep: 3385.03 | bwd_inner_microstep: 3383.97 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.58 [2025-06-19 23:21:20,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.71 | bwd: 3385.04 | bwd_inner: 3383.97 | bwd_allreduce: 1.03 | step: 7.58 62%|██████▏ | 6247/10000 [9:51:40<5:43:49, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.01591920293867588, 'learning_rate': 1.3042727302341307e-05, 'epoch': 6.25} 62%|██████▏ | 6247/10000 [9:51:40<5:43:49, 5.50s/it][2025-06-19 23:21:25,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:21:25,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.16 | bwd_microstep: 3318.75 | bwd_inner_microstep: 3317.93 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.53 [2025-06-19 23:21:25,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.16 | bwd: 3318.76 | bwd_inner: 3317.93 | bwd_allreduce: 0.79 | step: 7.53 62%|██████▏ | 6248/10000 [9:51:46<5:43:19, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002727850340306759, 'learning_rate': 1.3036654708726749e-05, 'epoch': 6.25} 62%|██████▏ | 6248/10000 [9:51:46<5:43:19, 5.49s/it][2025-06-19 23:21:30,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:21:30,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.94 | bwd_microstep: 3317.04 | bwd_inner_microstep: 3316.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-19 23:21:30,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.94 | bwd: 3317.06 | bwd_inner: 3316.24 | bwd_allreduce: 0.78 | step: 6.91 62%|██████▏ | 6249/10000 [9:51:51<5:42:36, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.01860811561346054, 'learning_rate': 1.3030582845534796e-05, 'epoch': 6.25} 62%|██████▏ | 6249/10000 [9:51:51<5:42:36, 5.48s/it][2025-06-19 23:21:36,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:21:36,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.90 | bwd_microstep: 3367.22 | bwd_inner_microstep: 3366.26 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.53 [2025-06-19 23:21:36,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.90 | bwd: 3367.24 | bwd_inner: 3366.26 | bwd_allreduce: 0.93 | step: 7.53 62%|██████▎ | 6250/10000 [9:51:57<5:43:30, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.009395432658493519, 'learning_rate': 1.3024511713402355e-05, 'epoch': 6.25} 62%|██████▎ | 6250/10000 [9:51:57<5:43:30, 5.50s/it][2025-06-19 23:21:42,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:21:42,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.58 | bwd_microstep: 3386.65 | bwd_inner_microstep: 3385.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 23:21:42,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.58 | bwd: 3386.67 | bwd_inner: 3385.87 | bwd_allreduce: 0.76 | step: 6.62 63%|██████▎ | 6251/10000 [9:52:02<5:44:48, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.005098256282508373, 'learning_rate': 1.3018441312966266e-05, 'epoch': 6.25} 63%|██████▎ | 6251/10000 [9:52:02<5:44:48, 5.52s/it][2025-06-19 23:21:47,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:21:47,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.11 | bwd_microstep: 3359.72 | bwd_inner_microstep: 3358.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 23:21:47,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.11 | bwd: 3359.73 | bwd_inner: 3358.94 | bwd_allreduce: 0.76 | step: 6.67 63%|██████▎ | 6252/10000 [9:52:08<5:44:49, 5.52s/it] {'loss': 0.0006, 'grad_norm': 0.12409611791372299, 'learning_rate': 1.3012371644863278e-05, 'epoch': 6.25} 63%|██████▎ | 6252/10000 [9:52:08<5:44:49, 5.52s/it][2025-06-19 23:21:53,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:21:53,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.47 | bwd_microstep: 3364.61 | bwd_inner_microstep: 3363.54 | bwd_allreduce_microstep: 1.01 | step_microstep: 8.08 [2025-06-19 23:21:53,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.47 | bwd: 3364.63 | bwd_inner: 3363.54 | bwd_allreduce: 1.04 | step: 8.08 63%|██████▎ | 6253/10000 [9:52:13<5:45:00, 5.52s/it] {'loss': 0.021, 'grad_norm': 3.331386089324951, 'learning_rate': 1.3006302709730072e-05, 'epoch': 6.25} 63%|██████▎ | 6253/10000 [9:52:13<5:45:00, 5.52s/it][2025-06-19 23:21:58,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:21:58,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.97 | bwd_microstep: 3325.14 | bwd_inner_microstep: 3324.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-19 23:21:58,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.97 | bwd: 3325.15 | bwd_inner: 3324.35 | bwd_allreduce: 0.76 | step: 6.83 63%|██████▎ | 6254/10000 [9:52:19<5:43:54, 5.51s/it] {'loss': 0.0013, 'grad_norm': 0.18081672489643097, 'learning_rate': 1.3000234508203254e-05, 'epoch': 6.25} 63%|██████▎ | 6254/10000 [9:52:19<5:43:54, 5.51s/it][2025-06-19 23:22:04,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:22:04,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.18 | bwd_microstep: 3318.33 | bwd_inner_microstep: 3317.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 23:22:04,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.18 | bwd: 3318.34 | bwd_inner: 3317.54 | bwd_allreduce: 0.76 | step: 6.65 63%|██████▎ | 6255/10000 [9:52:24<5:42:51, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0006331705371849239, 'learning_rate': 1.2994167040919348e-05, 'epoch': 6.25} 63%|██████▎ | 6255/10000 [9:52:24<5:42:51, 5.49s/it][2025-06-19 23:22:09,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:22:09,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.63 | bwd_microstep: 3365.83 | bwd_inner_microstep: 3365.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 23:22:09,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.63 | bwd: 3365.84 | bwd_inner: 3365.04 | bwd_allreduce: 0.76 | step: 6.79 63%|██████▎ | 6256/10000 [9:52:30<5:43:29, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.045410506427288055, 'learning_rate': 1.2988100308514816e-05, 'epoch': 6.26} 63%|██████▎ | 6256/10000 [9:52:30<5:43:29, 5.50s/it][2025-06-19 23:22:15,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:22:15,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.86 | bwd_microstep: 3361.17 | bwd_inner_microstep: 3360.34 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.14 [2025-06-19 23:22:15,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.86 | bwd: 3361.18 | bwd_inner: 3360.34 | bwd_allreduce: 0.80 | step: 7.14 63%|██████▎ | 6257/10000 [9:52:35<5:43:55, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.013713655062019825, 'learning_rate': 1.2982034311626003e-05, 'epoch': 6.26} 63%|██████▎ | 6257/10000 [9:52:35<5:43:55, 5.51s/it][2025-06-19 23:22:20,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:22:20,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.84 | bwd_microstep: 3308.47 | bwd_inner_microstep: 3307.66 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.98 [2025-06-19 23:22:20,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.84 | bwd: 3308.48 | bwd_inner: 3307.66 | bwd_allreduce: 0.78 | step: 6.98 63%|██████▎ | 6258/10000 [9:52:41<5:42:44, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005695837549865246, 'learning_rate': 1.2975969050889225e-05, 'epoch': 6.26} 63%|██████▎ | 6258/10000 [9:52:41<5:42:44, 5.50s/it][2025-06-19 23:22:26,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:22:26,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.16 | bwd_microstep: 3309.94 | bwd_inner_microstep: 3308.83 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.41 [2025-06-19 23:22:26,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.16 | bwd: 3309.96 | bwd_inner: 3308.83 | bwd_allreduce: 1.06 | step: 7.42 63%|██████▎ | 6259/10000 [9:52:46<5:42:01, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005967906676232815, 'learning_rate': 1.2969904526940696e-05, 'epoch': 6.26} 63%|██████▎ | 6259/10000 [9:52:46<5:42:01, 5.49s/it][2025-06-19 23:22:31,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 23:22:31,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.70 | bwd_microstep: 3305.26 | bwd_inner_microstep: 3304.23 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.83 [2025-06-19 23:22:31,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.70 | bwd: 3305.27 | bwd_inner: 3304.23 | bwd_allreduce: 0.99 | step: 7.83 63%|██████▎ | 6260/10000 [9:52:52<5:41:23, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.05939522758126259, 'learning_rate': 1.2963840740416548e-05, 'epoch': 6.26} 63%|██████▎ | 6260/10000 [9:52:52<5:41:23, 5.48s/it][2025-06-19 23:22:36,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:22:36,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.86 | bwd_microstep: 3359.22 | bwd_inner_microstep: 3358.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-19 23:22:36,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.86 | bwd: 3359.23 | bwd_inner: 3358.43 | bwd_allreduce: 0.76 | step: 6.74 63%|██████▎ | 6261/10000 [9:52:57<5:42:08, 5.49s/it] {'loss': 0.004, 'grad_norm': 0.5218322277069092, 'learning_rate': 1.295777769195286e-05, 'epoch': 6.26} 63%|██████▎ | 6261/10000 [9:52:57<5:42:08, 5.49s/it][2025-06-19 23:22:42,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:22:42,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.00 | bwd_microstep: 3316.13 | bwd_inner_microstep: 3315.29 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.85 [2025-06-19 23:22:42,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.00 | bwd: 3316.14 | bwd_inner: 3315.29 | bwd_allreduce: 0.81 | step: 6.85 63%|██████▎ | 6262/10000 [9:53:03<5:41:33, 5.48s/it] {'loss': 0.0004, 'grad_norm': 0.1114918440580368, 'learning_rate': 1.2951715382185602e-05, 'epoch': 6.26} 63%|██████▎ | 6262/10000 [9:53:03<5:41:33, 5.48s/it][2025-06-19 23:22:47,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:22:47,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.30 | bwd_microstep: 3306.41 | bwd_inner_microstep: 3305.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 23:22:47,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.30 | bwd: 3306.43 | bwd_inner: 3305.62 | bwd_allreduce: 0.76 | step: 6.80 63%|██████▎ | 6263/10000 [9:53:08<5:40:46, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00268141133710742, 'learning_rate': 1.2945653811750686e-05, 'epoch': 6.26} 63%|██████▎ | 6263/10000 [9:53:08<5:40:46, 5.47s/it][2025-06-19 23:22:53,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:22:53,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.23 | bwd_microstep: 3311.81 | bwd_inner_microstep: 3311.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 23:22:53,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.23 | bwd: 3311.82 | bwd_inner: 3311.00 | bwd_allreduce: 0.78 | step: 7.09 63%|██████▎ | 6264/10000 [9:53:14<5:40:28, 5.47s/it] {'loss': 0.0006, 'grad_norm': 0.11662724614143372, 'learning_rate': 1.2939592981283948e-05, 'epoch': 6.26} 63%|██████▎ | 6264/10000 [9:53:14<5:40:28, 5.47s/it][2025-06-19 23:22:58,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 23:22:58,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.90 | bwd_microstep: 3359.36 | bwd_inner_microstep: 3358.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.03 [2025-06-19 23:22:58,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.90 | bwd: 3359.37 | bwd_inner: 3358.58 | bwd_allreduce: 0.75 | step: 7.04 63%|██████▎ | 6265/10000 [9:53:19<5:41:27, 5.49s/it] {'loss': 0.0016, 'grad_norm': 0.1929531842470169, 'learning_rate': 1.2933532891421136e-05, 'epoch': 6.26} 63%|██████▎ | 6265/10000 [9:53:19<5:41:27, 5.49s/it][2025-06-19 23:23:04,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.72 | optimizer_step: 2.73 [2025-06-19 23:23:04,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.73 | bwd_microstep: 3361.77 | bwd_inner_microstep: 3360.92 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.14 [2025-06-19 23:23:04,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.73 | bwd: 3361.79 | bwd_inner: 3360.92 | bwd_allreduce: 0.81 | step: 7.15 63%|██████▎ | 6266/10000 [9:53:25<5:42:09, 5.50s/it] {'loss': 0.0021, 'grad_norm': 0.5705777406692505, 'learning_rate': 1.292747354279794e-05, 'epoch': 6.27} 63%|██████▎ | 6266/10000 [9:53:25<5:42:09, 5.50s/it][2025-06-19 23:23:09,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 23:23:09,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.03 | bwd_microstep: 3311.58 | bwd_inner_microstep: 3310.56 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.50 [2025-06-19 23:23:09,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.03 | bwd: 3311.60 | bwd_inner: 3310.56 | bwd_allreduce: 1.00 | step: 7.50 63%|██████▎ | 6267/10000 [9:53:30<5:41:21, 5.49s/it] {'loss': 0.0428, 'grad_norm': 5.468526363372803, 'learning_rate': 1.292141493604993e-05, 'epoch': 6.27} 63%|██████▎ | 6267/10000 [9:53:30<5:41:21, 5.49s/it][2025-06-19 23:23:15,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:23:15,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.55 | bwd_microstep: 3395.52 | bwd_inner_microstep: 3394.69 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.24 [2025-06-19 23:23:15,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.55 | bwd: 3395.53 | bwd_inner: 3394.69 | bwd_allreduce: 0.79 | step: 7.24 63%|██████▎ | 6268/10000 [9:53:36<5:43:00, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.015195326879620552, 'learning_rate': 1.2915357071812645e-05, 'epoch': 6.27} 63%|██████▎ | 6268/10000 [9:53:36<5:43:00, 5.51s/it][2025-06-19 23:23:20,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:23:20,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.61 | bwd_microstep: 3320.60 | bwd_inner_microstep: 3319.71 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.69 [2025-06-19 23:23:20,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.61 | bwd: 3320.62 | bwd_inner: 3319.71 | bwd_allreduce: 0.86 | step: 7.69 63%|██████▎ | 6269/10000 [9:53:41<5:42:07, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.010036755353212357, 'learning_rate': 1.290929995072152e-05, 'epoch': 6.27} 63%|██████▎ | 6269/10000 [9:53:41<5:42:07, 5.50s/it][2025-06-19 23:23:26,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:23:26,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.19 | bwd_microstep: 3354.87 | bwd_inner_microstep: 3354.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-19 23:23:26,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.19 | bwd: 3354.88 | bwd_inner: 3354.08 | bwd_allreduce: 0.76 | step: 6.82 63%|██████▎ | 6270/10000 [9:53:47<5:42:32, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0014460945967584848, 'learning_rate': 1.2903243573411923e-05, 'epoch': 6.27} 63%|██████▎ | 6270/10000 [9:53:47<5:42:32, 5.51s/it][2025-06-19 23:23:32,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:23:32,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.59 | bwd_microstep: 3384.15 | bwd_inner_microstep: 3383.15 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.44 [2025-06-19 23:23:32,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.59 | bwd: 3384.17 | bwd_inner: 3383.15 | bwd_allreduce: 0.96 | step: 7.44 63%|██████▎ | 6271/10000 [9:53:52<5:43:24, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.019964920356869698, 'learning_rate': 1.2897187940519142e-05, 'epoch': 6.27} 63%|██████▎ | 6271/10000 [9:53:52<5:43:24, 5.53s/it][2025-06-19 23:23:37,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:23:37,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.26 | bwd_microstep: 3311.72 | bwd_inner_microstep: 3310.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.16 [2025-06-19 23:23:37,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.26 | bwd: 3311.73 | bwd_inner: 3310.93 | bwd_allreduce: 0.76 | step: 7.16 63%|██████▎ | 6272/10000 [9:53:58<5:42:12, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0017387280240654945, 'learning_rate': 1.2891133052678374e-05, 'epoch': 6.27} 63%|██████▎ | 6272/10000 [9:53:58<5:42:12, 5.51s/it][2025-06-19 23:23:42,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:23:42,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.45 | bwd_microstep: 3313.88 | bwd_inner_microstep: 3313.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-19 23:23:42,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.46 | bwd: 3313.89 | bwd_inner: 3313.10 | bwd_allreduce: 0.75 | step: 6.60 63%|██████▎ | 6273/10000 [9:54:03<5:41:19, 5.50s/it] {'loss': 0.0012, 'grad_norm': 0.5061084032058716, 'learning_rate': 1.288507891052476e-05, 'epoch': 6.27} 63%|██████▎ | 6273/10000 [9:54:03<5:41:19, 5.50s/it][2025-06-19 23:23:48,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:23:48,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.30 | bwd_microstep: 3318.80 | bwd_inner_microstep: 3317.98 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-19 23:23:48,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.30 | bwd: 3318.81 | bwd_inner: 3317.98 | bwd_allreduce: 0.78 | step: 7.03 63%|██████▎ | 6274/10000 [9:54:09<5:40:27, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.020994264632463455, 'learning_rate': 1.2879025514693341e-05, 'epoch': 6.27} 63%|██████▎ | 6274/10000 [9:54:09<5:40:27, 5.48s/it][2025-06-19 23:23:53,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 23:23:53,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.34 | bwd_microstep: 3372.40 | bwd_inner_microstep: 3371.52 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.98 [2025-06-19 23:23:53,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.34 | bwd: 3372.41 | bwd_inner: 3371.52 | bwd_allreduce: 0.84 | step: 6.99 63%|██████▎ | 6275/10000 [9:54:14<5:41:30, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.3954133689403534, 'learning_rate': 1.28729728658191e-05, 'epoch': 6.28} 63%|██████▎ | 6275/10000 [9:54:14<5:41:30, 5.50s/it][2025-06-19 23:23:59,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:23:59,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.66 | bwd_microstep: 3319.38 | bwd_inner_microstep: 3318.47 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.97 [2025-06-19 23:23:59,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.66 | bwd: 3319.39 | bwd_inner: 3318.47 | bwd_allreduce: 0.88 | step: 6.98 63%|██████▎ | 6276/10000 [9:54:20<5:40:44, 5.49s/it] {'loss': 0.002, 'grad_norm': 0.6968031525611877, 'learning_rate': 1.2866920964536929e-05, 'epoch': 6.28} 63%|██████▎ | 6276/10000 [9:54:20<5:40:44, 5.49s/it][2025-06-19 23:24:04,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:24:04,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.11 | bwd_microstep: 3316.98 | bwd_inner_microstep: 3316.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 23:24:04,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.11 | bwd: 3316.99 | bwd_inner: 3316.19 | bwd_allreduce: 0.76 | step: 6.69 63%|██████▎ | 6277/10000 [9:54:25<5:40:06, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001230378751643002, 'learning_rate': 1.286086981148165e-05, 'epoch': 6.28} 63%|██████▎ | 6277/10000 [9:54:25<5:40:06, 5.48s/it][2025-06-19 23:24:10,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-19 23:24:10,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.72 | bwd_microstep: 3374.27 | bwd_inner_microstep: 3373.17 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.76 [2025-06-19 23:24:10,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.72 | bwd: 3374.28 | bwd_inner: 3373.17 | bwd_allreduce: 1.06 | step: 7.76 63%|██████▎ | 6278/10000 [9:54:31<5:41:18, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.004590924363583326, 'learning_rate': 1.285481940728798e-05, 'epoch': 6.28} 63%|██████▎ | 6278/10000 [9:54:31<5:41:18, 5.50s/it][2025-06-19 23:24:15,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:24:15,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.69 | bwd_microstep: 3318.88 | bwd_inner_microstep: 3317.86 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.49 [2025-06-19 23:24:15,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.69 | bwd: 3318.90 | bwd_inner: 3317.86 | bwd_allreduce: 0.99 | step: 7.50 63%|██████▎ | 6279/10000 [9:54:36<5:40:44, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.09878667443990707, 'learning_rate': 1.28487697525906e-05, 'epoch': 6.28} 63%|██████▎ | 6279/10000 [9:54:36<5:40:44, 5.49s/it][2025-06-19 23:24:21,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:24:21,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.85 | bwd_microstep: 3368.11 | bwd_inner_microstep: 3367.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 23:24:21,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.85 | bwd: 3368.12 | bwd_inner: 3367.33 | bwd_allreduce: 0.75 | step: 6.58 63%|██████▎ | 6280/10000 [9:54:42<5:41:20, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0027518467977643013, 'learning_rate': 1.2842720848024079e-05, 'epoch': 6.28} 63%|██████▎ | 6280/10000 [9:54:42<5:41:20, 5.51s/it][2025-06-19 23:24:26,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:24:26,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.23 | bwd_microstep: 3368.79 | bwd_inner_microstep: 3367.99 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 23:24:26,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.23 | bwd: 3368.80 | bwd_inner: 3367.99 | bwd_allreduce: 0.77 | step: 6.70 63%|██████▎ | 6281/10000 [9:54:47<5:41:53, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.057643406093120575, 'learning_rate': 1.2836672694222925e-05, 'epoch': 6.28} 63%|██████▎ | 6281/10000 [9:54:47<5:41:53, 5.52s/it][2025-06-19 23:24:32,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:24:32,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.34 | bwd_microstep: 3316.03 | bwd_inner_microstep: 3315.05 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.08 [2025-06-19 23:24:32,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.34 | bwd: 3316.05 | bwd_inner: 3315.05 | bwd_allreduce: 0.95 | step: 7.08 63%|██████▎ | 6282/10000 [9:54:53<5:40:43, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.05021445080637932, 'learning_rate': 1.283062529182157e-05, 'epoch': 6.28} 63%|██████▎ | 6282/10000 [9:54:53<5:40:43, 5.50s/it][2025-06-19 23:24:37,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:24:37,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.36 | bwd_microstep: 3324.56 | bwd_inner_microstep: 3323.75 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.26 [2025-06-19 23:24:37,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.36 | bwd: 3324.58 | bwd_inner: 3323.75 | bwd_allreduce: 0.78 | step: 7.26 63%|██████▎ | 6283/10000 [9:54:58<5:40:09, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.008383404463529587, 'learning_rate': 1.2824578641454335e-05, 'epoch': 6.28} 63%|██████▎ | 6283/10000 [9:54:58<5:40:09, 5.49s/it][2025-06-19 23:24:43,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:24:43,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.53 | bwd_microstep: 3314.52 | bwd_inner_microstep: 3313.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 23:24:43,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.53 | bwd: 3314.54 | bwd_inner: 3313.73 | bwd_allreduce: 0.76 | step: 6.71 63%|██████▎ | 6284/10000 [9:55:04<5:39:26, 5.48s/it] {'loss': 0.001, 'grad_norm': 0.3002290427684784, 'learning_rate': 1.2818532743755503e-05, 'epoch': 6.28} 63%|██████▎ | 6284/10000 [9:55:04<5:39:26, 5.48s/it][2025-06-19 23:24:48,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:24:48,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.44 | bwd_microstep: 3310.99 | bwd_inner_microstep: 3309.97 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.73 [2025-06-19 23:24:48,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.45 | bwd: 3311.00 | bwd_inner: 3309.97 | bwd_allreduce: 0.99 | step: 7.73 63%|██████▎ | 6285/10000 [9:55:09<5:38:51, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.005878337658941746, 'learning_rate': 1.281248759935925e-05, 'epoch': 6.29} 63%|██████▎ | 6285/10000 [9:55:09<5:38:51, 5.47s/it][2025-06-19 23:24:54,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:24:54,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.69 | bwd_microstep: 3366.25 | bwd_inner_microstep: 3365.24 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.07 [2025-06-19 23:24:54,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.69 | bwd: 3366.27 | bwd_inner: 3365.24 | bwd_allreduce: 0.96 | step: 7.06 63%|██████▎ | 6286/10000 [9:55:15<5:40:01, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00042373911128379405, 'learning_rate': 1.2806443208899695e-05, 'epoch': 6.29} 63%|██████▎ | 6286/10000 [9:55:15<5:40:01, 5.49s/it][2025-06-19 23:24:59,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:24:59,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.59 | bwd_microstep: 3321.56 | bwd_inner_microstep: 3320.69 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.48 [2025-06-19 23:24:59,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.59 | bwd: 3321.58 | bwd_inner: 3320.69 | bwd_allreduce: 0.83 | step: 7.49 63%|██████▎ | 6287/10000 [9:55:20<5:39:43, 5.49s/it] {'loss': 0.0532, 'grad_norm': 6.355321407318115, 'learning_rate': 1.2800399573010864e-05, 'epoch': 6.29} 63%|██████▎ | 6287/10000 [9:55:20<5:39:43, 5.49s/it][2025-06-19 23:25:05,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:25:05,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.94 | bwd_microstep: 3328.48 | bwd_inner_microstep: 3327.59 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.93 [2025-06-19 23:25:05,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.94 | bwd: 3328.49 | bwd_inner: 3327.59 | bwd_allreduce: 0.86 | step: 6.93 63%|██████▎ | 6288/10000 [9:55:26<5:39:20, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.06378895044326782, 'learning_rate': 1.2794356692326696e-05, 'epoch': 6.29} 63%|██████▎ | 6288/10000 [9:55:26<5:39:20, 5.48s/it][2025-06-19 23:25:10,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:25:10,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.73 | bwd_microstep: 3317.20 | bwd_inner_microstep: 3316.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.13 [2025-06-19 23:25:10,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.73 | bwd: 3317.21 | bwd_inner: 3316.40 | bwd_allreduce: 0.77 | step: 7.14 63%|██████▎ | 6289/10000 [9:55:31<5:38:50, 5.48s/it] {'loss': 0.0009, 'grad_norm': 0.14949950575828552, 'learning_rate': 1.2788314567481072e-05, 'epoch': 6.29} 63%|██████▎ | 6289/10000 [9:55:31<5:38:50, 5.48s/it][2025-06-19 23:25:16,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:25:16,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.09 | bwd_microstep: 3362.51 | bwd_inner_microstep: 3361.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 23:25:16,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.09 | bwd: 3362.52 | bwd_inner: 3361.73 | bwd_allreduce: 0.76 | step: 6.60 63%|██████▎ | 6290/10000 [9:55:37<5:39:49, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.018255596980452538, 'learning_rate': 1.278227319910778e-05, 'epoch': 6.29} 63%|██████▎ | 6290/10000 [9:55:37<5:39:49, 5.50s/it][2025-06-19 23:25:21,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-19 23:25:21,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.43 | bwd_microstep: 3328.67 | bwd_inner_microstep: 3327.81 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.98 [2025-06-19 23:25:21,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.43 | bwd: 3328.69 | bwd_inner: 3327.81 | bwd_allreduce: 0.83 | step: 6.99 63%|██████▎ | 6291/10000 [9:55:42<5:39:23, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.008212507702410221, 'learning_rate': 1.2776232587840529e-05, 'epoch': 6.29} 63%|██████▎ | 6291/10000 [9:55:42<5:39:23, 5.49s/it][2025-06-19 23:25:27,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:25:27,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.97 | bwd_microstep: 3313.79 | bwd_inner_microstep: 3312.99 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.30 [2025-06-19 23:25:27,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.97 | bwd: 3313.81 | bwd_inner: 3312.99 | bwd_allreduce: 0.77 | step: 7.31 63%|██████▎ | 6292/10000 [9:55:48<5:38:53, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0006786130834370852, 'learning_rate': 1.277019273431296e-05, 'epoch': 6.29} 63%|██████▎ | 6292/10000 [9:55:48<5:38:53, 5.48s/it][2025-06-19 23:25:32,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:25:32,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.46 | bwd_microstep: 3376.08 | bwd_inner_microstep: 3375.16 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.64 [2025-06-19 23:25:32,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.47 | bwd: 3376.09 | bwd_inner: 3375.16 | bwd_allreduce: 0.88 | step: 6.64 63%|██████▎ | 6293/10000 [9:55:53<5:40:04, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.017707809805870056, 'learning_rate': 1.2764153639158613e-05, 'epoch': 6.29} 63%|██████▎ | 6293/10000 [9:55:53<5:40:04, 5.50s/it][2025-06-19 23:25:38,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:25:38,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.80 | bwd_microstep: 3326.37 | bwd_inner_microstep: 3325.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 23:25:38,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.80 | bwd: 3326.38 | bwd_inner: 3325.57 | bwd_allreduce: 0.77 | step: 6.80 63%|██████▎ | 6294/10000 [9:55:59<5:39:26, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0015153712593019009, 'learning_rate': 1.2758115303010965e-05, 'epoch': 6.29} 63%|██████▎ | 6294/10000 [9:55:59<5:39:26, 5.50s/it][2025-06-19 23:25:43,761] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:25:43,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.89 | bwd_microstep: 3329.61 | bwd_inner_microstep: 3328.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.11 [2025-06-19 23:25:43,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.89 | bwd: 3329.62 | bwd_inner: 3328.81 | bwd_allreduce: 0.77 | step: 7.11 63%|██████▎ | 6295/10000 [9:56:04<5:39:09, 5.49s/it] {'loss': 0.0079, 'grad_norm': 1.3968019485473633, 'learning_rate': 1.2752077726503411e-05, 'epoch': 6.29} 63%|██████▎ | 6295/10000 [9:56:04<5:39:09, 5.49s/it][2025-06-19 23:25:49,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:25:49,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.72 | bwd_microstep: 3377.23 | bwd_inner_microstep: 3376.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 23:25:49,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.72 | bwd: 3377.24 | bwd_inner: 3376.44 | bwd_allreduce: 0.76 | step: 6.68 63%|██████▎ | 6296/10000 [9:56:10<5:40:04, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.011551020666956902, 'learning_rate': 1.2746040910269267e-05, 'epoch': 6.3} 63%|██████▎ | 6296/10000 [9:56:10<5:40:04, 5.51s/it][2025-06-19 23:25:54,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:25:54,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.23 | bwd_microstep: 3393.67 | bwd_inner_microstep: 3392.88 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 23:25:54,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.23 | bwd: 3393.68 | bwd_inner: 3392.88 | bwd_allreduce: 0.77 | step: 6.76 63%|██████▎ | 6297/10000 [9:56:15<5:40:55, 5.52s/it] {'loss': 0.0033, 'grad_norm': 0.568859875202179, 'learning_rate': 1.2740004854941768e-05, 'epoch': 6.3} 63%|██████▎ | 6297/10000 [9:56:15<5:40:55, 5.52s/it][2025-06-19 23:26:00,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:26:00,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.27 | bwd_microstep: 3338.41 | bwd_inner_microstep: 3337.56 | bwd_allreduce_microstep: 0.79 | step_microstep: 8.20 [2025-06-19 23:26:00,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.27 | bwd: 3338.42 | bwd_inner: 3337.56 | bwd_allreduce: 0.81 | step: 8.21 63%|██████▎ | 6298/10000 [9:56:21<5:40:07, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0021487113554030657, 'learning_rate': 1.273396956115406e-05, 'epoch': 6.3} 63%|██████▎ | 6298/10000 [9:56:21<5:40:07, 5.51s/it][2025-06-19 23:26:05,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:26:05,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.66 | bwd_microstep: 3325.98 | bwd_inner_microstep: 3325.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 23:26:05,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.65 | bwd: 3326.00 | bwd_inner: 3325.19 | bwd_allreduce: 0.76 | step: 6.88 63%|██████▎ | 6299/10000 [9:56:26<5:39:29, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0023992944043129683, 'learning_rate': 1.2727935029539222e-05, 'epoch': 6.3} 63%|██████▎ | 6299/10000 [9:56:26<5:39:29, 5.50s/it][2025-06-19 23:26:11,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:26:11,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.16 | bwd_microstep: 3322.56 | bwd_inner_microstep: 3321.62 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.04 [2025-06-19 23:26:11,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.16 | bwd: 3322.57 | bwd_inner: 3321.62 | bwd_allreduce: 0.91 | step: 7.04 63%|██████▎ | 6300/10000 [9:56:32<5:38:45, 5.49s/it] {'loss': 0.0532, 'grad_norm': 3.6878538131713867, 'learning_rate': 1.2721901260730252e-05, 'epoch': 6.3} 63%|██████▎ | 6300/10000 [9:56:32<5:38:45, 5.49s/it][2025-06-19 23:26:16,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:26:16,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.39 | bwd_microstep: 3334.31 | bwd_inner_microstep: 3333.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 23:26:16,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.39 | bwd: 3334.32 | bwd_inner: 3333.52 | bwd_allreduce: 0.76 | step: 6.70 63%|██████▎ | 6301/10000 [9:56:37<5:38:34, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.000245108938543126, 'learning_rate': 1.2715868255360058e-05, 'epoch': 6.3} 63%|██████▎ | 6301/10000 [9:56:37<5:38:34, 5.49s/it][2025-06-19 23:26:22,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.70 | optimizer_step: 2.73 [2025-06-19 23:26:22,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.23 | bwd_microstep: 3378.19 | bwd_inner_microstep: 3377.39 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 23:26:22,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.23 | bwd: 3378.21 | bwd_inner: 3377.39 | bwd_allreduce: 0.77 | step: 7.12 63%|██████▎ | 6302/10000 [9:56:43<5:39:29, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.05908212810754776, 'learning_rate': 1.2709836014061486e-05, 'epoch': 6.3} 63%|██████▎ | 6302/10000 [9:56:43<5:39:29, 5.51s/it][2025-06-19 23:26:27,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:26:27,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.81 | bwd_microstep: 3375.85 | bwd_inner_microstep: 3375.04 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-19 23:26:27,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.81 | bwd: 3375.86 | bwd_inner: 3375.04 | bwd_allreduce: 0.79 | step: 7.24 63%|██████▎ | 6303/10000 [9:56:48<5:40:07, 5.52s/it] {'loss': 0.0005, 'grad_norm': 0.18541668355464935, 'learning_rate': 1.2703804537467272e-05, 'epoch': 6.3} 63%|██████▎ | 6303/10000 [9:56:48<5:40:07, 5.52s/it][2025-06-19 23:26:33,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:26:33,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.37 | bwd_microstep: 3411.09 | bwd_inner_microstep: 3409.96 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.85 [2025-06-19 23:26:33,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.37 | bwd: 3411.11 | bwd_inner: 3409.96 | bwd_allreduce: 1.09 | step: 7.86 63%|██████▎ | 6304/10000 [9:56:54<5:41:39, 5.55s/it] {'loss': 0.0012, 'grad_norm': 0.47295236587524414, 'learning_rate': 1.2697773826210101e-05, 'epoch': 6.3} 63%|██████▎ | 6304/10000 [9:56:54<5:41:39, 5.55s/it][2025-06-19 23:26:39,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:26:39,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.28 | bwd_microstep: 3377.64 | bwd_inner_microstep: 3376.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 23:26:39,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.28 | bwd: 3377.66 | bwd_inner: 3376.86 | bwd_allreduce: 0.76 | step: 6.67 63%|██████▎ | 6305/10000 [9:56:59<5:41:53, 5.55s/it] {'loss': 0.0, 'grad_norm': 0.0002453165652696043, 'learning_rate': 1.2691743880922564e-05, 'epoch': 6.3} 63%|██████▎ | 6305/10000 [9:56:59<5:41:53, 5.55s/it][2025-06-19 23:26:44,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:26:44,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.78 | bwd_microstep: 3327.71 | bwd_inner_microstep: 3326.77 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.88 [2025-06-19 23:26:44,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.78 | bwd: 3327.72 | bwd_inner: 3326.77 | bwd_allreduce: 0.91 | step: 6.89 63%|██████▎ | 6306/10000 [9:57:05<5:40:32, 5.53s/it] {'loss': 0.0009, 'grad_norm': 0.1023770347237587, 'learning_rate': 1.2685714702237176e-05, 'epoch': 6.31} 63%|██████▎ | 6306/10000 [9:57:05<5:40:32, 5.53s/it][2025-06-19 23:26:50,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:26:50,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.69 | bwd_microstep: 3367.31 | bwd_inner_microstep: 3366.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.03 [2025-06-19 23:26:50,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.69 | bwd: 3367.33 | bwd_inner: 3366.51 | bwd_allreduce: 0.77 | step: 7.03 63%|██████▎ | 6307/10000 [9:57:10<5:40:32, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.010235433466732502, 'learning_rate': 1.2679686290786375e-05, 'epoch': 6.31} 63%|██████▎ | 6307/10000 [9:57:10<5:40:32, 5.53s/it][2025-06-19 23:26:55,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:26:55,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.35 | bwd_microstep: 3321.91 | bwd_inner_microstep: 3321.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-19 23:26:55,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.35 | bwd: 3321.92 | bwd_inner: 3321.12 | bwd_allreduce: 0.76 | step: 6.62 63%|██████▎ | 6308/10000 [9:57:16<5:39:14, 5.51s/it] {'loss': 0.0006, 'grad_norm': 0.11389745026826859, 'learning_rate': 1.26736586472025e-05, 'epoch': 6.31} 63%|██████▎ | 6308/10000 [9:57:16<5:39:14, 5.51s/it][2025-06-19 23:27:01,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 23:27:01,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.93 | bwd_microstep: 3326.72 | bwd_inner_microstep: 3325.78 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.24 [2025-06-19 23:27:01,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.93 | bwd: 3326.73 | bwd_inner: 3325.78 | bwd_allreduce: 0.91 | step: 7.25 63%|██████▎ | 6309/10000 [9:57:21<5:38:29, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.026594718918204308, 'learning_rate': 1.2667631772117832e-05, 'epoch': 6.31} 63%|██████▎ | 6309/10000 [9:57:21<5:38:29, 5.50s/it][2025-06-19 23:27:06,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.91 [2025-06-19 23:27:06,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.07 | bwd_microstep: 3327.25 | bwd_inner_microstep: 3326.40 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.01 [2025-06-19 23:27:06,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.07 | bwd: 3327.27 | bwd_inner: 3326.40 | bwd_allreduce: 0.81 | step: 7.01 63%|██████▎ | 6310/10000 [9:57:27<5:38:04, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.036410119384527206, 'learning_rate': 1.266160566616456e-05, 'epoch': 6.31} 63%|██████▎ | 6310/10000 [9:57:27<5:38:04, 5.50s/it][2025-06-19 23:27:12,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:27:12,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.92 | bwd_microstep: 3368.81 | bwd_inner_microstep: 3367.97 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.81 [2025-06-19 23:27:12,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.92 | bwd: 3368.83 | bwd_inner: 3367.97 | bwd_allreduce: 0.80 | step: 6.82 63%|██████▎ | 6311/10000 [9:57:32<5:38:45, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.007085443940013647, 'learning_rate': 1.2655580329974795e-05, 'epoch': 6.31} 63%|██████▎ | 6311/10000 [9:57:32<5:38:45, 5.51s/it][2025-06-19 23:27:17,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:27:17,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.17 | bwd_microstep: 3315.83 | bwd_inner_microstep: 3315.05 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 23:27:17,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.18 | bwd: 3315.84 | bwd_inner: 3315.05 | bwd_allreduce: 0.76 | step: 6.72 63%|██████▎ | 6312/10000 [9:57:38<5:37:50, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001021954114548862, 'learning_rate': 1.2649555764180574e-05, 'epoch': 6.31} 63%|██████▎ | 6312/10000 [9:57:38<5:37:50, 5.50s/it][2025-06-19 23:27:23,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:27:23,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.17 | bwd_microstep: 3378.55 | bwd_inner_microstep: 3377.77 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 23:27:23,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.17 | bwd: 3378.56 | bwd_inner: 3377.77 | bwd_allreduce: 0.76 | step: 6.64 63%|██████▎ | 6313/10000 [9:57:43<5:38:47, 5.51s/it] {'loss': 0.0178, 'grad_norm': 2.9944584369659424, 'learning_rate': 1.264353196941383e-05, 'epoch': 6.31} 63%|██████▎ | 6313/10000 [9:57:43<5:38:47, 5.51s/it][2025-06-19 23:27:28,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:27:28,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.82 | bwd_microstep: 3314.47 | bwd_inner_microstep: 3313.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 23:27:28,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.82 | bwd: 3314.48 | bwd_inner: 3313.67 | bwd_allreduce: 0.77 | step: 6.74 63%|██████▎ | 6314/10000 [9:57:49<5:37:44, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.003566490253433585, 'learning_rate': 1.2637508946306443e-05, 'epoch': 6.31} 63%|██████▎ | 6314/10000 [9:57:49<5:37:44, 5.50s/it][2025-06-19 23:27:33,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:27:33,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.29 | bwd_microstep: 3316.98 | bwd_inner_microstep: 3316.16 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.82 [2025-06-19 23:27:33,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.29 | bwd: 3317.00 | bwd_inner: 3316.16 | bwd_allreduce: 0.80 | step: 6.82 63%|██████▎ | 6315/10000 [9:57:54<5:37:00, 5.49s/it] {'loss': 0.0017, 'grad_norm': 0.3889349102973938, 'learning_rate': 1.2631486695490196e-05, 'epoch': 6.32} 63%|██████▎ | 6315/10000 [9:57:54<5:37:00, 5.49s/it][2025-06-19 23:27:39,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:27:39,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.59 | bwd_microstep: 3322.18 | bwd_inner_microstep: 3321.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 23:27:39,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.59 | bwd: 3322.20 | bwd_inner: 3321.39 | bwd_allreduce: 0.76 | step: 6.69 63%|██████▎ | 6316/10000 [9:58:00<5:36:34, 5.48s/it] {'loss': 0.0022, 'grad_norm': 0.39758145809173584, 'learning_rate': 1.2625465217596795e-05, 'epoch': 6.32} 63%|██████▎ | 6316/10000 [9:58:00<5:36:34, 5.48s/it][2025-06-19 23:27:44,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:27:44,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.64 | bwd_microstep: 3332.07 | bwd_inner_microstep: 3331.15 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.48 [2025-06-19 23:27:44,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.64 | bwd: 3332.09 | bwd_inner: 3331.15 | bwd_allreduce: 0.89 | step: 7.49 63%|██████▎ | 6317/10000 [9:58:05<5:36:27, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.022315723821520805, 'learning_rate': 1.2619444513257876e-05, 'epoch': 6.32} 63%|██████▎ | 6317/10000 [9:58:05<5:36:27, 5.48s/it][2025-06-19 23:27:50,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.72 [2025-06-19 23:27:50,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.42 | bwd_microstep: 3375.71 | bwd_inner_microstep: 3374.91 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-19 23:27:50,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.42 | bwd: 3375.72 | bwd_inner: 3374.91 | bwd_allreduce: 0.77 | step: 6.96 63%|██████▎ | 6318/10000 [9:58:11<5:37:37, 5.50s/it] {'loss': 0.001, 'grad_norm': 0.20221346616744995, 'learning_rate': 1.2613424583104968e-05, 'epoch': 6.32} 63%|██████▎ | 6318/10000 [9:58:11<5:37:37, 5.50s/it][2025-06-19 23:27:55,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:27:55,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.70 | bwd_microstep: 3323.93 | bwd_inner_microstep: 3323.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-19 23:27:55,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.70 | bwd: 3323.95 | bwd_inner: 3323.13 | bwd_allreduce: 0.78 | step: 6.86 63%|██████▎ | 6319/10000 [9:58:16<5:36:50, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005232169292867184, 'learning_rate': 1.2607405427769534e-05, 'epoch': 6.32} 63%|██████▎ | 6319/10000 [9:58:16<5:36:50, 5.49s/it][2025-06-19 23:28:01,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:28:01,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.26 | bwd_microstep: 3321.61 | bwd_inner_microstep: 3320.76 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.44 [2025-06-19 23:28:01,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.27 | bwd: 3321.63 | bwd_inner: 3320.76 | bwd_allreduce: 0.83 | step: 7.45 63%|██████▎ | 6320/10000 [9:58:22<5:36:25, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006075636483728886, 'learning_rate': 1.2601387047882963e-05, 'epoch': 6.32} 63%|██████▎ | 6320/10000 [9:58:22<5:36:25, 5.49s/it][2025-06-19 23:28:06,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:28:06,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.32 | bwd_microstep: 3375.04 | bwd_inner_microstep: 3374.21 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.84 [2025-06-19 23:28:06,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.32 | bwd: 3375.06 | bwd_inner: 3374.21 | bwd_allreduce: 0.80 | step: 6.85 63%|██████▎ | 6321/10000 [9:58:27<5:37:30, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0026162387803196907, 'learning_rate': 1.2595369444076553e-05, 'epoch': 6.32} 63%|██████▎ | 6321/10000 [9:58:27<5:37:30, 5.50s/it][2025-06-19 23:28:12,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:28:12,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.72 | bwd_microstep: 3326.22 | bwd_inner_microstep: 3325.10 | bwd_allreduce_microstep: 1.04 | step_microstep: 8.04 [2025-06-19 23:28:12,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.72 | bwd: 3326.24 | bwd_inner: 3325.10 | bwd_allreduce: 1.08 | step: 8.05 63%|██████▎ | 6322/10000 [9:58:33<5:36:54, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0038409275002777576, 'learning_rate': 1.2589352616981525e-05, 'epoch': 6.32} 63%|██████▎ | 6322/10000 [9:58:33<5:36:54, 5.50s/it][2025-06-19 23:28:17,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:28:17,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.97 | bwd_microstep: 3327.18 | bwd_inner_microstep: 3326.20 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.61 [2025-06-19 23:28:17,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.97 | bwd: 3327.19 | bwd_inner: 3326.21 | bwd_allreduce: 0.94 | step: 7.61 63%|██████▎ | 6323/10000 [9:58:38<5:36:32, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.009810029529035091, 'learning_rate': 1.258333656722901e-05, 'epoch': 6.32} 63%|██████▎ | 6323/10000 [9:58:38<5:36:32, 5.49s/it][2025-06-19 23:28:23,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:28:23,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.76 | bwd_microstep: 3370.93 | bwd_inner_microstep: 3370.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 23:28:23,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.76 | bwd: 3370.95 | bwd_inner: 3370.15 | bwd_allreduce: 0.75 | step: 6.62 63%|██████▎ | 6324/10000 [9:58:44<5:37:24, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.030133137479424477, 'learning_rate': 1.2577321295450069e-05, 'epoch': 6.32} 63%|██████▎ | 6324/10000 [9:58:44<5:37:24, 5.51s/it][2025-06-19 23:28:29,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:28:29,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.90 | bwd_microstep: 3376.01 | bwd_inner_microstep: 3375.20 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.59 [2025-06-19 23:28:29,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.90 | bwd: 3376.03 | bwd_inner: 3375.20 | bwd_allreduce: 0.78 | step: 7.59 63%|██████▎ | 6325/10000 [9:58:49<5:38:02, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.020503755658864975, 'learning_rate': 1.2571306802275673e-05, 'epoch': 6.33} 63%|██████▎ | 6325/10000 [9:58:49<5:38:02, 5.52s/it][2025-06-19 23:28:34,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:28:34,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.28 | bwd_microstep: 3363.68 | bwd_inner_microstep: 3362.77 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.96 [2025-06-19 23:28:34,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.28 | bwd: 3363.70 | bwd_inner: 3362.77 | bwd_allreduce: 0.89 | step: 6.97 63%|██████▎ | 6326/10000 [9:58:55<5:38:04, 5.52s/it] {'loss': 0.0005, 'grad_norm': 0.12801890075206757, 'learning_rate': 1.2565293088336716e-05, 'epoch': 6.33} 63%|██████▎ | 6326/10000 [9:58:55<5:38:04, 5.52s/it][2025-06-19 23:28:40,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:28:40,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.61 | bwd_microstep: 3363.80 | bwd_inner_microstep: 3362.99 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-19 23:28:40,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.61 | bwd: 3363.81 | bwd_inner: 3362.99 | bwd_allreduce: 0.78 | step: 6.85 63%|██████▎ | 6327/10000 [9:59:00<5:38:09, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.03415193408727646, 'learning_rate': 1.2559280154264016e-05, 'epoch': 6.33} 63%|██████▎ | 6327/10000 [9:59:00<5:38:09, 5.52s/it][2025-06-19 23:28:45,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:28:45,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.13 | bwd_microstep: 3319.03 | bwd_inner_microstep: 3318.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-19 23:28:45,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.13 | bwd: 3319.05 | bwd_inner: 3318.22 | bwd_allreduce: 0.78 | step: 6.94 63%|██████▎ | 6328/10000 [9:59:06<5:37:08, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.002921751933172345, 'learning_rate': 1.2553268000688289e-05, 'epoch': 6.33} 63%|██████▎ | 6328/10000 [9:59:06<5:37:08, 5.51s/it][2025-06-19 23:28:51,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:28:51,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.75 | bwd_microstep: 3317.73 | bwd_inner_microstep: 3316.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-19 23:28:51,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.75 | bwd: 3317.75 | bwd_inner: 3316.94 | bwd_allreduce: 0.77 | step: 6.86 63%|██████▎ | 6329/10000 [9:59:11<5:36:15, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.011032654903829098, 'learning_rate': 1.2547256628240184e-05, 'epoch': 6.33} 63%|██████▎ | 6329/10000 [9:59:11<5:36:15, 5.50s/it][2025-06-19 23:28:56,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:28:56,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.41 | bwd_microstep: 3311.17 | bwd_inner_microstep: 3310.35 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.77 [2025-06-19 23:28:56,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.41 | bwd: 3311.18 | bwd_inner: 3310.35 | bwd_allreduce: 0.79 | step: 6.77 63%|██████▎ | 6330/10000 [9:59:17<5:35:17, 5.48s/it] {'loss': 0.0012, 'grad_norm': 0.2909785807132721, 'learning_rate': 1.254124603755027e-05, 'epoch': 6.33} 63%|██████▎ | 6330/10000 [9:59:17<5:35:17, 5.48s/it][2025-06-19 23:29:01,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:29:01,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.31 | bwd_microstep: 3321.21 | bwd_inner_microstep: 3320.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 23:29:01,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.31 | bwd: 3321.22 | bwd_inner: 3320.42 | bwd_allreduce: 0.76 | step: 6.68 63%|██████▎ | 6331/10000 [9:59:22<5:34:55, 5.48s/it] {'loss': 0.0442, 'grad_norm': 5.481491565704346, 'learning_rate': 1.253523622924903e-05, 'epoch': 6.33} 63%|██████▎ | 6331/10000 [9:59:22<5:34:55, 5.48s/it][2025-06-19 23:29:07,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:29:07,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.46 | bwd_microstep: 3364.87 | bwd_inner_microstep: 3363.94 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.39 [2025-06-19 23:29:07,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.46 | bwd: 3364.89 | bwd_inner: 3363.94 | bwd_allreduce: 0.90 | step: 7.40 63%|██████▎ | 6332/10000 [9:59:28<5:35:56, 5.50s/it] {'loss': 0.0032, 'grad_norm': 0.6255280375480652, 'learning_rate': 1.252922720396687e-05, 'epoch': 6.33} 63%|██████▎ | 6332/10000 [9:59:28<5:35:56, 5.50s/it][2025-06-19 23:29:13,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:29:13,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.35 | bwd_microstep: 3369.66 | bwd_inner_microstep: 3368.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 23:29:13,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.35 | bwd: 3369.67 | bwd_inner: 3368.87 | bwd_allreduce: 0.76 | step: 6.74 63%|██████▎ | 6333/10000 [9:59:33<5:36:40, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0027766223065555096, 'learning_rate': 1.2523218962334096e-05, 'epoch': 6.33} 63%|██████▎ | 6333/10000 [9:59:33<5:36:40, 5.51s/it][2025-06-19 23:29:18,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:29:18,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.73 | bwd_microstep: 3311.88 | bwd_inner_microstep: 3310.92 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.34 [2025-06-19 23:29:18,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.73 | bwd: 3311.90 | bwd_inner: 3310.92 | bwd_allreduce: 0.93 | step: 7.35 63%|██████▎ | 6334/10000 [9:59:39<5:35:31, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.06248479709029198, 'learning_rate': 1.2517211504980954e-05, 'epoch': 6.33} 63%|██████▎ | 6334/10000 [9:59:39<5:35:31, 5.49s/it][2025-06-19 23:29:23,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:29:23,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.82 | bwd_microstep: 3361.71 | bwd_inner_microstep: 3360.87 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.94 [2025-06-19 23:29:23,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.82 | bwd: 3361.72 | bwd_inner: 3360.87 | bwd_allreduce: 0.80 | step: 6.95 63%|██████▎ | 6335/10000 [9:59:44<5:36:06, 5.50s/it] {'loss': 0.0245, 'grad_norm': 6.618879318237305, 'learning_rate': 1.2511204832537599e-05, 'epoch': 6.33} 63%|██████▎ | 6335/10000 [9:59:44<5:36:06, 5.50s/it][2025-06-19 23:29:29,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:29:29,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.04 | bwd_microstep: 3369.79 | bwd_inner_microstep: 3368.99 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 23:29:29,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.04 | bwd: 3369.80 | bwd_inner: 3368.99 | bwd_allreduce: 0.77 | step: 6.84 63%|██████▎ | 6336/10000 [9:59:50<5:36:41, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0042084683664143085, 'learning_rate': 1.2505198945634099e-05, 'epoch': 6.34} 63%|██████▎ | 6336/10000 [9:59:50<5:36:41, 5.51s/it][2025-06-19 23:29:35,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:29:35,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.91 | bwd_microstep: 3367.44 | bwd_inner_microstep: 3366.59 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.23 [2025-06-19 23:29:35,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.91 | bwd: 3367.46 | bwd_inner: 3366.59 | bwd_allreduce: 0.83 | step: 7.23 63%|██████▎ | 6337/10000 [9:59:55<5:37:02, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.021404698491096497, 'learning_rate': 1.2499193844900442e-05, 'epoch': 6.34} 63%|██████▎ | 6337/10000 [9:59:55<5:37:02, 5.52s/it][2025-06-19 23:29:40,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:29:40,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.27 | bwd_microstep: 3363.66 | bwd_inner_microstep: 3362.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 23:29:40,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.27 | bwd: 3363.68 | bwd_inner: 3362.87 | bwd_allreduce: 0.77 | step: 6.69 63%|██████▎ | 6338/10000 [10:00:01<5:37:01, 5.52s/it] {'loss': 0.0013, 'grad_norm': 0.36868229508399963, 'learning_rate': 1.2493189530966548e-05, 'epoch': 6.34} 63%|██████▎ | 6338/10000 [10:00:01<5:37:01, 5.52s/it][2025-06-19 23:29:46,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:29:46,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.06 | bwd_microstep: 3314.25 | bwd_inner_microstep: 3313.43 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.79 [2025-06-19 23:29:46,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.06 | bwd: 3314.27 | bwd_inner: 3313.43 | bwd_allreduce: 0.79 | step: 6.80 63%|██████▎ | 6339/10000 [10:00:06<5:35:34, 5.50s/it] {'loss': 0.0016, 'grad_norm': 0.45201680064201355, 'learning_rate': 1.2487186004462223e-05, 'epoch': 6.34} 63%|██████▎ | 6339/10000 [10:00:06<5:35:34, 5.50s/it][2025-06-19 23:29:51,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:29:51,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.96 | bwd_microstep: 3319.63 | bwd_inner_microstep: 3318.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 23:29:51,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.96 | bwd: 3319.64 | bwd_inner: 3318.83 | bwd_allreduce: 0.77 | step: 6.93 63%|██████▎ | 6340/10000 [10:00:12<5:34:45, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.18300165235996246, 'learning_rate': 1.2481183266017221e-05, 'epoch': 6.34} 63%|██████▎ | 6340/10000 [10:00:12<5:34:45, 5.49s/it][2025-06-19 23:29:56,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:29:56,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.75 | bwd_microstep: 3322.83 | bwd_inner_microstep: 3321.84 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.64 [2025-06-19 23:29:56,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.75 | bwd: 3322.85 | bwd_inner: 3321.84 | bwd_allreduce: 0.96 | step: 7.64 63%|██████▎ | 6341/10000 [10:00:17<5:34:23, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004369070287793875, 'learning_rate': 1.2475181316261198e-05, 'epoch': 6.34} 63%|██████▎ | 6341/10000 [10:00:17<5:34:23, 5.48s/it][2025-06-19 23:30:02,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:30:02,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.28 | bwd_microstep: 3366.74 | bwd_inner_microstep: 3365.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.83 [2025-06-19 23:30:02,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.28 | bwd: 3366.75 | bwd_inner: 3365.95 | bwd_allreduce: 0.76 | step: 6.84 63%|██████▎ | 6342/10000 [10:00:23<5:35:25, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.036293912678956985, 'learning_rate': 1.246918015582373e-05, 'epoch': 6.34} 63%|██████▎ | 6342/10000 [10:00:23<5:35:25, 5.50s/it][2025-06-19 23:30:07,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:30:07,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.52 | bwd_microstep: 3312.02 | bwd_inner_microstep: 3311.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 23:30:07,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.52 | bwd: 3312.03 | bwd_inner: 3311.23 | bwd_allreduce: 0.76 | step: 6.97 63%|██████▎ | 6343/10000 [10:00:28<5:34:26, 5.49s/it] {'loss': 0.001, 'grad_norm': 0.13247251510620117, 'learning_rate': 1.2463179785334316e-05, 'epoch': 6.34} 63%|██████▎ | 6343/10000 [10:00:28<5:34:26, 5.49s/it][2025-06-19 23:30:13,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:30:13,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.73 | bwd_microstep: 3374.42 | bwd_inner_microstep: 3373.58 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.87 [2025-06-19 23:30:13,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.73 | bwd: 3374.44 | bwd_inner: 3373.58 | bwd_allreduce: 0.81 | step: 6.87 63%|██████▎ | 6344/10000 [10:00:34<5:35:17, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.021424364298582077, 'learning_rate': 1.2457180205422361e-05, 'epoch': 6.34} 63%|██████▎ | 6344/10000 [10:00:34<5:35:17, 5.50s/it][2025-06-19 23:30:18,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.73 [2025-06-19 23:30:18,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.94 | bwd_microstep: 3324.37 | bwd_inner_microstep: 3323.15 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.80 [2025-06-19 23:30:18,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.95 | bwd: 3324.40 | bwd_inner: 3323.15 | bwd_allreduce: 1.17 | step: 7.80 63%|██████▎ | 6345/10000 [10:00:39<5:34:32, 5.49s/it] {'loss': 0.0021, 'grad_norm': 0.49389487504959106, 'learning_rate': 1.2451181416717193e-05, 'epoch': 6.34} 63%|██████▎ | 6345/10000 [10:00:39<5:34:32, 5.49s/it][2025-06-19 23:30:24,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:30:24,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.48 | bwd_microstep: 3318.84 | bwd_inner_microstep: 3318.03 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-19 23:30:24,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.48 | bwd: 3318.85 | bwd_inner: 3318.03 | bwd_allreduce: 0.78 | step: 7.07 63%|██████▎ | 6346/10000 [10:00:45<5:34:05, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.01573314145207405, 'learning_rate': 1.244518341984806e-05, 'epoch': 6.35} 63%|██████▎ | 6346/10000 [10:00:45<5:34:05, 5.49s/it][2025-06-19 23:30:29,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:30:29,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.12 | bwd_microstep: 3318.01 | bwd_inner_microstep: 3317.16 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.76 [2025-06-19 23:30:29,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.12 | bwd: 3318.03 | bwd_inner: 3317.16 | bwd_allreduce: 0.83 | step: 6.77 63%|██████▎ | 6347/10000 [10:00:50<5:33:37, 5.48s/it] {'loss': 0.0344, 'grad_norm': 2.7285115718841553, 'learning_rate': 1.2439186215444124e-05, 'epoch': 6.35} 63%|██████▎ | 6347/10000 [10:00:50<5:33:37, 5.48s/it][2025-06-19 23:30:35,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:30:35,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.73 | bwd_microstep: 3318.74 | bwd_inner_microstep: 3317.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.65 [2025-06-19 23:30:35,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.73 | bwd: 3318.76 | bwd_inner: 3317.94 | bwd_allreduce: 0.77 | step: 6.66 63%|██████▎ | 6348/10000 [10:00:56<5:33:08, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.020456142723560333, 'learning_rate': 1.2433189804134465e-05, 'epoch': 6.35} 63%|██████▎ | 6348/10000 [10:00:56<5:33:08, 5.47s/it][2025-06-19 23:30:40,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:30:40,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.12 | bwd_microstep: 3329.13 | bwd_inner_microstep: 3328.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-19 23:30:40,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.12 | bwd: 3329.15 | bwd_inner: 3328.34 | bwd_allreduce: 0.76 | step: 6.71 63%|██████▎ | 6349/10000 [10:01:01<5:32:56, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.015467985533177853, 'learning_rate': 1.2427194186548075e-05, 'epoch': 6.35} 63%|██████▎ | 6349/10000 [10:01:01<5:32:56, 5.47s/it][2025-06-19 23:30:46,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.76 [2025-06-19 23:30:46,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.47 | bwd_microstep: 3376.61 | bwd_inner_microstep: 3375.79 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.33 [2025-06-19 23:30:46,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.47 | bwd: 3376.62 | bwd_inner: 3375.79 | bwd_allreduce: 0.79 | step: 7.33 64%|██████▎ | 6350/10000 [10:01:07<5:34:08, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.1591220647096634, 'learning_rate': 1.2421199363313866e-05, 'epoch': 6.35} 64%|██████▎ | 6350/10000 [10:01:07<5:34:08, 5.49s/it][2025-06-19 23:30:51,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:30:51,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.40 | bwd_microstep: 3364.61 | bwd_inner_microstep: 3363.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-19 23:30:51,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.40 | bwd: 3364.63 | bwd_inner: 3363.81 | bwd_allreduce: 0.78 | step: 6.85 64%|██████▎ | 6351/10000 [10:01:12<5:34:38, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.012641003355383873, 'learning_rate': 1.2415205335060672e-05, 'epoch': 6.35} 64%|██████▎ | 6351/10000 [10:01:12<5:34:38, 5.50s/it][2025-06-19 23:30:57,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:30:57,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.24 | bwd_microstep: 3314.83 | bwd_inner_microstep: 3314.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 23:30:57,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.24 | bwd: 3314.85 | bwd_inner: 3314.03 | bwd_allreduce: 0.77 | step: 6.75 64%|██████▎ | 6352/10000 [10:01:18<5:33:36, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.036260318011045456, 'learning_rate': 1.2409212102417235e-05, 'epoch': 6.35} 64%|██████▎ | 6352/10000 [10:01:18<5:33:36, 5.49s/it][2025-06-19 23:31:02,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:31:02,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.25 | bwd_microstep: 3307.12 | bwd_inner_microstep: 3306.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-19 23:31:02,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.25 | bwd: 3307.13 | bwd_inner: 3306.31 | bwd_allreduce: 0.77 | step: 6.86 64%|██████▎ | 6353/10000 [10:01:23<5:32:45, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00310147856362164, 'learning_rate': 1.2403219666012225e-05, 'epoch': 6.35} 64%|██████▎ | 6353/10000 [10:01:23<5:32:45, 5.47s/it][2025-06-19 23:31:08,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:31:08,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.23 | bwd_microstep: 3317.92 | bwd_inner_microstep: 3317.10 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.02 [2025-06-19 23:31:08,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.23 | bwd: 3317.93 | bwd_inner: 3317.10 | bwd_allreduce: 0.79 | step: 7.03 64%|██████▎ | 6354/10000 [10:01:29<5:32:19, 5.47s/it] {'loss': 0.0011, 'grad_norm': 0.12995080649852753, 'learning_rate': 1.239722802647421e-05, 'epoch': 6.35} 64%|██████▎ | 6354/10000 [10:01:29<5:32:19, 5.47s/it][2025-06-19 23:31:13,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:31:13,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.93 | bwd_microstep: 3310.33 | bwd_inner_microstep: 3309.33 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.35 [2025-06-19 23:31:13,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.93 | bwd: 3310.35 | bwd_inner: 3309.33 | bwd_allreduce: 0.96 | step: 7.35 64%|██████▎ | 6355/10000 [10:01:34<5:31:54, 5.46s/it] {'loss': 0.0009, 'grad_norm': 0.21053847670555115, 'learning_rate': 1.2391237184431689e-05, 'epoch': 6.36} 64%|██████▎ | 6355/10000 [10:01:34<5:31:54, 5.46s/it][2025-06-19 23:31:19,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:31:19,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.06 | bwd_microstep: 3362.47 | bwd_inner_microstep: 3361.68 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.03 [2025-06-19 23:31:19,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.06 | bwd: 3362.48 | bwd_inner: 3361.68 | bwd_allreduce: 0.76 | step: 7.03 64%|██████▎ | 6356/10000 [10:01:40<5:32:59, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.008674644865095615, 'learning_rate': 1.238524714051307e-05, 'epoch': 6.36} 64%|██████▎ | 6356/10000 [10:01:40<5:32:59, 5.48s/it][2025-06-19 23:31:24,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:31:24,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.31 | bwd_microstep: 3315.52 | bwd_inner_microstep: 3314.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 23:31:24,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.32 | bwd: 3315.53 | bwd_inner: 3314.73 | bwd_allreduce: 0.76 | step: 6.68 64%|██████▎ | 6357/10000 [10:01:45<5:32:20, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.006204716395586729, 'learning_rate': 1.2379257895346688e-05, 'epoch': 6.36} 64%|██████▎ | 6357/10000 [10:01:45<5:32:20, 5.47s/it][2025-06-19 23:31:30,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:31:30,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.19 | bwd_microstep: 3317.08 | bwd_inner_microstep: 3316.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 23:31:30,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.19 | bwd: 3317.09 | bwd_inner: 3316.29 | bwd_allreduce: 0.76 | step: 6.72 64%|██████▎ | 6358/10000 [10:01:50<5:32:04, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.007516913115978241, 'learning_rate': 1.2373269449560788e-05, 'epoch': 6.36} 64%|██████▎ | 6358/10000 [10:01:50<5:32:04, 5.47s/it][2025-06-19 23:31:35,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:31:35,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.12 | bwd_microstep: 3396.38 | bwd_inner_microstep: 3395.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 23:31:35,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.12 | bwd: 3396.39 | bwd_inner: 3395.57 | bwd_allreduce: 0.78 | step: 7.12 64%|██████▎ | 6359/10000 [10:01:56<5:33:43, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.1031750813126564, 'learning_rate': 1.236728180378352e-05, 'epoch': 6.36} 64%|██████▎ | 6359/10000 [10:01:56<5:33:43, 5.50s/it][2025-06-19 23:31:41,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:31:41,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.42 | bwd_microstep: 3320.81 | bwd_inner_microstep: 3320.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-19 23:31:41,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.42 | bwd: 3320.82 | bwd_inner: 3320.02 | bwd_allreduce: 0.76 | step: 6.86 64%|██████▎ | 6360/10000 [10:02:01<5:33:01, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.020958073437213898, 'learning_rate': 1.2361294958642969e-05, 'epoch': 6.36} 64%|██████▎ | 6360/10000 [10:02:01<5:33:01, 5.49s/it][2025-06-19 23:31:46,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:31:46,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.25 | bwd_microstep: 3361.67 | bwd_inner_microstep: 3360.63 | bwd_allreduce_microstep: 0.99 | step_microstep: 6.85 [2025-06-19 23:31:46,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.26 | bwd: 3361.69 | bwd_inner: 3360.63 | bwd_allreduce: 1.01 | step: 6.86 64%|██████▎ | 6361/10000 [10:02:07<5:33:36, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.016547461971640587, 'learning_rate': 1.2355308914767121e-05, 'epoch': 6.36} 64%|██████▎ | 6361/10000 [10:02:07<5:33:36, 5.50s/it][2025-06-19 23:31:52,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:31:52,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.59 | bwd_microstep: 3369.71 | bwd_inner_microstep: 3368.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 23:31:52,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.58 | bwd: 3369.72 | bwd_inner: 3368.92 | bwd_allreduce: 0.75 | step: 6.65 64%|██████▎ | 6362/10000 [10:02:13<5:34:15, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.03289628401398659, 'learning_rate': 1.234932367278389e-05, 'epoch': 6.36} 64%|██████▎ | 6362/10000 [10:02:13<5:34:15, 5.51s/it][2025-06-19 23:31:57,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:31:57,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.67 | bwd_microstep: 3370.71 | bwd_inner_microstep: 3369.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-19 23:31:57,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.67 | bwd: 3370.73 | bwd_inner: 3369.91 | bwd_allreduce: 0.77 | step: 6.84 64%|██████▎ | 6363/10000 [10:02:18<5:34:31, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.009462332352995872, 'learning_rate': 1.2343339233321106e-05, 'epoch': 6.36} 64%|██████▎ | 6363/10000 [10:02:18<5:34:31, 5.52s/it][2025-06-19 23:32:03,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:32:03,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.64 | bwd_microstep: 3314.27 | bwd_inner_microstep: 3313.46 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-19 23:32:03,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.64 | bwd: 3314.29 | bwd_inner: 3313.46 | bwd_allreduce: 0.78 | step: 7.04 64%|██████▎ | 6364/10000 [10:02:24<5:33:18, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0036586516071110964, 'learning_rate': 1.2337355597006492e-05, 'epoch': 6.36} 64%|██████▎ | 6364/10000 [10:02:24<5:33:18, 5.50s/it][2025-06-19 23:32:08,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:32:08,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.68 | bwd_microstep: 3321.69 | bwd_inner_microstep: 3320.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 23:32:08,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.68 | bwd: 3321.71 | bwd_inner: 3320.90 | bwd_allreduce: 0.76 | step: 6.67 64%|██████▎ | 6365/10000 [10:02:29<5:32:28, 5.49s/it] {'loss': 0.0015, 'grad_norm': 0.33815857768058777, 'learning_rate': 1.2331372764467711e-05, 'epoch': 6.37} 64%|██████▎ | 6365/10000 [10:02:29<5:32:28, 5.49s/it][2025-06-19 23:32:14,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:32:14,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.94 | bwd_microstep: 3319.38 | bwd_inner_microstep: 3318.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 23:32:14,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.94 | bwd: 3319.40 | bwd_inner: 3318.59 | bwd_allreduce: 0.76 | step: 6.79 64%|██████▎ | 6366/10000 [10:02:34<5:31:56, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.002765149809420109, 'learning_rate': 1.232539073633234e-05, 'epoch': 6.37} 64%|██████▎ | 6366/10000 [10:02:34<5:31:56, 5.48s/it][2025-06-19 23:32:19,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:32:19,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.69 | bwd_microstep: 3321.79 | bwd_inner_microstep: 3320.98 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-19 23:32:19,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.69 | bwd: 3321.80 | bwd_inner: 3320.98 | bwd_allreduce: 0.78 | step: 7.15 64%|██████▎ | 6367/10000 [10:02:40<5:31:28, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.012363762594759464, 'learning_rate': 1.2319409513227858e-05, 'epoch': 6.37} 64%|██████▎ | 6367/10000 [10:02:40<5:31:28, 5.47s/it][2025-06-19 23:32:25,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:32:25,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.40 | bwd_microstep: 3318.92 | bwd_inner_microstep: 3318.01 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.96 [2025-06-19 23:32:25,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.40 | bwd: 3318.93 | bwd_inner: 3318.01 | bwd_allreduce: 0.88 | step: 6.97 64%|██████▎ | 6368/10000 [10:02:45<5:31:04, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0014980531996116042, 'learning_rate': 1.2313429095781677e-05, 'epoch': 6.37} 64%|██████▎ | 6368/10000 [10:02:45<5:31:04, 5.47s/it][2025-06-19 23:32:30,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:32:30,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.02 | bwd_microstep: 3366.89 | bwd_inner_microstep: 3366.01 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.00 [2025-06-19 23:32:30,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.02 | bwd: 3366.91 | bwd_inner: 3366.01 | bwd_allreduce: 0.85 | step: 7.00 64%|██████▎ | 6369/10000 [10:02:51<5:32:09, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.03156045079231262, 'learning_rate': 1.2307449484621103e-05, 'epoch': 6.37} 64%|██████▎ | 6369/10000 [10:02:51<5:32:09, 5.49s/it][2025-06-19 23:32:36,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:32:36,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.29 | bwd_microstep: 3319.41 | bwd_inner_microstep: 3318.59 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.80 [2025-06-19 23:32:36,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.29 | bwd: 3319.42 | bwd_inner: 3318.59 | bwd_allreduce: 0.79 | step: 6.80 64%|██████▎ | 6370/10000 [10:02:56<5:31:33, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.020227432250976562, 'learning_rate': 1.2301470680373379e-05, 'epoch': 6.37} 64%|██████▎ | 6370/10000 [10:02:56<5:31:33, 5.48s/it][2025-06-19 23:32:41,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:32:41,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.77 | bwd_microstep: 3358.71 | bwd_inner_microstep: 3357.91 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-19 23:32:41,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.77 | bwd: 3358.73 | bwd_inner: 3357.91 | bwd_allreduce: 0.78 | step: 7.00 64%|██████▎ | 6371/10000 [10:03:02<5:32:10, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.024433212354779243, 'learning_rate': 1.2295492683665651e-05, 'epoch': 6.37} 64%|██████▎ | 6371/10000 [10:03:02<5:32:10, 5.49s/it][2025-06-19 23:32:47,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:32:47,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.58 | bwd_microstep: 3319.70 | bwd_inner_microstep: 3318.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 23:32:47,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.59 | bwd: 3319.71 | bwd_inner: 3318.89 | bwd_allreduce: 0.78 | step: 7.17 64%|██████▎ | 6372/10000 [10:03:07<5:31:33, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.01654575951397419, 'learning_rate': 1.228951549512498e-05, 'epoch': 6.37} 64%|██████▎ | 6372/10000 [10:03:07<5:31:33, 5.48s/it][2025-06-19 23:32:52,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:32:52,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.13 | bwd_microstep: 3313.83 | bwd_inner_microstep: 3313.02 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-19 23:32:52,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.13 | bwd: 3313.85 | bwd_inner: 3313.02 | bwd_allreduce: 0.78 | step: 6.80 64%|██████▎ | 6373/10000 [10:03:13<5:30:54, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.020592963322997093, 'learning_rate': 1.2283539115378358e-05, 'epoch': 6.37} 64%|██████▎ | 6373/10000 [10:03:13<5:30:54, 5.47s/it][2025-06-19 23:32:58,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.85 [2025-06-19 23:32:58,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.47 | bwd_microstep: 3386.68 | bwd_inner_microstep: 3385.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.33 [2025-06-19 23:32:58,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.48 | bwd: 3386.70 | bwd_inner: 3385.87 | bwd_allreduce: 0.78 | step: 7.33 64%|██████▎ | 6374/10000 [10:03:18<5:32:24, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.020169144496321678, 'learning_rate': 1.227756354505266e-05, 'epoch': 6.37} 64%|██████▎ | 6374/10000 [10:03:18<5:32:24, 5.50s/it][2025-06-19 23:33:03,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:33:03,525] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.25 | bwd_microstep: 3310.74 | bwd_inner_microstep: 3309.96 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.71 [2025-06-19 23:33:03,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.25 | bwd: 3310.75 | bwd_inner: 3309.96 | bwd_allreduce: 0.75 | step: 6.72 64%|██████▍ | 6375/10000 [10:03:24<5:31:22, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.02370418980717659, 'learning_rate': 1.2271588784774706e-05, 'epoch': 6.38} 64%|██████▍ | 6375/10000 [10:03:24<5:31:22, 5.48s/it][2025-06-19 23:33:08,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:33:08,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.52 | bwd_microstep: 3306.37 | bwd_inner_microstep: 3305.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-19 23:33:08,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.52 | bwd: 3306.38 | bwd_inner: 3305.57 | bwd_allreduce: 0.77 | step: 6.88 64%|██████▍ | 6376/10000 [10:03:29<5:30:39, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.000996636226773262, 'learning_rate': 1.226561483517122e-05, 'epoch': 6.38} 64%|██████▍ | 6376/10000 [10:03:29<5:30:39, 5.47s/it][2025-06-19 23:33:14,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:33:14,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.71 | bwd_microstep: 3317.06 | bwd_inner_microstep: 3316.13 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.04 [2025-06-19 23:33:14,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.71 | bwd: 3317.08 | bwd_inner: 3316.13 | bwd_allreduce: 0.91 | step: 7.04 64%|██████▍ | 6377/10000 [10:03:35<5:30:17, 5.47s/it] {'loss': 0.0004, 'grad_norm': 0.06461109966039658, 'learning_rate': 1.225964169686884e-05, 'epoch': 6.38} 64%|██████▍ | 6377/10000 [10:03:35<5:30:17, 5.47s/it][2025-06-19 23:33:19,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:33:19,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.95 | bwd_microstep: 3308.01 | bwd_inner_microstep: 3307.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 23:33:19,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.95 | bwd: 3308.02 | bwd_inner: 3307.23 | bwd_allreduce: 0.75 | step: 6.65 64%|██████▍ | 6378/10000 [10:03:40<5:29:47, 5.46s/it] {'loss': 0.0033, 'grad_norm': 1.1545404195785522, 'learning_rate': 1.225366937049413e-05, 'epoch': 6.38} 64%|██████▍ | 6378/10000 [10:03:40<5:29:47, 5.46s/it][2025-06-19 23:33:25,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:33:25,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.87 | bwd_microstep: 3310.70 | bwd_inner_microstep: 3309.86 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.35 [2025-06-19 23:33:25,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.87 | bwd: 3310.71 | bwd_inner: 3309.86 | bwd_allreduce: 0.81 | step: 7.36 64%|██████▍ | 6379/10000 [10:03:46<5:29:24, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.004410408902913332, 'learning_rate': 1.2247697856673542e-05, 'epoch': 6.38} 64%|██████▍ | 6379/10000 [10:03:46<5:29:24, 5.46s/it][2025-06-19 23:33:30,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 23:33:30,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.13 | bwd_microstep: 3358.13 | bwd_inner_microstep: 3357.11 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.35 [2025-06-19 23:33:30,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.13 | bwd: 3358.15 | bwd_inner: 3357.11 | bwd_allreduce: 0.99 | step: 7.35 64%|██████▍ | 6380/10000 [10:03:51<5:30:27, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.011963802389800549, 'learning_rate': 1.2241727156033471e-05, 'epoch': 6.38} 64%|██████▍ | 6380/10000 [10:03:51<5:30:27, 5.48s/it][2025-06-19 23:33:36,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:33:36,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.91 | bwd_microstep: 3307.69 | bwd_inner_microstep: 3306.65 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.82 [2025-06-19 23:33:36,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.91 | bwd: 3307.71 | bwd_inner: 3306.65 | bwd_allreduce: 1.00 | step: 7.82 64%|██████▍ | 6381/10000 [10:03:57<5:30:00, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.002685205079615116, 'learning_rate': 1.2235757269200215e-05, 'epoch': 6.38} 64%|██████▍ | 6381/10000 [10:03:57<5:30:00, 5.47s/it][2025-06-19 23:33:41,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 23:33:41,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.73 | bwd_microstep: 3310.35 | bwd_inner_microstep: 3309.40 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.42 [2025-06-19 23:33:41,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.73 | bwd: 3310.37 | bwd_inner: 3309.40 | bwd_allreduce: 0.92 | step: 7.43 64%|██████▍ | 6382/10000 [10:04:02<5:29:40, 5.47s/it] {'loss': 0.001, 'grad_norm': 0.20590625703334808, 'learning_rate': 1.2229788196799987e-05, 'epoch': 6.38} 64%|██████▍ | 6382/10000 [10:04:02<5:29:40, 5.47s/it][2025-06-19 23:33:47,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:33:47,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.86 | bwd_microstep: 3305.37 | bwd_inner_microstep: 3304.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-19 23:33:47,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.86 | bwd: 3305.39 | bwd_inner: 3304.57 | bwd_allreduce: 0.77 | step: 6.89 64%|██████▍ | 6383/10000 [10:04:08<5:29:25, 5.46s/it] {'loss': 0.0033, 'grad_norm': 0.6954007148742676, 'learning_rate': 1.222381993945892e-05, 'epoch': 6.38} 64%|██████▍ | 6383/10000 [10:04:08<5:29:25, 5.46s/it][2025-06-19 23:33:52,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 23:33:52,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.11 | bwd_microstep: 3322.61 | bwd_inner_microstep: 3321.74 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.40 [2025-06-19 23:33:52,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.11 | bwd: 3322.63 | bwd_inner: 3321.74 | bwd_allreduce: 0.85 | step: 7.41 64%|██████▍ | 6384/10000 [10:04:13<5:29:15, 5.46s/it] {'loss': 0.0425, 'grad_norm': 9.313397407531738, 'learning_rate': 1.2217852497803043e-05, 'epoch': 6.38} 64%|██████▍ | 6384/10000 [10:04:13<5:29:15, 5.46s/it][2025-06-19 23:33:58,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:33:58,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.11 | bwd_microstep: 3316.14 | bwd_inner_microstep: 3315.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-19 23:33:58,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.11 | bwd: 3316.15 | bwd_inner: 3315.34 | bwd_allreduce: 0.77 | step: 6.94 64%|██████▍ | 6385/10000 [10:04:18<5:29:05, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.013285251334309578, 'learning_rate': 1.2211885872458323e-05, 'epoch': 6.38} 64%|██████▍ | 6385/10000 [10:04:18<5:29:05, 5.46s/it][2025-06-19 23:34:03,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.73 | optimizer_step: 2.72 [2025-06-19 23:34:03,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.66 | bwd_microstep: 3320.14 | bwd_inner_microstep: 3318.80 | bwd_allreduce_microstep: 1.27 | step_microstep: 8.52 [2025-06-19 23:34:03,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.66 | bwd: 3320.16 | bwd_inner: 3318.80 | bwd_allreduce: 1.31 | step: 8.53 64%|██████▍ | 6386/10000 [10:04:24<5:29:09, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0006939462618902326, 'learning_rate': 1.2205920064050627e-05, 'epoch': 6.39} 64%|██████▍ | 6386/10000 [10:04:24<5:29:09, 5.46s/it][2025-06-19 23:34:09,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:34:09,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.08 | bwd_microstep: 3357.39 | bwd_inner_microstep: 3356.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 23:34:09,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.08 | bwd: 3357.41 | bwd_inner: 3356.60 | bwd_allreduce: 0.76 | step: 6.69 64%|██████▍ | 6387/10000 [10:04:29<5:30:13, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001749504473991692, 'learning_rate': 1.2199955073205744e-05, 'epoch': 6.39} 64%|██████▍ | 6387/10000 [10:04:29<5:30:13, 5.48s/it][2025-06-19 23:34:14,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:34:14,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.23 | bwd_microstep: 3319.24 | bwd_inner_microstep: 3318.27 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.40 [2025-06-19 23:34:14,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.23 | bwd: 3319.26 | bwd_inner: 3318.27 | bwd_allreduce: 0.94 | step: 7.40 64%|██████▍ | 6388/10000 [10:04:35<5:29:39, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.014727214351296425, 'learning_rate': 1.2193990900549383e-05, 'epoch': 6.39} 64%|██████▍ | 6388/10000 [10:04:35<5:29:39, 5.48s/it][2025-06-19 23:34:20,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:34:20,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.23 | bwd_microstep: 3321.77 | bwd_inner_microstep: 3320.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-19 23:34:20,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.23 | bwd: 3321.78 | bwd_inner: 3320.97 | bwd_allreduce: 0.77 | step: 6.72 64%|██████▍ | 6389/10000 [10:04:40<5:29:28, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.04497986659407616, 'learning_rate': 1.2188027546707138e-05, 'epoch': 6.39} 64%|██████▍ | 6389/10000 [10:04:40<5:29:28, 5.47s/it][2025-06-19 23:34:25,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.75 [2025-06-19 23:34:25,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.51 | bwd_microstep: 3364.10 | bwd_inner_microstep: 3363.19 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.42 [2025-06-19 23:34:25,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.51 | bwd: 3364.11 | bwd_inner: 3363.19 | bwd_allreduce: 0.88 | step: 7.42 64%|██████▍ | 6390/10000 [10:04:46<5:30:21, 5.49s/it] {'loss': 0.001, 'grad_norm': 0.16702555119991302, 'learning_rate': 1.2182065012304549e-05, 'epoch': 6.39} 64%|██████▍ | 6390/10000 [10:04:46<5:30:21, 5.49s/it][2025-06-19 23:34:31,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:34:31,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.12 | bwd_microstep: 3309.21 | bwd_inner_microstep: 3308.37 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.97 [2025-06-19 23:34:31,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.12 | bwd: 3309.23 | bwd_inner: 3308.37 | bwd_allreduce: 0.81 | step: 6.97 64%|██████▍ | 6391/10000 [10:04:51<5:29:32, 5.48s/it] {'loss': 0.0244, 'grad_norm': 6.942126750946045, 'learning_rate': 1.2176103297967053e-05, 'epoch': 6.39} 64%|██████▍ | 6391/10000 [10:04:51<5:29:32, 5.48s/it][2025-06-19 23:34:36,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:34:36,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.93 | bwd_microstep: 3310.43 | bwd_inner_microstep: 3309.46 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.54 [2025-06-19 23:34:36,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.93 | bwd: 3310.44 | bwd_inner: 3309.46 | bwd_allreduce: 0.93 | step: 7.55 64%|██████▍ | 6392/10000 [10:04:57<5:29:08, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0013542448868975043, 'learning_rate': 1.2170142404320009e-05, 'epoch': 6.39} 64%|██████▍ | 6392/10000 [10:04:57<5:29:08, 5.47s/it][2025-06-19 23:34:42,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:34:42,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.44 | bwd_microstep: 3352.59 | bwd_inner_microstep: 3351.50 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.06 [2025-06-19 23:34:42,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.44 | bwd: 3352.61 | bwd_inner: 3351.50 | bwd_allreduce: 1.05 | step: 7.06 64%|██████▍ | 6393/10000 [10:05:02<5:30:01, 5.49s/it] {'loss': 0.0426, 'grad_norm': 3.604325294494629, 'learning_rate': 1.216418233198869e-05, 'epoch': 6.39} 64%|██████▍ | 6393/10000 [10:05:02<5:30:01, 5.49s/it][2025-06-19 23:34:47,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:34:47,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.86 | bwd_microstep: 3315.19 | bwd_inner_microstep: 3314.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-19 23:34:47,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.86 | bwd: 3315.21 | bwd_inner: 3314.38 | bwd_allreduce: 0.78 | step: 6.86 64%|██████▍ | 6394/10000 [10:05:08<5:29:21, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.015167058445513248, 'learning_rate': 1.215822308159828e-05, 'epoch': 6.39} 64%|██████▍ | 6394/10000 [10:05:08<5:29:21, 5.48s/it][2025-06-19 23:34:52,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.74 [2025-06-19 23:34:52,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.68 | bwd_microstep: 3312.74 | bwd_inner_microstep: 3311.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.88 [2025-06-19 23:34:52,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.68 | bwd: 3312.75 | bwd_inner: 3311.95 | bwd_allreduce: 0.76 | step: 6.88 64%|██████▍ | 6395/10000 [10:05:13<5:28:44, 5.47s/it] {'loss': 0.005, 'grad_norm': 1.3180968761444092, 'learning_rate': 1.2152264653773868e-05, 'epoch': 6.39} 64%|██████▍ | 6395/10000 [10:05:13<5:28:44, 5.47s/it][2025-06-19 23:34:58,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:34:58,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.57 | bwd_microstep: 3317.13 | bwd_inner_microstep: 3316.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 23:34:58,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.57 | bwd: 3317.14 | bwd_inner: 3316.33 | bwd_allreduce: 0.76 | step: 6.68 64%|██████▍ | 6396/10000 [10:05:19<5:28:43, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0006938898586668074, 'learning_rate': 1.2146307049140472e-05, 'epoch': 6.4} 64%|██████▍ | 6396/10000 [10:05:19<5:28:43, 5.47s/it][2025-06-19 23:35:03,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:35:03,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.99 | bwd_microstep: 3320.56 | bwd_inner_microstep: 3319.76 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-19 23:35:03,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.99 | bwd: 3320.57 | bwd_inner: 3319.76 | bwd_allreduce: 0.77 | step: 7.13 64%|██████▍ | 6397/10000 [10:05:24<5:28:32, 5.47s/it] {'loss': 0.0425, 'grad_norm': 11.161174774169922, 'learning_rate': 1.2140350268323017e-05, 'epoch': 6.4} 64%|██████▍ | 6397/10000 [10:05:24<5:28:32, 5.47s/it][2025-06-19 23:35:09,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:35:09,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.29 | bwd_microstep: 3361.01 | bwd_inner_microstep: 3360.12 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.44 [2025-06-19 23:35:09,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.29 | bwd: 3361.02 | bwd_inner: 3360.12 | bwd_allreduce: 0.86 | step: 7.44 64%|██████▍ | 6398/10000 [10:05:30<5:29:37, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.012203153222799301, 'learning_rate': 1.2134394311946345e-05, 'epoch': 6.4} 64%|██████▍ | 6398/10000 [10:05:30<5:29:37, 5.49s/it][2025-06-19 23:35:14,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:35:14,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.89 | bwd_microstep: 3323.17 | bwd_inner_microstep: 3322.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 23:35:14,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.89 | bwd: 3323.19 | bwd_inner: 3322.38 | bwd_allreduce: 0.77 | step: 6.75 64%|██████▍ | 6399/10000 [10:05:35<5:29:02, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0010333484970033169, 'learning_rate': 1.2128439180635208e-05, 'epoch': 6.4} 64%|██████▍ | 6399/10000 [10:05:35<5:29:02, 5.48s/it][2025-06-19 23:35:20,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 23:35:20,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.67 | bwd_microstep: 3366.35 | bwd_inner_microstep: 3365.29 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.84 [2025-06-19 23:35:20,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.67 | bwd: 3366.37 | bwd_inner: 3365.29 | bwd_allreduce: 1.03 | step: 7.85 64%|██████▍ | 6400/10000 [10:05:41<5:29:56, 5.50s/it] {'loss': 0.0079, 'grad_norm': 3.855041980743408, 'learning_rate': 1.2122484875014261e-05, 'epoch': 6.4} 64%|██████▍ | 6400/10000 [10:05:41<5:29:56, 5.50s/it][2025-06-19 23:35:25,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:35:25,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.73 | bwd_microstep: 3324.98 | bwd_inner_microstep: 3324.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-19 23:35:25,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.73 | bwd: 3325.00 | bwd_inner: 3324.19 | bwd_allreduce: 0.77 | step: 7.08 64%|██████▍ | 6401/10000 [10:05:46<5:29:36, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.014032498002052307, 'learning_rate': 1.2116531395708089e-05, 'epoch': 6.4} 64%|██████▍ | 6401/10000 [10:05:46<5:29:36, 5.50s/it][2025-06-19 23:35:31,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:35:31,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.07 | bwd_microstep: 3322.07 | bwd_inner_microstep: 3321.27 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.65 [2025-06-19 23:35:31,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.07 | bwd: 3322.09 | bwd_inner: 3321.27 | bwd_allreduce: 0.78 | step: 7.66 64%|██████▍ | 6402/10000 [10:05:52<5:28:59, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.011903352104127407, 'learning_rate': 1.2110578743341185e-05, 'epoch': 6.4} 64%|██████▍ | 6402/10000 [10:05:52<5:28:59, 5.49s/it][2025-06-19 23:35:36,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:35:36,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.45 | bwd_microstep: 3380.17 | bwd_inner_microstep: 3379.06 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.84 [2025-06-19 23:35:36,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.45 | bwd: 3380.19 | bwd_inner: 3379.06 | bwd_allreduce: 1.07 | step: 7.85 64%|██████▍ | 6403/10000 [10:05:57<5:30:15, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.001991051947697997, 'learning_rate': 1.2104626918537958e-05, 'epoch': 6.4} 64%|██████▍ | 6403/10000 [10:05:57<5:30:15, 5.51s/it][2025-06-19 23:35:42,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:35:42,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.08 | bwd_microstep: 3325.19 | bwd_inner_microstep: 3324.28 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.39 [2025-06-19 23:35:42,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.08 | bwd: 3325.20 | bwd_inner: 3324.28 | bwd_allreduce: 0.87 | step: 7.39 64%|██████▍ | 6404/10000 [10:06:03<5:29:34, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.03130911663174629, 'learning_rate': 1.2098675921922725e-05, 'epoch': 6.4} 64%|██████▍ | 6404/10000 [10:06:03<5:29:34, 5.50s/it][2025-06-19 23:35:47,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:35:47,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.16 | bwd_microstep: 3325.05 | bwd_inner_microstep: 3324.25 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-19 23:35:47,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.16 | bwd: 3325.06 | bwd_inner: 3324.25 | bwd_allreduce: 0.76 | step: 6.86 64%|██████▍ | 6405/10000 [10:06:08<5:29:11, 5.49s/it] {'loss': 0.0017, 'grad_norm': 0.2486574947834015, 'learning_rate': 1.2092725754119715e-05, 'epoch': 6.41} 64%|██████▍ | 6405/10000 [10:06:08<5:29:11, 5.49s/it][2025-06-19 23:35:53,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:35:53,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.00 | bwd_microstep: 3322.73 | bwd_inner_microstep: 3321.91 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.19 [2025-06-19 23:35:53,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.00 | bwd: 3322.74 | bwd_inner: 3321.91 | bwd_allreduce: 0.79 | step: 7.19 64%|██████▍ | 6406/10000 [10:06:14<5:28:43, 5.49s/it] {'loss': 0.0079, 'grad_norm': 1.041339635848999, 'learning_rate': 1.208677641575307e-05, 'epoch': 6.41} 64%|██████▍ | 6406/10000 [10:06:14<5:28:43, 5.49s/it][2025-06-19 23:35:58,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:35:58,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.14 | bwd_microstep: 3321.28 | bwd_inner_microstep: 3320.28 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.18 [2025-06-19 23:35:58,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.14 | bwd: 3321.30 | bwd_inner: 3320.28 | bwd_allreduce: 0.97 | step: 7.18 64%|██████▍ | 6407/10000 [10:06:19<5:28:19, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.009057898074388504, 'learning_rate': 1.2080827907446854e-05, 'epoch': 6.41} 64%|██████▍ | 6407/10000 [10:06:19<5:28:19, 5.48s/it][2025-06-19 23:36:04,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:36:04,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.40 | bwd_microstep: 3320.34 | bwd_inner_microstep: 3319.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 23:36:04,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.40 | bwd: 3320.36 | bwd_inner: 3319.53 | bwd_allreduce: 0.78 | step: 7.16 64%|██████▍ | 6408/10000 [10:06:25<5:27:57, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.05985887721180916, 'learning_rate': 1.2074880229825042e-05, 'epoch': 6.41} 64%|██████▍ | 6408/10000 [10:06:25<5:27:57, 5.48s/it][2025-06-19 23:36:09,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:36:09,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.13 | bwd_microstep: 3325.22 | bwd_inner_microstep: 3324.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 23:36:09,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.13 | bwd: 3325.24 | bwd_inner: 3324.44 | bwd_allreduce: 0.76 | step: 6.82 64%|██████▍ | 6409/10000 [10:06:30<5:27:38, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.03611208498477936, 'learning_rate': 1.2068933383511513e-05, 'epoch': 6.41} 64%|██████▍ | 6409/10000 [10:06:30<5:27:38, 5.47s/it][2025-06-19 23:36:15,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:36:15,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.95 | bwd_microstep: 3322.05 | bwd_inner_microstep: 3321.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.85 [2025-06-19 23:36:15,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.95 | bwd: 3322.06 | bwd_inner: 3321.26 | bwd_allreduce: 0.76 | step: 6.86 64%|██████▍ | 6410/10000 [10:06:36<5:27:23, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.024650052189826965, 'learning_rate': 1.2062987369130059e-05, 'epoch': 6.41} 64%|██████▍ | 6410/10000 [10:06:36<5:27:23, 5.47s/it][2025-06-19 23:36:20,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:36:20,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.12 | bwd_microstep: 3372.55 | bwd_inner_microstep: 3371.73 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.27 [2025-06-19 23:36:20,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.12 | bwd: 3372.57 | bwd_inner: 3371.73 | bwd_allreduce: 0.79 | step: 7.28 64%|██████▍ | 6411/10000 [10:06:41<5:28:34, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.1786855012178421, 'learning_rate': 1.205704218730439e-05, 'epoch': 6.41} 64%|██████▍ | 6411/10000 [10:06:41<5:28:34, 5.49s/it][2025-06-19 23:36:26,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:36:26,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.20 | bwd_microstep: 3364.35 | bwd_inner_microstep: 3363.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-19 23:36:26,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.20 | bwd: 3364.37 | bwd_inner: 3363.55 | bwd_allreduce: 0.77 | step: 6.79 64%|██████▍ | 6412/10000 [10:06:47<5:29:06, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.009137365967035294, 'learning_rate': 1.2051097838658135e-05, 'epoch': 6.41} 64%|██████▍ | 6412/10000 [10:06:47<5:29:06, 5.50s/it][2025-06-19 23:36:31,773] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:36:31,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.72 | bwd_microstep: 3315.67 | bwd_inner_microstep: 3314.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-19 23:36:31,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.72 | bwd: 3315.69 | bwd_inner: 3314.87 | bwd_allreduce: 0.77 | step: 6.83 64%|██████▍ | 6413/10000 [10:06:52<5:28:12, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.013699277304112911, 'learning_rate': 1.2045154323814821e-05, 'epoch': 6.41} 64%|██████▍ | 6413/10000 [10:06:52<5:28:12, 5.49s/it][2025-06-19 23:36:37,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:36:37,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.38 | bwd_microstep: 3322.51 | bwd_inner_microstep: 3321.70 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 23:36:37,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.38 | bwd: 3322.52 | bwd_inner: 3321.70 | bwd_allreduce: 0.78 | step: 7.11 64%|██████▍ | 6414/10000 [10:06:58<5:27:38, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.010483413003385067, 'learning_rate': 1.2039211643397905e-05, 'epoch': 6.41} 64%|██████▍ | 6414/10000 [10:06:58<5:27:38, 5.48s/it][2025-06-19 23:36:42,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:36:42,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.36 | bwd_microstep: 3367.53 | bwd_inner_microstep: 3366.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 23:36:42,772] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.36 | bwd: 3367.54 | bwd_inner: 3366.75 | bwd_allreduce: 0.75 | step: 6.67 64%|██████▍ | 6415/10000 [10:07:03<5:28:28, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.0371391624212265, 'learning_rate': 1.2033269798030738e-05, 'epoch': 6.42} 64%|██████▍ | 6415/10000 [10:07:03<5:28:28, 5.50s/it][2025-06-19 23:36:48,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 23:36:48,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.64 | bwd_microstep: 3330.25 | bwd_inner_microstep: 3329.25 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.48 [2025-06-19 23:36:48,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.64 | bwd: 3330.27 | bwd_inner: 3329.25 | bwd_allreduce: 0.96 | step: 7.49 64%|██████▍ | 6416/10000 [10:07:09<5:27:58, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006297469139099121, 'learning_rate': 1.2027328788336594e-05, 'epoch': 6.42} 64%|██████▍ | 6416/10000 [10:07:09<5:27:58, 5.49s/it][2025-06-19 23:36:53,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:36:53,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.52 | bwd_microstep: 3328.55 | bwd_inner_microstep: 3327.75 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.46 [2025-06-19 23:36:53,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.52 | bwd: 3328.57 | bwd_inner: 3327.75 | bwd_allreduce: 0.78 | step: 7.46 64%|██████▍ | 6417/10000 [10:07:14<5:27:59, 5.49s/it] {'loss': 0.0032, 'grad_norm': 0.46361541748046875, 'learning_rate': 1.2021388614938654e-05, 'epoch': 6.42} 64%|██████▍ | 6417/10000 [10:07:14<5:27:59, 5.49s/it][2025-06-19 23:36:59,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:36:59,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.97 | bwd_microstep: 3319.36 | bwd_inner_microstep: 3318.39 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.19 [2025-06-19 23:36:59,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.97 | bwd: 3319.37 | bwd_inner: 3318.39 | bwd_allreduce: 0.93 | step: 7.19 64%|██████▍ | 6418/10000 [10:07:20<5:27:25, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00015521346358582377, 'learning_rate': 1.201544927846002e-05, 'epoch': 6.42} 64%|██████▍ | 6418/10000 [10:07:20<5:27:25, 5.48s/it][2025-06-19 23:37:04,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 23:37:04,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.11 | bwd_microstep: 3319.65 | bwd_inner_microstep: 3318.74 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.83 [2025-06-19 23:37:04,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.11 | bwd: 3319.67 | bwd_inner: 3318.74 | bwd_allreduce: 0.87 | step: 7.83 64%|██████▍ | 6419/10000 [10:07:25<5:27:18, 5.48s/it] {'loss': 0.0032, 'grad_norm': 1.4194663763046265, 'learning_rate': 1.2009510779523708e-05, 'epoch': 6.42} 64%|██████▍ | 6419/10000 [10:07:25<5:27:18, 5.48s/it][2025-06-19 23:37:10,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:37:10,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.08 | bwd_microstep: 3320.58 | bwd_inner_microstep: 3319.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-19 23:37:10,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.08 | bwd: 3320.60 | bwd_inner: 3319.78 | bwd_allreduce: 0.77 | step: 6.92 64%|██████▍ | 6420/10000 [10:07:30<5:26:55, 5.48s/it] {'loss': 0.001, 'grad_norm': 0.17230872809886932, 'learning_rate': 1.200357311875262e-05, 'epoch': 6.42} 64%|██████▍ | 6420/10000 [10:07:30<5:26:55, 5.48s/it][2025-06-19 23:37:15,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 23:37:15,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.50 | bwd_microstep: 3319.76 | bwd_inner_microstep: 3318.91 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.91 [2025-06-19 23:37:15,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.50 | bwd: 3319.77 | bwd_inner: 3318.91 | bwd_allreduce: 0.82 | step: 6.91 64%|██████▍ | 6421/10000 [10:07:36<5:26:38, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.003410206874832511, 'learning_rate': 1.1997636296769603e-05, 'epoch': 6.42} 64%|██████▍ | 6421/10000 [10:07:36<5:26:38, 5.48s/it][2025-06-19 23:37:21,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:37:21,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.32 | bwd_microstep: 3368.91 | bwd_inner_microstep: 3368.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 23:37:21,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.32 | bwd: 3368.93 | bwd_inner: 3368.11 | bwd_allreduce: 0.77 | step: 6.81 64%|██████▍ | 6422/10000 [10:07:41<5:27:40, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.05127483233809471, 'learning_rate': 1.1991700314197396e-05, 'epoch': 6.42} 64%|██████▍ | 6422/10000 [10:07:41<5:27:40, 5.49s/it][2025-06-19 23:37:26,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:37:26,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.67 | bwd_microstep: 3330.92 | bwd_inner_microstep: 3330.10 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.38 [2025-06-19 23:37:26,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.67 | bwd: 3330.94 | bwd_inner: 3330.10 | bwd_allreduce: 0.79 | step: 7.39 64%|██████▍ | 6423/10000 [10:07:47<5:27:17, 5.49s/it] {'loss': 0.0017, 'grad_norm': 0.22151970863342285, 'learning_rate': 1.198576517165866e-05, 'epoch': 6.42} 64%|██████▍ | 6423/10000 [10:07:47<5:27:17, 5.49s/it][2025-06-19 23:37:32,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:37:32,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.56 | bwd_microstep: 3324.77 | bwd_inner_microstep: 3323.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 23:37:32,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.56 | bwd: 3324.79 | bwd_inner: 3323.98 | bwd_allreduce: 0.76 | step: 6.80 64%|██████▍ | 6424/10000 [10:07:52<5:26:55, 5.49s/it] {'loss': 0.0425, 'grad_norm': 8.649184226989746, 'learning_rate': 1.197983086977597e-05, 'epoch': 6.42} 64%|██████▍ | 6424/10000 [10:07:52<5:26:55, 5.49s/it][2025-06-19 23:37:37,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:37:37,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.10 | bwd_microstep: 3325.43 | bwd_inner_microstep: 3324.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-19 23:37:37,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.10 | bwd: 3325.45 | bwd_inner: 3324.63 | bwd_allreduce: 0.77 | step: 6.77 64%|██████▍ | 6425/10000 [10:07:58<5:26:33, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.41684821248054504, 'learning_rate': 1.197389740917179e-05, 'epoch': 6.42} 64%|██████▍ | 6425/10000 [10:07:58<5:26:33, 5.48s/it][2025-06-19 23:37:43,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.83 [2025-06-19 23:37:43,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.74 | bwd_microstep: 3320.94 | bwd_inner_microstep: 3320.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 23:37:43,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.75 | bwd: 3320.95 | bwd_inner: 3320.14 | bwd_allreduce: 0.77 | step: 6.80 64%|██████▍ | 6426/10000 [10:08:03<5:26:10, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0016243079444393516, 'learning_rate': 1.1967964790468522e-05, 'epoch': 6.43} 64%|██████▍ | 6426/10000 [10:08:03<5:26:10, 5.48s/it][2025-06-19 23:37:48,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:37:48,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.01 | bwd_microstep: 3319.11 | bwd_inner_microstep: 3318.31 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 23:37:48,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.01 | bwd: 3319.12 | bwd_inner: 3318.31 | bwd_allreduce: 0.76 | step: 6.73 64%|██████▍ | 6427/10000 [10:08:09<5:25:53, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0009664274984970689, 'learning_rate': 1.196203301428847e-05, 'epoch': 6.43} 64%|██████▍ | 6427/10000 [10:08:09<5:25:53, 5.47s/it][2025-06-19 23:37:53,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:37:53,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.08 | bwd_microstep: 3321.49 | bwd_inner_microstep: 3320.68 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 23:37:53,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.08 | bwd: 3321.51 | bwd_inner: 3320.68 | bwd_allreduce: 0.78 | step: 7.21 64%|██████▍ | 6428/10000 [10:08:14<5:25:44, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.000416987226344645, 'learning_rate': 1.1956102081253849e-05, 'epoch': 6.43} 64%|██████▍ | 6428/10000 [10:08:14<5:25:44, 5.47s/it][2025-06-19 23:37:59,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:37:59,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.05 | bwd_microstep: 3376.93 | bwd_inner_microstep: 3376.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-19 23:37:59,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.05 | bwd: 3376.95 | bwd_inner: 3376.14 | bwd_allreduce: 0.76 | step: 6.79 64%|██████▍ | 6429/10000 [10:08:20<5:27:00, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00131887081079185, 'learning_rate': 1.1950171991986792e-05, 'epoch': 6.43} 64%|██████▍ | 6429/10000 [10:08:20<5:27:00, 5.49s/it][2025-06-19 23:38:05,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-19 23:38:05,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.28 | bwd_microstep: 3322.16 | bwd_inner_microstep: 3321.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-19 23:38:05,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.28 | bwd: 3322.17 | bwd_inner: 3321.38 | bwd_allreduce: 0.75 | step: 6.54 64%|██████▍ | 6430/10000 [10:08:25<5:26:30, 5.49s/it] {'loss': 0.0131, 'grad_norm': 1.767699122428894, 'learning_rate': 1.194424274710933e-05, 'epoch': 6.43} 64%|██████▍ | 6430/10000 [10:08:25<5:26:30, 5.49s/it][2025-06-19 23:38:10,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:38:10,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.49 | bwd_microstep: 3326.91 | bwd_inner_microstep: 3326.08 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.47 [2025-06-19 23:38:10,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.49 | bwd: 3326.93 | bwd_inner: 3326.08 | bwd_allreduce: 0.80 | step: 7.48 64%|██████▍ | 6431/10000 [10:08:31<5:26:06, 5.48s/it] {'loss': 0.002, 'grad_norm': 1.2667886018753052, 'learning_rate': 1.1938314347243412e-05, 'epoch': 6.43} 64%|██████▍ | 6431/10000 [10:08:31<5:26:06, 5.48s/it][2025-06-19 23:38:15,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:38:15,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.16 | bwd_microstep: 3322.69 | bwd_inner_microstep: 3321.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-19 23:38:15,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.16 | bwd: 3322.71 | bwd_inner: 3321.89 | bwd_allreduce: 0.77 | step: 6.90 64%|██████▍ | 6432/10000 [10:08:36<5:25:57, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.003660912159830332, 'learning_rate': 1.1932386793010903e-05, 'epoch': 6.43} 64%|██████▍ | 6432/10000 [10:08:36<5:25:57, 5.48s/it][2025-06-19 23:38:21,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:38:21,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.39 | bwd_microstep: 3313.46 | bwd_inner_microstep: 3312.64 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.76 [2025-06-19 23:38:21,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.39 | bwd: 3313.48 | bwd_inner: 3312.64 | bwd_allreduce: 0.79 | step: 6.76 64%|██████▍ | 6433/10000 [10:08:42<5:25:37, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.040692925453186035, 'learning_rate': 1.192646008503358e-05, 'epoch': 6.43} 64%|██████▍ | 6433/10000 [10:08:42<5:25:37, 5.48s/it][2025-06-19 23:38:26,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:38:26,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.82 | bwd_microstep: 3366.01 | bwd_inner_microstep: 3365.19 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.99 [2025-06-19 23:38:26,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.82 | bwd: 3366.02 | bwd_inner: 3365.19 | bwd_allreduce: 0.79 | step: 7.00 64%|██████▍ | 6434/10000 [10:08:47<5:26:38, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.01964428648352623, 'learning_rate': 1.192053422393313e-05, 'epoch': 6.43} 64%|██████▍ | 6434/10000 [10:08:47<5:26:38, 5.50s/it][2025-06-19 23:38:32,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:38:32,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.06 | bwd_microstep: 3320.37 | bwd_inner_microstep: 3319.45 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.34 [2025-06-19 23:38:32,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.06 | bwd: 3320.38 | bwd_inner: 3319.45 | bwd_allreduce: 0.89 | step: 7.34 64%|██████▍ | 6435/10000 [10:08:53<5:26:00, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.05060700699687004, 'learning_rate': 1.1914609210331134e-05, 'epoch': 6.43} 64%|██████▍ | 6435/10000 [10:08:53<5:26:00, 5.49s/it][2025-06-19 23:38:37,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:38:37,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.79 | bwd_microstep: 3368.12 | bwd_inner_microstep: 3367.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-19 23:38:37,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.79 | bwd: 3368.13 | bwd_inner: 3367.33 | bwd_allreduce: 0.76 | step: 6.62 64%|██████▍ | 6436/10000 [10:08:58<5:26:49, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.008030487224459648, 'learning_rate': 1.1908685044849106e-05, 'epoch': 6.44} 64%|██████▍ | 6436/10000 [10:08:58<5:26:49, 5.50s/it][2025-06-19 23:38:43,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:38:43,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.73 | bwd_microstep: 3377.10 | bwd_inner_microstep: 3376.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 23:38:43,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.73 | bwd: 3377.12 | bwd_inner: 3376.32 | bwd_allreduce: 0.75 | step: 6.77 64%|██████▍ | 6437/10000 [10:09:04<5:27:36, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0016402992187067866, 'learning_rate': 1.1902761728108465e-05, 'epoch': 6.44} 64%|██████▍ | 6437/10000 [10:09:04<5:27:36, 5.52s/it][2025-06-19 23:38:49,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:38:49,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.12 | bwd_microstep: 3362.96 | bwd_inner_microstep: 3362.17 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.06 [2025-06-19 23:38:49,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.12 | bwd: 3362.98 | bwd_inner: 3362.17 | bwd_allreduce: 0.77 | step: 7.06 64%|██████▍ | 6438/10000 [10:09:09<5:27:41, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0020055966451764107, 'learning_rate': 1.1896839260730539e-05, 'epoch': 6.44} 64%|██████▍ | 6438/10000 [10:09:09<5:27:41, 5.52s/it][2025-06-19 23:38:54,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:38:54,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.20 | bwd_microstep: 3373.34 | bwd_inner_microstep: 3372.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 23:38:54,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.20 | bwd: 3373.35 | bwd_inner: 3372.55 | bwd_allreduce: 0.76 | step: 6.59 64%|██████▍ | 6439/10000 [10:09:15<5:28:00, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.008872386999428272, 'learning_rate': 1.1890917643336568e-05, 'epoch': 6.44} 64%|██████▍ | 6439/10000 [10:09:15<5:28:00, 5.53s/it][2025-06-19 23:39:00,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:39:00,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.43 | bwd_microstep: 3331.96 | bwd_inner_microstep: 3330.94 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.26 [2025-06-19 23:39:00,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.43 | bwd: 3331.98 | bwd_inner: 3330.94 | bwd_allreduce: 0.99 | step: 7.27 64%|██████▍ | 6440/10000 [10:09:20<5:27:01, 5.51s/it] {'loss': 0.0026, 'grad_norm': 0.40036875009536743, 'learning_rate': 1.1884996876547698e-05, 'epoch': 6.44} 64%|██████▍ | 6440/10000 [10:09:20<5:27:01, 5.51s/it][2025-06-19 23:39:05,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:39:05,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.97 | bwd_microstep: 3368.48 | bwd_inner_microstep: 3367.66 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.81 [2025-06-19 23:39:05,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.97 | bwd: 3368.49 | bwd_inner: 3367.66 | bwd_allreduce: 0.79 | step: 6.81 64%|██████▍ | 6441/10000 [10:09:26<5:27:34, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.007964878343045712, 'learning_rate': 1.1879076960984995e-05, 'epoch': 6.44} 64%|██████▍ | 6441/10000 [10:09:26<5:27:34, 5.52s/it][2025-06-19 23:39:11,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 23:39:11,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.67 | bwd_microstep: 3319.27 | bwd_inner_microstep: 3318.35 | bwd_allreduce_microstep: 0.86 | step_microstep: 8.04 [2025-06-19 23:39:11,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.67 | bwd: 3319.29 | bwd_inner: 3318.35 | bwd_allreduce: 0.89 | step: 8.04 64%|██████▍ | 6442/10000 [10:09:31<5:26:28, 5.51s/it] {'loss': 0.0079, 'grad_norm': 1.6872512102127075, 'learning_rate': 1.1873157897269425e-05, 'epoch': 6.44} 64%|██████▍ | 6442/10000 [10:09:31<5:26:28, 5.51s/it][2025-06-19 23:39:16,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:39:16,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.87 | bwd_microstep: 3309.64 | bwd_inner_microstep: 3308.85 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 23:39:16,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.87 | bwd: 3309.65 | bwd_inner: 3308.85 | bwd_allreduce: 0.76 | step: 6.62 64%|██████▍ | 6443/10000 [10:09:37<5:25:38, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005864274222403765, 'learning_rate': 1.1867239686021874e-05, 'epoch': 6.44} 64%|██████▍ | 6443/10000 [10:09:37<5:25:38, 5.49s/it][2025-06-19 23:39:22,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:39:22,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.57 | bwd_microstep: 3323.95 | bwd_inner_microstep: 3323.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-19 23:39:22,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.57 | bwd: 3323.96 | bwd_inner: 3323.16 | bwd_allreduce: 0.76 | step: 6.68 64%|██████▍ | 6444/10000 [10:09:42<5:25:09, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.0525861456990242, 'learning_rate': 1.1861322327863145e-05, 'epoch': 6.44} 64%|██████▍ | 6444/10000 [10:09:42<5:25:09, 5.49s/it][2025-06-19 23:39:27,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.91 [2025-06-19 23:39:27,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.39 | bwd_microstep: 3321.86 | bwd_inner_microstep: 3320.93 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.25 [2025-06-19 23:39:27,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.40 | bwd: 3321.87 | bwd_inner: 3320.93 | bwd_allreduce: 0.90 | step: 7.25 64%|██████▍ | 6445/10000 [10:09:48<5:24:39, 5.48s/it] {'loss': 0.004, 'grad_norm': 0.5261854529380798, 'learning_rate': 1.1855405823413922e-05, 'epoch': 6.45} 64%|██████▍ | 6445/10000 [10:09:48<5:24:39, 5.48s/it][2025-06-19 23:39:32,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:39:32,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.96 | bwd_microstep: 3319.89 | bwd_inner_microstep: 3319.06 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.77 [2025-06-19 23:39:32,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.96 | bwd: 3319.91 | bwd_inner: 3319.06 | bwd_allreduce: 0.80 | step: 6.77 64%|██████▍ | 6446/10000 [10:09:53<5:24:22, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.09461753070354462, 'learning_rate': 1.1849490173294826e-05, 'epoch': 6.45} 64%|██████▍ | 6446/10000 [10:09:53<5:24:22, 5.48s/it][2025-06-19 23:39:38,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:39:38,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.66 | bwd_microstep: 3323.86 | bwd_inner_microstep: 3323.05 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-19 23:39:38,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.66 | bwd: 3323.88 | bwd_inner: 3323.05 | bwd_allreduce: 0.78 | step: 7.11 64%|██████▍ | 6447/10000 [10:09:59<5:24:12, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.024867504835128784, 'learning_rate': 1.1843575378126385e-05, 'epoch': 6.45} 64%|██████▍ | 6447/10000 [10:09:59<5:24:12, 5.47s/it][2025-06-19 23:39:43,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:39:43,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.33 | bwd_microstep: 3314.62 | bwd_inner_microstep: 3313.83 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-19 23:39:43,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.33 | bwd: 3314.63 | bwd_inner: 3313.83 | bwd_allreduce: 0.76 | step: 6.79 64%|██████▍ | 6448/10000 [10:10:04<5:23:53, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0011739383917301893, 'learning_rate': 1.1837661438529028e-05, 'epoch': 6.45} 64%|██████▍ | 6448/10000 [10:10:04<5:23:53, 5.47s/it][2025-06-19 23:39:49,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:39:49,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.30 | bwd_microstep: 3317.22 | bwd_inner_microstep: 3316.03 | bwd_allreduce_microstep: 1.11 | step_microstep: 7.89 [2025-06-19 23:39:49,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.30 | bwd: 3317.24 | bwd_inner: 3316.03 | bwd_allreduce: 1.14 | step: 7.90 64%|██████▍ | 6449/10000 [10:10:10<5:23:38, 5.47s/it] {'loss': 0.0036, 'grad_norm': 1.4772155284881592, 'learning_rate': 1.1831748355123115e-05, 'epoch': 6.45} 64%|██████▍ | 6449/10000 [10:10:10<5:23:38, 5.47s/it][2025-06-19 23:39:54,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:39:54,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.59 | bwd_microstep: 3312.94 | bwd_inner_microstep: 3312.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-19 23:39:54,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.59 | bwd: 3312.96 | bwd_inner: 3312.14 | bwd_allreduce: 0.77 | step: 6.94 64%|██████▍ | 6450/10000 [10:10:15<5:23:35, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.026123670861124992, 'learning_rate': 1.1825836128528882e-05, 'epoch': 6.45} 64%|██████▍ | 6450/10000 [10:10:15<5:23:35, 5.47s/it][2025-06-19 23:40:00,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:40:00,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.03 | bwd_microstep: 3366.06 | bwd_inner_microstep: 3365.26 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-19 23:40:00,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.03 | bwd: 3366.08 | bwd_inner: 3365.26 | bwd_allreduce: 0.77 | step: 7.05 65%|██████▍ | 6451/10000 [10:10:21<5:24:39, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0024424842558801174, 'learning_rate': 1.1819924759366498e-05, 'epoch': 6.45} 65%|██████▍ | 6451/10000 [10:10:21<5:24:39, 5.49s/it][2025-06-19 23:40:05,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:40:05,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.20 | bwd_microstep: 3315.58 | bwd_inner_microstep: 3314.73 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.02 [2025-06-19 23:40:05,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.20 | bwd: 3315.59 | bwd_inner: 3314.74 | bwd_allreduce: 0.82 | step: 7.02 65%|██████▍ | 6452/10000 [10:10:26<5:24:04, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.06944204121828079, 'learning_rate': 1.1814014248256047e-05, 'epoch': 6.45} 65%|██████▍ | 6452/10000 [10:10:26<5:24:04, 5.48s/it][2025-06-19 23:40:11,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:40:11,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.79 | bwd_microstep: 3313.32 | bwd_inner_microstep: 3312.32 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.89 [2025-06-19 23:40:11,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.79 | bwd: 3313.34 | bwd_inner: 3312.32 | bwd_allreduce: 0.97 | step: 7.90 65%|██████▍ | 6453/10000 [10:10:32<5:23:36, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.036567505449056625, 'learning_rate': 1.1808104595817507e-05, 'epoch': 6.45} 65%|██████▍ | 6453/10000 [10:10:32<5:23:36, 5.47s/it][2025-06-19 23:40:16,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:40:16,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.86 | bwd_microstep: 3369.53 | bwd_inner_microstep: 3368.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 23:40:16,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.86 | bwd: 3369.55 | bwd_inner: 3368.74 | bwd_allreduce: 0.76 | step: 6.71 65%|██████▍ | 6454/10000 [10:10:37<5:24:38, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.007392600644379854, 'learning_rate': 1.1802195802670776e-05, 'epoch': 6.45} 65%|██████▍ | 6454/10000 [10:10:37<5:24:38, 5.49s/it][2025-06-19 23:40:22,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:40:22,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.91 | bwd_microstep: 3361.81 | bwd_inner_microstep: 3360.87 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.06 [2025-06-19 23:40:22,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.90 | bwd: 3361.83 | bwd_inner: 3360.87 | bwd_allreduce: 0.92 | step: 7.06 65%|██████▍ | 6455/10000 [10:10:43<5:25:12, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.013660364784300327, 'learning_rate': 1.1796287869435661e-05, 'epoch': 6.46} 65%|██████▍ | 6455/10000 [10:10:43<5:25:12, 5.50s/it][2025-06-19 23:40:27,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:40:27,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.62 | bwd_microstep: 3307.93 | bwd_inner_microstep: 3307.12 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-19 23:40:27,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.62 | bwd: 3307.95 | bwd_inner: 3307.12 | bwd_allreduce: 0.78 | step: 7.14 65%|██████▍ | 6456/10000 [10:10:48<5:24:14, 5.49s/it] {'loss': 0.0009, 'grad_norm': 0.158786341547966, 'learning_rate': 1.179038079673187e-05, 'epoch': 6.46} 65%|██████▍ | 6456/10000 [10:10:48<5:24:14, 5.49s/it][2025-06-19 23:40:33,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:40:33,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.19 | bwd_microstep: 3319.16 | bwd_inner_microstep: 3318.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 23:40:33,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.19 | bwd: 3319.18 | bwd_inner: 3318.38 | bwd_allreduce: 0.76 | step: 6.77 65%|██████▍ | 6457/10000 [10:10:54<5:23:42, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.05898981913924217, 'learning_rate': 1.1784474585179031e-05, 'epoch': 6.46} 65%|██████▍ | 6457/10000 [10:10:54<5:23:42, 5.48s/it][2025-06-19 23:40:38,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:40:38,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.14 | bwd_microstep: 3321.00 | bwd_inner_microstep: 3320.21 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-19 23:40:38,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.14 | bwd: 3321.01 | bwd_inner: 3320.21 | bwd_allreduce: 0.76 | step: 6.77 65%|██████▍ | 6458/10000 [10:10:59<5:23:18, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00559985963627696, 'learning_rate': 1.1778569235396678e-05, 'epoch': 6.46} 65%|██████▍ | 6458/10000 [10:10:59<5:23:18, 5.48s/it][2025-06-19 23:40:44,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:40:44,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.12 | bwd_microstep: 3366.61 | bwd_inner_microstep: 3365.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 23:40:44,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.12 | bwd: 3366.62 | bwd_inner: 3365.82 | bwd_allreduce: 0.76 | step: 6.62 65%|██████▍ | 6459/10000 [10:11:05<5:24:08, 5.49s/it] {'loss': 0.0012, 'grad_norm': 0.4858606159687042, 'learning_rate': 1.1772664748004257e-05, 'epoch': 6.46} 65%|██████▍ | 6459/10000 [10:11:05<5:24:08, 5.49s/it][2025-06-19 23:40:49,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:40:49,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.73 | bwd_microstep: 3316.79 | bwd_inner_microstep: 3316.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 23:40:49,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.73 | bwd: 3316.81 | bwd_inner: 3316.00 | bwd_allreduce: 0.77 | step: 6.89 65%|██████▍ | 6460/10000 [10:11:10<5:23:19, 5.48s/it] {'loss': 0.0104, 'grad_norm': 1.2831151485443115, 'learning_rate': 1.1766761123621127e-05, 'epoch': 6.46} 65%|██████▍ | 6460/10000 [10:11:10<5:23:19, 5.48s/it][2025-06-19 23:40:55,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:40:55,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.32 | bwd_microstep: 3395.77 | bwd_inner_microstep: 3394.80 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.44 [2025-06-19 23:40:55,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.32 | bwd: 3395.78 | bwd_inner: 3394.80 | bwd_allreduce: 0.93 | step: 7.44 65%|██████▍ | 6461/10000 [10:11:16<5:24:53, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.0763653814792633, 'learning_rate': 1.1760858362866536e-05, 'epoch': 6.46} 65%|██████▍ | 6461/10000 [10:11:16<5:24:53, 5.51s/it][2025-06-19 23:41:00,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:41:00,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.63 | bwd_microstep: 3314.01 | bwd_inner_microstep: 3313.01 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.33 [2025-06-19 23:41:00,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.63 | bwd: 3314.03 | bwd_inner: 3313.01 | bwd_allreduce: 0.96 | step: 7.33 65%|██████▍ | 6462/10000 [10:11:21<5:23:56, 5.49s/it] {'loss': 0.0006, 'grad_norm': 0.15399661660194397, 'learning_rate': 1.1754956466359664e-05, 'epoch': 6.46} 65%|██████▍ | 6462/10000 [10:11:21<5:23:56, 5.49s/it][2025-06-19 23:41:06,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:41:06,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.60 | bwd_microstep: 3364.47 | bwd_inner_microstep: 3363.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 23:41:06,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.60 | bwd: 3364.48 | bwd_inner: 3363.67 | bwd_allreduce: 0.77 | step: 6.77 65%|██████▍ | 6463/10000 [10:11:27<5:24:35, 5.51s/it] {'loss': 0.0022, 'grad_norm': 0.7380267977714539, 'learning_rate': 1.1749055434719592e-05, 'epoch': 6.46} 65%|██████▍ | 6463/10000 [10:11:27<5:24:35, 5.51s/it][2025-06-19 23:41:11,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:41:11,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.03 | bwd_microstep: 3363.88 | bwd_inner_microstep: 3363.09 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 23:41:11,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.03 | bwd: 3363.89 | bwd_inner: 3363.09 | bwd_allreduce: 0.76 | step: 6.70 65%|██████▍ | 6464/10000 [10:11:32<5:24:51, 5.51s/it] {'loss': 0.0426, 'grad_norm': 6.77286958694458, 'learning_rate': 1.174315526856531e-05, 'epoch': 6.46} 65%|██████▍ | 6464/10000 [10:11:32<5:24:51, 5.51s/it][2025-06-19 23:41:17,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:41:17,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.21 | bwd_microstep: 3324.08 | bwd_inner_microstep: 3323.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 23:41:17,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.21 | bwd: 3324.09 | bwd_inner: 3323.28 | bwd_allreduce: 0.77 | step: 7.11 65%|██████▍ | 6465/10000 [10:11:38<5:23:59, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.36197802424430847, 'learning_rate': 1.1737255968515724e-05, 'epoch': 6.46} 65%|██████▍ | 6465/10000 [10:11:38<5:23:59, 5.50s/it][2025-06-19 23:41:22,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:41:22,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.76 | bwd_microstep: 3317.55 | bwd_inner_microstep: 3316.68 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.84 [2025-06-19 23:41:22,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.75 | bwd: 3317.56 | bwd_inner: 3316.68 | bwd_allreduce: 0.84 | step: 6.85 65%|██████▍ | 6466/10000 [10:11:43<5:23:12, 5.49s/it] {'loss': 0.0329, 'grad_norm': 6.200434684753418, 'learning_rate': 1.1731357535189634e-05, 'epoch': 6.47} 65%|██████▍ | 6466/10000 [10:11:43<5:23:12, 5.49s/it][2025-06-19 23:41:28,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:41:28,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.77 | bwd_microstep: 3316.65 | bwd_inner_microstep: 3315.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 23:41:28,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.77 | bwd: 3316.67 | bwd_inner: 3315.87 | bwd_allreduce: 0.76 | step: 6.62 65%|██████▍ | 6467/10000 [10:11:48<5:22:31, 5.48s/it] {'loss': 0.0646, 'grad_norm': 6.00584602355957, 'learning_rate': 1.1725459969205764e-05, 'epoch': 6.47} 65%|██████▍ | 6467/10000 [10:11:48<5:22:31, 5.48s/it][2025-06-19 23:41:33,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:41:33,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.99 | bwd_microstep: 3378.39 | bwd_inner_microstep: 3377.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 23:41:33,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.99 | bwd: 3378.40 | bwd_inner: 3377.60 | bwd_allreduce: 0.76 | step: 6.63 65%|██████▍ | 6468/10000 [10:11:54<5:23:44, 5.50s/it] {'loss': 0.004, 'grad_norm': 1.5869226455688477, 'learning_rate': 1.1719563271182737e-05, 'epoch': 6.47} 65%|██████▍ | 6468/10000 [10:11:54<5:23:44, 5.50s/it][2025-06-19 23:41:39,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:41:39,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.89 | bwd_microstep: 3322.32 | bwd_inner_microstep: 3321.48 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.97 [2025-06-19 23:41:39,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.89 | bwd: 3322.33 | bwd_inner: 3321.48 | bwd_allreduce: 0.81 | step: 6.97 65%|██████▍ | 6469/10000 [10:11:59<5:22:59, 5.49s/it] {'loss': 0.002, 'grad_norm': 0.8749560117721558, 'learning_rate': 1.1713667441739094e-05, 'epoch': 6.47} 65%|██████▍ | 6469/10000 [10:11:59<5:22:59, 5.49s/it][2025-06-19 23:41:44,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:41:44,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.74 | bwd_microstep: 3364.75 | bwd_inner_microstep: 3363.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-19 23:41:44,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.74 | bwd: 3364.77 | bwd_inner: 3363.95 | bwd_allreduce: 0.77 | step: 7.27 65%|██████▍ | 6470/10000 [10:12:05<5:23:43, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.014080766588449478, 'learning_rate': 1.1707772481493285e-05, 'epoch': 6.47} 65%|██████▍ | 6470/10000 [10:12:05<5:23:43, 5.50s/it][2025-06-19 23:41:50,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:41:50,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.44 | bwd_microstep: 3363.42 | bwd_inner_microstep: 3362.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 23:41:50,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.44 | bwd: 3363.43 | bwd_inner: 3362.63 | bwd_allreduce: 0.76 | step: 6.74 65%|██████▍ | 6471/10000 [10:12:11<5:24:06, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.020397763699293137, 'learning_rate': 1.1701878391063652e-05, 'epoch': 6.47} 65%|██████▍ | 6471/10000 [10:12:11<5:24:06, 5.51s/it][2025-06-19 23:41:55,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:41:55,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.89 | bwd_microstep: 3359.94 | bwd_inner_microstep: 3359.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-19 23:41:55,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.89 | bwd: 3359.95 | bwd_inner: 3359.13 | bwd_allreduce: 0.78 | step: 6.75 65%|██████▍ | 6472/10000 [10:12:16<5:24:18, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.09515430778265, 'learning_rate': 1.169598517106846e-05, 'epoch': 6.47} 65%|██████▍ | 6472/10000 [10:12:16<5:24:18, 5.52s/it][2025-06-19 23:42:01,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:42:01,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.61 | bwd_microstep: 3369.23 | bwd_inner_microstep: 3368.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-19 23:42:01,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.61 | bwd: 3369.25 | bwd_inner: 3368.43 | bwd_allreduce: 0.78 | step: 6.86 65%|██████▍ | 6473/10000 [10:12:22<5:24:38, 5.52s/it] {'loss': 0.0008, 'grad_norm': 0.10476254671812057, 'learning_rate': 1.1690092822125884e-05, 'epoch': 6.47} 65%|██████▍ | 6473/10000 [10:12:22<5:24:38, 5.52s/it][2025-06-19 23:42:06,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:42:06,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.57 | bwd_microstep: 3360.05 | bwd_inner_microstep: 3359.06 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.27 [2025-06-19 23:42:06,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.57 | bwd: 3360.07 | bwd_inner: 3359.06 | bwd_allreduce: 0.96 | step: 7.27 65%|██████▍ | 6474/10000 [10:12:27<5:24:40, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0070408848114311695, 'learning_rate': 1.1684201344854005e-05, 'epoch': 6.47} 65%|██████▍ | 6474/10000 [10:12:27<5:24:40, 5.52s/it][2025-06-19 23:42:12,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:42:12,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.45 | bwd_microstep: 3322.13 | bwd_inner_microstep: 3321.29 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.33 [2025-06-19 23:42:12,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.45 | bwd: 3322.16 | bwd_inner: 3321.29 | bwd_allreduce: 0.81 | step: 7.33 65%|██████▍ | 6475/10000 [10:12:33<5:23:39, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0030091358348727226, 'learning_rate': 1.1678310739870815e-05, 'epoch': 6.47} 65%|██████▍ | 6475/10000 [10:12:33<5:23:39, 5.51s/it][2025-06-19 23:42:17,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:42:17,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.28 | bwd_microstep: 3310.12 | bwd_inner_microstep: 3309.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 23:42:17,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.28 | bwd: 3310.14 | bwd_inner: 3309.33 | bwd_allreduce: 0.76 | step: 6.72 65%|██████▍ | 6476/10000 [10:12:38<5:22:42, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.07584622502326965, 'learning_rate': 1.1672421007794198e-05, 'epoch': 6.48} 65%|██████▍ | 6476/10000 [10:12:38<5:22:42, 5.49s/it][2025-06-19 23:42:23,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:42:23,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.37 | bwd_microstep: 3311.62 | bwd_inner_microstep: 3310.83 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-19 23:42:23,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.37 | bwd: 3311.63 | bwd_inner: 3310.83 | bwd_allreduce: 0.76 | step: 6.70 65%|██████▍ | 6477/10000 [10:12:44<5:22:04, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.1030023992061615, 'learning_rate': 1.1666532149241974e-05, 'epoch': 6.48} 65%|██████▍ | 6477/10000 [10:12:44<5:22:04, 5.49s/it][2025-06-19 23:42:28,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:42:28,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.76 | bwd_microstep: 3323.91 | bwd_inner_microstep: 3323.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 23:42:28,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.76 | bwd: 3323.92 | bwd_inner: 3323.12 | bwd_allreduce: 0.76 | step: 6.69 65%|██████▍ | 6478/10000 [10:12:49<5:21:41, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.006675553973764181, 'learning_rate': 1.1660644164831844e-05, 'epoch': 6.48} 65%|██████▍ | 6478/10000 [10:12:49<5:21:41, 5.48s/it][2025-06-19 23:42:34,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.85 [2025-06-19 23:42:34,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.98 | bwd_microstep: 3310.62 | bwd_inner_microstep: 3309.72 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.28 [2025-06-19 23:42:34,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.98 | bwd: 3310.64 | bwd_inner: 3309.72 | bwd_allreduce: 0.87 | step: 7.28 65%|██████▍ | 6479/10000 [10:12:54<5:21:01, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.01870441623032093, 'learning_rate': 1.1654757055181443e-05, 'epoch': 6.48} 65%|██████▍ | 6479/10000 [10:12:54<5:21:01, 5.47s/it][2025-06-19 23:42:39,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:42:39,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.94 | bwd_microstep: 3323.76 | bwd_inner_microstep: 3322.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 23:42:39,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.94 | bwd: 3323.77 | bwd_inner: 3322.97 | bwd_allreduce: 0.76 | step: 6.68 65%|██████▍ | 6480/10000 [10:13:00<5:20:52, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.012072940357029438, 'learning_rate': 1.16488708209083e-05, 'epoch': 6.48} 65%|██████▍ | 6480/10000 [10:13:00<5:20:52, 5.47s/it][2025-06-19 23:42:45,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:42:45,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.30 | bwd_microstep: 3366.70 | bwd_inner_microstep: 3365.88 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.90 [2025-06-19 23:42:45,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.30 | bwd: 3366.71 | bwd_inner: 3365.88 | bwd_allreduce: 0.79 | step: 6.90 65%|██████▍ | 6481/10000 [10:13:05<5:21:56, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0007849290850572288, 'learning_rate': 1.164298546262984e-05, 'epoch': 6.48} 65%|██████▍ | 6481/10000 [10:13:05<5:21:56, 5.49s/it][2025-06-19 23:42:50,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:42:50,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.44 | bwd_microstep: 3373.45 | bwd_inner_microstep: 3372.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.83 [2025-06-19 23:42:50,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.44 | bwd: 3373.46 | bwd_inner: 3372.66 | bwd_allreduce: 0.76 | step: 6.84 65%|██████▍ | 6482/10000 [10:13:11<5:22:44, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.013422077521681786, 'learning_rate': 1.1637100980963423e-05, 'epoch': 6.48} 65%|██████▍ | 6482/10000 [10:13:11<5:22:44, 5.50s/it][2025-06-19 23:42:56,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:42:56,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.93 | bwd_microstep: 3309.98 | bwd_inner_microstep: 3309.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-19 23:42:56,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.93 | bwd: 3309.99 | bwd_inner: 3309.19 | bwd_allreduce: 0.76 | step: 6.64 65%|██████▍ | 6483/10000 [10:13:16<5:21:40, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00493178004398942, 'learning_rate': 1.1631217376526296e-05, 'epoch': 6.48} 65%|██████▍ | 6483/10000 [10:13:16<5:21:40, 5.49s/it][2025-06-19 23:43:01,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:43:01,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.87 | bwd_microstep: 3316.83 | bwd_inner_microstep: 3315.90 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.84 [2025-06-19 23:43:01,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.87 | bwd: 3316.85 | bwd_inner: 3315.90 | bwd_allreduce: 0.89 | step: 6.85 65%|██████▍ | 6484/10000 [10:13:22<5:20:57, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.25276699662208557, 'learning_rate': 1.1625334649935626e-05, 'epoch': 6.48} 65%|██████▍ | 6484/10000 [10:13:22<5:20:57, 5.48s/it][2025-06-19 23:43:07,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:43:07,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.74 | bwd_microstep: 3370.92 | bwd_inner_microstep: 3370.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 23:43:07,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.74 | bwd: 3370.93 | bwd_inner: 3370.14 | bwd_allreduce: 0.75 | step: 6.67 65%|██████▍ | 6485/10000 [10:13:27<5:21:51, 5.49s/it] {'loss': 0.0328, 'grad_norm': 6.705176830291748, 'learning_rate': 1.1619452801808484e-05, 'epoch': 6.49} 65%|██████▍ | 6485/10000 [10:13:27<5:21:51, 5.49s/it][2025-06-19 23:43:12,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:43:12,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.62 | bwd_microstep: 3309.29 | bwd_inner_microstep: 3308.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-19 23:43:12,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.62 | bwd: 3309.31 | bwd_inner: 3308.50 | bwd_allreduce: 0.76 | step: 6.75 65%|██████▍ | 6486/10000 [10:13:33<5:20:55, 5.48s/it] {'loss': 0.0006, 'grad_norm': 0.1298462301492691, 'learning_rate': 1.1613571832761847e-05, 'epoch': 6.49} 65%|██████▍ | 6486/10000 [10:13:33<5:20:55, 5.48s/it][2025-06-19 23:43:18,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:43:18,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.76 | bwd_microstep: 3304.78 | bwd_inner_microstep: 3303.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 23:43:18,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.76 | bwd: 3304.80 | bwd_inner: 3303.97 | bwd_allreduce: 0.78 | step: 7.21 65%|██████▍ | 6487/10000 [10:13:38<5:20:13, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.006027619820088148, 'learning_rate': 1.1607691743412604e-05, 'epoch': 6.49} 65%|██████▍ | 6487/10000 [10:13:38<5:20:13, 5.47s/it][2025-06-19 23:43:23,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:43:23,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.09 | bwd_microstep: 3312.53 | bwd_inner_microstep: 3311.57 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.14 [2025-06-19 23:43:23,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.09 | bwd: 3312.55 | bwd_inner: 3311.57 | bwd_allreduce: 0.93 | step: 7.14 65%|██████▍ | 6488/10000 [10:13:44<5:19:54, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.003815564326941967, 'learning_rate': 1.1601812534377546e-05, 'epoch': 6.49} 65%|██████▍ | 6488/10000 [10:13:44<5:19:54, 5.47s/it][2025-06-19 23:43:29,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:43:29,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.86 | bwd_microstep: 3370.86 | bwd_inner_microstep: 3369.89 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.11 [2025-06-19 23:43:29,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.86 | bwd: 3370.88 | bwd_inner: 3369.89 | bwd_allreduce: 0.94 | step: 7.12 65%|██████▍ | 6489/10000 [10:13:49<5:21:05, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.029335658997297287, 'learning_rate': 1.1595934206273377e-05, 'epoch': 6.49} 65%|██████▍ | 6489/10000 [10:13:49<5:21:05, 5.49s/it][2025-06-19 23:43:34,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:43:34,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.32 | bwd_microstep: 3315.10 | bwd_inner_microstep: 3314.29 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-19 23:43:34,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.32 | bwd: 3315.12 | bwd_inner: 3314.29 | bwd_allreduce: 0.78 | step: 7.21 65%|██████▍ | 6490/10000 [10:13:55<5:20:23, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.06600036472082138, 'learning_rate': 1.159005675971671e-05, 'epoch': 6.49} 65%|██████▍ | 6490/10000 [10:13:55<5:20:23, 5.48s/it][2025-06-19 23:43:39,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:43:39,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.02 | bwd_microstep: 3309.61 | bwd_inner_microstep: 3308.83 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-19 23:43:39,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.02 | bwd: 3309.62 | bwd_inner: 3308.83 | bwd_allreduce: 0.76 | step: 6.78 65%|██████▍ | 6491/10000 [10:14:00<5:19:49, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.08406460285186768, 'learning_rate': 1.1584180195324053e-05, 'epoch': 6.49} 65%|██████▍ | 6491/10000 [10:14:00<5:19:49, 5.47s/it][2025-06-19 23:43:45,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:43:45,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.35 | bwd_microstep: 3309.52 | bwd_inner_microstep: 3308.74 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 23:43:45,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.35 | bwd: 3309.53 | bwd_inner: 3308.74 | bwd_allreduce: 0.75 | step: 6.65 65%|██████▍ | 6492/10000 [10:14:06<5:19:25, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.021982116624712944, 'learning_rate': 1.1578304513711833e-05, 'epoch': 6.49} 65%|██████▍ | 6492/10000 [10:14:06<5:19:25, 5.46s/it][2025-06-19 23:43:50,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 23:43:50,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.91 | bwd_microstep: 3367.11 | bwd_inner_microstep: 3366.24 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.44 [2025-06-19 23:43:50,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.91 | bwd: 3367.13 | bwd_inner: 3366.24 | bwd_allreduce: 0.84 | step: 7.45 65%|██████▍ | 6493/10000 [10:14:11<5:20:42, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0020568666514009237, 'learning_rate': 1.1572429715496383e-05, 'epoch': 6.49} 65%|██████▍ | 6493/10000 [10:14:11<5:20:42, 5.49s/it][2025-06-19 23:43:56,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:43:56,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.53 | bwd_microstep: 3355.52 | bwd_inner_microstep: 3354.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-19 23:43:56,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.53 | bwd: 3355.53 | bwd_inner: 3354.73 | bwd_allreduce: 0.76 | step: 6.78 65%|██████▍ | 6494/10000 [10:14:17<5:21:12, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005266787949949503, 'learning_rate': 1.1566555801293941e-05, 'epoch': 6.49} 65%|██████▍ | 6494/10000 [10:14:17<5:21:12, 5.50s/it][2025-06-19 23:44:01,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:44:01,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.95 | bwd_microstep: 3370.79 | bwd_inner_microstep: 3369.87 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.07 [2025-06-19 23:44:01,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.95 | bwd: 3370.81 | bwd_inner: 3369.87 | bwd_allreduce: 0.89 | step: 7.07 65%|██████▍ | 6495/10000 [10:14:22<5:21:58, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.003446722635999322, 'learning_rate': 1.1560682771720666e-05, 'epoch': 6.5} 65%|██████▍ | 6495/10000 [10:14:22<5:21:58, 5.51s/it][2025-06-19 23:44:07,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:44:07,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.14 | bwd_microstep: 3314.50 | bwd_inner_microstep: 3313.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-19 23:44:07,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.14 | bwd: 3314.51 | bwd_inner: 3313.71 | bwd_allreduce: 0.76 | step: 6.79 65%|██████▍ | 6496/10000 [10:14:28<5:21:03, 5.50s/it] {'loss': 0.0091, 'grad_norm': 2.286240577697754, 'learning_rate': 1.1554810627392588e-05, 'epoch': 6.5} 65%|██████▍ | 6496/10000 [10:14:28<5:21:03, 5.50s/it][2025-06-19 23:44:13,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:44:13,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.02 | bwd_microstep: 3394.06 | bwd_inner_microstep: 3393.11 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.32 [2025-06-19 23:44:13,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.02 | bwd: 3394.08 | bwd_inner: 3393.11 | bwd_allreduce: 0.92 | step: 7.33 65%|██████▍ | 6497/10000 [10:14:33<5:22:12, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.05720291659235954, 'learning_rate': 1.1548939368925682e-05, 'epoch': 6.5} 65%|██████▍ | 6497/10000 [10:14:33<5:22:12, 5.52s/it][2025-06-19 23:44:18,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.89 [2025-06-19 23:44:18,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.41 | bwd_microstep: 3364.11 | bwd_inner_microstep: 3363.29 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.97 [2025-06-19 23:44:18,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.41 | bwd: 3364.13 | bwd_inner: 3363.29 | bwd_allreduce: 0.79 | step: 6.98 65%|██████▍ | 6498/10000 [10:14:39<5:22:16, 5.52s/it] {'loss': 0.0425, 'grad_norm': 4.55482292175293, 'learning_rate': 1.1543068996935816e-05, 'epoch': 6.5} 65%|██████▍ | 6498/10000 [10:14:39<5:22:16, 5.52s/it][2025-06-19 23:44:24,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:44:24,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.74 | bwd_microstep: 3316.16 | bwd_inner_microstep: 3315.22 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.60 [2025-06-19 23:44:24,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.74 | bwd: 3316.18 | bwd_inner: 3315.22 | bwd_allreduce: 0.91 | step: 7.61 65%|██████▍ | 6499/10000 [10:14:44<5:22:54, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.007066513877362013, 'learning_rate': 1.153719951203876e-05, 'epoch': 6.5} 65%|██████▍ | 6499/10000 [10:14:44<5:22:54, 5.53s/it][2025-06-19 23:44:29,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:44:29,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.93 | bwd_microstep: 3324.13 | bwd_inner_microstep: 3323.21 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.35 [2025-06-19 23:44:29,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.93 | bwd: 3324.14 | bwd_inner: 3323.21 | bwd_allreduce: 0.89 | step: 7.35 65%|██████▌ | 6500/10000 [10:14:50<5:22:04, 5.52s/it] {'loss': 0.0433, 'grad_norm': 6.278965950012207, 'learning_rate': 1.1531330914850204e-05, 'epoch': 6.5} 65%|██████▌ | 6500/10000 [10:14:50<5:22:04, 5.52s/it][2025-06-19 23:44:35,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:44:35,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.26 | bwd_microstep: 3324.37 | bwd_inner_microstep: 3323.27 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.30 [2025-06-19 23:44:35,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.26 | bwd: 3324.39 | bwd_inner: 3323.27 | bwd_allreduce: 1.06 | step: 7.31 65%|██████▌ | 6501/10000 [10:14:55<5:21:15, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.01657424308359623, 'learning_rate': 1.1525463205985725e-05, 'epoch': 6.5} 65%|██████▌ | 6501/10000 [10:14:55<5:21:15, 5.51s/it][2025-06-19 23:44:40,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:44:40,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.56 | bwd_microstep: 3310.49 | bwd_inner_microstep: 3309.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 23:44:40,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.56 | bwd: 3310.50 | bwd_inner: 3309.70 | bwd_allreduce: 0.76 | step: 6.66 65%|██████▌ | 6502/10000 [10:15:01<5:20:08, 5.49s/it] {'loss': 0.0425, 'grad_norm': 5.9177093505859375, 'learning_rate': 1.1519596386060825e-05, 'epoch': 6.5} 65%|██████▌ | 6502/10000 [10:15:01<5:20:08, 5.49s/it][2025-06-19 23:44:46,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:44:46,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.94 | bwd_microstep: 3363.96 | bwd_inner_microstep: 3363.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 23:44:46,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.94 | bwd: 3363.98 | bwd_inner: 3363.18 | bwd_allreduce: 0.76 | step: 6.57 65%|██████▌ | 6503/10000 [10:15:06<5:20:45, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.06389248371124268, 'learning_rate': 1.1513730455690903e-05, 'epoch': 6.5} 65%|██████▌ | 6503/10000 [10:15:06<5:20:45, 5.50s/it][2025-06-19 23:44:51,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:44:51,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.80 | bwd_microstep: 3316.22 | bwd_inner_microstep: 3315.40 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.96 [2025-06-19 23:44:51,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.80 | bwd: 3316.24 | bwd_inner: 3315.40 | bwd_allreduce: 0.79 | step: 6.97 65%|██████▌ | 6504/10000 [10:15:12<5:19:52, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006544006057083607, 'learning_rate': 1.150786541549127e-05, 'epoch': 6.5} 65%|██████▌ | 6504/10000 [10:15:12<5:19:52, 5.49s/it][2025-06-19 23:44:57,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:44:57,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.92 | bwd_microstep: 3398.12 | bwd_inner_microstep: 3397.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-19 23:44:57,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.92 | bwd: 3398.13 | bwd_inner: 3397.33 | bwd_allreduce: 0.76 | step: 6.90 65%|██████▌ | 6505/10000 [10:15:17<5:21:27, 5.52s/it] {'loss': 0.0157, 'grad_norm': 2.5086545944213867, 'learning_rate': 1.1502001266077146e-05, 'epoch': 6.5} 65%|██████▌ | 6505/10000 [10:15:17<5:21:27, 5.52s/it][2025-06-19 23:45:02,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 23:45:02,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.44 | bwd_microstep: 3319.45 | bwd_inner_microstep: 3318.67 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 23:45:02,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.44 | bwd: 3319.46 | bwd_inner: 3318.67 | bwd_allreduce: 0.76 | step: 6.57 65%|██████▌ | 6506/10000 [10:15:23<5:20:25, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0011090808548033237, 'learning_rate': 1.1496138008063644e-05, 'epoch': 6.51} 65%|██████▌ | 6506/10000 [10:15:23<5:20:25, 5.50s/it][2025-06-19 23:45:08,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:45:08,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.58 | bwd_microstep: 3371.49 | bwd_inner_microstep: 3370.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-19 23:45:08,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.58 | bwd: 3371.51 | bwd_inner: 3370.72 | bwd_allreduce: 0.75 | step: 6.54 65%|██████▌ | 6507/10000 [10:15:28<5:20:58, 5.51s/it] {'loss': 0.0109, 'grad_norm': 1.6659013032913208, 'learning_rate': 1.14902756420658e-05, 'epoch': 6.51} 65%|██████▌ | 6507/10000 [10:15:28<5:20:58, 5.51s/it][2025-06-19 23:45:13,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.77 [2025-06-19 23:45:13,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.62 | bwd_microstep: 3323.55 | bwd_inner_microstep: 3322.67 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.96 [2025-06-19 23:45:13,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.62 | bwd: 3323.57 | bwd_inner: 3322.67 | bwd_allreduce: 0.86 | step: 6.97 65%|██████▌ | 6508/10000 [10:15:34<5:20:04, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.21182045340538025, 'learning_rate': 1.1484414168698547e-05, 'epoch': 6.51} 65%|██████▌ | 6508/10000 [10:15:34<5:20:04, 5.50s/it][2025-06-19 23:45:19,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:45:19,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.78 | bwd_microstep: 3320.94 | bwd_inner_microstep: 3320.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 23:45:19,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.78 | bwd: 3320.95 | bwd_inner: 3320.15 | bwd_allreduce: 0.76 | step: 6.66 65%|██████▌ | 6509/10000 [10:15:39<5:19:31, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.035006020218133926, 'learning_rate': 1.1478553588576724e-05, 'epoch': 6.51} 65%|██████▌ | 6509/10000 [10:15:39<5:19:31, 5.49s/it][2025-06-19 23:45:24,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:45:24,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.50 | bwd_microstep: 3334.58 | bwd_inner_microstep: 3333.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 23:45:24,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.50 | bwd: 3334.60 | bwd_inner: 3333.78 | bwd_allreduce: 0.78 | step: 7.09 65%|██████▌ | 6510/10000 [10:15:45<5:19:17, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0003242971433792263, 'learning_rate': 1.1472693902315088e-05, 'epoch': 6.51} 65%|██████▌ | 6510/10000 [10:15:45<5:19:17, 5.49s/it][2025-06-19 23:45:30,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:45:30,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.28 | bwd_microstep: 3388.39 | bwd_inner_microstep: 3387.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 23:45:30,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.29 | bwd: 3388.41 | bwd_inner: 3387.61 | bwd_allreduce: 0.76 | step: 6.69 65%|██████▌ | 6511/10000 [10:15:50<5:20:22, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.021369455382227898, 'learning_rate': 1.1466835110528278e-05, 'epoch': 6.51} 65%|██████▌ | 6511/10000 [10:15:50<5:20:22, 5.51s/it][2025-06-19 23:45:35,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.86 [2025-06-19 23:45:35,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.69 | bwd_microstep: 3324.47 | bwd_inner_microstep: 3323.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.10 [2025-06-19 23:45:35,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.69 | bwd: 3324.48 | bwd_inner: 3323.68 | bwd_allreduce: 0.76 | step: 7.11 65%|██████▌ | 6512/10000 [10:15:56<5:19:30, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.02436273731291294, 'learning_rate': 1.146097721383086e-05, 'epoch': 6.51} 65%|██████▌ | 6512/10000 [10:15:56<5:19:30, 5.50s/it][2025-06-19 23:45:41,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:45:41,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.82 | bwd_microstep: 3379.97 | bwd_inner_microstep: 3379.17 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.69 [2025-06-19 23:45:41,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.82 | bwd: 3379.99 | bwd_inner: 3379.17 | bwd_allreduce: 0.78 | step: 6.70 65%|██████▌ | 6513/10000 [10:16:01<5:20:22, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00024897680850699544, 'learning_rate': 1.1455120212837302e-05, 'epoch': 6.51} 65%|██████▌ | 6513/10000 [10:16:01<5:20:22, 5.51s/it][2025-06-19 23:45:46,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:45:46,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.50 | bwd_microstep: 3338.20 | bwd_inner_microstep: 3337.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-19 23:45:46,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.50 | bwd: 3338.22 | bwd_inner: 3337.37 | bwd_allreduce: 0.79 | step: 6.74 65%|██████▌ | 6514/10000 [10:16:07<5:19:52, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.024837002158164978, 'learning_rate': 1.1449264108161978e-05, 'epoch': 6.51} 65%|██████▌ | 6514/10000 [10:16:07<5:19:52, 5.51s/it][2025-06-19 23:45:52,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:45:52,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.77 | bwd_microstep: 3329.89 | bwd_inner_microstep: 3329.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.86 [2025-06-19 23:45:52,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.77 | bwd: 3329.90 | bwd_inner: 3329.10 | bwd_allreduce: 0.76 | step: 6.86 65%|██████▌ | 6515/10000 [10:16:12<5:19:15, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.004499329719692469, 'learning_rate': 1.1443408900419162e-05, 'epoch': 6.51} 65%|██████▌ | 6515/10000 [10:16:12<5:19:15, 5.50s/it][2025-06-19 23:45:57,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:45:57,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.53 | bwd_microstep: 3371.36 | bwd_inner_microstep: 3370.44 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.03 [2025-06-19 23:45:57,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.53 | bwd: 3371.37 | bwd_inner: 3370.44 | bwd_allreduce: 0.88 | step: 7.03 65%|██████▌ | 6516/10000 [10:16:18<5:20:00, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0005326365353539586, 'learning_rate': 1.1437554590223049e-05, 'epoch': 6.52} 65%|██████▌ | 6516/10000 [10:16:18<5:20:00, 5.51s/it][2025-06-19 23:46:03,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:46:03,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.49 | bwd_microstep: 3332.66 | bwd_inner_microstep: 3331.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 23:46:03,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.49 | bwd: 3332.68 | bwd_inner: 3331.87 | bwd_allreduce: 0.76 | step: 6.73 65%|██████▌ | 6517/10000 [10:16:23<5:19:31, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.15228921175003052, 'learning_rate': 1.143170117818771e-05, 'epoch': 6.52} 65%|██████▌ | 6517/10000 [10:16:23<5:19:31, 5.50s/it][2025-06-19 23:46:08,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-19 23:46:08,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.89 | bwd_microstep: 3321.26 | bwd_inner_microstep: 3320.22 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.96 [2025-06-19 23:46:08,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.89 | bwd: 3321.28 | bwd_inner: 3320.22 | bwd_allreduce: 1.01 | step: 7.97 65%|██████▌ | 6518/10000 [10:16:29<5:19:01, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0001591219479450956, 'learning_rate': 1.142584866492715e-05, 'epoch': 6.52} 65%|██████▌ | 6518/10000 [10:16:29<5:19:01, 5.50s/it][2025-06-19 23:46:14,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:46:14,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.90 | bwd_microstep: 3330.37 | bwd_inner_microstep: 3329.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-19 23:46:14,085] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.90 | bwd: 3330.39 | bwd_inner: 3329.57 | bwd_allreduce: 0.77 | step: 7.02 65%|██████▌ | 6519/10000 [10:16:34<5:18:55, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.07229803502559662, 'learning_rate': 1.1419997051055274e-05, 'epoch': 6.52} 65%|██████▌ | 6519/10000 [10:16:34<5:18:55, 5.50s/it][2025-06-19 23:46:19,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:46:19,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.58 | bwd_microstep: 3332.35 | bwd_inner_microstep: 3331.43 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.90 [2025-06-19 23:46:19,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.58 | bwd: 3332.37 | bwd_inner: 3331.43 | bwd_allreduce: 0.90 | step: 6.90 65%|██████▌ | 6520/10000 [10:16:40<5:18:31, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.06300389021635056, 'learning_rate': 1.1414146337185885e-05, 'epoch': 6.52} 65%|██████▌ | 6520/10000 [10:16:40<5:18:31, 5.49s/it][2025-06-19 23:46:25,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:46:25,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.17 | bwd_microstep: 3319.34 | bwd_inner_microstep: 3318.39 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.29 [2025-06-19 23:46:25,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.17 | bwd: 3319.36 | bwd_inner: 3318.39 | bwd_allreduce: 0.92 | step: 7.30 65%|██████▌ | 6521/10000 [10:16:45<5:18:02, 5.49s/it] {'loss': 0.0035, 'grad_norm': 1.0477592945098877, 'learning_rate': 1.1408296523932698e-05, 'epoch': 6.52} 65%|██████▌ | 6521/10000 [10:16:45<5:18:02, 5.49s/it][2025-06-19 23:46:30,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:46:30,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.12 | bwd_microstep: 3390.99 | bwd_inner_microstep: 3390.11 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.87 [2025-06-19 23:46:30,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.12 | bwd: 3391.00 | bwd_inner: 3390.11 | bwd_allreduce: 0.86 | step: 6.88 65%|██████▌ | 6522/10000 [10:16:51<5:19:33, 5.51s/it] {'loss': 0.0008, 'grad_norm': 0.412291020154953, 'learning_rate': 1.1402447611909328e-05, 'epoch': 6.52} 65%|██████▌ | 6522/10000 [10:16:51<5:19:33, 5.51s/it][2025-06-19 23:46:36,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:46:36,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.90 | bwd_microstep: 3379.37 | bwd_inner_microstep: 3378.46 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.08 [2025-06-19 23:46:36,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.90 | bwd: 3379.39 | bwd_inner: 3378.46 | bwd_allreduce: 0.88 | step: 7.09 65%|██████▌ | 6523/10000 [10:16:56<5:20:14, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.00284015154466033, 'learning_rate': 1.1396599601729301e-05, 'epoch': 6.52} 65%|██████▌ | 6523/10000 [10:16:56<5:20:14, 5.53s/it][2025-06-19 23:46:41,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.83 [2025-06-19 23:46:41,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.43 | bwd_microstep: 3376.71 | bwd_inner_microstep: 3375.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 23:46:41,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.43 | bwd: 3376.73 | bwd_inner: 3375.93 | bwd_allreduce: 0.76 | step: 6.69 65%|██████▌ | 6524/10000 [10:17:02<5:20:41, 5.54s/it] {'loss': 0.0004, 'grad_norm': 0.03876524418592453, 'learning_rate': 1.139075249400605e-05, 'epoch': 6.52} 65%|██████▌ | 6524/10000 [10:17:02<5:20:41, 5.54s/it][2025-06-19 23:46:47,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:46:47,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.59 | bwd_microstep: 3325.43 | bwd_inner_microstep: 3324.49 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.20 [2025-06-19 23:46:47,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.59 | bwd: 3325.44 | bwd_inner: 3324.49 | bwd_allreduce: 0.91 | step: 7.21 65%|██████▌ | 6525/10000 [10:17:07<5:19:31, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.08448518067598343, 'learning_rate': 1.1384906289352902e-05, 'epoch': 6.53} 65%|██████▌ | 6525/10000 [10:17:08<5:19:31, 5.52s/it][2025-06-19 23:46:52,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:46:52,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.33 | bwd_microstep: 3334.75 | bwd_inner_microstep: 3333.86 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.12 [2025-06-19 23:46:52,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.33 | bwd: 3334.77 | bwd_inner: 3333.86 | bwd_allreduce: 0.86 | step: 7.12 65%|██████▌ | 6526/10000 [10:17:13<5:18:55, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.002442165045067668, 'learning_rate': 1.137906098838311e-05, 'epoch': 6.53} 65%|██████▌ | 6526/10000 [10:17:13<5:18:55, 5.51s/it][2025-06-19 23:46:58,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:46:58,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.84 | bwd_microstep: 3336.71 | bwd_inner_microstep: 3335.69 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.56 [2025-06-19 23:46:58,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.84 | bwd: 3336.73 | bwd_inner: 3335.69 | bwd_allreduce: 0.99 | step: 7.57 65%|██████▌ | 6527/10000 [10:17:18<5:18:35, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0029847463592886925, 'learning_rate': 1.1373216591709802e-05, 'epoch': 6.53} 65%|██████▌ | 6527/10000 [10:17:18<5:18:35, 5.50s/it][2025-06-19 23:47:03,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-19 23:47:03,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.83 | bwd_microstep: 3384.48 | bwd_inner_microstep: 3383.68 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.70 [2025-06-19 23:47:03,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.83 | bwd: 3384.50 | bwd_inner: 3383.68 | bwd_allreduce: 0.78 | step: 6.71 65%|██████▌ | 6528/10000 [10:17:24<5:19:34, 5.52s/it] {'loss': 0.0246, 'grad_norm': 3.0105292797088623, 'learning_rate': 1.1367373099946036e-05, 'epoch': 6.53} 65%|██████▌ | 6528/10000 [10:17:24<5:19:34, 5.52s/it][2025-06-19 23:47:09,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:47:09,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.67 | bwd_microstep: 3325.84 | bwd_inner_microstep: 3325.07 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.54 [2025-06-19 23:47:09,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.67 | bwd: 3325.85 | bwd_inner: 3325.06 | bwd_allreduce: 0.75 | step: 6.55 65%|██████▌ | 6529/10000 [10:17:30<5:18:51, 5.51s/it] {'loss': 0.0012, 'grad_norm': 0.2746887505054474, 'learning_rate': 1.1361530513704762e-05, 'epoch': 6.53} 65%|██████▌ | 6529/10000 [10:17:30<5:18:51, 5.51s/it][2025-06-19 23:47:14,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-19 23:47:14,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.74 | bwd_microstep: 3316.41 | bwd_inner_microstep: 3315.64 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.51 [2025-06-19 23:47:14,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.74 | bwd: 3316.43 | bwd_inner: 3315.64 | bwd_allreduce: 0.75 | step: 6.52 65%|██████▌ | 6530/10000 [10:17:35<5:17:52, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.731124997138977, 'learning_rate': 1.1355688833598848e-05, 'epoch': 6.53} 65%|██████▌ | 6530/10000 [10:17:35<5:17:52, 5.50s/it][2025-06-19 23:47:20,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-19 23:47:20,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.83 | bwd_microstep: 3384.93 | bwd_inner_microstep: 3383.82 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.85 [2025-06-19 23:47:20,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.83 | bwd: 3384.94 | bwd_inner: 3383.82 | bwd_allreduce: 1.07 | step: 7.86 65%|██████▌ | 6531/10000 [10:17:41<5:18:57, 5.52s/it] {'loss': 0.0763, 'grad_norm': 13.974781036376953, 'learning_rate': 1.1349848060241065e-05, 'epoch': 6.53} 65%|██████▌ | 6531/10000 [10:17:41<5:18:57, 5.52s/it][2025-06-19 23:47:25,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:47:25,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.19 | bwd_microstep: 3328.81 | bwd_inner_microstep: 3328.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-19 23:47:25,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.19 | bwd: 3328.82 | bwd_inner: 3328.01 | bwd_allreduce: 0.77 | step: 6.73 65%|██████▌ | 6532/10000 [10:17:46<5:18:17, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.033641036599874496, 'learning_rate': 1.1344008194244066e-05, 'epoch': 6.53} 65%|██████▌ | 6532/10000 [10:17:46<5:18:17, 5.51s/it][2025-06-19 23:47:31,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:47:31,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.45 | bwd_microstep: 3382.47 | bwd_inner_microstep: 3381.54 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.94 [2025-06-19 23:47:31,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.45 | bwd: 3382.49 | bwd_inner: 3381.54 | bwd_allreduce: 0.90 | step: 6.94 65%|██████▌ | 6533/10000 [10:17:52<5:19:05, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0004934954340569675, 'learning_rate': 1.133816923622043e-05, 'epoch': 6.53} 65%|██████▌ | 6533/10000 [10:17:52<5:19:05, 5.52s/it][2025-06-19 23:47:36,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:47:36,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.75 | bwd_microstep: 3328.63 | bwd_inner_microstep: 3327.78 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.79 [2025-06-19 23:47:36,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.75 | bwd: 3328.65 | bwd_inner: 3327.78 | bwd_allreduce: 0.81 | step: 6.80 65%|██████▌ | 6534/10000 [10:17:57<5:18:20, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0009896887931972742, 'learning_rate': 1.1332331186782643e-05, 'epoch': 6.53} 65%|██████▌ | 6534/10000 [10:17:57<5:18:20, 5.51s/it][2025-06-19 23:47:42,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:47:42,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.65 | bwd_microstep: 3318.08 | bwd_inner_microstep: 3317.02 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.73 [2025-06-19 23:47:42,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.65 | bwd: 3318.10 | bwd_inner: 3317.02 | bwd_allreduce: 1.02 | step: 7.74 65%|██████▌ | 6535/10000 [10:18:03<5:17:28, 5.50s/it] {'loss': 0.0884, 'grad_norm': 17.27303123474121, 'learning_rate': 1.1326494046543086e-05, 'epoch': 6.54} 65%|██████▌ | 6535/10000 [10:18:03<5:17:28, 5.50s/it][2025-06-19 23:47:47,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:47:47,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.81 | bwd_microstep: 3386.37 | bwd_inner_microstep: 3385.41 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.28 [2025-06-19 23:47:47,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.81 | bwd: 3386.39 | bwd_inner: 3385.41 | bwd_allreduce: 0.93 | step: 7.29 65%|██████▌ | 6536/10000 [10:18:08<5:18:36, 5.52s/it] {'loss': 0.0004, 'grad_norm': 0.1253587156534195, 'learning_rate': 1.1320657816114055e-05, 'epoch': 6.54} 65%|██████▌ | 6536/10000 [10:18:08<5:18:36, 5.52s/it][2025-06-19 23:47:53,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 23:47:53,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.74 | bwd_microstep: 3330.00 | bwd_inner_microstep: 3329.03 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.13 [2025-06-19 23:47:53,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.74 | bwd: 3330.02 | bwd_inner: 3329.03 | bwd_allreduce: 0.94 | step: 7.13 65%|██████▌ | 6537/10000 [10:18:14<5:17:49, 5.51s/it] {'loss': 0.0016, 'grad_norm': 0.3188154101371765, 'learning_rate': 1.1314822496107732e-05, 'epoch': 6.54} 65%|██████▌ | 6537/10000 [10:18:14<5:17:49, 5.51s/it][2025-06-19 23:47:58,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:47:58,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.82 | bwd_microstep: 3330.54 | bwd_inner_microstep: 3329.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-19 23:47:58,785] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.82 | bwd: 3330.56 | bwd_inner: 3329.75 | bwd_allreduce: 0.76 | step: 6.65 65%|██████▌ | 6538/10000 [10:18:19<5:17:21, 5.50s/it] {'loss': 0.0012, 'grad_norm': 0.22293740510940552, 'learning_rate': 1.1308988087136217e-05, 'epoch': 6.54} 65%|██████▌ | 6538/10000 [10:18:19<5:17:21, 5.50s/it][2025-06-19 23:48:04,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:48:04,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.21 | bwd_microstep: 3372.07 | bwd_inner_microstep: 3371.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 23:48:04,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.21 | bwd: 3372.09 | bwd_inner: 3371.28 | bwd_allreduce: 0.76 | step: 6.69 65%|██████▌ | 6539/10000 [10:18:25<5:18:01, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.006518394220620394, 'learning_rate': 1.1303154589811518e-05, 'epoch': 6.54} 65%|██████▌ | 6539/10000 [10:18:25<5:18:01, 5.51s/it][2025-06-19 23:48:09,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.76 [2025-06-19 23:48:09,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.44 | bwd_microstep: 3338.48 | bwd_inner_microstep: 3337.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 23:48:09,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.44 | bwd: 3338.49 | bwd_inner: 3337.68 | bwd_allreduce: 0.77 | step: 6.74 65%|██████▌ | 6540/10000 [10:18:30<5:17:20, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.01692919060587883, 'learning_rate': 1.1297322004745537e-05, 'epoch': 6.54} 65%|██████▌ | 6540/10000 [10:18:30<5:17:20, 5.50s/it][2025-06-19 23:48:15,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:48:15,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.54 | bwd_microstep: 3334.28 | bwd_inner_microstep: 3333.46 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.07 [2025-06-19 23:48:15,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.54 | bwd: 3334.30 | bwd_inner: 3333.47 | bwd_allreduce: 0.79 | step: 7.07 65%|██████▌ | 6541/10000 [10:18:36<5:16:55, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.045343369245529175, 'learning_rate': 1.129149033255009e-05, 'epoch': 6.54} 65%|██████▌ | 6541/10000 [10:18:36<5:16:55, 5.50s/it][2025-06-19 23:48:20,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:48:20,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.44 | bwd_microstep: 3323.92 | bwd_inner_microstep: 3323.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-19 23:48:20,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.44 | bwd: 3323.94 | bwd_inner: 3323.13 | bwd_allreduce: 0.76 | step: 6.74 65%|██████▌ | 6542/10000 [10:18:41<5:16:15, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.0225912407040596, 'learning_rate': 1.1285659573836889e-05, 'epoch': 6.54} 65%|██████▌ | 6542/10000 [10:18:41<5:16:15, 5.49s/it][2025-06-19 23:48:26,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:48:26,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.54 | bwd_microstep: 3316.64 | bwd_inner_microstep: 3315.84 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-19 23:48:26,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.54 | bwd: 3316.65 | bwd_inner: 3315.84 | bwd_allreduce: 0.76 | step: 6.65 65%|██████▌ | 6543/10000 [10:18:47<5:15:42, 5.48s/it] {'loss': 0.0004, 'grad_norm': 0.09107423573732376, 'learning_rate': 1.1279829729217553e-05, 'epoch': 6.54} 65%|██████▌ | 6543/10000 [10:18:47<5:15:42, 5.48s/it][2025-06-19 23:48:31,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:48:31,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.39 | bwd_microstep: 3310.99 | bwd_inner_microstep: 3310.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 23:48:31,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.39 | bwd: 3311.01 | bwd_inner: 3310.20 | bwd_allreduce: 0.76 | step: 6.69 65%|██████▌ | 6544/10000 [10:18:52<5:15:10, 5.47s/it] {'loss': 0.0008, 'grad_norm': 0.12096560746431351, 'learning_rate': 1.1274000799303609e-05, 'epoch': 6.54} 65%|██████▌ | 6544/10000 [10:18:52<5:15:10, 5.47s/it][2025-06-19 23:48:37,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:48:37,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.76 | bwd_microstep: 3310.35 | bwd_inner_microstep: 3309.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-19 23:48:37,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.76 | bwd: 3310.36 | bwd_inner: 3309.55 | bwd_allreduce: 0.76 | step: 6.72 65%|██████▌ | 6545/10000 [10:18:57<5:14:45, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.003923557233065367, 'learning_rate': 1.126817278470648e-05, 'epoch': 6.54} 65%|██████▌ | 6545/10000 [10:18:57<5:14:45, 5.47s/it][2025-06-19 23:48:42,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:48:42,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.19 | bwd_microstep: 3379.27 | bwd_inner_microstep: 3378.18 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.36 [2025-06-19 23:48:42,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.19 | bwd: 3379.29 | bwd_inner: 3378.18 | bwd_allreduce: 1.05 | step: 7.37 65%|██████▌ | 6546/10000 [10:19:03<5:16:17, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.03863684460520744, 'learning_rate': 1.1262345686037509e-05, 'epoch': 6.55} 65%|██████▌ | 6546/10000 [10:19:03<5:16:17, 5.49s/it][2025-06-19 23:48:48,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 23:48:48,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.27 | bwd_microstep: 3314.77 | bwd_inner_microstep: 3313.86 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.35 [2025-06-19 23:48:48,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.27 | bwd: 3314.79 | bwd_inner: 3313.86 | bwd_allreduce: 0.88 | step: 7.36 65%|██████▌ | 6547/10000 [10:19:08<5:15:41, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006498406641185284, 'learning_rate': 1.1256519503907915e-05, 'epoch': 6.55} 65%|██████▌ | 6547/10000 [10:19:08<5:15:41, 5.49s/it][2025-06-19 23:48:53,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:48:53,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.69 | bwd_microstep: 3325.09 | bwd_inner_microstep: 3324.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 23:48:53,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.69 | bwd: 3325.10 | bwd_inner: 3324.30 | bwd_allreduce: 0.76 | step: 6.66 65%|██████▌ | 6548/10000 [10:19:14<5:15:19, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.014860830269753933, 'learning_rate': 1.1250694238928844e-05, 'epoch': 6.55} 65%|██████▌ | 6548/10000 [10:19:14<5:15:19, 5.48s/it][2025-06-19 23:48:59,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:48:59,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.04 | bwd_microstep: 3318.77 | bwd_inner_microstep: 3317.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-19 23:48:59,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.04 | bwd: 3318.79 | bwd_inner: 3317.97 | bwd_allreduce: 0.78 | step: 6.81 65%|██████▌ | 6549/10000 [10:19:19<5:14:51, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.013658812269568443, 'learning_rate': 1.1244869891711342e-05, 'epoch': 6.55} 65%|██████▌ | 6549/10000 [10:19:19<5:14:51, 5.47s/it][2025-06-19 23:49:04,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:49:04,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.14 | bwd_microstep: 3373.15 | bwd_inner_microstep: 3372.21 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.28 [2025-06-19 23:49:04,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.14 | bwd: 3373.17 | bwd_inner: 3372.21 | bwd_allreduce: 0.91 | step: 7.29 66%|██████▌ | 6550/10000 [10:19:25<5:15:58, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.0507299043238163, 'learning_rate': 1.1239046462866354e-05, 'epoch': 6.55} 66%|██████▌ | 6550/10000 [10:19:25<5:15:58, 5.50s/it][2025-06-19 23:49:10,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:49:10,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.95 | bwd_microstep: 3371.96 | bwd_inner_microstep: 3370.91 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.76 [2025-06-19 23:49:10,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.95 | bwd: 3371.97 | bwd_inner: 3370.91 | bwd_allreduce: 1.02 | step: 7.77 66%|██████▌ | 6551/10000 [10:19:30<5:16:41, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.0675370842218399, 'learning_rate': 1.1233223953004739e-05, 'epoch': 6.55} 66%|██████▌ | 6551/10000 [10:19:30<5:16:41, 5.51s/it][2025-06-19 23:49:15,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:49:15,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.41 | bwd_microstep: 3313.31 | bwd_inner_microstep: 3312.53 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.86 [2025-06-19 23:49:15,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.41 | bwd: 3313.32 | bwd_inner: 3312.53 | bwd_allreduce: 0.75 | step: 6.86 66%|██████▌ | 6552/10000 [10:19:36<5:15:43, 5.49s/it] {'loss': 0.0012, 'grad_norm': 0.18643838167190552, 'learning_rate': 1.1227402362737236e-05, 'epoch': 6.55} 66%|██████▌ | 6552/10000 [10:19:36<5:15:43, 5.49s/it][2025-06-19 23:49:21,086] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:49:21,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.89 | bwd_microstep: 3324.96 | bwd_inner_microstep: 3324.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 23:49:21,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.89 | bwd: 3324.97 | bwd_inner: 3324.17 | bwd_allreduce: 0.76 | step: 6.67 66%|██████▌ | 6553/10000 [10:19:41<5:15:07, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.030058979988098145, 'learning_rate': 1.122158169267451e-05, 'epoch': 6.55} 66%|██████▌ | 6553/10000 [10:19:41<5:15:07, 5.49s/it][2025-06-19 23:49:26,634] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:49:26,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.83 | bwd_microstep: 3375.51 | bwd_inner_microstep: 3374.71 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-19 23:49:26,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.83 | bwd: 3375.53 | bwd_inner: 3374.71 | bwd_allreduce: 0.77 | step: 7.20 66%|██████▌ | 6554/10000 [10:19:47<5:16:08, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.047753795981407166, 'learning_rate': 1.1215761943427123e-05, 'epoch': 6.55} 66%|██████▌ | 6554/10000 [10:19:47<5:16:08, 5.50s/it][2025-06-19 23:49:32,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:49:32,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.92 | bwd_microstep: 3313.03 | bwd_inner_microstep: 3312.11 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.14 [2025-06-19 23:49:32,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.92 | bwd: 3313.05 | bwd_inner: 3312.11 | bwd_allreduce: 0.90 | step: 7.15 66%|██████▌ | 6555/10000 [10:19:52<5:15:16, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.020326245576143265, 'learning_rate': 1.1209943115605539e-05, 'epoch': 6.55} 66%|██████▌ | 6555/10000 [10:19:52<5:15:16, 5.49s/it][2025-06-19 23:49:37,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:49:37,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.86 | bwd_microstep: 3306.20 | bwd_inner_microstep: 3305.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-19 23:49:37,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.86 | bwd: 3306.21 | bwd_inner: 3305.41 | bwd_allreduce: 0.76 | step: 6.69 66%|██████▌ | 6556/10000 [10:19:58<5:14:29, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.022273676469922066, 'learning_rate': 1.1204125209820127e-05, 'epoch': 6.56} 66%|██████▌ | 6556/10000 [10:19:58<5:14:29, 5.48s/it][2025-06-19 23:49:43,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:49:43,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.92 | bwd_microstep: 3317.66 | bwd_inner_microstep: 3316.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 23:49:43,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.92 | bwd: 3317.68 | bwd_inner: 3316.86 | bwd_allreduce: 0.77 | step: 6.88 66%|██████▌ | 6557/10000 [10:20:03<5:14:10, 5.48s/it] {'loss': 0.0066, 'grad_norm': 2.1581850051879883, 'learning_rate': 1.1198308226681156e-05, 'epoch': 6.56} 66%|██████▌ | 6557/10000 [10:20:03<5:14:10, 5.48s/it][2025-06-19 23:49:48,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 23:49:48,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.66 | bwd_microstep: 3360.48 | bwd_inner_microstep: 3359.42 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.95 [2025-06-19 23:49:48,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.66 | bwd: 3360.50 | bwd_inner: 3359.42 | bwd_allreduce: 1.02 | step: 7.94 66%|██████▌ | 6558/10000 [10:20:09<5:14:54, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.02587055042386055, 'learning_rate': 1.1192492166798803e-05, 'epoch': 6.56} 66%|██████▌ | 6558/10000 [10:20:09<5:14:54, 5.49s/it][2025-06-19 23:49:54,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:49:54,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.24 | bwd_microstep: 3374.45 | bwd_inner_microstep: 3373.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 23:49:54,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.24 | bwd: 3374.47 | bwd_inner: 3373.66 | bwd_allreduce: 0.76 | step: 6.71 66%|██████▌ | 6559/10000 [10:20:14<5:15:46, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.02876402623951435, 'learning_rate': 1.1186677030783147e-05, 'epoch': 6.56} 66%|██████▌ | 6559/10000 [10:20:14<5:15:46, 5.51s/it][2025-06-19 23:49:59,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:49:59,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.99 | bwd_microstep: 3313.47 | bwd_inner_microstep: 3312.49 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.38 [2025-06-19 23:49:59,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.99 | bwd: 3313.49 | bwd_inner: 3312.49 | bwd_allreduce: 0.95 | step: 7.38 66%|██████▌ | 6560/10000 [10:20:20<5:14:59, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.05144985020160675, 'learning_rate': 1.1180862819244162e-05, 'epoch': 6.56} 66%|██████▌ | 6560/10000 [10:20:20<5:14:59, 5.49s/it][2025-06-19 23:50:05,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:50:05,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.89 | bwd_microstep: 3359.21 | bwd_inner_microstep: 3358.41 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-19 23:50:05,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.89 | bwd: 3359.23 | bwd_inner: 3358.41 | bwd_allreduce: 0.78 | step: 6.90 66%|██████▌ | 6561/10000 [10:20:25<5:15:24, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001068762387149036, 'learning_rate': 1.1175049532791747e-05, 'epoch': 6.56} 66%|██████▌ | 6561/10000 [10:20:25<5:15:24, 5.50s/it][2025-06-19 23:50:10,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:50:10,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.17 | bwd_microstep: 3312.44 | bwd_inner_microstep: 3311.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-19 23:50:10,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.17 | bwd: 3312.45 | bwd_inner: 3311.64 | bwd_allreduce: 0.76 | step: 6.82 66%|██████▌ | 6562/10000 [10:20:31<5:14:27, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0016696812817826867, 'learning_rate': 1.1169237172035669e-05, 'epoch': 6.56} 66%|██████▌ | 6562/10000 [10:20:31<5:14:27, 5.49s/it][2025-06-19 23:50:15,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:50:15,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.30 | bwd_microstep: 3311.29 | bwd_inner_microstep: 3310.28 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.40 [2025-06-19 23:50:15,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.30 | bwd: 3311.31 | bwd_inner: 3310.28 | bwd_allreduce: 0.97 | step: 7.41 66%|██████▌ | 6563/10000 [10:20:36<5:13:50, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.02538730390369892, 'learning_rate': 1.1163425737585629e-05, 'epoch': 6.56} 66%|██████▌ | 6563/10000 [10:20:36<5:13:50, 5.48s/it][2025-06-19 23:50:21,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-19 23:50:21,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.20 | bwd_microstep: 3315.18 | bwd_inner_microstep: 3314.04 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.68 [2025-06-19 23:50:21,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.20 | bwd: 3315.20 | bwd_inner: 3314.04 | bwd_allreduce: 1.09 | step: 7.69 66%|██████▌ | 6564/10000 [10:20:42<5:13:40, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.13270176947116852, 'learning_rate': 1.1157615230051215e-05, 'epoch': 6.56} 66%|██████▌ | 6564/10000 [10:20:42<5:13:40, 5.48s/it][2025-06-19 23:50:27,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:50:27,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.95 | bwd_microstep: 3374.93 | bwd_inner_microstep: 3373.95 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.46 [2025-06-19 23:50:27,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.95 | bwd: 3374.95 | bwd_inner: 3373.95 | bwd_allreduce: 0.95 | step: 7.46 66%|██████▌ | 6565/10000 [10:20:47<5:14:48, 5.50s/it] {'loss': 0.0006, 'grad_norm': 0.08027611672878265, 'learning_rate': 1.1151805650041924e-05, 'epoch': 6.56} 66%|██████▌ | 6565/10000 [10:20:47<5:14:48, 5.50s/it][2025-06-19 23:50:32,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:50:32,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.63 | bwd_microstep: 3387.70 | bwd_inner_microstep: 3386.63 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.30 [2025-06-19 23:50:32,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.63 | bwd: 3387.72 | bwd_inner: 3386.63 | bwd_allreduce: 1.03 | step: 7.30 66%|██████▌ | 6566/10000 [10:20:53<5:16:01, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.001315712695941329, 'learning_rate': 1.1145996998167166e-05, 'epoch': 6.57} 66%|██████▌ | 6566/10000 [10:20:53<5:16:01, 5.52s/it][2025-06-19 23:50:38,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-19 23:50:38,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.39 | bwd_microstep: 3364.45 | bwd_inner_microstep: 3363.31 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.39 [2025-06-19 23:50:38,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.39 | bwd: 3364.47 | bwd_inner: 3363.31 | bwd_allreduce: 1.09 | step: 7.39 66%|██████▌ | 6567/10000 [10:20:58<5:16:02, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0006938516744412482, 'learning_rate': 1.1140189275036221e-05, 'epoch': 6.57} 66%|██████▌ | 6567/10000 [10:20:58<5:16:02, 5.52s/it][2025-06-19 23:50:43,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:50:43,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.80 | bwd_microstep: 3321.94 | bwd_inner_microstep: 3321.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-19 23:50:43,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.80 | bwd: 3321.95 | bwd_inner: 3321.14 | bwd_allreduce: 0.77 | step: 6.86 66%|██████▌ | 6568/10000 [10:21:04<5:15:06, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0012793553760275245, 'learning_rate': 1.1134382481258302e-05, 'epoch': 6.57} 66%|██████▌ | 6568/10000 [10:21:04<5:15:06, 5.51s/it][2025-06-19 23:50:49,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 23:50:49,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.35 | bwd_microstep: 3319.35 | bwd_inner_microstep: 3318.30 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.37 [2025-06-19 23:50:49,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.35 | bwd: 3319.37 | bwd_inner: 3318.30 | bwd_allreduce: 1.02 | step: 7.38 66%|██████▌ | 6569/10000 [10:21:09<5:14:17, 5.50s/it] {'loss': 0.0079, 'grad_norm': 2.088819742202759, 'learning_rate': 1.112857661744251e-05, 'epoch': 6.57} 66%|██████▌ | 6569/10000 [10:21:09<5:14:17, 5.50s/it][2025-06-19 23:50:54,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:50:54,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.30 | bwd_microstep: 3311.06 | bwd_inner_microstep: 3310.22 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.83 [2025-06-19 23:50:54,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.30 | bwd: 3311.08 | bwd_inner: 3310.22 | bwd_allreduce: 0.80 | step: 6.84 66%|██████▌ | 6570/10000 [10:21:15<5:13:33, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.02365155704319477, 'learning_rate': 1.112277168419786e-05, 'epoch': 6.57} 66%|██████▌ | 6570/10000 [10:21:15<5:13:33, 5.49s/it][2025-06-19 23:51:00,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-19 23:51:00,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.43 | bwd_microstep: 3362.62 | bwd_inner_microstep: 3361.57 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.97 [2025-06-19 23:51:00,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.43 | bwd: 3362.64 | bwd_inner: 3361.57 | bwd_allreduce: 1.01 | step: 7.98 66%|██████▌ | 6571/10000 [10:21:20<5:14:11, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.1778372824192047, 'learning_rate': 1.1116967682133271e-05, 'epoch': 6.57} 66%|██████▌ | 6571/10000 [10:21:20<5:14:11, 5.50s/it][2025-06-19 23:51:05,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:51:05,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.51 | bwd_microstep: 3360.02 | bwd_inner_microstep: 3359.21 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-19 23:51:05,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.51 | bwd: 3360.04 | bwd_inner: 3359.21 | bwd_allreduce: 0.78 | step: 6.90 66%|██████▌ | 6572/10000 [10:21:26<5:14:45, 5.51s/it] {'loss': 0.002, 'grad_norm': 0.3631655275821686, 'learning_rate': 1.1111164611857532e-05, 'epoch': 6.57} 66%|██████▌ | 6572/10000 [10:21:26<5:14:45, 5.51s/it][2025-06-19 23:51:11,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:51:11,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.48 | bwd_microstep: 3320.58 | bwd_inner_microstep: 3319.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-19 23:51:11,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.48 | bwd: 3320.60 | bwd_inner: 3319.79 | bwd_allreduce: 0.77 | step: 7.02 66%|██████▌ | 6573/10000 [10:21:31<5:13:55, 5.50s/it] {'loss': 0.0055, 'grad_norm': 1.080736517906189, 'learning_rate': 1.1105362473979371e-05, 'epoch': 6.57} 66%|██████▌ | 6573/10000 [10:21:31<5:13:55, 5.50s/it][2025-06-19 23:51:16,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:51:16,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.14 | bwd_microstep: 3365.12 | bwd_inner_microstep: 3364.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 23:51:16,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.14 | bwd: 3365.13 | bwd_inner: 3364.31 | bwd_allreduce: 0.77 | step: 7.17 66%|██████▌ | 6574/10000 [10:21:37<5:14:29, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.020052509382367134, 'learning_rate': 1.1099561269107408e-05, 'epoch': 6.57} 66%|██████▌ | 6574/10000 [10:21:37<5:14:29, 5.51s/it][2025-06-19 23:51:22,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:51:22,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.23 | bwd_microstep: 3367.98 | bwd_inner_microstep: 3367.17 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-19 23:51:22,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.23 | bwd: 3368.00 | bwd_inner: 3367.17 | bwd_allreduce: 0.78 | step: 7.17 66%|██████▌ | 6575/10000 [10:21:42<5:14:52, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.009137452580034733, 'learning_rate': 1.1093760997850158e-05, 'epoch': 6.58} 66%|██████▌ | 6575/10000 [10:21:42<5:14:52, 5.52s/it][2025-06-19 23:51:27,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:51:27,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.49 | bwd_microstep: 3307.97 | bwd_inner_microstep: 3307.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 23:51:27,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.49 | bwd: 3307.98 | bwd_inner: 3307.18 | bwd_allreduce: 0.76 | step: 6.68 66%|██████▌ | 6576/10000 [10:21:48<5:13:33, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.028469813987612724, 'learning_rate': 1.1087961660816045e-05, 'epoch': 6.58} 66%|██████▌ | 6576/10000 [10:21:48<5:13:33, 5.49s/it][2025-06-19 23:51:33,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:51:33,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.46 | bwd_microstep: 3320.77 | bwd_inner_microstep: 3319.84 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.63 [2025-06-19 23:51:33,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.46 | bwd: 3320.78 | bwd_inner: 3319.84 | bwd_allreduce: 0.90 | step: 7.63 66%|██████▌ | 6577/10000 [10:21:53<5:12:53, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004511710721999407, 'learning_rate': 1.108216325861339e-05, 'epoch': 6.58} 66%|██████▌ | 6577/10000 [10:21:53<5:12:53, 5.48s/it][2025-06-19 23:51:38,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:51:38,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.49 | bwd_microstep: 3360.36 | bwd_inner_microstep: 3359.55 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-19 23:51:38,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.49 | bwd: 3360.37 | bwd_inner: 3359.55 | bwd_allreduce: 0.78 | step: 6.77 66%|██████▌ | 6578/10000 [10:21:59<5:13:29, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0015180397313088179, 'learning_rate': 1.1076365791850423e-05, 'epoch': 6.58} 66%|██████▌ | 6578/10000 [10:21:59<5:13:29, 5.50s/it][2025-06-19 23:51:44,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:51:44,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.59 | bwd_microstep: 3368.56 | bwd_inner_microstep: 3367.75 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 23:51:44,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.59 | bwd: 3368.57 | bwd_inner: 3367.75 | bwd_allreduce: 0.77 | step: 7.11 66%|██████▌ | 6579/10000 [10:22:04<5:14:10, 5.51s/it] {'loss': 0.0032, 'grad_norm': 0.7043624520301819, 'learning_rate': 1.1070569261135267e-05, 'epoch': 6.58} 66%|██████▌ | 6579/10000 [10:22:04<5:14:10, 5.51s/it][2025-06-19 23:51:49,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:51:49,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.33 | bwd_microstep: 3308.20 | bwd_inner_microstep: 3307.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 23:51:49,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.33 | bwd: 3308.21 | bwd_inner: 3307.41 | bwd_allreduce: 0.76 | step: 6.72 66%|██████▌ | 6580/10000 [10:22:10<5:12:56, 5.49s/it] {'loss': 0.0119, 'grad_norm': 4.8783793449401855, 'learning_rate': 1.1064773667075955e-05, 'epoch': 6.58} 66%|██████▌ | 6580/10000 [10:22:10<5:12:56, 5.49s/it][2025-06-19 23:51:54,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:51:54,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.34 | bwd_microstep: 3308.61 | bwd_inner_microstep: 3307.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 23:51:54,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.34 | bwd: 3308.63 | bwd_inner: 3307.81 | bwd_allreduce: 0.77 | step: 6.76 66%|██████▌ | 6581/10000 [10:22:15<5:12:09, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0006879952270537615, 'learning_rate': 1.1058979010280411e-05, 'epoch': 6.58} 66%|██████▌ | 6581/10000 [10:22:15<5:12:09, 5.48s/it][2025-06-19 23:52:00,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:52:00,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.63 | bwd_microstep: 3314.67 | bwd_inner_microstep: 3313.81 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.25 [2025-06-19 23:52:00,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.63 | bwd: 3314.69 | bwd_inner: 3313.81 | bwd_allreduce: 0.83 | step: 7.26 66%|██████▌ | 6582/10000 [10:22:21<5:11:41, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.013344816863536835, 'learning_rate': 1.1053185291356483e-05, 'epoch': 6.58} 66%|██████▌ | 6582/10000 [10:22:21<5:11:41, 5.47s/it][2025-06-19 23:52:05,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:52:05,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.61 | bwd_microstep: 3364.86 | bwd_inner_microstep: 3364.04 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.85 [2025-06-19 23:52:05,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.61 | bwd: 3364.87 | bwd_inner: 3364.04 | bwd_allreduce: 0.78 | step: 6.86 66%|██████▌ | 6583/10000 [10:22:26<5:12:33, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0021741304080933332, 'learning_rate': 1.1047392510911888e-05, 'epoch': 6.58} 66%|██████▌ | 6583/10000 [10:22:26<5:12:33, 5.49s/it][2025-06-19 23:52:11,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:52:11,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.39 | bwd_microstep: 3320.47 | bwd_inner_microstep: 3319.59 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.94 [2025-06-19 23:52:11,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.39 | bwd: 3320.48 | bwd_inner: 3319.59 | bwd_allreduce: 0.84 | step: 6.94 66%|██████▌ | 6584/10000 [10:22:32<5:12:03, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.030147472396492958, 'learning_rate': 1.1041600669554264e-05, 'epoch': 6.58} 66%|██████▌ | 6584/10000 [10:22:32<5:12:03, 5.48s/it][2025-06-19 23:52:16,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:52:16,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.59 | bwd_microstep: 3310.01 | bwd_inner_microstep: 3309.07 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.49 [2025-06-19 23:52:16,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.59 | bwd: 3310.03 | bwd_inner: 3309.07 | bwd_allreduce: 0.91 | step: 7.49 66%|██████▌ | 6585/10000 [10:22:37<5:11:32, 5.47s/it] {'loss': 0.0012, 'grad_norm': 0.45016443729400635, 'learning_rate': 1.1035809767891154e-05, 'epoch': 6.58} 66%|██████▌ | 6585/10000 [10:22:37<5:11:32, 5.47s/it][2025-06-19 23:52:22,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:52:22,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.67 | bwd_microstep: 3313.66 | bwd_inner_microstep: 3312.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 23:52:22,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.67 | bwd: 3313.67 | bwd_inner: 3312.86 | bwd_allreduce: 0.77 | step: 6.90 66%|██████▌ | 6586/10000 [10:22:43<5:11:09, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0056857033632695675, 'learning_rate': 1.1030019806529996e-05, 'epoch': 6.59} 66%|██████▌ | 6586/10000 [10:22:43<5:11:09, 5.47s/it][2025-06-19 23:52:27,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:52:27,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.06 | bwd_microstep: 3351.20 | bwd_inner_microstep: 3350.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 23:52:27,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.06 | bwd: 3351.22 | bwd_inner: 3350.42 | bwd_allreduce: 0.75 | step: 6.62 66%|██████▌ | 6587/10000 [10:22:48<5:11:48, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.07950493693351746, 'learning_rate': 1.1024230786078141e-05, 'epoch': 6.59} 66%|██████▌ | 6587/10000 [10:22:48<5:11:48, 5.48s/it][2025-06-19 23:52:33,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 23:52:33,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.58 | bwd_microstep: 3310.01 | bwd_inner_microstep: 3309.06 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.75 [2025-06-19 23:52:33,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.58 | bwd: 3310.03 | bwd_inner: 3309.06 | bwd_allreduce: 0.92 | step: 7.75 66%|██████▌ | 6588/10000 [10:22:54<5:11:10, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.026108331978321075, 'learning_rate': 1.1018442707142807e-05, 'epoch': 6.59} 66%|██████▌ | 6588/10000 [10:22:54<5:11:10, 5.47s/it][2025-06-19 23:52:38,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.82 [2025-06-19 23:52:38,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.56 | bwd_microstep: 3312.21 | bwd_inner_microstep: 3311.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 23:52:38,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.56 | bwd: 3312.22 | bwd_inner: 3311.43 | bwd_allreduce: 0.76 | step: 6.69 66%|██████▌ | 6589/10000 [10:22:59<5:10:49, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0012427791953086853, 'learning_rate': 1.1012655570331148e-05, 'epoch': 6.59} 66%|██████▌ | 6589/10000 [10:22:59<5:10:49, 5.47s/it][2025-06-19 23:52:44,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:52:44,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.41 | bwd_microstep: 3318.56 | bwd_inner_microstep: 3317.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-19 23:52:44,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.41 | bwd: 3318.57 | bwd_inner: 3317.78 | bwd_allreduce: 0.75 | step: 6.56 66%|██████▌ | 6590/10000 [10:23:05<5:10:33, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0019313397351652384, 'learning_rate': 1.1006869376250209e-05, 'epoch': 6.59} 66%|██████▌ | 6590/10000 [10:23:05<5:10:33, 5.46s/it][2025-06-19 23:52:49,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.91 [2025-06-19 23:52:49,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.16 | bwd_microstep: 3303.72 | bwd_inner_microstep: 3302.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.29 [2025-06-19 23:52:49,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.16 | bwd: 3303.73 | bwd_inner: 3302.92 | bwd_allreduce: 0.77 | step: 7.29 66%|██████▌ | 6591/10000 [10:23:10<5:10:04, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.006034405902028084, 'learning_rate': 1.1001084125506933e-05, 'epoch': 6.59} 66%|██████▌ | 6591/10000 [10:23:10<5:10:04, 5.46s/it][2025-06-19 23:52:55,175] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:52:55,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.59 | bwd_microstep: 3358.63 | bwd_inner_microstep: 3357.82 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.87 [2025-06-19 23:52:55,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.59 | bwd: 3358.64 | bwd_inner: 3357.82 | bwd_allreduce: 0.78 | step: 6.87 66%|██████▌ | 6592/10000 [10:23:15<5:11:11, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00012253577006049454, 'learning_rate': 1.0995299818708167e-05, 'epoch': 6.59} 66%|██████▌ | 6592/10000 [10:23:15<5:11:11, 5.48s/it][2025-06-19 23:53:00,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.98 [2025-06-19 23:53:00,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.88 | bwd_microstep: 3310.81 | bwd_inner_microstep: 3309.95 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.19 [2025-06-19 23:53:00,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.88 | bwd: 3310.83 | bwd_inner: 3309.95 | bwd_allreduce: 0.83 | step: 7.19 66%|██████▌ | 6593/10000 [10:23:21<5:10:41, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.025868551805615425, 'learning_rate': 1.098951645646066e-05, 'epoch': 6.59} 66%|██████▌ | 6593/10000 [10:23:21<5:10:41, 5.47s/it][2025-06-19 23:53:06,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:53:06,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.83 | bwd_microstep: 3388.26 | bwd_inner_microstep: 3387.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-19 23:53:06,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.83 | bwd: 3388.28 | bwd_inner: 3387.47 | bwd_allreduce: 0.76 | step: 6.66 66%|██████▌ | 6594/10000 [10:23:26<5:12:06, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0012316048378124833, 'learning_rate': 1.0983734039371056e-05, 'epoch': 6.59} 66%|██████▌ | 6594/10000 [10:23:26<5:12:06, 5.50s/it][2025-06-19 23:53:11,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 23:53:11,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.20 | bwd_microstep: 3365.04 | bwd_inner_microstep: 3364.15 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.97 [2025-06-19 23:53:11,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.20 | bwd: 3365.06 | bwd_inner: 3364.15 | bwd_allreduce: 0.86 | step: 6.98 66%|██████▌ | 6595/10000 [10:23:32<5:12:26, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0022675616201013327, 'learning_rate': 1.0977952568045906e-05, 'epoch': 6.59} 66%|██████▌ | 6595/10000 [10:23:32<5:12:26, 5.51s/it][2025-06-19 23:53:17,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:53:17,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.38 | bwd_microstep: 3325.51 | bwd_inner_microstep: 3324.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-19 23:53:17,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.38 | bwd: 3325.53 | bwd_inner: 3324.72 | bwd_allreduce: 0.77 | step: 6.69 66%|██████▌ | 6596/10000 [10:23:37<5:11:42, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0036718356423079967, 'learning_rate': 1.0972172043091658e-05, 'epoch': 6.6} 66%|██████▌ | 6596/10000 [10:23:37<5:11:42, 5.49s/it][2025-06-19 23:53:22,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:53:22,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.78 | bwd_microstep: 3366.96 | bwd_inner_microstep: 3366.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-19 23:53:22,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.78 | bwd: 3366.97 | bwd_inner: 3366.15 | bwd_allreduce: 0.78 | step: 7.10 66%|██████▌ | 6597/10000 [10:23:43<5:12:11, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.009797850623726845, 'learning_rate': 1.0966392465114675e-05, 'epoch': 6.6} 66%|██████▌ | 6597/10000 [10:23:43<5:12:11, 5.50s/it][2025-06-19 23:53:28,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:53:28,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.97 | bwd_microstep: 3362.21 | bwd_inner_microstep: 3361.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 23:53:28,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.97 | bwd: 3362.23 | bwd_inner: 3361.41 | bwd_allreduce: 0.77 | step: 6.69 66%|██████▌ | 6598/10000 [10:23:49<5:12:35, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.012129060924053192, 'learning_rate': 1.0960613834721183e-05, 'epoch': 6.6} 66%|██████▌ | 6598/10000 [10:23:49<5:12:35, 5.51s/it][2025-06-19 23:53:33,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:53:33,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.13 | bwd_microstep: 3313.06 | bwd_inner_microstep: 3312.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-19 23:53:33,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.13 | bwd: 3313.08 | bwd_inner: 3312.27 | bwd_allreduce: 0.76 | step: 6.80 66%|██████▌ | 6599/10000 [10:23:54<5:11:31, 5.50s/it] {'loss': 0.0052, 'grad_norm': 1.2560462951660156, 'learning_rate': 1.095483615251735e-05, 'epoch': 6.6} 66%|██████▌ | 6599/10000 [10:23:54<5:11:31, 5.50s/it][2025-06-19 23:53:39,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:53:39,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.78 | bwd_microstep: 3310.78 | bwd_inner_microstep: 3309.98 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-19 23:53:39,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.78 | bwd: 3310.80 | bwd_inner: 3309.98 | bwd_allreduce: 0.77 | step: 6.72 66%|██████▌ | 6600/10000 [10:23:59<5:10:42, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.0608844980597496, 'learning_rate': 1.0949059419109225e-05, 'epoch': 6.6} 66%|██████▌ | 6600/10000 [10:23:59<5:10:42, 5.48s/it][2025-06-19 23:53:44,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:53:44,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.33 | bwd_microstep: 3318.17 | bwd_inner_microstep: 3317.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-19 23:53:44,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.33 | bwd: 3318.19 | bwd_inner: 3317.38 | bwd_allreduce: 0.77 | step: 6.88 66%|██████▌ | 6601/10000 [10:24:05<5:10:26, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.007024099584668875, 'learning_rate': 1.0943283635102757e-05, 'epoch': 6.6} 66%|██████▌ | 6601/10000 [10:24:05<5:10:26, 5.48s/it][2025-06-19 23:53:50,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:53:50,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.55 | bwd_microstep: 3320.86 | bwd_inner_microstep: 3320.06 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-19 23:53:50,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.55 | bwd: 3320.88 | bwd_inner: 3320.06 | bwd_allreduce: 0.77 | step: 7.09 66%|██████▌ | 6602/10000 [10:24:10<5:10:07, 5.48s/it] {'loss': 0.0628, 'grad_norm': 4.890285491943359, 'learning_rate': 1.093750880110381e-05, 'epoch': 6.6} 66%|██████▌ | 6602/10000 [10:24:10<5:10:07, 5.48s/it][2025-06-19 23:53:55,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:53:55,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.97 | bwd_microstep: 3318.23 | bwd_inner_microstep: 3317.26 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.44 [2025-06-19 23:53:55,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.97 | bwd: 3318.24 | bwd_inner: 3317.26 | bwd_allreduce: 0.94 | step: 7.45 66%|██████▌ | 6603/10000 [10:24:16<5:09:47, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00046644770191051066, 'learning_rate': 1.0931734917718125e-05, 'epoch': 6.6} 66%|██████▌ | 6603/10000 [10:24:16<5:09:47, 5.47s/it][2025-06-19 23:54:01,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:54:01,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.84 | bwd_microstep: 3364.01 | bwd_inner_microstep: 3363.22 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 23:54:01,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.84 | bwd: 3364.03 | bwd_inner: 3363.22 | bwd_allreduce: 0.76 | step: 6.67 66%|██████▌ | 6604/10000 [10:24:21<5:10:37, 5.49s/it] {'loss': 0.0007, 'grad_norm': 0.18352390825748444, 'learning_rate': 1.0925961985551359e-05, 'epoch': 6.6} 66%|██████▌ | 6604/10000 [10:24:21<5:10:37, 5.49s/it][2025-06-19 23:54:06,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:54:06,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.92 | bwd_microstep: 3367.31 | bwd_inner_microstep: 3366.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-19 23:54:06,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.92 | bwd: 3367.33 | bwd_inner: 3366.51 | bwd_allreduce: 0.77 | step: 7.01 66%|██████▌ | 6605/10000 [10:24:27<5:11:21, 5.50s/it] {'loss': 0.0009, 'grad_norm': 0.16243959963321686, 'learning_rate': 1.0920190005209066e-05, 'epoch': 6.61} 66%|██████▌ | 6605/10000 [10:24:27<5:11:21, 5.50s/it][2025-06-19 23:54:12,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:54:12,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.40 | bwd_microstep: 3319.25 | bwd_inner_microstep: 3318.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-19 23:54:12,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.40 | bwd: 3319.27 | bwd_inner: 3318.46 | bwd_allreduce: 0.77 | step: 6.99 66%|██████▌ | 6606/10000 [10:24:32<5:10:38, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0012602756032720208, 'learning_rate': 1.0914418977296702e-05, 'epoch': 6.61} 66%|██████▌ | 6606/10000 [10:24:32<5:10:38, 5.49s/it][2025-06-19 23:54:17,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:54:17,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.04 | bwd_microstep: 3319.83 | bwd_inner_microstep: 3319.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 23:54:17,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.04 | bwd: 3319.85 | bwd_inner: 3319.04 | bwd_allreduce: 0.76 | step: 6.71 66%|██████▌ | 6607/10000 [10:24:38<5:10:08, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.033537011593580246, 'learning_rate': 1.0908648902419628e-05, 'epoch': 6.61} 66%|██████▌ | 6607/10000 [10:24:38<5:10:08, 5.48s/it][2025-06-19 23:54:23,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:54:23,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.22 | bwd_microstep: 3370.04 | bwd_inner_microstep: 3369.25 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 23:54:23,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.22 | bwd: 3370.05 | bwd_inner: 3369.25 | bwd_allreduce: 0.76 | step: 6.71 66%|██████▌ | 6608/10000 [10:24:43<5:10:50, 5.50s/it] {'loss': 0.0012, 'grad_norm': 0.349128395318985, 'learning_rate': 1.0902879781183077e-05, 'epoch': 6.61} 66%|██████▌ | 6608/10000 [10:24:43<5:10:50, 5.50s/it][2025-06-19 23:54:28,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:54:28,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.44 | bwd_microstep: 3322.15 | bwd_inner_microstep: 3321.34 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-19 23:54:28,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.44 | bwd: 3322.16 | bwd_inner: 3321.34 | bwd_allreduce: 0.78 | step: 7.08 66%|██████▌ | 6609/10000 [10:24:49<5:10:18, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0002770178543869406, 'learning_rate': 1.0897111614192223e-05, 'epoch': 6.61} 66%|██████▌ | 6609/10000 [10:24:49<5:10:18, 5.49s/it][2025-06-19 23:54:34,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:54:34,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.54 | bwd_microstep: 3322.90 | bwd_inner_microstep: 3321.94 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.04 [2025-06-19 23:54:34,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.54 | bwd: 3322.91 | bwd_inner: 3321.94 | bwd_allreduce: 0.92 | step: 7.05 66%|██████▌ | 6610/10000 [10:24:54<5:09:50, 5.48s/it] {'loss': 0.0119, 'grad_norm': 2.7418901920318604, 'learning_rate': 1.0891344402052109e-05, 'epoch': 6.61} 66%|██████▌ | 6610/10000 [10:24:54<5:09:50, 5.48s/it][2025-06-19 23:54:39,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:54:39,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.35 | bwd_microstep: 3374.07 | bwd_inner_microstep: 3373.28 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 23:54:39,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.35 | bwd: 3374.08 | bwd_inner: 3373.28 | bwd_allreduce: 0.76 | step: 6.73 66%|██████▌ | 6611/10000 [10:25:00<5:10:45, 5.50s/it] {'loss': 0.0647, 'grad_norm': 7.213029861450195, 'learning_rate': 1.0885578145367692e-05, 'epoch': 6.61} 66%|██████▌ | 6611/10000 [10:25:00<5:10:45, 5.50s/it][2025-06-19 23:54:45,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:54:45,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.00 | bwd_microstep: 3320.84 | bwd_inner_microstep: 3320.01 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.20 [2025-06-19 23:54:45,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.00 | bwd: 3320.85 | bwd_inner: 3320.01 | bwd_allreduce: 0.80 | step: 7.20 66%|██████▌ | 6612/10000 [10:25:05<5:10:02, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.007761984597891569, 'learning_rate': 1.0879812844743829e-05, 'epoch': 6.61} 66%|██████▌ | 6612/10000 [10:25:05<5:10:02, 5.49s/it][2025-06-19 23:54:50,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:54:50,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.89 | bwd_microstep: 3321.27 | bwd_inner_microstep: 3320.48 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-19 23:54:50,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.89 | bwd: 3321.28 | bwd_inner: 3320.48 | bwd_allreduce: 0.76 | step: 6.69 66%|██████▌ | 6613/10000 [10:25:11<5:09:32, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.017939304932951927, 'learning_rate': 1.0874048500785268e-05, 'epoch': 6.61} 66%|██████▌ | 6613/10000 [10:25:11<5:09:32, 5.48s/it][2025-06-19 23:54:55,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:54:55,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.16 | bwd_microstep: 3323.26 | bwd_inner_microstep: 3322.48 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.57 [2025-06-19 23:54:55,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.16 | bwd: 3323.27 | bwd_inner: 3322.48 | bwd_allreduce: 0.75 | step: 6.58 66%|██████▌ | 6614/10000 [10:25:16<5:09:15, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.16278013586997986, 'learning_rate': 1.0868285114096664e-05, 'epoch': 6.61} 66%|██████▌ | 6614/10000 [10:25:16<5:09:15, 5.48s/it][2025-06-19 23:55:01,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:55:01,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.80 | bwd_microstep: 3325.63 | bwd_inner_microstep: 3324.77 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.48 [2025-06-19 23:55:01,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.80 | bwd: 3325.65 | bwd_inner: 3324.77 | bwd_allreduce: 0.82 | step: 7.48 66%|██████▌ | 6615/10000 [10:25:22<5:09:12, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0019584388937801123, 'learning_rate': 1.0862522685282573e-05, 'epoch': 6.62} 66%|██████▌ | 6615/10000 [10:25:22<5:09:12, 5.48s/it][2025-06-19 23:55:06,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:55:06,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.25 | bwd_microstep: 3320.13 | bwd_inner_microstep: 3319.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-19 23:55:06,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.25 | bwd: 3320.14 | bwd_inner: 3319.35 | bwd_allreduce: 0.75 | step: 6.65 66%|██████▌ | 6616/10000 [10:25:27<5:08:53, 5.48s/it] {'loss': 0.0025, 'grad_norm': 0.668233335018158, 'learning_rate': 1.085676121494744e-05, 'epoch': 6.62} 66%|██████▌ | 6616/10000 [10:25:27<5:08:53, 5.48s/it][2025-06-19 23:55:12,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:55:12,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.05 | bwd_microstep: 3321.59 | bwd_inner_microstep: 3320.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-19 23:55:12,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.05 | bwd: 3321.60 | bwd_inner: 3320.80 | bwd_allreduce: 0.75 | step: 6.58 66%|██████▌ | 6617/10000 [10:25:33<5:08:42, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.07729045301675797, 'learning_rate': 1.0851000703695633e-05, 'epoch': 6.62} 66%|██████▌ | 6617/10000 [10:25:33<5:08:42, 5.48s/it][2025-06-19 23:55:17,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:55:17,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.15 | bwd_microstep: 3326.43 | bwd_inner_microstep: 3325.34 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.51 [2025-06-19 23:55:17,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.15 | bwd: 3326.45 | bwd_inner: 3325.34 | bwd_allreduce: 1.06 | step: 7.52 66%|██████▌ | 6618/10000 [10:25:38<5:08:34, 5.47s/it] {'loss': 0.0008, 'grad_norm': 0.15854744613170624, 'learning_rate': 1.0845241152131377e-05, 'epoch': 6.62} 66%|██████▌ | 6618/10000 [10:25:38<5:08:34, 5.47s/it][2025-06-19 23:55:23,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:55:23,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.16 | bwd_microstep: 3322.04 | bwd_inner_microstep: 3321.07 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.36 [2025-06-19 23:55:23,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.16 | bwd: 3322.06 | bwd_inner: 3321.07 | bwd_allreduce: 0.95 | step: 7.37 66%|██████▌ | 6619/10000 [10:25:44<5:08:26, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.004521416500210762, 'learning_rate': 1.083948256085884e-05, 'epoch': 6.62} 66%|██████▌ | 6619/10000 [10:25:44<5:08:26, 5.47s/it][2025-06-19 23:55:28,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:55:28,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.26 | bwd_microstep: 3328.41 | bwd_inner_microstep: 3327.61 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-19 23:55:28,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.26 | bwd: 3328.43 | bwd_inner: 3327.61 | bwd_allreduce: 0.77 | step: 7.17 66%|██████▌ | 6620/10000 [10:25:49<5:08:31, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.05092570185661316, 'learning_rate': 1.0833724930482066e-05, 'epoch': 6.62} 66%|██████▌ | 6620/10000 [10:25:49<5:08:31, 5.48s/it][2025-06-19 23:55:34,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:55:34,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.29 | bwd_microstep: 3324.40 | bwd_inner_microstep: 3323.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-19 23:55:34,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.29 | bwd: 3324.42 | bwd_inner: 3323.61 | bwd_allreduce: 0.76 | step: 6.73 66%|██████▌ | 6621/10000 [10:25:55<5:08:19, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0008271734695881605, 'learning_rate': 1.0827968261605007e-05, 'epoch': 6.62} 66%|██████▌ | 6621/10000 [10:25:55<5:08:19, 5.47s/it][2025-06-19 23:55:39,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:55:39,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.29 | bwd_microstep: 3377.51 | bwd_inner_microstep: 3376.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-19 23:55:39,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.29 | bwd: 3377.52 | bwd_inner: 3376.73 | bwd_allreduce: 0.75 | step: 6.63 66%|██████▌ | 6622/10000 [10:26:00<5:09:26, 5.50s/it] {'loss': 0.1091, 'grad_norm': 9.562271118164062, 'learning_rate': 1.0822212554831513e-05, 'epoch': 6.62} 66%|██████▌ | 6622/10000 [10:26:00<5:09:26, 5.50s/it][2025-06-19 23:55:45,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:55:45,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.06 | bwd_microstep: 3323.92 | bwd_inner_microstep: 3322.97 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.52 [2025-06-19 23:55:45,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.06 | bwd: 3323.94 | bwd_inner: 3322.97 | bwd_allreduce: 0.92 | step: 7.52 66%|██████▌ | 6623/10000 [10:26:06<5:08:53, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006654575001448393, 'learning_rate': 1.081645781076532e-05, 'epoch': 6.62} 66%|██████▌ | 6623/10000 [10:26:06<5:08:53, 5.49s/it][2025-06-19 23:55:50,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:55:50,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.07 | bwd_microstep: 3328.01 | bwd_inner_microstep: 3327.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-19 23:55:50,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.07 | bwd: 3328.02 | bwd_inner: 3327.23 | bwd_allreduce: 0.75 | step: 6.68 66%|██████▌ | 6624/10000 [10:26:11<5:08:37, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.013812638819217682, 'learning_rate': 1.0810704030010078e-05, 'epoch': 6.62} 66%|██████▌ | 6624/10000 [10:26:11<5:08:37, 5.49s/it][2025-06-19 23:55:56,260] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:55:56,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.48 | bwd_microstep: 3333.39 | bwd_inner_microstep: 3332.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-19 23:55:56,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.48 | bwd: 3333.40 | bwd_inner: 3332.60 | bwd_allreduce: 0.75 | step: 6.57 66%|██████▋ | 6625/10000 [10:26:17<5:08:25, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0012893956154584885, 'learning_rate': 1.080495121316934e-05, 'epoch': 6.62} 66%|██████▋ | 6625/10000 [10:26:17<5:08:25, 5.48s/it][2025-06-19 23:56:01,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:56:01,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.37 | bwd_microstep: 3321.72 | bwd_inner_microstep: 3320.82 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.82 [2025-06-19 23:56:01,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.37 | bwd: 3321.73 | bwd_inner: 3320.82 | bwd_allreduce: 0.87 | step: 6.83 66%|██████▋ | 6626/10000 [10:26:22<5:08:05, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.000592377211432904, 'learning_rate': 1.0799199360846538e-05, 'epoch': 6.63} 66%|██████▋ | 6626/10000 [10:26:22<5:08:05, 5.48s/it][2025-06-19 23:56:07,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:56:07,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.95 | bwd_microstep: 3327.06 | bwd_inner_microstep: 3326.28 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-19 23:56:07,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.95 | bwd: 3327.08 | bwd_inner: 3326.28 | bwd_allreduce: 0.76 | step: 6.65 66%|██████▋ | 6627/10000 [10:26:28<5:07:55, 5.48s/it] {'loss': 0.001, 'grad_norm': 0.1736319214105606, 'learning_rate': 1.0793448473645026e-05, 'epoch': 6.63} 66%|██████▋ | 6627/10000 [10:26:28<5:07:55, 5.48s/it][2025-06-19 23:56:12,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:56:12,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.11 | bwd_microstep: 3335.14 | bwd_inner_microstep: 3334.22 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.02 [2025-06-19 23:56:12,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.11 | bwd: 3335.15 | bwd_inner: 3334.22 | bwd_allreduce: 0.88 | step: 7.02 66%|██████▋ | 6628/10000 [10:26:33<5:07:59, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.01305609755218029, 'learning_rate': 1.0787698552168035e-05, 'epoch': 6.63} 66%|██████▋ | 6628/10000 [10:26:33<5:07:59, 5.48s/it][2025-06-19 23:56:18,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:56:18,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.63 | bwd_microstep: 3323.01 | bwd_inner_microstep: 3322.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-19 23:56:18,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.63 | bwd: 3323.03 | bwd_inner: 3322.22 | bwd_allreduce: 0.77 | step: 7.03 66%|██████▋ | 6629/10000 [10:26:38<5:07:48, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0013144094264134765, 'learning_rate': 1.0781949597018713e-05, 'epoch': 6.63} 66%|██████▋ | 6629/10000 [10:26:38<5:07:48, 5.48s/it][2025-06-19 23:56:23,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:56:23,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.55 | bwd_microstep: 3323.60 | bwd_inner_microstep: 3322.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 23:56:23,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.55 | bwd: 3323.62 | bwd_inner: 3322.82 | bwd_allreduce: 0.75 | step: 6.66 66%|██████▋ | 6630/10000 [10:26:44<5:07:46, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.10475358366966248, 'learning_rate': 1.0776201608800099e-05, 'epoch': 6.63} 66%|██████▋ | 6630/10000 [10:26:44<5:07:46, 5.48s/it][2025-06-19 23:56:29,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:56:29,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.55 | bwd_microstep: 3333.06 | bwd_inner_microstep: 3332.23 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.11 [2025-06-19 23:56:29,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.55 | bwd: 3333.07 | bwd_inner: 3332.23 | bwd_allreduce: 0.80 | step: 7.11 66%|██████▋ | 6631/10000 [10:26:49<5:07:45, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.023417338728904724, 'learning_rate': 1.0770454588115122e-05, 'epoch': 6.63} 66%|██████▋ | 6631/10000 [10:26:49<5:07:45, 5.48s/it][2025-06-19 23:56:34,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:56:34,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.82 | bwd_microstep: 3327.73 | bwd_inner_microstep: 3326.85 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.02 [2025-06-19 23:56:34,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.82 | bwd: 3327.74 | bwd_inner: 3326.85 | bwd_allreduce: 0.85 | step: 7.02 66%|██████▋ | 6632/10000 [10:26:55<5:07:39, 5.48s/it] {'loss': 0.002, 'grad_norm': 0.6007204055786133, 'learning_rate': 1.0764708535566633e-05, 'epoch': 6.63} 66%|██████▋ | 6632/10000 [10:26:55<5:07:39, 5.48s/it][2025-06-19 23:56:40,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:56:40,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.35 | bwd_microstep: 3338.00 | bwd_inner_microstep: 3337.09 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.08 [2025-06-19 23:56:40,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.35 | bwd: 3338.01 | bwd_inner: 3337.09 | bwd_allreduce: 0.88 | step: 7.09 66%|██████▋ | 6633/10000 [10:27:00<5:07:53, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.016447782516479492, 'learning_rate': 1.0758963451757349e-05, 'epoch': 6.63} 66%|██████▋ | 6633/10000 [10:27:00<5:07:53, 5.49s/it][2025-06-19 23:56:45,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:56:45,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.27 | bwd_microstep: 3379.31 | bwd_inner_microstep: 3378.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-19 23:56:45,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.27 | bwd: 3379.32 | bwd_inner: 3378.53 | bwd_allreduce: 0.75 | step: 6.60 66%|██████▋ | 6634/10000 [10:27:06<5:09:10, 5.51s/it] {'loss': 0.0006, 'grad_norm': 0.12138153612613678, 'learning_rate': 1.0753219337289908e-05, 'epoch': 6.63} 66%|██████▋ | 6634/10000 [10:27:06<5:09:10, 5.51s/it][2025-06-19 23:56:51,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:56:51,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.28 | bwd_microstep: 3323.85 | bwd_inner_microstep: 3323.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 23:56:51,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.28 | bwd: 3323.87 | bwd_inner: 3323.07 | bwd_allreduce: 0.75 | step: 6.63 66%|██████▋ | 6635/10000 [10:27:11<5:08:34, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0007211255724541843, 'learning_rate': 1.0747476192766846e-05, 'epoch': 6.63} 66%|██████▋ | 6635/10000 [10:27:11<5:08:34, 5.50s/it][2025-06-19 23:56:56,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:56:56,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.35 | bwd_microstep: 3373.40 | bwd_inner_microstep: 3372.53 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.43 [2025-06-19 23:56:56,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.36 | bwd: 3373.42 | bwd_inner: 3372.53 | bwd_allreduce: 0.83 | step: 7.43 66%|██████▋ | 6636/10000 [10:27:17<5:09:25, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.02438449114561081, 'learning_rate': 1.0741734018790588e-05, 'epoch': 6.64} 66%|██████▋ | 6636/10000 [10:27:17<5:09:25, 5.52s/it][2025-06-19 23:57:02,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:57:02,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.23 | bwd_microstep: 3385.01 | bwd_inner_microstep: 3383.93 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.26 [2025-06-19 23:57:02,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.23 | bwd: 3385.03 | bwd_inner: 3383.93 | bwd_allreduce: 1.04 | step: 7.26 66%|██████▋ | 6637/10000 [10:27:23<5:10:07, 5.53s/it] {'loss': 0.0533, 'grad_norm': 8.097352981567383, 'learning_rate': 1.0735992815963464e-05, 'epoch': 6.64} 66%|██████▋ | 6637/10000 [10:27:23<5:10:07, 5.53s/it][2025-06-19 23:57:07,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-19 23:57:07,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.33 | bwd_microstep: 3384.49 | bwd_inner_microstep: 3383.57 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.99 [2025-06-19 23:57:07,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.34 | bwd: 3384.50 | bwd_inner: 3383.57 | bwd_allreduce: 0.89 | step: 6.99 66%|██████▋ | 6638/10000 [10:27:28<5:10:30, 5.54s/it] {'loss': 0.0645, 'grad_norm': 5.652052879333496, 'learning_rate': 1.0730252584887704e-05, 'epoch': 6.64} 66%|██████▋ | 6638/10000 [10:27:28<5:10:30, 5.54s/it][2025-06-19 23:57:13,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-19 23:57:13,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.02 | bwd_microstep: 3340.46 | bwd_inner_microstep: 3339.44 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.84 [2025-06-19 23:57:13,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.02 | bwd: 3340.49 | bwd_inner: 3339.44 | bwd_allreduce: 0.99 | step: 7.84 66%|██████▋ | 6639/10000 [10:27:34<5:09:37, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.012115963734686375, 'learning_rate': 1.0724513326165416e-05, 'epoch': 6.64} 66%|██████▋ | 6639/10000 [10:27:34<5:09:37, 5.53s/it][2025-06-19 23:57:18,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:57:18,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.23 | bwd_microstep: 3331.69 | bwd_inner_microstep: 3330.78 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.91 [2025-06-19 23:57:18,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.23 | bwd: 3331.70 | bwd_inner: 3330.78 | bwd_allreduce: 0.88 | step: 6.91 66%|██████▋ | 6640/10000 [10:27:39<5:08:49, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0016748798079788685, 'learning_rate': 1.0718775040398635e-05, 'epoch': 6.64} 66%|██████▋ | 6640/10000 [10:27:39<5:08:49, 5.51s/it][2025-06-19 23:57:24,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:57:24,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.29 | bwd_microstep: 3381.19 | bwd_inner_microstep: 3380.37 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.69 [2025-06-19 23:57:24,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.29 | bwd: 3381.20 | bwd_inner: 3380.37 | bwd_allreduce: 0.78 | step: 7.70 66%|██████▋ | 6641/10000 [10:27:45<5:09:25, 5.53s/it] {'loss': 0.0004, 'grad_norm': 0.08780603110790253, 'learning_rate': 1.0713037728189273e-05, 'epoch': 6.64} 66%|██████▋ | 6641/10000 [10:27:45<5:09:25, 5.53s/it][2025-06-19 23:57:29,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:57:29,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.08 | bwd_microstep: 3328.03 | bwd_inner_microstep: 3327.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-19 23:57:29,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.08 | bwd: 3328.05 | bwd_inner: 3327.24 | bwd_allreduce: 0.76 | step: 6.91 66%|██████▋ | 6642/10000 [10:27:50<5:08:42, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.017974374815821648, 'learning_rate': 1.0707301390139153e-05, 'epoch': 6.64} 66%|██████▋ | 6642/10000 [10:27:50<5:08:42, 5.52s/it][2025-06-19 23:57:35,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:57:35,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.50 | bwd_microstep: 3321.89 | bwd_inner_microstep: 3321.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-19 23:57:35,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.50 | bwd: 3321.90 | bwd_inner: 3321.10 | bwd_allreduce: 0.76 | step: 6.79 66%|██████▋ | 6643/10000 [10:27:56<5:07:54, 5.50s/it] {'loss': 0.0032, 'grad_norm': 0.475849449634552, 'learning_rate': 1.0701566026849997e-05, 'epoch': 6.64} 66%|██████▋ | 6643/10000 [10:27:56<5:07:54, 5.50s/it][2025-06-19 23:57:40,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-19 23:57:40,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.45 | bwd_microstep: 3368.70 | bwd_inner_microstep: 3367.73 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.26 [2025-06-19 23:57:40,886] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.45 | bwd: 3368.72 | bwd_inner: 3367.73 | bwd_allreduce: 0.94 | step: 7.27 66%|██████▋ | 6644/10000 [10:28:01<5:08:25, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.008705394342541695, 'learning_rate': 1.06958316389234e-05, 'epoch': 6.64} 66%|██████▋ | 6644/10000 [10:28:01<5:08:25, 5.51s/it][2025-06-19 23:57:46,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.89 [2025-06-19 23:57:46,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.53 | bwd_microstep: 3327.04 | bwd_inner_microstep: 3326.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.82 [2025-06-19 23:57:46,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.53 | bwd: 3327.06 | bwd_inner: 3326.26 | bwd_allreduce: 0.76 | step: 6.83 66%|██████▋ | 6645/10000 [10:28:07<5:07:46, 5.50s/it] {'loss': 0.0434, 'grad_norm': 14.09348201751709, 'learning_rate': 1.0690098226960883e-05, 'epoch': 6.64} 66%|██████▋ | 6645/10000 [10:28:07<5:07:46, 5.50s/it][2025-06-19 23:57:51,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:57:51,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.40 | bwd_microstep: 3328.11 | bwd_inner_microstep: 3327.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-19 23:57:51,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.40 | bwd: 3328.13 | bwd_inner: 3327.31 | bwd_allreduce: 0.77 | step: 6.75 66%|██████▋ | 6646/10000 [10:28:12<5:07:19, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.000308257614960894, 'learning_rate': 1.0684365791563853e-05, 'epoch': 6.65} 66%|██████▋ | 6646/10000 [10:28:12<5:07:19, 5.50s/it][2025-06-19 23:57:57,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:57:57,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.90 | bwd_microstep: 3333.90 | bwd_inner_microstep: 3332.98 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.19 [2025-06-19 23:57:57,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.90 | bwd: 3333.92 | bwd_inner: 3332.98 | bwd_allreduce: 0.89 | step: 7.19 66%|██████▋ | 6647/10000 [10:28:18<5:07:02, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.023200782015919685, 'learning_rate': 1.0678634333333617e-05, 'epoch': 6.65} 66%|██████▋ | 6647/10000 [10:28:18<5:07:02, 5.49s/it][2025-06-19 23:58:02,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:58:02,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.43 | bwd_microstep: 3328.89 | bwd_inner_microstep: 3328.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-19 23:58:02,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.44 | bwd: 3328.91 | bwd_inner: 3328.11 | bwd_allreduce: 0.75 | step: 6.55 66%|██████▋ | 6648/10000 [10:28:23<5:06:38, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.09106504172086716, 'learning_rate': 1.0672903852871377e-05, 'epoch': 6.65} 66%|██████▋ | 6648/10000 [10:28:23<5:06:38, 5.49s/it][2025-06-19 23:58:08,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-19 23:58:08,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.81 | bwd_microstep: 3318.45 | bwd_inner_microstep: 3317.45 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.25 [2025-06-19 23:58:08,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.81 | bwd: 3318.47 | bwd_inner: 3317.45 | bwd_allreduce: 0.97 | step: 7.26 66%|██████▋ | 6649/10000 [10:28:29<5:06:10, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.05442163348197937, 'learning_rate': 1.0667174350778233e-05, 'epoch': 6.65} 66%|██████▋ | 6649/10000 [10:28:29<5:06:10, 5.48s/it][2025-06-19 23:58:13,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-19 23:58:13,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.69 | bwd_microstep: 3321.06 | bwd_inner_microstep: 3320.10 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.00 [2025-06-19 23:58:13,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.69 | bwd: 3321.07 | bwd_inner: 3320.10 | bwd_allreduce: 0.93 | step: 7.01 66%|██████▋ | 6650/10000 [10:28:34<5:05:57, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.022579222917556763, 'learning_rate': 1.0661445827655187e-05, 'epoch': 6.65} 66%|██████▋ | 6650/10000 [10:28:34<5:05:57, 5.48s/it][2025-06-19 23:58:19,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-19 23:58:19,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.76 | bwd_microstep: 3318.41 | bwd_inner_microstep: 3317.39 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.82 [2025-06-19 23:58:19,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.76 | bwd: 3318.43 | bwd_inner: 3317.39 | bwd_allreduce: 0.99 | step: 7.83 67%|██████▋ | 6651/10000 [10:28:40<5:05:45, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.020282544195652008, 'learning_rate': 1.065571828410313e-05, 'epoch': 6.65} 67%|██████▋ | 6651/10000 [10:28:40<5:05:45, 5.48s/it][2025-06-19 23:58:24,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:58:24,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.40 | bwd_microstep: 3320.34 | bwd_inner_microstep: 3319.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-19 23:58:24,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.40 | bwd: 3320.36 | bwd_inner: 3319.55 | bwd_allreduce: 0.77 | step: 6.75 67%|██████▋ | 6652/10000 [10:28:45<5:05:33, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004963803570717573, 'learning_rate': 1.064999172072286e-05, 'epoch': 6.65} 67%|██████▋ | 6652/10000 [10:28:45<5:05:33, 5.48s/it][2025-06-19 23:58:30,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:58:30,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.44 | bwd_microstep: 3329.20 | bwd_inner_microstep: 3328.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-19 23:58:30,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.44 | bwd: 3329.21 | bwd_inner: 3328.41 | bwd_allreduce: 0.76 | step: 6.66 67%|██████▋ | 6653/10000 [10:28:50<5:05:29, 5.48s/it] {'loss': 0.0027, 'grad_norm': 0.3890514373779297, 'learning_rate': 1.064426613811507e-05, 'epoch': 6.65} 67%|██████▋ | 6653/10000 [10:28:50<5:05:29, 5.48s/it][2025-06-19 23:58:35,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-19 23:58:35,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.80 | bwd_microstep: 3331.46 | bwd_inner_microstep: 3330.44 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.81 [2025-06-19 23:58:35,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.80 | bwd: 3331.47 | bwd_inner: 3330.44 | bwd_allreduce: 0.99 | step: 7.81 67%|██████▋ | 6654/10000 [10:28:56<5:05:29, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.08985850214958191, 'learning_rate': 1.0638541536880333e-05, 'epoch': 6.65} 67%|██████▋ | 6654/10000 [10:28:56<5:05:29, 5.48s/it][2025-06-19 23:58:41,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:58:41,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.88 | bwd_microstep: 3376.48 | bwd_inner_microstep: 3375.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-19 23:58:41,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.88 | bwd: 3376.49 | bwd_inner: 3375.69 | bwd_allreduce: 0.76 | step: 6.69 67%|██████▋ | 6655/10000 [10:29:02<5:06:34, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.010614098981022835, 'learning_rate': 1.0632817917619141e-05, 'epoch': 6.66} 67%|██████▋ | 6655/10000 [10:29:02<5:06:34, 5.50s/it][2025-06-19 23:58:46,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:58:46,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.34 | bwd_microstep: 3318.63 | bwd_inner_microstep: 3317.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-19 23:58:46,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.34 | bwd: 3318.64 | bwd_inner: 3317.83 | bwd_allreduce: 0.77 | step: 6.72 67%|██████▋ | 6656/10000 [10:29:07<5:05:51, 5.49s/it] {'loss': 0.0032, 'grad_norm': 0.9896411299705505, 'learning_rate': 1.0627095280931874e-05, 'epoch': 6.66} 67%|██████▋ | 6656/10000 [10:29:07<5:05:51, 5.49s/it][2025-06-19 23:58:52,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:58:52,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.11 | bwd_microstep: 3324.09 | bwd_inner_microstep: 3323.29 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-19 23:58:52,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.11 | bwd: 3324.11 | bwd_inner: 3323.29 | bwd_allreduce: 0.78 | step: 7.15 67%|██████▋ | 6657/10000 [10:29:12<5:05:35, 5.48s/it] {'loss': 0.0502, 'grad_norm': 7.090639114379883, 'learning_rate': 1.0621373627418811e-05, 'epoch': 6.66} 67%|██████▋ | 6657/10000 [10:29:12<5:05:35, 5.48s/it][2025-06-19 23:58:57,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:58:57,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.89 | bwd_microstep: 3372.09 | bwd_inner_microstep: 3371.04 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.79 [2025-06-19 23:58:57,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.89 | bwd: 3372.11 | bwd_inner: 3371.04 | bwd_allreduce: 1.01 | step: 7.80 67%|██████▋ | 6658/10000 [10:29:18<5:06:31, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00801277905702591, 'learning_rate': 1.0615652957680135e-05, 'epoch': 6.66} 67%|██████▋ | 6658/10000 [10:29:18<5:06:31, 5.50s/it][2025-06-19 23:59:03,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-19 23:59:03,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.23 | bwd_microstep: 3365.29 | bwd_inner_microstep: 3364.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-19 23:59:03,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.23 | bwd: 3365.31 | bwd_inner: 3364.50 | bwd_allreduce: 0.76 | step: 6.72 67%|██████▋ | 6659/10000 [10:29:24<5:07:03, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.10276122391223907, 'learning_rate': 1.0609933272315904e-05, 'epoch': 6.66} 67%|██████▋ | 6659/10000 [10:29:24<5:07:03, 5.51s/it][2025-06-19 23:59:08,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:59:08,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.31 | bwd_microstep: 3398.62 | bwd_inner_microstep: 3397.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-19 23:59:08,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.31 | bwd: 3398.64 | bwd_inner: 3397.82 | bwd_allreduce: 0.77 | step: 6.89 67%|██████▋ | 6660/10000 [10:29:29<5:08:02, 5.53s/it] {'loss': 0.0004, 'grad_norm': 0.06751509010791779, 'learning_rate': 1.060421457192609e-05, 'epoch': 6.66} 67%|██████▋ | 6660/10000 [10:29:29<5:08:02, 5.53s/it][2025-06-19 23:59:14,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:59:14,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.96 | bwd_microstep: 3370.97 | bwd_inner_microstep: 3370.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-19 23:59:14,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.96 | bwd: 3370.99 | bwd_inner: 3370.19 | bwd_allreduce: 0.76 | step: 6.95 67%|██████▋ | 6661/10000 [10:29:35<5:08:03, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0012481383746489882, 'learning_rate': 1.059849685711056e-05, 'epoch': 6.66} 67%|██████▋ | 6661/10000 [10:29:35<5:08:03, 5.54s/it][2025-06-19 23:59:19,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:59:19,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.29 | bwd_microstep: 3367.20 | bwd_inner_microstep: 3366.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-19 23:59:19,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.29 | bwd: 3367.22 | bwd_inner: 3366.41 | bwd_allreduce: 0.76 | step: 6.97 67%|██████▋ | 6662/10000 [10:29:40<5:08:02, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.002598605817183852, 'learning_rate': 1.0592780128469077e-05, 'epoch': 6.66} 67%|██████▋ | 6662/10000 [10:29:40<5:08:02, 5.54s/it][2025-06-19 23:59:25,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-19 23:59:25,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.34 | bwd_microstep: 3315.56 | bwd_inner_microstep: 3314.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-19 23:59:25,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.34 | bwd: 3315.57 | bwd_inner: 3314.76 | bwd_allreduce: 0.77 | step: 6.92 67%|██████▋ | 6663/10000 [10:29:46<5:06:35, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.030608195811510086, 'learning_rate': 1.05870643866013e-05, 'epoch': 6.66} 67%|██████▋ | 6663/10000 [10:29:46<5:06:35, 5.51s/it][2025-06-19 23:59:30,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-19 23:59:30,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.80 | bwd_microstep: 3374.32 | bwd_inner_microstep: 3373.40 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.31 [2025-06-19 23:59:30,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.80 | bwd: 3374.34 | bwd_inner: 3373.40 | bwd_allreduce: 0.88 | step: 7.31 67%|██████▋ | 6664/10000 [10:29:51<5:06:56, 5.52s/it] {'loss': 0.0005, 'grad_norm': 0.11767738312482834, 'learning_rate': 1.0581349632106779e-05, 'epoch': 6.66} 67%|██████▋ | 6664/10000 [10:29:51<5:06:56, 5.52s/it][2025-06-19 23:59:36,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.74 [2025-06-19 23:59:36,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.25 | bwd_microstep: 3316.60 | bwd_inner_microstep: 3315.75 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.36 [2025-06-19 23:59:36,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.25 | bwd: 3316.62 | bwd_inner: 3315.75 | bwd_allreduce: 0.81 | step: 7.36 67%|██████▋ | 6665/10000 [10:29:57<5:06:08, 5.51s/it] {'loss': 0.0007, 'grad_norm': 0.15385666489601135, 'learning_rate': 1.057563586558497e-05, 'epoch': 6.67} 67%|██████▋ | 6665/10000 [10:29:57<5:06:08, 5.51s/it][2025-06-19 23:59:41,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-19 23:59:41,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.06 | bwd_microstep: 3371.94 | bwd_inner_microstep: 3371.10 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.01 [2025-06-19 23:59:41,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.06 | bwd: 3371.96 | bwd_inner: 3371.10 | bwd_allreduce: 0.81 | step: 7.01 67%|██████▋ | 6666/10000 [10:30:02<5:06:31, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0045304675586521626, 'learning_rate': 1.0569923087635216e-05, 'epoch': 6.67} 67%|██████▋ | 6666/10000 [10:30:02<5:06:31, 5.52s/it][2025-06-19 23:59:47,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-19 23:59:47,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.66 | bwd_microstep: 3370.26 | bwd_inner_microstep: 3369.42 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.77 [2025-06-19 23:59:47,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.66 | bwd: 3370.28 | bwd_inner: 3369.42 | bwd_allreduce: 0.80 | step: 6.77 67%|██████▋ | 6667/10000 [10:30:08<5:06:46, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0010535672772675753, 'learning_rate': 1.0564211298856768e-05, 'epoch': 6.67} 67%|██████▋ | 6667/10000 [10:30:08<5:06:46, 5.52s/it][2025-06-19 23:59:52,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-19 23:59:52,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.05 | bwd_microstep: 3372.47 | bwd_inner_microstep: 3371.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-19 23:59:52,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.05 | bwd: 3372.48 | bwd_inner: 3371.69 | bwd_allreduce: 0.75 | step: 6.63 67%|██████▋ | 6668/10000 [10:30:13<5:06:53, 5.53s/it] {'loss': 0.0245, 'grad_norm': 8.93543815612793, 'learning_rate': 1.0558500499848768e-05, 'epoch': 6.67} 67%|██████▋ | 6668/10000 [10:30:13<5:06:53, 5.53s/it][2025-06-19 23:59:58,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-19 23:59:58,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.41 | bwd_microstep: 3372.57 | bwd_inner_microstep: 3371.74 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.93 [2025-06-19 23:59:58,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.41 | bwd: 3372.59 | bwd_inner: 3371.74 | bwd_allreduce: 0.79 | step: 6.94 67%|██████▋ | 6669/10000 [10:30:19<5:07:03, 5.53s/it] {'loss': 0.0013, 'grad_norm': 0.33716997504234314, 'learning_rate': 1.0552790691210238e-05, 'epoch': 6.67} 67%|██████▋ | 6669/10000 [10:30:19<5:07:03, 5.53s/it][2025-06-20 00:00:03,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:00:03,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.51 | bwd_microstep: 3323.71 | bwd_inner_microstep: 3322.91 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-20 00:00:03,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.51 | bwd: 3323.73 | bwd_inner: 3322.91 | bwd_allreduce: 0.78 | step: 6.90 67%|██████▋ | 6670/10000 [10:30:24<5:05:59, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00029148810426704586, 'learning_rate': 1.0547081873540117e-05, 'epoch': 6.67} 67%|██████▋ | 6670/10000 [10:30:24<5:05:59, 5.51s/it][2025-06-20 00:00:09,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:00:09,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.27 | bwd_microstep: 3314.48 | bwd_inner_microstep: 3313.65 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.16 [2025-06-20 00:00:09,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.27 | bwd: 3314.50 | bwd_inner: 3313.65 | bwd_allreduce: 0.80 | step: 7.16 67%|██████▋ | 6671/10000 [10:30:30<5:05:04, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.1014062911272049, 'learning_rate': 1.0541374047437236e-05, 'epoch': 6.67} 67%|██████▋ | 6671/10000 [10:30:30<5:05:04, 5.50s/it][2025-06-20 00:00:14,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:00:14,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.56 | bwd_microstep: 3357.24 | bwd_inner_microstep: 3356.41 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.80 [2025-06-20 00:00:14,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.56 | bwd: 3357.26 | bwd_inner: 3356.41 | bwd_allreduce: 0.80 | step: 6.81 67%|██████▋ | 6672/10000 [10:30:35<5:05:27, 5.51s/it] {'loss': 0.0532, 'grad_norm': 3.1057889461517334, 'learning_rate': 1.053566721350032e-05, 'epoch': 6.67} 67%|██████▋ | 6672/10000 [10:30:35<5:05:27, 5.51s/it][2025-06-20 00:00:20,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.76 [2025-06-20 00:00:20,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.07 | bwd_microstep: 3370.32 | bwd_inner_microstep: 3369.40 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.09 [2025-06-20 00:00:20,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.07 | bwd: 3370.34 | bwd_inner: 3369.40 | bwd_allreduce: 0.88 | step: 7.10 67%|██████▋ | 6673/10000 [10:30:41<5:05:55, 5.52s/it] {'loss': 0.004, 'grad_norm': 0.6693984270095825, 'learning_rate': 1.0529961372327995e-05, 'epoch': 6.67} 67%|██████▋ | 6673/10000 [10:30:41<5:05:55, 5.52s/it][2025-06-20 00:00:25,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:00:25,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.26 | bwd_microstep: 3317.30 | bwd_inner_microstep: 3316.49 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-20 00:00:25,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.26 | bwd: 3317.31 | bwd_inner: 3316.49 | bwd_allreduce: 0.78 | step: 6.93 67%|██████▋ | 6674/10000 [10:30:46<5:05:02, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.007231196388602257, 'learning_rate': 1.052425652451876e-05, 'epoch': 6.67} 67%|██████▋ | 6674/10000 [10:30:46<5:05:02, 5.50s/it][2025-06-20 00:00:31,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:00:31,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.00 | bwd_microstep: 3323.90 | bwd_inner_microstep: 3323.07 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.96 [2025-06-20 00:00:31,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.00 | bwd: 3323.91 | bwd_inner: 3323.07 | bwd_allreduce: 0.80 | step: 6.96 67%|██████▋ | 6675/10000 [10:30:52<5:04:26, 5.49s/it] {'loss': 0.005, 'grad_norm': 1.0914788246154785, 'learning_rate': 1.0518552670671043e-05, 'epoch': 6.67} 67%|██████▋ | 6675/10000 [10:30:52<5:04:26, 5.49s/it][2025-06-20 00:00:36,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:00:36,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.39 | bwd_microstep: 3315.47 | bwd_inner_microstep: 3314.33 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.52 [2025-06-20 00:00:36,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.39 | bwd: 3315.49 | bwd_inner: 3314.33 | bwd_allreduce: 1.11 | step: 7.53 67%|██████▋ | 6676/10000 [10:30:57<5:03:51, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0003223964886274189, 'learning_rate': 1.0512849811383141e-05, 'epoch': 6.68} 67%|██████▋ | 6676/10000 [10:30:57<5:03:51, 5.48s/it][2025-06-20 00:00:42,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:00:42,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.55 | bwd_microstep: 3318.29 | bwd_inner_microstep: 3317.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 00:00:42,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.55 | bwd: 3318.30 | bwd_inner: 3317.49 | bwd_allreduce: 0.77 | step: 6.77 67%|██████▋ | 6677/10000 [10:31:03<5:03:24, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.010692623443901539, 'learning_rate': 1.0507147947253267e-05, 'epoch': 6.68} 67%|██████▋ | 6677/10000 [10:31:03<5:03:24, 5.48s/it][2025-06-20 00:00:47,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:00:47,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.43 | bwd_microstep: 3320.12 | bwd_inner_microstep: 3319.19 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.99 [2025-06-20 00:00:47,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.43 | bwd: 3320.14 | bwd_inner: 3319.19 | bwd_allreduce: 0.90 | step: 7.00 67%|██████▋ | 6678/10000 [10:31:08<5:03:08, 5.48s/it] {'loss': 0.008, 'grad_norm': 1.8346030712127686, 'learning_rate': 1.0501447078879519e-05, 'epoch': 6.68} 67%|██████▋ | 6678/10000 [10:31:08<5:03:08, 5.48s/it][2025-06-20 00:00:53,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 00:00:53,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.45 | bwd_microstep: 3317.72 | bwd_inner_microstep: 3316.75 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.18 [2025-06-20 00:00:53,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.45 | bwd: 3317.74 | bwd_inner: 3316.75 | bwd_allreduce: 0.93 | step: 7.18 67%|██████▋ | 6679/10000 [10:31:14<5:02:58, 5.47s/it] {'loss': 0.0008, 'grad_norm': 0.15504516661167145, 'learning_rate': 1.0495747206859883e-05, 'epoch': 6.68} 67%|██████▋ | 6679/10000 [10:31:14<5:02:58, 5.47s/it][2025-06-20 00:00:58,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:00:58,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.36 | bwd_microstep: 3360.80 | bwd_inner_microstep: 3360.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-20 00:00:58,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.36 | bwd: 3360.82 | bwd_inner: 3360.02 | bwd_allreduce: 0.75 | step: 6.52 67%|██████▋ | 6680/10000 [10:31:19<5:03:55, 5.49s/it] {'loss': 0.0006, 'grad_norm': 0.13813206553459167, 'learning_rate': 1.0490048331792252e-05, 'epoch': 6.68} 67%|██████▋ | 6680/10000 [10:31:19<5:03:55, 5.49s/it][2025-06-20 00:01:04,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:01:04,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.31 | bwd_microstep: 3372.69 | bwd_inner_microstep: 3371.83 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.00 [2025-06-20 00:01:04,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.31 | bwd: 3372.72 | bwd_inner: 3371.83 | bwd_allreduce: 0.83 | step: 7.00 67%|██████▋ | 6681/10000 [10:31:25<5:04:42, 5.51s/it] {'loss': 0.1501, 'grad_norm': 4.101569175720215, 'learning_rate': 1.0484350454274417e-05, 'epoch': 6.68} 67%|██████▋ | 6681/10000 [10:31:25<5:04:42, 5.51s/it][2025-06-20 00:01:09,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:01:09,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.59 | bwd_microstep: 3317.74 | bwd_inner_microstep: 3316.96 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.57 [2025-06-20 00:01:09,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.59 | bwd: 3317.75 | bwd_inner: 3316.96 | bwd_allreduce: 0.75 | step: 6.58 67%|██████▋ | 6682/10000 [10:31:30<5:03:54, 5.50s/it] {'loss': 0.0011, 'grad_norm': 0.138876810669899, 'learning_rate': 1.0478653574904057e-05, 'epoch': 6.68} 67%|██████▋ | 6682/10000 [10:31:30<5:03:54, 5.50s/it][2025-06-20 00:01:15,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:01:15,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.71 | bwd_microstep: 3374.17 | bwd_inner_microstep: 3373.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-20 00:01:15,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.71 | bwd: 3374.19 | bwd_inner: 3373.38 | bwd_allreduce: 0.77 | step: 6.88 67%|██████▋ | 6683/10000 [10:31:36<5:04:29, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.001559886266477406, 'learning_rate': 1.0472957694278743e-05, 'epoch': 6.68} 67%|██████▋ | 6683/10000 [10:31:36<5:04:29, 5.51s/it][2025-06-20 00:01:20,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:01:20,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.78 | bwd_microstep: 3314.01 | bwd_inner_microstep: 3313.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-20 00:01:20,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.78 | bwd: 3314.03 | bwd_inner: 3313.23 | bwd_allreduce: 0.75 | step: 6.70 67%|██████▋ | 6684/10000 [10:31:41<5:03:27, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.0698852688074112, 'learning_rate': 1.0467262812995953e-05, 'epoch': 6.68} 67%|██████▋ | 6684/10000 [10:31:41<5:03:27, 5.49s/it][2025-06-20 00:01:26,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:01:26,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.12 | bwd_microstep: 3360.35 | bwd_inner_microstep: 3359.40 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.02 [2025-06-20 00:01:26,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.12 | bwd: 3360.36 | bwd_inner: 3359.40 | bwd_allreduce: 0.92 | step: 7.02 67%|██████▋ | 6685/10000 [10:31:47<5:03:54, 5.50s/it] {'loss': 0.0054, 'grad_norm': 1.28031587600708, 'learning_rate': 1.0461568931653047e-05, 'epoch': 6.69} 67%|██████▋ | 6685/10000 [10:31:47<5:03:54, 5.50s/it][2025-06-20 00:01:31,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:01:31,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.49 | bwd_microstep: 3315.69 | bwd_inner_microstep: 3314.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 00:01:31,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.49 | bwd: 3315.71 | bwd_inner: 3314.90 | bwd_allreduce: 0.76 | step: 6.64 67%|██████▋ | 6686/10000 [10:31:52<5:03:05, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.045258935540914536, 'learning_rate': 1.0455876050847295e-05, 'epoch': 6.69} 67%|██████▋ | 6686/10000 [10:31:52<5:03:05, 5.49s/it][2025-06-20 00:01:37,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:01:37,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.58 | bwd_microstep: 3374.54 | bwd_inner_microstep: 3373.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 00:01:37,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.58 | bwd: 3374.55 | bwd_inner: 3373.75 | bwd_allreduce: 0.76 | step: 6.67 67%|██████▋ | 6687/10000 [10:31:58<5:03:45, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.10643408447504044, 'learning_rate': 1.045018417117585e-05, 'epoch': 6.69} 67%|██████▋ | 6687/10000 [10:31:58<5:03:45, 5.50s/it][2025-06-20 00:01:42,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:01:42,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.92 | bwd_microstep: 3321.76 | bwd_inner_microstep: 3320.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 00:01:42,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.92 | bwd: 3321.78 | bwd_inner: 3320.98 | bwd_allreduce: 0.76 | step: 6.64 67%|██████▋ | 6688/10000 [10:32:03<5:03:01, 5.49s/it] {'loss': 0.0014, 'grad_norm': 0.2388516068458557, 'learning_rate': 1.0444493293235771e-05, 'epoch': 6.69} 67%|██████▋ | 6688/10000 [10:32:03<5:03:01, 5.49s/it][2025-06-20 00:01:48,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 00:01:48,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.70 | bwd_microstep: 3306.07 | bwd_inner_microstep: 3305.12 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.18 [2025-06-20 00:01:48,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.70 | bwd: 3306.09 | bwd_inner: 3305.12 | bwd_allreduce: 0.92 | step: 7.19 67%|██████▋ | 6689/10000 [10:32:09<5:02:11, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.04498732089996338, 'learning_rate': 1.0438803417623986e-05, 'epoch': 6.69} 67%|██████▋ | 6689/10000 [10:32:09<5:02:11, 5.48s/it][2025-06-20 00:01:53,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:01:53,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.17 | bwd_microstep: 3369.76 | bwd_inner_microstep: 3368.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-20 00:01:53,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.17 | bwd: 3369.78 | bwd_inner: 3368.97 | bwd_allreduce: 0.77 | step: 6.99 67%|██████▋ | 6690/10000 [10:32:14<5:03:05, 5.49s/it] {'loss': 0.012, 'grad_norm': 3.6883797645568848, 'learning_rate': 1.043311454493735e-05, 'epoch': 6.69} 67%|██████▋ | 6690/10000 [10:32:14<5:03:05, 5.49s/it][2025-06-20 00:01:59,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:01:59,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.23 | bwd_microstep: 3322.84 | bwd_inner_microstep: 3322.05 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-20 00:01:59,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.23 | bwd: 3322.85 | bwd_inner: 3322.06 | bwd_allreduce: 0.75 | step: 6.55 67%|██████▋ | 6691/10000 [10:32:20<5:02:22, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.006129464600235224, 'learning_rate': 1.0427426675772598e-05, 'epoch': 6.69} 67%|██████▋ | 6691/10000 [10:32:20<5:02:22, 5.48s/it][2025-06-20 00:02:04,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:02:04,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.42 | bwd_microstep: 3362.35 | bwd_inner_microstep: 3361.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-20 00:02:04,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.42 | bwd: 3362.36 | bwd_inner: 3361.56 | bwd_allreduce: 0.76 | step: 6.90 67%|██████▋ | 6692/10000 [10:32:25<5:02:59, 5.50s/it] {'loss': 0.0012, 'grad_norm': 0.2723095417022705, 'learning_rate': 1.0421739810726356e-05, 'epoch': 6.69} 67%|██████▋ | 6692/10000 [10:32:25<5:02:59, 5.50s/it][2025-06-20 00:02:10,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:02:10,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.00 | bwd_microstep: 3368.85 | bwd_inner_microstep: 3368.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-20 00:02:10,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.00 | bwd: 3368.86 | bwd_inner: 3368.06 | bwd_allreduce: 0.76 | step: 6.63 67%|██████▋ | 6693/10000 [10:32:31<5:03:33, 5.51s/it] {'loss': 0.0006, 'grad_norm': 0.14864258468151093, 'learning_rate': 1.0416053950395166e-05, 'epoch': 6.69} 67%|██████▋ | 6693/10000 [10:32:31<5:03:33, 5.51s/it][2025-06-20 00:02:15,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:02:15,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.61 | bwd_microstep: 3321.12 | bwd_inner_microstep: 3320.21 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.92 [2025-06-20 00:02:15,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.61 | bwd: 3321.15 | bwd_inner: 3320.21 | bwd_allreduce: 0.88 | step: 6.93 67%|██████▋ | 6694/10000 [10:32:36<5:02:52, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00014900405949447304, 'learning_rate': 1.0410369095375428e-05, 'epoch': 6.69} 67%|██████▋ | 6694/10000 [10:32:36<5:02:52, 5.50s/it][2025-06-20 00:02:21,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:02:21,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.70 | bwd_microstep: 3309.82 | bwd_inner_microstep: 3309.02 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-20 00:02:21,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.70 | bwd: 3309.84 | bwd_inner: 3309.02 | bwd_allreduce: 0.78 | step: 6.88 67%|██████▋ | 6695/10000 [10:32:42<5:02:07, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.047801923006772995, 'learning_rate': 1.0404685246263462e-05, 'epoch': 6.7} 67%|██████▋ | 6695/10000 [10:32:42<5:02:07, 5.48s/it][2025-06-20 00:02:26,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:02:26,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.38 | bwd_microstep: 3303.71 | bwd_inner_microstep: 3302.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.25 [2025-06-20 00:02:26,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.38 | bwd: 3303.72 | bwd_inner: 3302.90 | bwd_allreduce: 0.78 | step: 7.26 67%|██████▋ | 6696/10000 [10:32:47<5:01:31, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.014166868291795254, 'learning_rate': 1.0399002403655483e-05, 'epoch': 6.7} 67%|██████▋ | 6696/10000 [10:32:47<5:01:31, 5.48s/it][2025-06-20 00:02:32,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:02:32,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.73 | bwd_microstep: 3357.87 | bwd_inner_microstep: 3357.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 00:02:32,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.73 | bwd: 3357.89 | bwd_inner: 3357.09 | bwd_allreduce: 0.75 | step: 6.62 67%|██████▋ | 6697/10000 [10:32:53<5:02:08, 5.49s/it] {'loss': 0.004, 'grad_norm': 1.0140643119812012, 'learning_rate': 1.039332056814759e-05, 'epoch': 6.7} 67%|██████▋ | 6697/10000 [10:32:53<5:02:08, 5.49s/it][2025-06-20 00:02:37,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:02:37,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.95 | bwd_microstep: 3390.85 | bwd_inner_microstep: 3390.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-20 00:02:37,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.95 | bwd: 3390.87 | bwd_inner: 3390.06 | bwd_allreduce: 0.76 | step: 6.67 67%|██████▋ | 6698/10000 [10:32:58<5:03:19, 5.51s/it] {'loss': 0.0012, 'grad_norm': 0.26669421792030334, 'learning_rate': 1.0387639740335787e-05, 'epoch': 6.7} 67%|██████▋ | 6698/10000 [10:32:58<5:03:19, 5.51s/it][2025-06-20 00:02:43,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:02:43,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.77 | bwd_microstep: 3363.58 | bwd_inner_microstep: 3362.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-20 00:02:43,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.77 | bwd: 3363.60 | bwd_inner: 3362.77 | bwd_allreduce: 0.78 | step: 7.06 67%|██████▋ | 6699/10000 [10:33:04<5:03:34, 5.52s/it] {'loss': 0.0004, 'grad_norm': 0.10708589106798172, 'learning_rate': 1.038195992081596e-05, 'epoch': 6.7} 67%|██████▋ | 6699/10000 [10:33:04<5:03:34, 5.52s/it][2025-06-20 00:02:48,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:02:48,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.95 | bwd_microstep: 3312.33 | bwd_inner_microstep: 3311.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-20 00:02:48,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.95 | bwd: 3312.35 | bwd_inner: 3311.54 | bwd_allreduce: 0.77 | step: 6.94 67%|██████▋ | 6700/10000 [10:33:09<5:02:36, 5.50s/it] {'loss': 0.0246, 'grad_norm': 7.287600040435791, 'learning_rate': 1.03762811101839e-05, 'epoch': 6.7} 67%|██████▋ | 6700/10000 [10:33:09<5:02:36, 5.50s/it][2025-06-20 00:02:54,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:02:54,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.86 | bwd_microstep: 3353.41 | bwd_inner_microstep: 3352.49 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.67 [2025-06-20 00:02:54,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.86 | bwd: 3353.43 | bwd_inner: 3352.49 | bwd_allreduce: 0.90 | step: 6.68 67%|██████▋ | 6701/10000 [10:33:15<5:02:49, 5.51s/it] {'loss': 0.1377, 'grad_norm': 18.821434020996094, 'learning_rate': 1.0370603309035285e-05, 'epoch': 6.7} 67%|██████▋ | 6701/10000 [10:33:15<5:02:49, 5.51s/it][2025-06-20 00:02:59,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:02:59,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.76 | bwd_microstep: 3359.41 | bwd_inner_microstep: 3358.40 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.33 [2025-06-20 00:02:59,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.76 | bwd: 3359.43 | bwd_inner: 3358.40 | bwd_allreduce: 0.97 | step: 7.33 67%|██████▋ | 6702/10000 [10:33:20<5:02:56, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.002115631941705942, 'learning_rate': 1.0364926517965693e-05, 'epoch': 6.7} 67%|██████▋ | 6702/10000 [10:33:20<5:02:56, 5.51s/it][2025-06-20 00:03:05,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.84 [2025-06-20 00:03:05,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.48 | bwd_microstep: 3307.60 | bwd_inner_microstep: 3306.64 | bwd_allreduce_microstep: 0.91 | step_microstep: 8.08 [2025-06-20 00:03:05,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.48 | bwd: 3307.61 | bwd_inner: 3306.64 | bwd_allreduce: 0.93 | step: 8.09 67%|██████▋ | 6703/10000 [10:33:26<5:01:57, 5.50s/it] {'loss': 0.0006, 'grad_norm': 0.12116821110248566, 'learning_rate': 1.0359250737570589e-05, 'epoch': 6.7} 67%|██████▋ | 6703/10000 [10:33:26<5:01:57, 5.50s/it][2025-06-20 00:03:10,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:03:10,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.19 | bwd_microstep: 3312.44 | bwd_inner_microstep: 3311.60 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.16 [2025-06-20 00:03:10,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.19 | bwd: 3312.45 | bwd_inner: 3311.61 | bwd_allreduce: 0.81 | step: 7.16 67%|██████▋ | 6704/10000 [10:33:31<5:01:31, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.008182546123862267, 'learning_rate': 1.0353575968445349e-05, 'epoch': 6.7} 67%|██████▋ | 6704/10000 [10:33:31<5:01:31, 5.49s/it][2025-06-20 00:03:16,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 00:03:16,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.91 | bwd_microstep: 3360.56 | bwd_inner_microstep: 3359.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-20 00:03:16,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.91 | bwd: 3360.57 | bwd_inner: 3359.78 | bwd_allreduce: 0.75 | step: 6.56 67%|██████▋ | 6705/10000 [10:33:37<5:01:57, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0020932124461978674, 'learning_rate': 1.034790221118521e-05, 'epoch': 6.71} 67%|██████▋ | 6705/10000 [10:33:37<5:01:57, 5.50s/it][2025-06-20 00:03:21,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:03:21,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.75 | bwd_microstep: 3366.01 | bwd_inner_microstep: 3365.18 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.86 [2025-06-20 00:03:21,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.75 | bwd: 3366.03 | bwd_inner: 3365.18 | bwd_allreduce: 0.79 | step: 6.87 67%|██████▋ | 6706/10000 [10:33:42<5:02:25, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.08008888363838196, 'learning_rate': 1.0342229466385335e-05, 'epoch': 6.71} 67%|██████▋ | 6706/10000 [10:33:42<5:02:25, 5.51s/it][2025-06-20 00:03:27,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:03:27,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.21 | bwd_microstep: 3396.26 | bwd_inner_microstep: 3395.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 00:03:27,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.21 | bwd: 3396.28 | bwd_inner: 3395.47 | bwd_allreduce: 0.76 | step: 6.77 67%|██████▋ | 6707/10000 [10:33:48<5:03:26, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0013979177456349134, 'learning_rate': 1.0336557734640765e-05, 'epoch': 6.71} 67%|██████▋ | 6707/10000 [10:33:48<5:03:26, 5.53s/it][2025-06-20 00:03:32,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:03:32,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.24 | bwd_microstep: 3329.52 | bwd_inner_microstep: 3328.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.10 [2025-06-20 00:03:32,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.24 | bwd: 3329.54 | bwd_inner: 3328.72 | bwd_allreduce: 0.77 | step: 7.11 67%|██████▋ | 6708/10000 [10:33:53<5:02:25, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.01620183326303959, 'learning_rate': 1.0330887016546435e-05, 'epoch': 6.71} 67%|██████▋ | 6708/10000 [10:33:53<5:02:25, 5.51s/it][2025-06-20 00:03:38,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:03:38,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.06 | bwd_microstep: 3318.83 | bwd_inner_microstep: 3318.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-20 00:03:38,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.06 | bwd: 3318.85 | bwd_inner: 3318.03 | bwd_allreduce: 0.77 | step: 6.99 67%|██████▋ | 6709/10000 [10:33:59<5:01:26, 5.50s/it] {'loss': 0.2377, 'grad_norm': 6.519834995269775, 'learning_rate': 1.0325217312697194e-05, 'epoch': 6.71} 67%|██████▋ | 6709/10000 [10:33:59<5:01:26, 5.50s/it][2025-06-20 00:03:43,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:03:43,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.62 | bwd_microstep: 3370.65 | bwd_inner_microstep: 3369.46 | bwd_allreduce_microstep: 1.13 | step_microstep: 7.40 [2025-06-20 00:03:43,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.62 | bwd: 3370.66 | bwd_inner: 3369.46 | bwd_allreduce: 1.15 | step: 7.40 67%|██████▋ | 6710/10000 [10:34:04<5:02:04, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.009827234782278538, 'learning_rate': 1.0319548623687745e-05, 'epoch': 6.71} 67%|██████▋ | 6710/10000 [10:34:04<5:02:04, 5.51s/it][2025-06-20 00:03:49,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:03:49,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.08 | bwd_microstep: 3324.19 | bwd_inner_microstep: 3323.38 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.71 [2025-06-20 00:03:49,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.08 | bwd: 3324.20 | bwd_inner: 3323.38 | bwd_allreduce: 0.78 | step: 6.72 67%|██████▋ | 6711/10000 [10:34:10<5:01:17, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.004148208070546389, 'learning_rate': 1.0313880950112717e-05, 'epoch': 6.71} 67%|██████▋ | 6711/10000 [10:34:10<5:01:17, 5.50s/it][2025-06-20 00:03:54,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:03:54,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.47 | bwd_microstep: 3313.31 | bwd_inner_microstep: 3312.52 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 00:03:54,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.47 | bwd: 3313.32 | bwd_inner: 3312.52 | bwd_allreduce: 0.76 | step: 6.70 67%|██████▋ | 6712/10000 [10:34:15<5:00:33, 5.48s/it] {'loss': 0.0336, 'grad_norm': 8.943272590637207, 'learning_rate': 1.0308214292566622e-05, 'epoch': 6.71} 67%|██████▋ | 6712/10000 [10:34:15<5:00:33, 5.48s/it][2025-06-20 00:04:00,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:04:00,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.27 | bwd_microstep: 3314.76 | bwd_inner_microstep: 3313.94 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-20 00:04:00,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.27 | bwd: 3314.78 | bwd_inner: 3313.94 | bwd_allreduce: 0.79 | step: 7.23 67%|██████▋ | 6713/10000 [10:34:21<5:00:01, 5.48s/it] {'loss': 0.0007, 'grad_norm': 0.1213618814945221, 'learning_rate': 1.0302548651643867e-05, 'epoch': 6.71} 67%|██████▋ | 6713/10000 [10:34:21<5:00:01, 5.48s/it][2025-06-20 00:04:05,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:04:05,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.64 | bwd_microstep: 3321.65 | bwd_inner_microstep: 3320.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 00:04:05,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.64 | bwd: 3321.66 | bwd_inner: 3320.86 | bwd_allreduce: 0.76 | step: 6.65 67%|██████▋ | 6714/10000 [10:34:26<4:59:46, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.07736346870660782, 'learning_rate': 1.0296884027938761e-05, 'epoch': 6.71} 67%|██████▋ | 6714/10000 [10:34:26<4:59:46, 5.47s/it][2025-06-20 00:04:11,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:04:11,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.00 | bwd_microstep: 3316.98 | bwd_inner_microstep: 3316.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 00:04:11,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.01 | bwd: 3317.00 | bwd_inner: 3316.20 | bwd_allreduce: 0.75 | step: 6.60 67%|██████▋ | 6715/10000 [10:34:31<4:59:23, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.014807394705712795, 'learning_rate': 1.0291220422045476e-05, 'epoch': 6.71} 67%|██████▋ | 6715/10000 [10:34:31<4:59:23, 5.47s/it][2025-06-20 00:04:16,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:04:16,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.80 | bwd_microstep: 3315.16 | bwd_inner_microstep: 3314.37 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 00:04:16,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.80 | bwd: 3315.18 | bwd_inner: 3314.37 | bwd_allreduce: 0.76 | step: 6.71 67%|██████▋ | 6716/10000 [10:34:37<4:59:02, 5.46s/it] {'loss': 0.0587, 'grad_norm': 6.559431552886963, 'learning_rate': 1.0285557834558113e-05, 'epoch': 6.72} 67%|██████▋ | 6716/10000 [10:34:37<4:59:02, 5.46s/it][2025-06-20 00:04:22,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:04:22,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.35 | bwd_microstep: 3375.75 | bwd_inner_microstep: 3374.76 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.65 [2025-06-20 00:04:22,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.35 | bwd: 3375.76 | bwd_inner: 3374.76 | bwd_allreduce: 0.96 | step: 7.65 67%|██████▋ | 6717/10000 [10:34:42<5:00:14, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0025634721387177706, 'learning_rate': 1.0279896266070644e-05, 'epoch': 6.72} 67%|██████▋ | 6717/10000 [10:34:42<5:00:14, 5.49s/it][2025-06-20 00:04:27,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:04:27,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.28 | bwd_microstep: 3323.52 | bwd_inner_microstep: 3322.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 00:04:27,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.28 | bwd: 3323.53 | bwd_inner: 3322.73 | bwd_allreduce: 0.76 | step: 6.68 67%|██████▋ | 6718/10000 [10:34:48<4:59:53, 5.48s/it] {'loss': 0.0097, 'grad_norm': 1.3240575790405273, 'learning_rate': 1.0274235717176943e-05, 'epoch': 6.72} 67%|██████▋ | 6718/10000 [10:34:48<4:59:53, 5.48s/it][2025-06-20 00:04:33,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:04:33,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.90 | bwd_microstep: 3321.14 | bwd_inner_microstep: 3320.30 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.89 [2025-06-20 00:04:33,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.90 | bwd: 3321.16 | bwd_inner: 3320.30 | bwd_allreduce: 0.80 | step: 6.89 67%|██████▋ | 6719/10000 [10:34:53<4:59:40, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.04316670820116997, 'learning_rate': 1.026857618847078e-05, 'epoch': 6.72} 67%|██████▋ | 6719/10000 [10:34:53<4:59:40, 5.48s/it][2025-06-20 00:04:38,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:04:38,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.63 | bwd_microstep: 3331.13 | bwd_inner_microstep: 3330.09 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.33 [2025-06-20 00:04:38,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.63 | bwd: 3331.15 | bwd_inner: 3330.09 | bwd_allreduce: 1.01 | step: 7.34 67%|██████▋ | 6720/10000 [10:34:59<4:59:41, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.20565056800842285, 'learning_rate': 1.026291768054581e-05, 'epoch': 6.72} 67%|██████▋ | 6720/10000 [10:34:59<4:59:41, 5.48s/it][2025-06-20 00:04:44,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:04:44,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.13 | bwd_microstep: 3337.74 | bwd_inner_microstep: 3336.89 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.25 [2025-06-20 00:04:44,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.13 | bwd: 3337.76 | bwd_inner: 3336.89 | bwd_allreduce: 0.83 | step: 7.25 67%|██████▋ | 6721/10000 [10:35:04<4:59:43, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0017374135786667466, 'learning_rate': 1.0257260193995581e-05, 'epoch': 6.72} 67%|██████▋ | 6721/10000 [10:35:04<4:59:43, 5.48s/it][2025-06-20 00:04:49,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:04:49,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.84 | bwd_microstep: 3334.38 | bwd_inner_microstep: 3333.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 00:04:49,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.84 | bwd: 3334.39 | bwd_inner: 3333.59 | bwd_allreduce: 0.76 | step: 6.68 67%|██████▋ | 6722/10000 [10:35:10<4:59:50, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0019166164565831423, 'learning_rate': 1.0251603729413542e-05, 'epoch': 6.72} 67%|██████▋ | 6722/10000 [10:35:10<4:59:50, 5.49s/it][2025-06-20 00:04:55,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:04:55,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.86 | bwd_microstep: 3336.33 | bwd_inner_microstep: 3335.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 00:04:55,072] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.87 | bwd: 3336.35 | bwd_inner: 3335.55 | bwd_allreduce: 0.75 | step: 6.67 67%|██████▋ | 6723/10000 [10:35:15<4:59:37, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0033028586767613888, 'learning_rate': 1.0245948287393027e-05, 'epoch': 6.72} 67%|██████▋ | 6723/10000 [10:35:15<4:59:37, 5.49s/it][2025-06-20 00:05:00,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:05:00,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.26 | bwd_microstep: 3374.93 | bwd_inner_microstep: 3374.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-20 00:05:00,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.26 | bwd: 3374.95 | bwd_inner: 3374.13 | bwd_allreduce: 0.77 | step: 6.84 67%|██████▋ | 6724/10000 [10:35:21<5:00:34, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0039279647171497345, 'learning_rate': 1.0240293868527276e-05, 'epoch': 6.72} 67%|██████▋ | 6724/10000 [10:35:21<5:00:34, 5.51s/it][2025-06-20 00:05:06,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:05:06,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.82 | bwd_microstep: 3321.14 | bwd_inner_microstep: 3320.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-20 00:05:06,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.82 | bwd: 3321.15 | bwd_inner: 3320.34 | bwd_allreduce: 0.77 | step: 6.95 67%|██████▋ | 6725/10000 [10:35:26<4:59:54, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0023894375190138817, 'learning_rate': 1.0234640473409394e-05, 'epoch': 6.72} 67%|██████▋ | 6725/10000 [10:35:26<4:59:54, 5.49s/it][2025-06-20 00:05:11,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:05:11,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.39 | bwd_microstep: 3325.98 | bwd_inner_microstep: 3325.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-20 00:05:11,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.39 | bwd: 3325.99 | bwd_inner: 3325.18 | bwd_allreduce: 0.77 | step: 6.87 67%|██████▋ | 6726/10000 [10:35:32<4:59:32, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.02491798996925354, 'learning_rate': 1.0228988102632402e-05, 'epoch': 6.73} 67%|██████▋ | 6726/10000 [10:35:32<4:59:32, 5.49s/it][h264 @ 0x3cf73280] Reference 5 >= 5 [h264 @ 0x3cf73280] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x3ce079c0] left block unavailable for requested intra mode [h264 @ 0x3ce079c0] error while decoding MB 0 25, bytestream 45493 [h264 @ 0x3ce0e040] Reference 5 >= 5 [h264 @ 0x3ce0e040] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x3ce0e040] left block unavailable for requested intra mode [h264 @ 0x3ce0e040] error while decoding MB 0 25, bytestream 45493 [2025-06-20 00:05:17,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:05:17,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.38 | bwd_microstep: 3317.79 | bwd_inner_microstep: 3316.96 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.21 [2025-06-20 00:05:17,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.38 | bwd: 3317.80 | bwd_inner: 3316.96 | bwd_allreduce: 0.79 | step: 7.21 67%|██████▋ | 6727/10000 [10:35:37<4:58:57, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.0683336928486824, 'learning_rate': 1.0223336756789211e-05, 'epoch': 6.73} 67%|██████▋ | 6727/10000 [10:35:37<4:58:57, 5.48s/it][2025-06-20 00:05:22,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:05:22,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.11 | bwd_microstep: 3369.36 | bwd_inner_microstep: 3368.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 00:05:22,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.11 | bwd: 3369.38 | bwd_inner: 3368.56 | bwd_allreduce: 0.77 | step: 6.86 67%|██████▋ | 6728/10000 [10:35:43<4:59:55, 5.50s/it] {'loss': 0.0012, 'grad_norm': 0.2926847040653229, 'learning_rate': 1.021768643647262e-05, 'epoch': 6.73} 67%|██████▋ | 6728/10000 [10:35:43<4:59:55, 5.50s/it][2025-06-20 00:05:28,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:05:28,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.43 | bwd_microstep: 3316.90 | bwd_inner_microstep: 3315.87 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.45 [2025-06-20 00:05:28,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.43 | bwd: 3316.92 | bwd_inner: 3315.87 | bwd_allreduce: 1.00 | step: 7.45 67%|██████▋ | 6729/10000 [10:35:48<4:59:20, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.06146936118602753, 'learning_rate': 1.0212037142275326e-05, 'epoch': 6.73} 67%|██████▋ | 6729/10000 [10:35:48<4:59:20, 5.49s/it][2025-06-20 00:05:33,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:05:33,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.77 | bwd_microstep: 3324.25 | bwd_inner_microstep: 3323.44 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-20 00:05:33,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.77 | bwd: 3324.27 | bwd_inner: 3323.44 | bwd_allreduce: 0.78 | step: 7.00 67%|██████▋ | 6730/10000 [10:35:54<4:58:58, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.01675056293606758, 'learning_rate': 1.0206388874789903e-05, 'epoch': 6.73} 67%|██████▋ | 6730/10000 [10:35:54<4:58:58, 5.49s/it][2025-06-20 00:05:39,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:05:39,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.73 | bwd_microstep: 3368.43 | bwd_inner_microstep: 3367.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-20 00:05:39,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.73 | bwd: 3368.44 | bwd_inner: 3367.63 | bwd_allreduce: 0.77 | step: 7.02 67%|██████▋ | 6731/10000 [10:35:59<4:59:41, 5.50s/it] {'loss': 0.001, 'grad_norm': 0.440994530916214, 'learning_rate': 1.0200741634608833e-05, 'epoch': 6.73} 67%|██████▋ | 6731/10000 [10:35:59<4:59:41, 5.50s/it][2025-06-20 00:05:44,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:05:44,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.54 | bwd_microstep: 3382.86 | bwd_inner_microstep: 3381.93 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.99 [2025-06-20 00:05:44,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.55 | bwd: 3382.88 | bwd_inner: 3381.93 | bwd_allreduce: 0.90 | step: 6.99 67%|██████▋ | 6732/10000 [10:36:05<5:00:31, 5.52s/it] {'loss': 0.002, 'grad_norm': 0.8930106163024902, 'learning_rate': 1.0195095422324486e-05, 'epoch': 6.73} 67%|██████▋ | 6732/10000 [10:36:05<5:00:31, 5.52s/it][2025-06-20 00:05:50,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:05:50,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.54 | bwd_microstep: 3331.70 | bwd_inner_microstep: 3330.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-20 00:05:50,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.54 | bwd: 3331.72 | bwd_inner: 3330.90 | bwd_allreduce: 0.77 | step: 7.14 67%|██████▋ | 6733/10000 [10:36:10<4:59:53, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.020013602450489998, 'learning_rate': 1.0189450238529121e-05, 'epoch': 6.73} 67%|██████▋ | 6733/10000 [10:36:10<4:59:53, 5.51s/it][2025-06-20 00:05:55,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:05:55,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.47 | bwd_microstep: 3329.98 | bwd_inner_microstep: 3329.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.86 [2025-06-20 00:05:55,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.47 | bwd: 3330.19 | bwd_inner: 3329.20 | bwd_allreduce: 0.76 | step: 6.86 67%|██████▋ | 6734/10000 [10:36:16<4:59:22, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.6596065163612366, 'learning_rate': 1.0183806083814898e-05, 'epoch': 6.73} 67%|██████▋ | 6734/10000 [10:36:16<4:59:22, 5.50s/it][2025-06-20 00:06:01,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:06:01,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.79 | bwd_microstep: 3339.18 | bwd_inner_microstep: 3338.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-20 00:06:01,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.80 | bwd: 3339.19 | bwd_inner: 3338.38 | bwd_allreduce: 0.77 | step: 7.06 67%|██████▋ | 6735/10000 [10:36:21<4:59:09, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.02134796977043152, 'learning_rate': 1.0178162958773857e-05, 'epoch': 6.74} 67%|██████▋ | 6735/10000 [10:36:21<4:59:09, 5.50s/it][2025-06-20 00:06:06,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 00:06:06,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.79 | bwd_microstep: 3380.32 | bwd_inner_microstep: 3379.37 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.07 [2025-06-20 00:06:06,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.79 | bwd: 3380.34 | bwd_inner: 3379.37 | bwd_allreduce: 0.93 | step: 7.08 67%|██████▋ | 6736/10000 [10:36:27<4:59:54, 5.51s/it] {'loss': 0.0021, 'grad_norm': 0.3735518157482147, 'learning_rate': 1.0172520863997934e-05, 'epoch': 6.74} 67%|██████▋ | 6736/10000 [10:36:27<4:59:54, 5.51s/it][2025-06-20 00:06:12,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:06:12,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.18 | bwd_microstep: 3328.25 | bwd_inner_microstep: 3327.40 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.93 [2025-06-20 00:06:12,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.18 | bwd: 3328.27 | bwd_inner: 3327.40 | bwd_allreduce: 0.81 | step: 6.93 67%|██████▋ | 6737/10000 [10:36:32<4:59:25, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00044614297803491354, 'learning_rate': 1.0166879800078963e-05, 'epoch': 6.74} 67%|██████▋ | 6737/10000 [10:36:32<4:59:25, 5.51s/it][2025-06-20 00:06:17,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:06:17,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.91 | bwd_microstep: 3340.00 | bwd_inner_microstep: 3338.99 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.67 [2025-06-20 00:06:17,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.91 | bwd: 3340.01 | bwd_inner: 3338.99 | bwd_allreduce: 0.98 | step: 7.67 67%|██████▋ | 6738/10000 [10:36:38<4:59:17, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.08750089257955551, 'learning_rate': 1.0161239767608665e-05, 'epoch': 6.74} 67%|██████▋ | 6738/10000 [10:36:38<4:59:17, 5.51s/it][2025-06-20 00:06:23,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.89 [2025-06-20 00:06:23,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.05 | bwd_microstep: 3392.01 | bwd_inner_microstep: 3391.17 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.18 [2025-06-20 00:06:23,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.05 | bwd: 3392.03 | bwd_inner: 3391.17 | bwd_allreduce: 0.80 | step: 7.18 67%|██████▋ | 6739/10000 [10:36:43<5:00:20, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0006548857782036066, 'learning_rate': 1.0155600767178662e-05, 'epoch': 6.74} 67%|██████▋ | 6739/10000 [10:36:43<5:00:20, 5.53s/it][2025-06-20 00:06:28,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:06:28,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.87 | bwd_microstep: 3330.45 | bwd_inner_microstep: 3329.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 00:06:28,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.87 | bwd: 3330.46 | bwd_inner: 3329.65 | bwd_allreduce: 0.76 | step: 6.69 67%|██████▋ | 6740/10000 [10:36:49<4:59:27, 5.51s/it] {'loss': 0.0016, 'grad_norm': 0.2692733407020569, 'learning_rate': 1.0149962799380438e-05, 'epoch': 6.74} 67%|██████▋ | 6740/10000 [10:36:49<4:59:27, 5.51s/it][2025-06-20 00:06:34,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:06:34,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.67 | bwd_microstep: 3322.25 | bwd_inner_microstep: 3321.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-20 00:06:34,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.67 | bwd: 3322.26 | bwd_inner: 3321.46 | bwd_allreduce: 0.76 | step: 6.64 67%|██████▋ | 6741/10000 [10:36:54<4:58:41, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.0333842933177948, 'learning_rate': 1.0144325864805402e-05, 'epoch': 6.74} 67%|██████▋ | 6741/10000 [10:36:54<4:58:41, 5.50s/it][2025-06-20 00:06:39,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:06:39,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.48 | bwd_microstep: 3329.36 | bwd_inner_microstep: 3328.39 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.70 [2025-06-20 00:06:39,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.48 | bwd: 3329.38 | bwd_inner: 3328.39 | bwd_allreduce: 0.93 | step: 7.70 67%|██████▋ | 6742/10000 [10:37:00<4:58:13, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.0679110661149025, 'learning_rate': 1.0138689964044842e-05, 'epoch': 6.74} 67%|██████▋ | 6742/10000 [10:37:00<4:58:13, 5.49s/it][2025-06-20 00:06:45,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:06:45,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.30 | bwd_microstep: 3332.90 | bwd_inner_microstep: 3332.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-20 00:06:45,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.30 | bwd: 3332.91 | bwd_inner: 3332.12 | bwd_allreduce: 0.75 | step: 6.70 67%|██████▋ | 6743/10000 [10:37:05<4:58:04, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0007985418196767569, 'learning_rate': 1.0133055097689936e-05, 'epoch': 6.74} 67%|██████▋ | 6743/10000 [10:37:05<4:58:04, 5.49s/it][2025-06-20 00:06:50,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:06:50,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.90 | bwd_microstep: 3329.18 | bwd_inner_microstep: 3328.25 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.25 [2025-06-20 00:06:50,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.90 | bwd: 3329.20 | bwd_inner: 3328.25 | bwd_allreduce: 0.90 | step: 7.26 67%|██████▋ | 6744/10000 [10:37:11<4:57:44, 5.49s/it] {'loss': 0.0044, 'grad_norm': 1.055418610572815, 'learning_rate': 1.0127421266331763e-05, 'epoch': 6.74} 67%|██████▋ | 6744/10000 [10:37:11<4:57:44, 5.49s/it][2025-06-20 00:06:56,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:06:56,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.68 | bwd_microstep: 3372.07 | bwd_inner_microstep: 3370.99 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.59 [2025-06-20 00:06:56,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.67 | bwd: 3372.09 | bwd_inner: 3370.99 | bwd_allreduce: 1.05 | step: 7.60 67%|██████▋ | 6745/10000 [10:37:16<4:58:41, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.023101193830370903, 'learning_rate': 1.0121788470561274e-05, 'epoch': 6.75} 67%|██████▋ | 6745/10000 [10:37:16<4:58:41, 5.51s/it][2025-06-20 00:07:01,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:07:01,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.69 | bwd_microstep: 3325.50 | bwd_inner_microstep: 3324.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 00:07:01,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.69 | bwd: 3325.51 | bwd_inner: 3324.72 | bwd_allreduce: 0.75 | step: 6.64 67%|██████▋ | 6746/10000 [10:37:22<4:58:13, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.07418723404407501, 'learning_rate': 1.0116156710969326e-05, 'epoch': 6.75} 67%|██████▋ | 6746/10000 [10:37:22<4:58:13, 5.50s/it][2025-06-20 00:07:07,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 00:07:07,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.61 | bwd_microstep: 3381.30 | bwd_inner_microstep: 3380.44 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.44 [2025-06-20 00:07:07,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.61 | bwd: 3381.32 | bwd_inner: 3380.44 | bwd_allreduce: 0.82 | step: 7.44 67%|██████▋ | 6747/10000 [10:37:27<4:59:09, 5.52s/it] {'loss': 0.0004, 'grad_norm': 0.0744340792298317, 'learning_rate': 1.0110525988146669e-05, 'epoch': 6.75} 67%|██████▋ | 6747/10000 [10:37:27<4:59:09, 5.52s/it][2025-06-20 00:07:12,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:07:12,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.18 | bwd_microstep: 3327.08 | bwd_inner_microstep: 3326.28 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 00:07:12,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.18 | bwd: 3327.09 | bwd_inner: 3326.28 | bwd_allreduce: 0.76 | step: 6.68 67%|██████▋ | 6748/10000 [10:37:33<4:58:28, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0018730968004092574, 'learning_rate': 1.0104896302683935e-05, 'epoch': 6.75} 67%|██████▋ | 6748/10000 [10:37:33<4:58:28, 5.51s/it][2025-06-20 00:07:18,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:07:18,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.66 | bwd_microstep: 3340.43 | bwd_inner_microstep: 3339.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-20 00:07:18,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.66 | bwd: 3340.45 | bwd_inner: 3339.63 | bwd_allreduce: 0.77 | step: 6.84 67%|██████▋ | 6749/10000 [10:37:38<4:58:13, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.02059563249349594, 'learning_rate': 1.0099267655171663e-05, 'epoch': 6.75} 67%|██████▋ | 6749/10000 [10:37:38<4:58:13, 5.50s/it][2025-06-20 00:07:23,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:07:23,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.59 | bwd_microstep: 3373.12 | bwd_inner_microstep: 3372.32 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-20 00:07:23,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.58 | bwd: 3373.13 | bwd_inner: 3372.32 | bwd_allreduce: 0.77 | step: 6.87 68%|██████▊ | 6750/10000 [10:37:44<4:58:52, 5.52s/it] {'loss': 0.0176, 'grad_norm': 4.283713340759277, 'learning_rate': 1.0093640046200257e-05, 'epoch': 6.75} 68%|██████▊ | 6750/10000 [10:37:44<4:58:52, 5.52s/it][2025-06-20 00:07:29,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:07:29,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2155.43 | bwd_microstep: 3389.40 | bwd_inner_microstep: 3388.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 00:07:29,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2155.43 | bwd: 3389.41 | bwd_inner: 3388.60 | bwd_allreduce: 0.77 | step: 6.73 68%|██████▊ | 6751/10000 [10:37:50<4:59:49, 5.54s/it] {'loss': 0.0119, 'grad_norm': 2.4039087295532227, 'learning_rate': 1.0088013476360033e-05, 'epoch': 6.75} 68%|██████▊ | 6751/10000 [10:37:50<4:59:49, 5.54s/it][2025-06-20 00:07:34,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:07:34,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.08 | bwd_microstep: 3321.54 | bwd_inner_microstep: 3320.63 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.89 [2025-06-20 00:07:34,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.08 | bwd: 3321.56 | bwd_inner: 3320.63 | bwd_allreduce: 0.88 | step: 6.90 68%|██████▊ | 6752/10000 [10:37:55<4:58:36, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00018904352327808738, 'learning_rate': 1.008238794624119e-05, 'epoch': 6.75} 68%|██████▊ | 6752/10000 [10:37:55<4:58:36, 5.52s/it][2025-06-20 00:07:40,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:07:40,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.93 | bwd_microstep: 3404.83 | bwd_inner_microstep: 3404.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-20 00:07:40,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.93 | bwd: 3404.85 | bwd_inner: 3404.03 | bwd_allreduce: 0.77 | step: 6.93 68%|██████▊ | 6753/10000 [10:38:01<4:59:42, 5.54s/it] {'loss': 0.0002, 'grad_norm': 0.045129258185625076, 'learning_rate': 1.0076763456433824e-05, 'epoch': 6.75} 68%|██████▊ | 6753/10000 [10:38:01<4:59:42, 5.54s/it][2025-06-20 00:07:45,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:07:45,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.82 | bwd_microstep: 3371.12 | bwd_inner_microstep: 3370.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-20 00:07:45,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.82 | bwd: 3371.13 | bwd_inner: 3370.31 | bwd_allreduce: 0.78 | step: 6.85 68%|██████▊ | 6754/10000 [10:38:06<4:59:44, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.00021846391609869897, 'learning_rate': 1.0071140007527916e-05, 'epoch': 6.75} 68%|██████▊ | 6754/10000 [10:38:06<4:59:44, 5.54s/it][2025-06-20 00:07:51,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:07:51,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.36 | bwd_microstep: 3399.68 | bwd_inner_microstep: 3398.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 00:07:51,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.36 | bwd: 3399.69 | bwd_inner: 3398.90 | bwd_allreduce: 0.75 | step: 6.67 68%|██████▊ | 6755/10000 [10:38:12<5:00:15, 5.55s/it] {'loss': 0.002, 'grad_norm': 0.4108569920063019, 'learning_rate': 1.006551760011334e-05, 'epoch': 6.75} 68%|██████▊ | 6755/10000 [10:38:12<5:00:15, 5.55s/it][2025-06-20 00:07:57,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:07:57,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.75 | bwd_microstep: 3371.14 | bwd_inner_microstep: 3370.33 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-20 00:07:57,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.75 | bwd: 3371.15 | bwd_inner: 3370.33 | bwd_allreduce: 0.78 | step: 6.92 68%|██████▊ | 6756/10000 [10:38:17<5:00:04, 5.55s/it] {'loss': 0.001, 'grad_norm': 0.2934087812900543, 'learning_rate': 1.005989623477986e-05, 'epoch': 6.76} 68%|██████▊ | 6756/10000 [10:38:17<5:00:04, 5.55s/it][2025-06-20 00:08:02,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:08:02,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.95 | bwd_microstep: 3370.04 | bwd_inner_microstep: 3369.23 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.26 [2025-06-20 00:08:02,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.95 | bwd: 3370.05 | bwd_inner: 3369.23 | bwd_allreduce: 0.78 | step: 7.26 68%|██████▊ | 6757/10000 [10:38:23<5:00:04, 5.55s/it] {'loss': 0.0002, 'grad_norm': 0.0293746218085289, 'learning_rate': 1.005427591211713e-05, 'epoch': 6.76} 68%|██████▊ | 6757/10000 [10:38:23<5:00:04, 5.55s/it][2025-06-20 00:08:08,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:08:08,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.65 | bwd_microstep: 3328.15 | bwd_inner_microstep: 3327.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 00:08:08,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.65 | bwd: 3328.16 | bwd_inner: 3327.35 | bwd_allreduce: 0.76 | step: 6.76 68%|██████▊ | 6758/10000 [10:38:28<4:58:54, 5.53s/it] {'loss': 0.0426, 'grad_norm': 3.567838430404663, 'learning_rate': 1.00486566327147e-05, 'epoch': 6.76} 68%|██████▊ | 6758/10000 [10:38:28<4:58:54, 5.53s/it][2025-06-20 00:08:13,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:08:13,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.68 | bwd_microstep: 3370.61 | bwd_inner_microstep: 3369.76 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.70 [2025-06-20 00:08:13,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.68 | bwd: 3370.63 | bwd_inner: 3369.76 | bwd_allreduce: 0.81 | step: 7.70 68%|██████▊ | 6759/10000 [10:38:34<4:58:53, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.002437059534713626, 'learning_rate': 1.0043038397162003e-05, 'epoch': 6.76} 68%|██████▊ | 6759/10000 [10:38:34<4:58:53, 5.53s/it][2025-06-20 00:08:19,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:08:19,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.92 | bwd_microstep: 3308.29 | bwd_inner_microstep: 3307.23 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.40 [2025-06-20 00:08:19,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.92 | bwd: 3308.31 | bwd_inner: 3307.23 | bwd_allreduce: 1.03 | step: 7.41 68%|██████▊ | 6760/10000 [10:38:39<4:57:29, 5.51s/it] {'loss': 0.0007, 'grad_norm': 0.07958737760782242, 'learning_rate': 1.0037421206048375e-05, 'epoch': 6.76} 68%|██████▊ | 6760/10000 [10:38:39<4:57:29, 5.51s/it][2025-06-20 00:08:24,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:08:24,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.85 | bwd_microstep: 3369.20 | bwd_inner_microstep: 3368.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.88 [2025-06-20 00:08:24,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.85 | bwd: 3369.21 | bwd_inner: 3368.41 | bwd_allreduce: 0.75 | step: 6.89 68%|██████▊ | 6761/10000 [10:38:45<4:57:50, 5.52s/it] {'loss': 0.0032, 'grad_norm': 0.7634561657905579, 'learning_rate': 1.0031805059963016e-05, 'epoch': 6.76} 68%|██████▊ | 6761/10000 [10:38:45<4:57:50, 5.52s/it][2025-06-20 00:08:30,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:08:30,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.79 | bwd_microstep: 3373.32 | bwd_inner_microstep: 3372.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-20 00:08:30,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.79 | bwd: 3373.33 | bwd_inner: 3372.51 | bwd_allreduce: 0.78 | step: 7.19 68%|██████▊ | 6762/10000 [10:38:50<4:58:06, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0002802973031066358, 'learning_rate': 1.0026189959495044e-05, 'epoch': 6.76} 68%|██████▊ | 6762/10000 [10:38:50<4:58:06, 5.52s/it][2025-06-20 00:08:35,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:08:35,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.79 | bwd_microstep: 3317.81 | bwd_inner_microstep: 3317.01 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 00:08:35,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.79 | bwd: 3317.83 | bwd_inner: 3317.01 | bwd_allreduce: 0.77 | step: 6.82 68%|██████▊ | 6763/10000 [10:38:56<4:56:58, 5.50s/it] {'loss': 0.0006, 'grad_norm': 0.1295544058084488, 'learning_rate': 1.0020575905233452e-05, 'epoch': 6.76} 68%|██████▊ | 6763/10000 [10:38:56<4:56:58, 5.50s/it][2025-06-20 00:08:41,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:08:41,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.74 | bwd_microstep: 3370.06 | bwd_inner_microstep: 3369.05 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.13 [2025-06-20 00:08:41,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.74 | bwd: 3370.07 | bwd_inner: 3369.05 | bwd_allreduce: 0.97 | step: 7.14 68%|██████▊ | 6764/10000 [10:39:01<4:57:26, 5.52s/it] {'loss': 0.0051, 'grad_norm': 1.3021610975265503, 'learning_rate': 1.0014962897767133e-05, 'epoch': 6.76} 68%|██████▊ | 6764/10000 [10:39:01<4:57:26, 5.52s/it][2025-06-20 00:08:46,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:08:46,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.14 | bwd_microstep: 3321.75 | bwd_inner_microstep: 3320.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-20 00:08:46,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.14 | bwd: 3321.77 | bwd_inner: 3320.95 | bwd_allreduce: 0.77 | step: 6.96 68%|██████▊ | 6765/10000 [10:39:07<4:56:35, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0019995435141026974, 'learning_rate': 1.0009350937684872e-05, 'epoch': 6.76} 68%|██████▊ | 6765/10000 [10:39:07<4:56:35, 5.50s/it][2025-06-20 00:08:52,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:08:52,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.21 | bwd_microstep: 3320.34 | bwd_inner_microstep: 3319.53 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.02 [2025-06-20 00:08:52,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.21 | bwd: 3320.36 | bwd_inner: 3319.53 | bwd_allreduce: 0.78 | step: 7.02 68%|██████▊ | 6766/10000 [10:39:12<4:55:56, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.010402272455394268, 'learning_rate': 1.0003740025575321e-05, 'epoch': 6.77} 68%|██████▊ | 6766/10000 [10:39:12<4:55:56, 5.49s/it][2025-06-20 00:08:57,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:08:57,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.49 | bwd_microstep: 3325.58 | bwd_inner_microstep: 3324.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 00:08:57,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.49 | bwd: 3325.60 | bwd_inner: 3324.80 | bwd_allreduce: 0.76 | step: 6.67 68%|██████▊ | 6767/10000 [10:39:18<4:55:28, 5.48s/it] {'loss': 0.0032, 'grad_norm': 0.899261474609375, 'learning_rate': 9.998130162027049e-06, 'epoch': 6.77} 68%|██████▊ | 6767/10000 [10:39:18<4:55:28, 5.48s/it][2025-06-20 00:09:02,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:09:02,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.92 | bwd_microstep: 3310.55 | bwd_inner_microstep: 3309.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.47 [2025-06-20 00:09:02,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.92 | bwd: 3310.56 | bwd_inner: 3309.73 | bwd_allreduce: 0.78 | step: 7.47 68%|██████▊ | 6768/10000 [10:39:23<4:54:59, 5.48s/it] {'loss': 0.0066, 'grad_norm': 1.0983633995056152, 'learning_rate': 9.9925213476285e-06, 'epoch': 6.77} 68%|██████▊ | 6768/10000 [10:39:23<4:54:59, 5.48s/it][2025-06-20 00:09:08,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:09:08,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.87 | bwd_microstep: 3320.70 | bwd_inner_microstep: 3319.88 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.87 [2025-06-20 00:09:08,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.87 | bwd: 3320.72 | bwd_inner: 3319.88 | bwd_allreduce: 0.79 | step: 6.87 68%|██████▊ | 6769/10000 [10:39:29<4:54:39, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.08772572129964828, 'learning_rate': 9.986913582968018e-06, 'epoch': 6.77} 68%|██████▊ | 6769/10000 [10:39:29<4:54:39, 5.47s/it][2025-06-20 00:09:13,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:09:13,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.72 | bwd_microstep: 3372.24 | bwd_inner_microstep: 3371.39 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.91 [2025-06-20 00:09:13,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.72 | bwd: 3372.26 | bwd_inner: 3371.39 | bwd_allreduce: 0.81 | step: 6.92 68%|██████▊ | 6770/10000 [10:39:34<4:55:46, 5.49s/it] {'loss': 0.0027, 'grad_norm': 0.42972585558891296, 'learning_rate': 9.98130686863383e-06, 'epoch': 6.77} 68%|██████▊ | 6770/10000 [10:39:34<4:55:46, 5.49s/it][2025-06-20 00:09:19,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:09:19,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.34 | bwd_microstep: 3323.75 | bwd_inner_microstep: 3322.65 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.48 [2025-06-20 00:09:19,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.34 | bwd: 3323.77 | bwd_inner: 3322.65 | bwd_allreduce: 1.06 | step: 7.49 68%|██████▊ | 6771/10000 [10:39:40<4:55:25, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.04904753342270851, 'learning_rate': 9.975701205214053e-06, 'epoch': 6.77} 68%|██████▊ | 6771/10000 [10:39:40<4:55:25, 5.49s/it][2025-06-20 00:09:24,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-20 00:09:24,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.60 | bwd_microstep: 3321.08 | bwd_inner_microstep: 3320.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-20 00:09:24,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.60 | bwd: 3321.09 | bwd_inner: 3320.28 | bwd_allreduce: 0.77 | step: 6.91 68%|██████▊ | 6772/10000 [10:39:45<4:54:58, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004689484369009733, 'learning_rate': 9.970096593296691e-06, 'epoch': 6.77} 68%|██████▊ | 6772/10000 [10:39:45<4:54:58, 5.48s/it][2025-06-20 00:09:30,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:09:30,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.45 | bwd_microstep: 3315.94 | bwd_inner_microstep: 3315.00 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.05 [2025-06-20 00:09:30,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.45 | bwd: 3315.96 | bwd_inner: 3315.00 | bwd_allreduce: 0.91 | step: 7.06 68%|██████▊ | 6773/10000 [10:39:51<4:54:29, 5.48s/it] {'loss': 0.0011, 'grad_norm': 0.13332338631153107, 'learning_rate': 9.964493033469651e-06, 'epoch': 6.77} 68%|██████▊ | 6773/10000 [10:39:51<4:54:29, 5.48s/it][2025-06-20 00:09:35,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.72 [2025-06-20 00:09:35,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.96 | bwd_microstep: 3320.17 | bwd_inner_microstep: 3319.05 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.65 [2025-06-20 00:09:35,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.96 | bwd: 3320.18 | bwd_inner: 3319.05 | bwd_allreduce: 1.09 | step: 7.66 68%|██████▊ | 6774/10000 [10:39:56<4:54:18, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.025410175323486328, 'learning_rate': 9.958890526320715e-06, 'epoch': 6.77} 68%|██████▊ | 6774/10000 [10:39:56<4:54:18, 5.47s/it][2025-06-20 00:09:41,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:09:41,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.47 | bwd_microstep: 3321.75 | bwd_inner_microstep: 3320.97 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 00:09:41,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.47 | bwd: 3321.76 | bwd_inner: 3320.97 | bwd_allreduce: 0.75 | step: 6.62 68%|██████▊ | 6775/10000 [10:40:02<4:54:12, 5.47s/it] {'loss': 0.0004, 'grad_norm': 0.06989806145429611, 'learning_rate': 9.953289072437566e-06, 'epoch': 6.78} 68%|██████▊ | 6775/10000 [10:40:02<4:54:12, 5.47s/it][2025-06-20 00:09:46,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:09:46,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.29 | bwd_microstep: 3314.22 | bwd_inner_microstep: 3313.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 00:09:46,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.29 | bwd: 3314.24 | bwd_inner: 3313.43 | bwd_allreduce: 0.76 | step: 6.70 68%|██████▊ | 6776/10000 [10:40:07<4:53:47, 5.47s/it] {'loss': 0.0008, 'grad_norm': 0.21188536286354065, 'learning_rate': 9.947688672407757e-06, 'epoch': 6.78} 68%|██████▊ | 6776/10000 [10:40:07<4:53:47, 5.47s/it][2025-06-20 00:09:52,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:09:52,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.86 | bwd_microstep: 3315.20 | bwd_inner_microstep: 3314.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-20 00:09:52,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.86 | bwd: 3315.21 | bwd_inner: 3314.41 | bwd_allreduce: 0.76 | step: 6.65 68%|██████▊ | 6777/10000 [10:40:13<4:53:32, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.006657855585217476, 'learning_rate': 9.942089326818756e-06, 'epoch': 6.78} 68%|██████▊ | 6777/10000 [10:40:13<4:53:32, 5.46s/it][2025-06-20 00:09:57,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:09:57,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.71 | bwd_microstep: 3320.80 | bwd_inner_microstep: 3319.83 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.71 [2025-06-20 00:09:57,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.72 | bwd: 3320.82 | bwd_inner: 3319.83 | bwd_allreduce: 0.94 | step: 7.71 68%|██████▊ | 6778/10000 [10:40:18<4:53:23, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.011933719739317894, 'learning_rate': 9.9364910362579e-06, 'epoch': 6.78} 68%|██████▊ | 6778/10000 [10:40:18<4:53:23, 5.46s/it][2025-06-20 00:10:03,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:10:03,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.70 | bwd_microstep: 3369.59 | bwd_inner_microstep: 3368.80 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-20 00:10:03,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.70 | bwd: 3369.60 | bwd_inner: 3368.80 | bwd_allreduce: 0.76 | step: 6.66 68%|██████▊ | 6779/10000 [10:40:24<4:54:28, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0018070867517963052, 'learning_rate': 9.93089380131243e-06, 'epoch': 6.78} 68%|██████▊ | 6779/10000 [10:40:24<4:54:28, 5.49s/it][2025-06-20 00:10:08,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:10:08,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.30 | bwd_microstep: 3366.26 | bwd_inner_microstep: 3365.49 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 00:10:08,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.30 | bwd: 3366.28 | bwd_inner: 3365.49 | bwd_allreduce: 0.75 | step: 6.61 68%|██████▊ | 6780/10000 [10:40:29<4:55:04, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.24758678674697876, 'learning_rate': 9.925297622569477e-06, 'epoch': 6.78} 68%|██████▊ | 6780/10000 [10:40:29<4:55:04, 5.50s/it][2025-06-20 00:10:14,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:10:14,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.15 | bwd_microstep: 3308.72 | bwd_inner_microstep: 3307.91 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-20 00:10:14,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.15 | bwd: 3308.73 | bwd_inner: 3307.91 | bwd_allreduce: 0.77 | step: 7.13 68%|██████▊ | 6781/10000 [10:40:35<4:54:14, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.018400151282548904, 'learning_rate': 9.919702500616038e-06, 'epoch': 6.78} 68%|██████▊ | 6781/10000 [10:40:35<4:54:14, 5.48s/it][2025-06-20 00:10:19,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:10:19,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.05 | bwd_microstep: 3362.39 | bwd_inner_microstep: 3361.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 00:10:19,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.05 | bwd: 3362.40 | bwd_inner: 3361.59 | bwd_allreduce: 0.76 | step: 6.71 68%|██████▊ | 6782/10000 [10:40:40<4:54:47, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.04852110520005226, 'learning_rate': 9.914108436039021e-06, 'epoch': 6.78} 68%|██████▊ | 6782/10000 [10:40:40<4:54:47, 5.50s/it][2025-06-20 00:10:25,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:10:25,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.03 | bwd_microstep: 3321.40 | bwd_inner_microstep: 3320.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 00:10:25,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.03 | bwd: 3321.41 | bwd_inner: 3320.61 | bwd_allreduce: 0.76 | step: 6.66 68%|██████▊ | 6783/10000 [10:40:46<4:54:11, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005754494108259678, 'learning_rate': 9.908515429425219e-06, 'epoch': 6.78} 68%|██████▊ | 6783/10000 [10:40:46<4:54:11, 5.49s/it][2025-06-20 00:10:30,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 00:10:30,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.35 | bwd_microstep: 3320.94 | bwd_inner_microstep: 3319.98 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.05 [2025-06-20 00:10:30,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.35 | bwd: 3320.96 | bwd_inner: 3319.98 | bwd_allreduce: 0.93 | step: 7.06 68%|██████▊ | 6784/10000 [10:40:51<4:53:43, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004545385949313641, 'learning_rate': 9.90292348136131e-06, 'epoch': 6.78} 68%|██████▊ | 6784/10000 [10:40:51<4:53:43, 5.48s/it][2025-06-20 00:10:36,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-20 00:10:36,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.71 | bwd_microstep: 3314.00 | bwd_inner_microstep: 3313.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 00:10:36,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.71 | bwd: 3314.02 | bwd_inner: 3313.21 | bwd_allreduce: 0.76 | step: 6.82 68%|██████▊ | 6785/10000 [10:40:56<4:53:15, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.03681040182709694, 'learning_rate': 9.897332592433876e-06, 'epoch': 6.79} 68%|██████▊ | 6785/10000 [10:40:56<4:53:15, 5.47s/it][2025-06-20 00:10:41,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 00:10:41,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.40 | bwd_microstep: 3310.58 | bwd_inner_microstep: 3309.58 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.37 [2025-06-20 00:10:41,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.40 | bwd: 3310.60 | bwd_inner: 3309.58 | bwd_allreduce: 0.97 | step: 7.37 68%|██████▊ | 6786/10000 [10:41:02<4:52:47, 5.47s/it] {'loss': 0.005, 'grad_norm': 1.4511027336120605, 'learning_rate': 9.891742763229358e-06, 'epoch': 6.79} 68%|██████▊ | 6786/10000 [10:41:02<4:52:47, 5.47s/it][2025-06-20 00:10:47,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.84 [2025-06-20 00:10:47,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.55 | bwd_microstep: 3365.39 | bwd_inner_microstep: 3364.56 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.16 [2025-06-20 00:10:47,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.55 | bwd: 3365.40 | bwd_inner: 3364.56 | bwd_allreduce: 0.80 | step: 7.16 68%|██████▊ | 6787/10000 [10:41:07<4:53:48, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.01506365742534399, 'learning_rate': 9.88615399433411e-06, 'epoch': 6.79} 68%|██████▊ | 6787/10000 [10:41:07<4:53:48, 5.49s/it][2025-06-20 00:10:52,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:10:52,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.84 | bwd_microstep: 3367.20 | bwd_inner_microstep: 3366.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 00:10:52,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.84 | bwd: 3367.22 | bwd_inner: 3366.41 | bwd_allreduce: 0.77 | step: 6.69 68%|██████▊ | 6788/10000 [10:41:13<4:54:23, 5.50s/it] {'loss': 0.0247, 'grad_norm': 3.0468060970306396, 'learning_rate': 9.880566286334367e-06, 'epoch': 6.79} 68%|██████▊ | 6788/10000 [10:41:13<4:54:23, 5.50s/it][2025-06-20 00:10:58,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:10:58,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.36 | bwd_microstep: 3362.77 | bwd_inner_microstep: 3361.77 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.09 [2025-06-20 00:10:58,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.36 | bwd: 3362.78 | bwd_inner: 3361.77 | bwd_allreduce: 0.96 | step: 7.09 68%|██████▊ | 6789/10000 [10:41:18<4:54:47, 5.51s/it] {'loss': 0.0079, 'grad_norm': 3.3103861808776855, 'learning_rate': 9.874979639816249e-06, 'epoch': 6.79} 68%|██████▊ | 6789/10000 [10:41:18<4:54:47, 5.51s/it][2025-06-20 00:11:03,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:11:03,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.86 | bwd_microstep: 3304.86 | bwd_inner_microstep: 3304.05 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.80 [2025-06-20 00:11:03,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.86 | bwd: 3304.88 | bwd_inner: 3304.05 | bwd_allreduce: 0.79 | step: 6.81 68%|██████▊ | 6790/10000 [10:41:24<4:53:41, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.07354489713907242, 'learning_rate': 9.869394055365788e-06, 'epoch': 6.79} 68%|██████▊ | 6790/10000 [10:41:24<4:53:41, 5.49s/it][2025-06-20 00:11:09,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:11:09,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.03 | bwd_microstep: 3368.54 | bwd_inner_microstep: 3367.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-20 00:11:09,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.03 | bwd: 3368.56 | bwd_inner: 3367.75 | bwd_allreduce: 0.76 | step: 6.67 68%|██████▊ | 6791/10000 [10:41:29<4:54:20, 5.50s/it] {'loss': 0.0017, 'grad_norm': 0.3742513358592987, 'learning_rate': 9.863809533568867e-06, 'epoch': 6.79} 68%|██████▊ | 6791/10000 [10:41:29<4:54:20, 5.50s/it][2025-06-20 00:11:14,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 00:11:14,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.90 | bwd_microstep: 3361.83 | bwd_inner_microstep: 3360.80 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.76 [2025-06-20 00:11:14,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.90 | bwd: 3361.85 | bwd_inner: 3360.80 | bwd_allreduce: 1.00 | step: 7.76 68%|██████▊ | 6792/10000 [10:41:35<4:54:36, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.006429447792470455, 'learning_rate': 9.858226075011284e-06, 'epoch': 6.79} 68%|██████▊ | 6792/10000 [10:41:35<4:54:36, 5.51s/it][2025-06-20 00:11:20,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:11:20,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.56 | bwd_microstep: 3306.37 | bwd_inner_microstep: 3305.48 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.09 [2025-06-20 00:11:20,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.56 | bwd: 3306.39 | bwd_inner: 3305.48 | bwd_allreduce: 0.86 | step: 7.09 68%|██████▊ | 6793/10000 [10:41:40<4:53:38, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00194788898807019, 'learning_rate': 9.852643680278713e-06, 'epoch': 6.79} 68%|██████▊ | 6793/10000 [10:41:40<4:53:38, 5.49s/it][2025-06-20 00:11:25,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 00:11:25,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.00 | bwd_microstep: 3314.85 | bwd_inner_microstep: 3313.76 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.29 [2025-06-20 00:11:25,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.00 | bwd: 3314.87 | bwd_inner: 3313.76 | bwd_allreduce: 1.04 | step: 7.30 68%|██████▊ | 6794/10000 [10:41:46<4:53:21, 5.49s/it] {'loss': 0.0012, 'grad_norm': 0.3402246832847595, 'learning_rate': 9.847062349956726e-06, 'epoch': 6.79} 68%|██████▊ | 6794/10000 [10:41:46<4:53:21, 5.49s/it][2025-06-20 00:11:31,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:11:31,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.56 | bwd_microstep: 3360.63 | bwd_inner_microstep: 3359.68 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.14 [2025-06-20 00:11:31,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.56 | bwd: 3360.65 | bwd_inner: 3359.68 | bwd_allreduce: 0.92 | step: 7.14 68%|██████▊ | 6795/10000 [10:41:51<4:53:49, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.018437061458826065, 'learning_rate': 9.841482084630785e-06, 'epoch': 6.79} 68%|██████▊ | 6795/10000 [10:41:51<4:53:49, 5.50s/it][2025-06-20 00:11:36,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 00:11:36,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.55 | bwd_microstep: 3314.83 | bwd_inner_microstep: 3313.96 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.42 [2025-06-20 00:11:36,604] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.55 | bwd: 3314.85 | bwd_inner: 3313.96 | bwd_allreduce: 0.84 | step: 7.42 68%|██████▊ | 6796/10000 [10:41:57<4:53:10, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0005327769322320819, 'learning_rate': 9.83590288488622e-06, 'epoch': 6.8} 68%|██████▊ | 6796/10000 [10:41:57<4:53:10, 5.49s/it][2025-06-20 00:11:42,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:11:42,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.04 | bwd_microstep: 3366.41 | bwd_inner_microstep: 3365.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-20 00:11:42,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.04 | bwd: 3366.42 | bwd_inner: 3365.62 | bwd_allreduce: 0.76 | step: 6.75 68%|██████▊ | 6797/10000 [10:42:02<4:53:48, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.04256409779191017, 'learning_rate': 9.830324751308267e-06, 'epoch': 6.8} 68%|██████▊ | 6797/10000 [10:42:02<4:53:48, 5.50s/it][2025-06-20 00:11:47,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:11:47,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.26 | bwd_microstep: 3368.60 | bwd_inner_microstep: 3367.66 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.13 [2025-06-20 00:11:47,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.26 | bwd: 3368.62 | bwd_inner: 3367.66 | bwd_allreduce: 0.91 | step: 7.14 68%|██████▊ | 6798/10000 [10:42:08<4:54:21, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.004045203793793917, 'learning_rate': 9.82474768448205e-06, 'epoch': 6.8} 68%|██████▊ | 6798/10000 [10:42:08<4:54:21, 5.52s/it][2025-06-20 00:11:53,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 00:11:53,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.64 | bwd_microstep: 3364.63 | bwd_inner_microstep: 3363.67 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.30 [2025-06-20 00:11:53,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.63 | bwd: 3364.64 | bwd_inner: 3363.67 | bwd_allreduce: 0.93 | step: 7.31 68%|██████▊ | 6799/10000 [10:42:14<4:54:28, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.04134511575102806, 'learning_rate': 9.819171684992575e-06, 'epoch': 6.8} 68%|██████▊ | 6799/10000 [10:42:14<4:54:28, 5.52s/it][2025-06-20 00:11:58,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:11:58,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.06 | bwd_microstep: 3363.90 | bwd_inner_microstep: 3363.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 00:11:58,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.06 | bwd: 3363.91 | bwd_inner: 3363.11 | bwd_allreduce: 0.76 | step: 6.64 68%|██████▊ | 6800/10000 [10:42:19<4:54:36, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.06114114075899124, 'learning_rate': 9.813596753424747e-06, 'epoch': 6.8} 68%|██████▊ | 6800/10000 [10:42:19<4:54:36, 5.52s/it][2025-06-20 00:12:04,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:12:04,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.81 | bwd_microstep: 3363.01 | bwd_inner_microstep: 3362.21 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-20 00:12:04,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.81 | bwd: 3363.03 | bwd_inner: 3362.21 | bwd_allreduce: 0.77 | step: 6.99 68%|██████▊ | 6801/10000 [10:42:25<4:54:41, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.00043146865209564567, 'learning_rate': 9.808022890363336e-06, 'epoch': 6.8} 68%|██████▊ | 6801/10000 [10:42:25<4:54:41, 5.53s/it][2025-06-20 00:12:09,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:12:09,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.00 | bwd_microstep: 3308.27 | bwd_inner_microstep: 3307.35 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.01 [2025-06-20 00:12:09,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.00 | bwd: 3308.29 | bwd_inner: 3307.35 | bwd_allreduce: 0.89 | step: 7.01 68%|██████▊ | 6802/10000 [10:42:30<4:53:27, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00025795004330575466, 'learning_rate': 9.802450096393022e-06, 'epoch': 6.8} 68%|██████▊ | 6802/10000 [10:42:30<4:53:27, 5.51s/it][2025-06-20 00:12:15,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:12:15,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.85 | bwd_microstep: 3316.60 | bwd_inner_microstep: 3315.79 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-20 00:12:15,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.85 | bwd: 3316.61 | bwd_inner: 3315.79 | bwd_allreduce: 0.78 | step: 7.11 68%|██████▊ | 6803/10000 [10:42:36<4:52:44, 5.49s/it] {'loss': 0.0009, 'grad_norm': 0.33725497126579285, 'learning_rate': 9.796878372098364e-06, 'epoch': 6.8} 68%|██████▊ | 6803/10000 [10:42:36<4:52:44, 5.49s/it][2025-06-20 00:12:20,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:12:20,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.98 | bwd_microstep: 3316.20 | bwd_inner_microstep: 3315.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-20 00:12:20,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.98 | bwd: 3316.22 | bwd_inner: 3315.41 | bwd_allreduce: 0.76 | step: 6.88 68%|██████▊ | 6804/10000 [10:42:41<4:52:05, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.023014573380351067, 'learning_rate': 9.791307718063812e-06, 'epoch': 6.8} 68%|██████▊ | 6804/10000 [10:42:41<4:52:05, 5.48s/it][2025-06-20 00:12:26,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:12:26,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.25 | bwd_microstep: 3364.06 | bwd_inner_microstep: 3363.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 00:12:26,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.25 | bwd: 3364.07 | bwd_inner: 3363.27 | bwd_allreduce: 0.75 | step: 6.58 68%|██████▊ | 6805/10000 [10:42:46<4:52:36, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0016891881823539734, 'learning_rate': 9.785738134873699e-06, 'epoch': 6.8} 68%|██████▊ | 6805/10000 [10:42:46<4:52:36, 5.50s/it][2025-06-20 00:12:31,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:12:31,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.48 | bwd_microstep: 3396.64 | bwd_inner_microstep: 3395.82 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.79 [2025-06-20 00:12:31,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.48 | bwd: 3396.66 | bwd_inner: 3395.82 | bwd_allreduce: 0.79 | step: 6.80 68%|██████▊ | 6806/10000 [10:42:52<4:53:37, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.000464881508378312, 'learning_rate': 9.78016962311225e-06, 'epoch': 6.81} 68%|██████▊ | 6806/10000 [10:42:52<4:53:37, 5.52s/it][2025-06-20 00:12:37,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:12:37,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.59 | bwd_microstep: 3364.26 | bwd_inner_microstep: 3363.47 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 00:12:37,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.59 | bwd: 3364.27 | bwd_inner: 3363.47 | bwd_allreduce: 0.76 | step: 6.64 68%|██████▊ | 6807/10000 [10:42:58<4:53:38, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.005782420746982098, 'learning_rate': 9.774602183363575e-06, 'epoch': 6.81} 68%|██████▊ | 6807/10000 [10:42:58<4:53:38, 5.52s/it][2025-06-20 00:12:42,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:12:42,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.16 | bwd_microstep: 3314.28 | bwd_inner_microstep: 3313.46 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-20 00:12:42,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.16 | bwd: 3314.29 | bwd_inner: 3313.46 | bwd_allreduce: 0.78 | step: 7.31 68%|██████▊ | 6808/10000 [10:43:03<4:52:32, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.011070086620748043, 'learning_rate': 9.769035816211675e-06, 'epoch': 6.81} 68%|██████▊ | 6808/10000 [10:43:03<4:52:32, 5.50s/it][2025-06-20 00:12:48,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:12:48,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.55 | bwd_microstep: 3314.43 | bwd_inner_microstep: 3313.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 00:12:48,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.55 | bwd: 3314.44 | bwd_inner: 3313.64 | bwd_allreduce: 0.76 | step: 6.71 68%|██████▊ | 6809/10000 [10:43:08<4:51:46, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.02887541614472866, 'learning_rate': 9.763470522240432e-06, 'epoch': 6.81} 68%|██████▊ | 6809/10000 [10:43:08<4:51:46, 5.49s/it][2025-06-20 00:12:53,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.86 [2025-06-20 00:12:53,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.75 | bwd_microstep: 3315.30 | bwd_inner_microstep: 3314.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-20 00:12:53,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.75 | bwd: 3315.31 | bwd_inner: 3314.51 | bwd_allreduce: 0.76 | step: 6.75 68%|██████▊ | 6810/10000 [10:43:14<4:51:07, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.006915396079421043, 'learning_rate': 9.757906302033633e-06, 'epoch': 6.81} 68%|██████▊ | 6810/10000 [10:43:14<4:51:07, 5.48s/it][2025-06-20 00:12:59,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:12:59,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.19 | bwd_microstep: 3360.77 | bwd_inner_microstep: 3359.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 00:12:59,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.19 | bwd: 3360.78 | bwd_inner: 3359.98 | bwd_allreduce: 0.76 | step: 6.68 68%|██████▊ | 6811/10000 [10:43:19<4:51:41, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0016134926117956638, 'learning_rate': 9.752343156174917e-06, 'epoch': 6.81} 68%|██████▊ | 6811/10000 [10:43:19<4:51:41, 5.49s/it][2025-06-20 00:13:04,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:13:04,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.06 | bwd_microstep: 3362.58 | bwd_inner_microstep: 3361.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 00:13:04,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.06 | bwd: 3362.60 | bwd_inner: 3361.80 | bwd_allreduce: 0.75 | step: 6.58 68%|██████▊ | 6812/10000 [10:43:25<4:52:13, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.04461769014596939, 'learning_rate': 9.746781085247845e-06, 'epoch': 6.81} 68%|██████▊ | 6812/10000 [10:43:25<4:52:13, 5.50s/it][2025-06-20 00:13:10,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:13:10,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.05 | bwd_microstep: 3308.60 | bwd_inner_microstep: 3307.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 00:13:10,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.05 | bwd: 3308.61 | bwd_inner: 3307.82 | bwd_allreduce: 0.75 | step: 6.66 68%|██████▊ | 6813/10000 [10:43:30<4:51:17, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.02588399313390255, 'learning_rate': 9.74122008983585e-06, 'epoch': 6.81} 68%|██████▊ | 6813/10000 [10:43:30<4:51:17, 5.48s/it][2025-06-20 00:13:15,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:13:15,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.81 | bwd_microstep: 3316.95 | bwd_inner_microstep: 3316.12 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.43 [2025-06-20 00:13:15,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.81 | bwd: 3316.96 | bwd_inner: 3316.12 | bwd_allreduce: 0.80 | step: 7.43 68%|██████▊ | 6814/10000 [10:43:36<4:50:45, 5.48s/it] {'loss': 0.0013, 'grad_norm': 0.340143620967865, 'learning_rate': 9.735660170522257e-06, 'epoch': 6.81} 68%|██████▊ | 6814/10000 [10:43:36<4:50:45, 5.48s/it][2025-06-20 00:13:21,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:13:21,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.59 | bwd_microstep: 3310.65 | bwd_inner_microstep: 3309.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 00:13:21,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.59 | bwd: 3310.67 | bwd_inner: 3309.86 | bwd_allreduce: 0.76 | step: 6.72 68%|██████▊ | 6815/10000 [10:43:41<4:50:18, 5.47s/it] {'loss': 0.0329, 'grad_norm': 6.938263893127441, 'learning_rate': 9.730101327890277e-06, 'epoch': 6.81} 68%|██████▊ | 6815/10000 [10:43:41<4:50:18, 5.47s/it][2025-06-20 00:13:26,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:13:26,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.14 | bwd_microstep: 3316.33 | bwd_inner_microstep: 3315.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 00:13:26,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.14 | bwd: 3316.35 | bwd_inner: 3315.54 | bwd_allreduce: 0.76 | step: 6.73 68%|██████▊ | 6816/10000 [10:43:47<4:50:00, 5.47s/it] {'loss': 0.0119, 'grad_norm': 2.857743501663208, 'learning_rate': 9.724543562523e-06, 'epoch': 6.82} 68%|██████▊ | 6816/10000 [10:43:47<4:50:00, 5.47s/it][2025-06-20 00:13:31,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:13:31,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.45 | bwd_microstep: 3309.97 | bwd_inner_microstep: 3309.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.25 [2025-06-20 00:13:31,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.45 | bwd: 3309.99 | bwd_inner: 3309.16 | bwd_allreduce: 0.78 | step: 7.25 68%|██████▊ | 6817/10000 [10:43:52<4:49:37, 5.46s/it] {'loss': 0.1377, 'grad_norm': 5.550398826599121, 'learning_rate': 9.718986875003413e-06, 'epoch': 6.82} 68%|██████▊ | 6817/10000 [10:43:52<4:49:37, 5.46s/it][2025-06-20 00:13:37,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:13:37,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.86 | bwd_microstep: 3369.16 | bwd_inner_microstep: 3368.37 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 00:13:37,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.86 | bwd: 3369.18 | bwd_inner: 3368.37 | bwd_allreduce: 0.76 | step: 6.68 68%|██████▊ | 6818/10000 [10:43:58<4:50:39, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.022197775542736053, 'learning_rate': 9.713431265914387e-06, 'epoch': 6.82} 68%|██████▊ | 6818/10000 [10:43:58<4:50:39, 5.48s/it][2025-06-20 00:13:42,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:13:42,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.86 | bwd_microstep: 3317.73 | bwd_inner_microstep: 3316.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 00:13:42,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.86 | bwd: 3317.75 | bwd_inner: 3316.95 | bwd_allreduce: 0.76 | step: 6.65 68%|██████▊ | 6819/10000 [10:44:03<4:50:11, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00027998030418530107, 'learning_rate': 9.707876735838677e-06, 'epoch': 6.82} 68%|██████▊ | 6819/10000 [10:44:03<4:50:11, 5.47s/it][2025-06-20 00:13:48,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:13:48,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.93 | bwd_microstep: 3318.69 | bwd_inner_microstep: 3317.88 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.30 [2025-06-20 00:13:48,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.93 | bwd: 3318.71 | bwd_inner: 3317.88 | bwd_allreduce: 0.79 | step: 7.30 68%|██████▊ | 6820/10000 [10:44:09<4:49:57, 5.47s/it] {'loss': 0.0645, 'grad_norm': 10.718414306640625, 'learning_rate': 9.702323285358932e-06, 'epoch': 6.82} 68%|██████▊ | 6820/10000 [10:44:09<4:49:57, 5.47s/it][2025-06-20 00:13:53,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:13:53,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.39 | bwd_microstep: 3362.36 | bwd_inner_microstep: 3361.56 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-20 00:13:53,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.39 | bwd: 3362.38 | bwd_inner: 3361.56 | bwd_allreduce: 0.78 | step: 6.80 68%|██████▊ | 6821/10000 [10:44:14<4:50:46, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.03490709885954857, 'learning_rate': 9.69677091505769e-06, 'epoch': 6.82} 68%|██████▊ | 6821/10000 [10:44:14<4:50:46, 5.49s/it][2025-06-20 00:13:59,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:13:59,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.95 | bwd_microstep: 3364.36 | bwd_inner_microstep: 3363.54 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.75 [2025-06-20 00:13:59,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.95 | bwd: 3364.37 | bwd_inner: 3363.54 | bwd_allreduce: 0.79 | step: 6.76 68%|██████▊ | 6822/10000 [10:44:20<4:51:22, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.177580326795578, 'learning_rate': 9.691219625517352e-06, 'epoch': 6.82} 68%|██████▊ | 6822/10000 [10:44:20<4:51:22, 5.50s/it][2025-06-20 00:14:04,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:14:04,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.01 | bwd_microstep: 3316.91 | bwd_inner_microstep: 3316.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.79 [2025-06-20 00:14:04,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.01 | bwd: 3316.93 | bwd_inner: 3316.11 | bwd_allreduce: 0.77 | step: 6.80 68%|██████▊ | 6823/10000 [10:44:25<4:50:33, 5.49s/it] {'loss': 0.0173, 'grad_norm': 2.641921043395996, 'learning_rate': 9.68566941732023e-06, 'epoch': 6.82} 68%|██████▊ | 6823/10000 [10:44:25<4:50:33, 5.49s/it][2025-06-20 00:14:10,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:14:10,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.25 | bwd_microstep: 3373.69 | bwd_inner_microstep: 3372.79 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.43 [2025-06-20 00:14:10,445] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.25 | bwd: 3373.71 | bwd_inner: 3372.79 | bwd_allreduce: 0.87 | step: 7.43 68%|██████▊ | 6824/10000 [10:44:31<4:51:20, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0021661019418388605, 'learning_rate': 9.680120291048509e-06, 'epoch': 6.82} 68%|██████▊ | 6824/10000 [10:44:31<4:51:20, 5.50s/it][2025-06-20 00:14:15,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:14:15,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.89 | bwd_microstep: 3330.58 | bwd_inner_microstep: 3329.59 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.34 [2025-06-20 00:14:15,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.89 | bwd: 3330.59 | bwd_inner: 3329.59 | bwd_allreduce: 0.96 | step: 7.34 68%|██████▊ | 6825/10000 [10:44:36<4:51:00, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.03598949685692787, 'learning_rate': 9.674572247284282e-06, 'epoch': 6.83} 68%|██████▊ | 6825/10000 [10:44:36<4:51:00, 5.50s/it][2025-06-20 00:14:21,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.90 [2025-06-20 00:14:21,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.44 | bwd_microstep: 3322.43 | bwd_inner_microstep: 3321.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.19 [2025-06-20 00:14:21,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.44 | bwd: 3322.45 | bwd_inner: 3321.63 | bwd_allreduce: 0.77 | step: 7.19 68%|██████▊ | 6826/10000 [10:44:42<4:50:34, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.012586378492414951, 'learning_rate': 9.66902528660951e-06, 'epoch': 6.83} 68%|██████▊ | 6826/10000 [10:44:42<4:50:34, 5.49s/it][2025-06-20 00:14:26,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:14:26,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.48 | bwd_microstep: 3321.98 | bwd_inner_microstep: 3321.17 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.89 [2025-06-20 00:14:26,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.48 | bwd: 3322.00 | bwd_inner: 3321.17 | bwd_allreduce: 0.78 | step: 6.89 68%|██████▊ | 6827/10000 [10:44:47<4:50:12, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.010775000788271427, 'learning_rate': 9.663479409606032e-06, 'epoch': 6.83} 68%|██████▊ | 6827/10000 [10:44:47<4:50:12, 5.49s/it][2025-06-20 00:14:32,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:14:32,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.80 | bwd_microstep: 3366.43 | bwd_inner_microstep: 3365.65 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-20 00:14:32,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.80 | bwd: 3366.45 | bwd_inner: 3365.65 | bwd_allreduce: 0.76 | step: 6.76 68%|██████▊ | 6828/10000 [10:44:53<4:50:51, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.08210103213787079, 'learning_rate': 9.65793461685559e-06, 'epoch': 6.83} 68%|██████▊ | 6828/10000 [10:44:53<4:50:51, 5.50s/it][2025-06-20 00:14:37,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:14:37,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.21 | bwd_microstep: 3322.51 | bwd_inner_microstep: 3321.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 00:14:37,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.21 | bwd: 3322.53 | bwd_inner: 3321.72 | bwd_allreduce: 0.76 | step: 6.69 68%|██████▊ | 6829/10000 [10:44:58<4:50:15, 5.49s/it] {'loss': 0.1071, 'grad_norm': 5.715510845184326, 'learning_rate': 9.652390908939809e-06, 'epoch': 6.83} 68%|██████▊ | 6829/10000 [10:44:58<4:50:15, 5.49s/it][2025-06-20 00:14:43,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:14:43,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.07 | bwd_microstep: 3374.80 | bwd_inner_microstep: 3374.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.87 [2025-06-20 00:14:43,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.07 | bwd: 3374.82 | bwd_inner: 3374.00 | bwd_allreduce: 0.77 | step: 6.87 68%|██████▊ | 6830/10000 [10:45:04<4:51:05, 5.51s/it] {'loss': 0.0013, 'grad_norm': 0.5186745524406433, 'learning_rate': 9.646848286440193e-06, 'epoch': 6.83} 68%|██████▊ | 6830/10000 [10:45:04<4:51:05, 5.51s/it][2025-06-20 00:14:48,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:14:48,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.78 | bwd_microstep: 3324.11 | bwd_inner_microstep: 3323.31 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-20 00:14:48,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.78 | bwd: 3324.12 | bwd_inner: 3323.31 | bwd_allreduce: 0.77 | step: 7.02 68%|██████▊ | 6831/10000 [10:45:09<4:50:29, 5.50s/it] {'loss': 0.0034, 'grad_norm': 0.6228176355361938, 'learning_rate': 9.641306749938154e-06, 'epoch': 6.83} 68%|██████▊ | 6831/10000 [10:45:09<4:50:29, 5.50s/it][2025-06-20 00:14:54,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:14:54,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.18 | bwd_microstep: 3375.61 | bwd_inner_microstep: 3374.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 00:14:54,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.18 | bwd: 3375.62 | bwd_inner: 3374.82 | bwd_allreduce: 0.76 | step: 6.71 68%|██████▊ | 6832/10000 [10:45:15<4:51:06, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.014295249246060848, 'learning_rate': 9.635766300014949e-06, 'epoch': 6.83} 68%|██████▊ | 6832/10000 [10:45:15<4:51:06, 5.51s/it][2025-06-20 00:14:59,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:14:59,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.97 | bwd_microstep: 3317.91 | bwd_inner_microstep: 3317.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 00:14:59,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.97 | bwd: 3317.92 | bwd_inner: 3317.11 | bwd_allreduce: 0.77 | step: 6.75 68%|██████▊ | 6833/10000 [10:45:20<4:50:18, 5.50s/it] {'loss': 0.0056, 'grad_norm': 1.5662097930908203, 'learning_rate': 9.630226937251759e-06, 'epoch': 6.83} 68%|██████▊ | 6833/10000 [10:45:20<4:50:18, 5.50s/it][2025-06-20 00:15:05,427] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:15:05,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.49 | bwd_microstep: 3329.35 | bwd_inner_microstep: 3328.46 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.36 [2025-06-20 00:15:05,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.49 | bwd: 3329.37 | bwd_inner: 3328.46 | bwd_allreduce: 0.85 | step: 7.36 68%|██████▊ | 6834/10000 [10:45:26<4:50:08, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0014553108485415578, 'learning_rate': 9.624688662229635e-06, 'epoch': 6.83} 68%|██████▊ | 6834/10000 [10:45:26<4:50:08, 5.50s/it][2025-06-20 00:15:10,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:15:10,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.03 | bwd_microstep: 3367.38 | bwd_inner_microstep: 3366.55 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.99 [2025-06-20 00:15:10,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.03 | bwd: 3367.40 | bwd_inner: 3366.55 | bwd_allreduce: 0.80 | step: 7.00 68%|██████▊ | 6835/10000 [10:45:31<4:50:46, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.10785014182329178, 'learning_rate': 9.619151475529518e-06, 'epoch': 6.83} 68%|██████▊ | 6835/10000 [10:45:31<4:50:46, 5.51s/it][2025-06-20 00:15:16,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-20 00:15:16,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.38 | bwd_microstep: 3330.39 | bwd_inner_microstep: 3329.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 00:15:16,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.38 | bwd: 3330.40 | bwd_inner: 3329.60 | bwd_allreduce: 0.76 | step: 6.69 68%|██████▊ | 6836/10000 [10:45:37<4:50:09, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.05472665652632713, 'learning_rate': 9.613615377732239e-06, 'epoch': 6.84} 68%|██████▊ | 6836/10000 [10:45:37<4:50:09, 5.50s/it][2025-06-20 00:15:21,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:15:21,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.76 | bwd_microstep: 3371.95 | bwd_inner_microstep: 3371.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-20 00:15:21,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.76 | bwd: 3371.96 | bwd_inner: 3371.14 | bwd_allreduce: 0.78 | step: 6.97 68%|██████▊ | 6837/10000 [10:45:42<4:50:37, 5.51s/it] {'loss': 0.3064, 'grad_norm': 5.7324748039245605, 'learning_rate': 9.608080369418496e-06, 'epoch': 6.84} 68%|██████▊ | 6837/10000 [10:45:42<4:50:37, 5.51s/it][2025-06-20 00:15:27,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:15:27,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.31 | bwd_microstep: 3344.30 | bwd_inner_microstep: 3343.44 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.44 [2025-06-20 00:15:27,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.31 | bwd: 3344.32 | bwd_inner: 3343.44 | bwd_allreduce: 0.82 | step: 7.44 68%|██████▊ | 6838/10000 [10:45:48<4:50:25, 5.51s/it] {'loss': 0.0012, 'grad_norm': 0.16384391486644745, 'learning_rate': 9.60254645116889e-06, 'epoch': 6.84} 68%|██████▊ | 6838/10000 [10:45:48<4:50:25, 5.51s/it][2025-06-20 00:15:32,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:15:32,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.77 | bwd_microstep: 3324.73 | bwd_inner_microstep: 3323.95 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 00:15:32,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.77 | bwd: 3324.74 | bwd_inner: 3323.95 | bwd_allreduce: 0.75 | step: 6.63 68%|██████▊ | 6839/10000 [10:45:53<4:49:45, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.025329548865556717, 'learning_rate': 9.597013623563906e-06, 'epoch': 6.84} 68%|██████▊ | 6839/10000 [10:45:53<4:49:45, 5.50s/it][2025-06-20 00:15:38,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:15:38,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.27 | bwd_microstep: 3329.95 | bwd_inner_microstep: 3329.12 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.94 [2025-06-20 00:15:38,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.27 | bwd: 3329.97 | bwd_inner: 3329.12 | bwd_allreduce: 0.80 | step: 6.94 68%|██████▊ | 6840/10000 [10:45:59<4:49:23, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0041204532608389854, 'learning_rate': 9.59148188718391e-06, 'epoch': 6.84} 68%|██████▊ | 6840/10000 [10:45:59<4:49:23, 5.49s/it][2025-06-20 00:15:43,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:15:43,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.66 | bwd_microstep: 3334.27 | bwd_inner_microstep: 3333.44 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.20 [2025-06-20 00:15:43,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.66 | bwd: 3334.28 | bwd_inner: 3333.44 | bwd_allreduce: 0.79 | step: 7.21 68%|██████▊ | 6841/10000 [10:46:04<4:49:12, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.007389604579657316, 'learning_rate': 9.585951242609159e-06, 'epoch': 6.84} 68%|██████▊ | 6841/10000 [10:46:04<4:49:12, 5.49s/it][2025-06-20 00:15:49,412] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:15:49,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.59 | bwd_microstep: 3322.75 | bwd_inner_microstep: 3321.92 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.02 [2025-06-20 00:15:49,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.59 | bwd: 3322.77 | bwd_inner: 3321.92 | bwd_allreduce: 0.79 | step: 7.03 68%|██████▊ | 6842/10000 [10:46:10<4:48:46, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.012832307256758213, 'learning_rate': 9.580421690419787e-06, 'epoch': 6.84} 68%|██████▊ | 6842/10000 [10:46:10<4:48:46, 5.49s/it][2025-06-20 00:15:55,009] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:15:55,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.53 | bwd_microstep: 3411.43 | bwd_inner_microstep: 3410.61 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.89 [2025-06-20 00:15:55,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.53 | bwd: 3411.45 | bwd_inner: 3410.61 | bwd_allreduce: 0.79 | step: 6.89 68%|██████▊ | 6843/10000 [10:46:15<4:50:25, 5.52s/it] {'loss': 0.0032, 'grad_norm': 1.8087743520736694, 'learning_rate': 9.574893231195823e-06, 'epoch': 6.84} 68%|██████▊ | 6843/10000 [10:46:15<4:50:25, 5.52s/it][2025-06-20 00:16:00,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:16:00,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.53 | bwd_microstep: 3414.31 | bwd_inner_microstep: 3413.52 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-20 00:16:00,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.53 | bwd: 3414.33 | bwd_inner: 3413.52 | bwd_allreduce: 0.77 | step: 7.16 68%|██████▊ | 6844/10000 [10:46:21<4:51:33, 5.54s/it] {'loss': 0.0063, 'grad_norm': 1.065834641456604, 'learning_rate': 9.569365865517176e-06, 'epoch': 6.84} 68%|██████▊ | 6844/10000 [10:46:21<4:51:33, 5.54s/it][2025-06-20 00:16:06,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:16:06,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.91 | bwd_microstep: 3331.96 | bwd_inner_microstep: 3331.03 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.10 [2025-06-20 00:16:06,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.91 | bwd: 3331.98 | bwd_inner: 3331.03 | bwd_allreduce: 0.90 | step: 7.10 68%|██████▊ | 6845/10000 [10:46:26<4:50:30, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.026707034558057785, 'learning_rate': 9.563839593963636e-06, 'epoch': 6.84} 68%|██████▊ | 6845/10000 [10:46:26<4:50:30, 5.52s/it][2025-06-20 00:16:11,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:16:11,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.44 | bwd_microstep: 3330.30 | bwd_inner_microstep: 3329.45 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.99 [2025-06-20 00:16:11,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.44 | bwd: 3330.32 | bwd_inner: 3329.45 | bwd_allreduce: 0.82 | step: 6.99 68%|██████▊ | 6846/10000 [10:46:32<4:49:51, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.008181643672287464, 'learning_rate': 9.558314417114896e-06, 'epoch': 6.85} 68%|██████▊ | 6846/10000 [10:46:32<4:49:51, 5.51s/it][2025-06-20 00:16:17,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:16:17,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.80 | bwd_microstep: 3379.55 | bwd_inner_microstep: 3378.74 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.98 [2025-06-20 00:16:17,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.80 | bwd: 3379.56 | bwd_inner: 3378.74 | bwd_allreduce: 0.78 | step: 6.99 68%|██████▊ | 6847/10000 [10:46:37<4:50:25, 5.53s/it] {'loss': 0.0008, 'grad_norm': 0.10060396045446396, 'learning_rate': 9.552790335550503e-06, 'epoch': 6.85} 68%|██████▊ | 6847/10000 [10:46:37<4:50:25, 5.53s/it][2025-06-20 00:16:22,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:16:22,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.68 | bwd_microstep: 3329.18 | bwd_inner_microstep: 3328.35 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.89 [2025-06-20 00:16:22,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.68 | bwd: 3329.20 | bwd_inner: 3328.35 | bwd_allreduce: 0.80 | step: 6.89 68%|██████▊ | 6848/10000 [10:46:43<4:49:37, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.006609903182834387, 'learning_rate': 9.54726734984992e-06, 'epoch': 6.85} 68%|██████▊ | 6848/10000 [10:46:43<4:49:37, 5.51s/it][2025-06-20 00:16:28,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:16:28,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.61 | bwd_microstep: 3327.84 | bwd_inner_microstep: 3327.06 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-20 00:16:28,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.61 | bwd: 3327.86 | bwd_inner: 3327.06 | bwd_allreduce: 0.75 | step: 6.56 68%|██████▊ | 6849/10000 [10:46:48<4:48:55, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.011261950246989727, 'learning_rate': 9.541745460592478e-06, 'epoch': 6.85} 68%|██████▊ | 6849/10000 [10:46:48<4:48:55, 5.50s/it][2025-06-20 00:16:33,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:16:33,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.14 | bwd_microstep: 3375.30 | bwd_inner_microstep: 3374.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-20 00:16:33,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.14 | bwd: 3375.31 | bwd_inner: 3374.51 | bwd_allreduce: 0.76 | step: 6.69 68%|██████▊ | 6850/10000 [10:46:54<4:49:26, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.006979774218052626, 'learning_rate': 9.5362246683574e-06, 'epoch': 6.85} 68%|██████▊ | 6850/10000 [10:46:54<4:49:26, 5.51s/it][2025-06-20 00:16:39,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:16:39,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.19 | bwd_microstep: 3379.16 | bwd_inner_microstep: 3378.30 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.05 [2025-06-20 00:16:39,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.19 | bwd: 3379.17 | bwd_inner: 3378.30 | bwd_allreduce: 0.83 | step: 7.06 69%|██████▊ | 6851/10000 [10:46:59<4:49:52, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.04686760529875755, 'learning_rate': 9.530704973723795e-06, 'epoch': 6.85} 69%|██████▊ | 6851/10000 [10:46:59<4:49:52, 5.52s/it][2025-06-20 00:16:44,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:16:44,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.44 | bwd_microstep: 3378.18 | bwd_inner_microstep: 3377.36 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.96 [2025-06-20 00:16:44,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.44 | bwd: 3378.19 | bwd_inner: 3377.36 | bwd_allreduce: 0.79 | step: 6.97 69%|██████▊ | 6852/10000 [10:47:05<4:50:14, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.004059839062392712, 'learning_rate': 9.525186377270643e-06, 'epoch': 6.85} 69%|██████▊ | 6852/10000 [10:47:05<4:50:14, 5.53s/it][2025-06-20 00:16:50,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:16:50,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.65 | bwd_microstep: 3327.92 | bwd_inner_microstep: 3327.00 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.21 [2025-06-20 00:16:50,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.65 | bwd: 3327.93 | bwd_inner: 3327.00 | bwd_allreduce: 0.89 | step: 7.21 69%|██████▊ | 6853/10000 [10:47:11<4:49:26, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.0523650199174881, 'learning_rate': 9.519668879576824e-06, 'epoch': 6.85} 69%|██████▊ | 6853/10000 [10:47:11<4:49:26, 5.52s/it][2025-06-20 00:16:55,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:16:55,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.10 | bwd_microstep: 3320.53 | bwd_inner_microstep: 3319.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-20 00:16:55,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.10 | bwd: 3320.54 | bwd_inner: 3319.74 | bwd_allreduce: 0.76 | step: 6.68 69%|██████▊ | 6854/10000 [10:47:16<4:48:38, 5.51s/it] {'loss': 0.002, 'grad_norm': 0.7522677779197693, 'learning_rate': 9.5141524812211e-06, 'epoch': 6.85} 69%|██████▊ | 6854/10000 [10:47:16<4:48:38, 5.51s/it][2025-06-20 00:17:01,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:17:01,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.90 | bwd_microstep: 3375.23 | bwd_inner_microstep: 3374.36 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.37 [2025-06-20 00:17:01,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.90 | bwd: 3375.25 | bwd_inner: 3374.36 | bwd_allreduce: 0.83 | step: 7.37 69%|██████▊ | 6855/10000 [10:47:22<4:49:13, 5.52s/it] {'loss': 0.0246, 'grad_norm': 10.332379341125488, 'learning_rate': 9.508637182782114e-06, 'epoch': 6.86} 69%|██████▊ | 6855/10000 [10:47:22<4:49:13, 5.52s/it][2025-06-20 00:17:06,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:17:06,723] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.72 | bwd_microstep: 3326.74 | bwd_inner_microstep: 3325.96 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-20 00:17:06,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.72 | bwd: 3326.75 | bwd_inner: 3325.96 | bwd_allreduce: 0.76 | step: 6.82 69%|██████▊ | 6856/10000 [10:47:27<4:48:34, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.012274246662855148, 'learning_rate': 9.503122984838405e-06, 'epoch': 6.86} 69%|██████▊ | 6856/10000 [10:47:27<4:48:34, 5.51s/it][2025-06-20 00:17:12,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:17:12,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.55 | bwd_microstep: 3324.03 | bwd_inner_microstep: 3323.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.22 [2025-06-20 00:17:12,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.55 | bwd: 3324.04 | bwd_inner: 3323.24 | bwd_allreduce: 0.76 | step: 7.22 69%|██████▊ | 6857/10000 [10:47:32<4:47:54, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.07704237103462219, 'learning_rate': 9.497609887968366e-06, 'epoch': 6.86} 69%|██████▊ | 6857/10000 [10:47:32<4:47:54, 5.50s/it][2025-06-20 00:17:17,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:17:17,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.56 | bwd_microstep: 3320.03 | bwd_inner_microstep: 3319.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-20 00:17:17,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.56 | bwd: 3320.04 | bwd_inner: 3319.22 | bwd_allreduce: 0.77 | step: 6.80 69%|██████▊ | 6858/10000 [10:47:38<4:47:23, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.11542046815156937, 'learning_rate': 9.4920978927503e-06, 'epoch': 6.86} 69%|██████▊ | 6858/10000 [10:47:38<4:47:23, 5.49s/it][2025-06-20 00:17:23,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:17:23,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.68 | bwd_microstep: 3316.68 | bwd_inner_microstep: 3315.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 00:17:23,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.68 | bwd: 3316.70 | bwd_inner: 3315.89 | bwd_allreduce: 0.76 | step: 6.62 69%|██████▊ | 6859/10000 [10:47:43<4:46:50, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.1049719899892807, 'learning_rate': 9.486586999762407e-06, 'epoch': 6.86} 69%|██████▊ | 6859/10000 [10:47:43<4:46:50, 5.48s/it][2025-06-20 00:17:28,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:17:28,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.02 | bwd_microstep: 3396.93 | bwd_inner_microstep: 3396.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-20 00:17:28,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.02 | bwd: 3396.95 | bwd_inner: 3396.13 | bwd_allreduce: 0.77 | step: 6.89 69%|██████▊ | 6860/10000 [10:47:49<4:48:12, 5.51s/it] {'loss': 0.0078, 'grad_norm': 1.5814337730407715, 'learning_rate': 9.481077209582741e-06, 'epoch': 6.86} 69%|██████▊ | 6860/10000 [10:47:49<4:48:12, 5.51s/it][2025-06-20 00:17:34,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:17:34,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.51 | bwd_microstep: 3324.61 | bwd_inner_microstep: 3323.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-20 00:17:34,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.51 | bwd: 3324.63 | bwd_inner: 3323.82 | bwd_allreduce: 0.76 | step: 6.98 69%|██████▊ | 6861/10000 [10:47:54<4:47:29, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.013089793734252453, 'learning_rate': 9.475568522789264e-06, 'epoch': 6.86} 69%|██████▊ | 6861/10000 [10:47:54<4:47:29, 5.50s/it][2025-06-20 00:17:39,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:17:39,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.22 | bwd_microstep: 3375.44 | bwd_inner_microstep: 3374.63 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-20 00:17:39,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.23 | bwd: 3375.45 | bwd_inner: 3374.63 | bwd_allreduce: 0.78 | step: 6.97 69%|██████▊ | 6862/10000 [10:48:00<4:48:07, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.08881830424070358, 'learning_rate': 9.470060939959795e-06, 'epoch': 6.86} 69%|██████▊ | 6862/10000 [10:48:00<4:48:07, 5.51s/it][2025-06-20 00:17:45,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:17:45,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.73 | bwd_microstep: 3370.47 | bwd_inner_microstep: 3369.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-20 00:17:45,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.73 | bwd: 3370.48 | bwd_inner: 3369.67 | bwd_allreduce: 0.76 | step: 6.89 69%|██████▊ | 6863/10000 [10:48:06<4:48:31, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.010742145590484142, 'learning_rate': 9.464554461672064e-06, 'epoch': 6.86} 69%|██████▊ | 6863/10000 [10:48:06<4:48:31, 5.52s/it][2025-06-20 00:17:50,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:17:50,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.14 | bwd_microstep: 3323.38 | bwd_inner_microstep: 3322.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 00:17:50,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.14 | bwd: 3323.39 | bwd_inner: 3322.60 | bwd_allreduce: 0.75 | step: 6.68 69%|██████▊ | 6864/10000 [10:48:11<4:47:38, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.027485329657793045, 'learning_rate': 9.459049088503673e-06, 'epoch': 6.86} 69%|██████▊ | 6864/10000 [10:48:11<4:47:38, 5.50s/it][2025-06-20 00:17:56,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 00:17:56,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.31 | bwd_microstep: 3318.91 | bwd_inner_microstep: 3318.14 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.55 [2025-06-20 00:17:56,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.31 | bwd: 3318.93 | bwd_inner: 3318.14 | bwd_allreduce: 0.75 | step: 6.55 69%|██████▊ | 6865/10000 [10:48:16<4:46:50, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0032114190980792046, 'learning_rate': 9.45354482103211e-06, 'epoch': 6.87} 69%|██████▊ | 6865/10000 [10:48:16<4:46:50, 5.49s/it][2025-06-20 00:18:01,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:18:01,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.46 | bwd_microstep: 3318.48 | bwd_inner_microstep: 3317.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 00:18:01,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.46 | bwd: 3318.49 | bwd_inner: 3317.69 | bwd_allreduce: 0.76 | step: 6.70 69%|██████▊ | 6866/10000 [10:48:22<4:46:14, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0007728943019174039, 'learning_rate': 9.448041659834757e-06, 'epoch': 6.87} 69%|██████▊ | 6866/10000 [10:48:22<4:46:14, 5.48s/it][2025-06-20 00:18:07,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:18:07,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.75 | bwd_microstep: 3373.11 | bwd_inner_microstep: 3372.32 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 00:18:07,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.75 | bwd: 3373.12 | bwd_inner: 3372.32 | bwd_allreduce: 0.76 | step: 6.71 69%|██████▊ | 6867/10000 [10:48:27<4:47:08, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00033903997973538935, 'learning_rate': 9.44253960548885e-06, 'epoch': 6.87} 69%|██████▊ | 6867/10000 [10:48:27<4:47:08, 5.50s/it][2025-06-20 00:18:12,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:18:12,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.01 | bwd_microstep: 3372.69 | bwd_inner_microstep: 3371.84 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.21 [2025-06-20 00:18:12,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.01 | bwd: 3372.71 | bwd_inner: 3371.84 | bwd_allreduce: 0.82 | step: 7.21 69%|██████▊ | 6868/10000 [10:48:33<4:47:40, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0035214771050959826, 'learning_rate': 9.437038658571542e-06, 'epoch': 6.87} 69%|██████▊ | 6868/10000 [10:48:33<4:47:40, 5.51s/it][2025-06-20 00:18:18,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:18:18,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.72 | bwd_microstep: 3312.91 | bwd_inner_microstep: 3311.93 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.47 [2025-06-20 00:18:18,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.72 | bwd: 3312.93 | bwd_inner: 3311.93 | bwd_allreduce: 0.95 | step: 7.47 69%|██████▊ | 6869/10000 [10:48:38<4:46:45, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.014914319850504398, 'learning_rate': 9.431538819659853e-06, 'epoch': 6.87} 69%|██████▊ | 6869/10000 [10:48:38<4:46:45, 5.50s/it][2025-06-20 00:18:23,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:18:23,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.56 | bwd_microstep: 3382.50 | bwd_inner_microstep: 3381.54 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.07 [2025-06-20 00:18:23,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.56 | bwd: 3382.51 | bwd_inner: 3381.54 | bwd_allreduce: 0.93 | step: 7.07 69%|██████▊ | 6870/10000 [10:48:44<4:47:40, 5.51s/it] {'loss': 0.0173, 'grad_norm': 4.110080242156982, 'learning_rate': 9.426040089330695e-06, 'epoch': 6.87} 69%|██████▊ | 6870/10000 [10:48:44<4:47:40, 5.51s/it][2025-06-20 00:18:29,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:18:29,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.36 | bwd_microstep: 3322.06 | bwd_inner_microstep: 3321.06 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.43 [2025-06-20 00:18:29,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.36 | bwd: 3322.08 | bwd_inner: 3321.06 | bwd_allreduce: 0.97 | step: 7.43 69%|██████▊ | 6871/10000 [10:48:49<4:46:56, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.024048052728176117, 'learning_rate': 9.42054246816086e-06, 'epoch': 6.87} 69%|██████▊ | 6871/10000 [10:48:50<4:46:56, 5.50s/it][2025-06-20 00:18:34,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.88 [2025-06-20 00:18:34,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.30 | bwd_microstep: 3315.41 | bwd_inner_microstep: 3314.61 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-20 00:18:34,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.30 | bwd: 3315.43 | bwd_inner: 3314.61 | bwd_allreduce: 0.77 | step: 7.08 69%|██████▊ | 6872/10000 [10:48:55<4:46:11, 5.49s/it] {'loss': 0.0041, 'grad_norm': 1.2767151594161987, 'learning_rate': 9.415045956727016e-06, 'epoch': 6.87} 69%|██████▊ | 6872/10000 [10:48:55<4:46:11, 5.49s/it][2025-06-20 00:18:40,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:18:40,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.31 | bwd_microstep: 3327.73 | bwd_inner_microstep: 3326.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 00:18:40,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.31 | bwd: 3327.74 | bwd_inner: 3326.94 | bwd_allreduce: 0.76 | step: 6.72 69%|██████▊ | 6873/10000 [10:49:00<4:45:47, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.002857757965102792, 'learning_rate': 9.409550555605723e-06, 'epoch': 6.87} 69%|██████▊ | 6873/10000 [10:49:00<4:45:47, 5.48s/it][2025-06-20 00:18:45,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:18:45,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.42 | bwd_microstep: 3411.03 | bwd_inner_microstep: 3410.22 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.77 [2025-06-20 00:18:45,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.42 | bwd: 3411.05 | bwd_inner: 3410.22 | bwd_allreduce: 0.79 | step: 6.78 69%|██████▊ | 6874/10000 [10:49:06<4:47:18, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.01920820213854313, 'learning_rate': 9.404056265373422e-06, 'epoch': 6.87} 69%|██████▊ | 6874/10000 [10:49:06<4:47:18, 5.51s/it][2025-06-20 00:18:51,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 00:18:51,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.27 | bwd_microstep: 3377.71 | bwd_inner_microstep: 3376.78 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.99 [2025-06-20 00:18:51,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.27 | bwd: 3377.72 | bwd_inner: 3376.78 | bwd_allreduce: 0.89 | step: 6.99 69%|██████▉ | 6875/10000 [10:49:12<4:47:48, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.004836085252463818, 'learning_rate': 9.398563086606444e-06, 'epoch': 6.88} 69%|██████▉ | 6875/10000 [10:49:12<4:47:48, 5.53s/it][2025-06-20 00:18:56,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:18:56,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.54 | bwd_microstep: 3331.92 | bwd_inner_microstep: 3330.93 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.38 [2025-06-20 00:18:56,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.54 | bwd: 3331.94 | bwd_inner: 3330.93 | bwd_allreduce: 0.96 | step: 7.38 69%|██████▉ | 6876/10000 [10:49:17<4:47:02, 5.51s/it] {'loss': 0.0207, 'grad_norm': 2.9868392944335938, 'learning_rate': 9.393071019880997e-06, 'epoch': 6.88} 69%|██████▉ | 6876/10000 [10:49:17<4:47:02, 5.51s/it][2025-06-20 00:19:02,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:19:02,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.21 | bwd_microstep: 3324.90 | bwd_inner_microstep: 3323.85 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.52 [2025-06-20 00:19:02,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.21 | bwd: 3324.92 | bwd_inner: 3323.85 | bwd_allreduce: 1.01 | step: 7.52 69%|██████▉ | 6877/10000 [10:49:23<4:46:22, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.06061401590704918, 'learning_rate': 9.38758006577317e-06, 'epoch': 6.88} 69%|██████▉ | 6877/10000 [10:49:23<4:46:22, 5.50s/it][2025-06-20 00:19:07,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:19:07,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.73 | bwd_microstep: 3371.21 | bwd_inner_microstep: 3370.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 00:19:07,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.73 | bwd: 3371.23 | bwd_inner: 3370.43 | bwd_allreduce: 0.76 | step: 6.67 69%|██████▉ | 6878/10000 [10:49:28<4:46:53, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.019426556304097176, 'learning_rate': 9.38209022485894e-06, 'epoch': 6.88} 69%|██████▉ | 6878/10000 [10:49:28<4:46:53, 5.51s/it][2025-06-20 00:19:13,232] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:19:13,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.63 | bwd_microstep: 3322.10 | bwd_inner_microstep: 3321.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 00:19:13,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.63 | bwd: 3322.12 | bwd_inner: 3321.32 | bwd_allreduce: 0.76 | step: 6.64 69%|██████▉ | 6879/10000 [10:49:34<4:46:00, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0018246748950332403, 'learning_rate': 9.376601497714166e-06, 'epoch': 6.88} 69%|██████▉ | 6879/10000 [10:49:34<4:46:00, 5.50s/it][2025-06-20 00:19:18,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:19:18,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.26 | bwd_microstep: 3387.16 | bwd_inner_microstep: 3386.37 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 00:19:18,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.26 | bwd: 3387.18 | bwd_inner: 3386.37 | bwd_allreduce: 0.76 | step: 6.71 69%|██████▉ | 6880/10000 [10:49:39<4:46:43, 5.51s/it] {'loss': 0.0027, 'grad_norm': 0.5339252948760986, 'learning_rate': 9.371113884914591e-06, 'epoch': 6.88} 69%|██████▉ | 6880/10000 [10:49:39<4:46:43, 5.51s/it][2025-06-20 00:19:24,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:19:24,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.64 | bwd_microstep: 3324.08 | bwd_inner_microstep: 3323.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-20 00:19:24,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.64 | bwd: 3324.09 | bwd_inner: 3323.29 | bwd_allreduce: 0.76 | step: 6.67 69%|██████▉ | 6881/10000 [10:49:45<4:45:54, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.030625339597463608, 'learning_rate': 9.36562738703584e-06, 'epoch': 6.88} 69%|██████▉ | 6881/10000 [10:49:45<4:45:54, 5.50s/it][2025-06-20 00:19:29,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:19:29,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.97 | bwd_microstep: 3319.20 | bwd_inner_microstep: 3318.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 00:19:29,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.97 | bwd: 3319.22 | bwd_inner: 3318.41 | bwd_allreduce: 0.76 | step: 6.64 69%|██████▉ | 6882/10000 [10:49:50<4:45:07, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.05651507526636124, 'learning_rate': 9.360142004653428e-06, 'epoch': 6.88} 69%|██████▉ | 6882/10000 [10:49:50<4:45:07, 5.49s/it][2025-06-20 00:19:35,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.91 [2025-06-20 00:19:35,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.89 | bwd_microstep: 3367.33 | bwd_inner_microstep: 3366.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-20 00:19:35,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.89 | bwd: 3367.35 | bwd_inner: 3366.53 | bwd_allreduce: 0.77 | step: 7.20 69%|██████▉ | 6883/10000 [10:49:56<4:45:42, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0022606374695897102, 'learning_rate': 9.35465773834273e-06, 'epoch': 6.88} 69%|██████▉ | 6883/10000 [10:49:56<4:45:42, 5.50s/it][2025-06-20 00:19:40,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:19:40,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.95 | bwd_microstep: 3317.77 | bwd_inner_microstep: 3316.93 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.08 [2025-06-20 00:19:40,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.95 | bwd: 3317.79 | bwd_inner: 3316.93 | bwd_allreduce: 0.80 | step: 7.08 69%|██████▉ | 6884/10000 [10:50:01<4:45:00, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.030846698209643364, 'learning_rate': 9.349174588679026e-06, 'epoch': 6.88} 69%|██████▉ | 6884/10000 [10:50:01<4:45:00, 5.49s/it][2025-06-20 00:19:46,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:19:46,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.90 | bwd_microstep: 3310.40 | bwd_inner_microstep: 3309.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 00:19:46,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.90 | bwd: 3310.41 | bwd_inner: 3309.61 | bwd_allreduce: 0.76 | step: 6.76 69%|██████▉ | 6885/10000 [10:50:06<4:44:19, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0008532831561751664, 'learning_rate': 9.34369255623748e-06, 'epoch': 6.88} 69%|██████▉ | 6885/10000 [10:50:06<4:44:19, 5.48s/it][2025-06-20 00:19:51,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:19:51,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.28 | bwd_microstep: 3307.56 | bwd_inner_microstep: 3306.78 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.55 [2025-06-20 00:19:51,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.28 | bwd: 3307.57 | bwd_inner: 3306.78 | bwd_allreduce: 0.75 | step: 6.55 69%|██████▉ | 6886/10000 [10:50:12<4:43:46, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0001907648256747052, 'learning_rate': 9.338211641593127e-06, 'epoch': 6.89} 69%|██████▉ | 6886/10000 [10:50:12<4:43:46, 5.47s/it][2025-06-20 00:19:57,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.73 [2025-06-20 00:19:57,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.73 | bwd_microstep: 3305.33 | bwd_inner_microstep: 3304.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.89 [2025-06-20 00:19:57,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.73 | bwd: 3305.34 | bwd_inner: 3304.54 | bwd_allreduce: 0.76 | step: 6.90 69%|██████▉ | 6887/10000 [10:50:17<4:43:17, 5.46s/it] {'loss': 0.0003, 'grad_norm': 0.10973343253135681, 'learning_rate': 9.332731845320895e-06, 'epoch': 6.89} 69%|██████▉ | 6887/10000 [10:50:17<4:43:17, 5.46s/it][2025-06-20 00:20:02,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:20:02,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.50 | bwd_microstep: 3385.32 | bwd_inner_microstep: 3384.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-20 00:20:02,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.50 | bwd: 3385.33 | bwd_inner: 3384.52 | bwd_allreduce: 0.77 | step: 6.85 69%|██████▉ | 6888/10000 [10:50:23<4:44:37, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.025496963411569595, 'learning_rate': 9.327253167995578e-06, 'epoch': 6.89} 69%|██████▉ | 6888/10000 [10:50:23<4:44:37, 5.49s/it][2025-06-20 00:20:08,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:20:08,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.97 | bwd_microstep: 3315.38 | bwd_inner_microstep: 3314.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 00:20:08,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.97 | bwd: 3315.40 | bwd_inner: 3314.59 | bwd_allreduce: 0.77 | step: 6.70 69%|██████▉ | 6889/10000 [10:50:28<4:44:20, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001638062414713204, 'learning_rate': 9.32177561019187e-06, 'epoch': 6.89} 69%|██████▉ | 6889/10000 [10:50:28<4:44:20, 5.48s/it][2025-06-20 00:20:13,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:20:13,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.13 | bwd_microstep: 3314.38 | bwd_inner_microstep: 3313.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-20 00:20:13,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.13 | bwd: 3314.39 | bwd_inner: 3313.60 | bwd_allreduce: 0.75 | step: 6.55 69%|██████▉ | 6890/10000 [10:50:34<4:43:58, 5.48s/it] {'loss': 0.0137, 'grad_norm': 4.639645099639893, 'learning_rate': 9.31629917248434e-06, 'epoch': 6.89} 69%|██████▉ | 6890/10000 [10:50:34<4:43:58, 5.48s/it][2025-06-20 00:20:19,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-20 00:20:19,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.79 | bwd_microstep: 3360.16 | bwd_inner_microstep: 3359.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-20 00:20:19,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.79 | bwd: 3360.18 | bwd_inner: 3359.38 | bwd_allreduce: 0.75 | step: 6.55 69%|██████▉ | 6891/10000 [10:50:39<4:44:34, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.03996957838535309, 'learning_rate': 9.310823855447444e-06, 'epoch': 6.89} 69%|██████▉ | 6891/10000 [10:50:39<4:44:34, 5.49s/it][2025-06-20 00:20:24,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:20:24,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.28 | bwd_microstep: 3365.90 | bwd_inner_microstep: 3365.09 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-20 00:20:24,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.28 | bwd: 3365.91 | bwd_inner: 3365.09 | bwd_allreduce: 0.77 | step: 7.07 69%|██████▉ | 6892/10000 [10:50:45<4:45:11, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00025220905081368983, 'learning_rate': 9.305349659655523e-06, 'epoch': 6.89} 69%|██████▉ | 6892/10000 [10:50:45<4:45:11, 5.51s/it][2025-06-20 00:20:30,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:20:30,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.20 | bwd_microstep: 3309.83 | bwd_inner_microstep: 3309.05 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 00:20:30,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.20 | bwd: 3309.85 | bwd_inner: 3309.05 | bwd_allreduce: 0.76 | step: 6.69 69%|██████▉ | 6893/10000 [10:50:50<4:44:11, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0021940418519079685, 'learning_rate': 9.299876585682773e-06, 'epoch': 6.89} 69%|██████▉ | 6893/10000 [10:50:50<4:44:11, 5.49s/it][2025-06-20 00:20:35,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:20:35,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.96 | bwd_microstep: 3305.38 | bwd_inner_microstep: 3304.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 00:20:35,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.96 | bwd: 3305.39 | bwd_inner: 3304.59 | bwd_allreduce: 0.76 | step: 6.72 69%|██████▉ | 6894/10000 [10:50:56<4:43:39, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.05230218917131424, 'learning_rate': 9.294404634103313e-06, 'epoch': 6.89} 69%|██████▉ | 6894/10000 [10:50:56<4:43:39, 5.48s/it][2025-06-20 00:20:40,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:20:40,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.35 | bwd_microstep: 3321.44 | bwd_inner_microstep: 3320.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 00:20:40,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.35 | bwd: 3321.45 | bwd_inner: 3320.66 | bwd_allreduce: 0.75 | step: 6.66 69%|██████▉ | 6895/10000 [10:51:01<4:43:27, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.044306956231594086, 'learning_rate': 9.288933805491123e-06, 'epoch': 6.89} 69%|██████▉ | 6895/10000 [10:51:01<4:43:27, 5.48s/it][2025-06-20 00:20:46,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:20:46,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.72 | bwd_microstep: 3321.91 | bwd_inner_microstep: 3320.85 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.08 [2025-06-20 00:20:46,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.72 | bwd: 3321.93 | bwd_inner: 3320.85 | bwd_allreduce: 1.03 | step: 7.08 69%|██████▉ | 6896/10000 [10:51:07<4:43:11, 5.47s/it] {'loss': 0.002, 'grad_norm': 0.4360817074775696, 'learning_rate': 9.283464100420064e-06, 'epoch': 6.9} 69%|██████▉ | 6896/10000 [10:51:07<4:43:11, 5.47s/it][2025-06-20 00:20:51,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:20:51,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.25 | bwd_microstep: 3313.71 | bwd_inner_microstep: 3312.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 00:20:51,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.25 | bwd: 3313.72 | bwd_inner: 3312.92 | bwd_allreduce: 0.76 | step: 6.60 69%|██████▉ | 6897/10000 [10:51:12<4:42:52, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.004172802437096834, 'learning_rate': 9.277995519463891e-06, 'epoch': 6.9} 69%|██████▉ | 6897/10000 [10:51:12<4:42:52, 5.47s/it][2025-06-20 00:20:57,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:20:57,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.58 | bwd_microstep: 3323.60 | bwd_inner_microstep: 3322.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 00:20:57,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.58 | bwd: 3323.61 | bwd_inner: 3322.81 | bwd_allreduce: 0.76 | step: 6.61 69%|██████▉ | 6898/10000 [10:51:18<4:42:45, 5.47s/it] {'loss': 0.0008, 'grad_norm': 0.13311724364757538, 'learning_rate': 9.27252806319622e-06, 'epoch': 6.9} 69%|██████▉ | 6898/10000 [10:51:18<4:42:45, 5.47s/it][2025-06-20 00:21:02,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:21:02,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.58 | bwd_microstep: 3313.89 | bwd_inner_microstep: 3312.97 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.92 [2025-06-20 00:21:02,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.58 | bwd: 3313.90 | bwd_inner: 3312.97 | bwd_allreduce: 0.88 | step: 6.92 69%|██████▉ | 6899/10000 [10:51:23<4:42:32, 5.47s/it] {'loss': 0.0004, 'grad_norm': 0.12648189067840576, 'learning_rate': 9.267061732190565e-06, 'epoch': 6.9} 69%|██████▉ | 6899/10000 [10:51:23<4:42:32, 5.47s/it][2025-06-20 00:21:08,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:21:08,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.91 | bwd_microstep: 3317.80 | bwd_inner_microstep: 3317.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 00:21:08,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.91 | bwd: 3317.81 | bwd_inner: 3317.02 | bwd_allreduce: 0.76 | step: 6.65 69%|██████▉ | 6900/10000 [10:51:29<4:42:18, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0008119624108076096, 'learning_rate': 9.261596527020324e-06, 'epoch': 6.9} 69%|██████▉ | 6900/10000 [10:51:29<4:42:18, 5.46s/it][2025-06-20 00:21:13,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:21:13,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.59 | bwd_microstep: 3369.31 | bwd_inner_microstep: 3368.46 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.31 [2025-06-20 00:21:13,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.59 | bwd: 3369.33 | bwd_inner: 3368.46 | bwd_allreduce: 0.82 | step: 7.32 69%|██████▉ | 6901/10000 [10:51:34<4:43:25, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.009566927328705788, 'learning_rate': 9.256132448258767e-06, 'epoch': 6.9} 69%|██████▉ | 6901/10000 [10:51:34<4:43:25, 5.49s/it][2025-06-20 00:21:19,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-20 00:21:19,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.82 | bwd_microstep: 3315.50 | bwd_inner_microstep: 3314.70 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-20 00:21:19,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.82 | bwd: 3315.51 | bwd_inner: 3314.70 | bwd_allreduce: 0.77 | step: 6.77 69%|██████▉ | 6902/10000 [10:51:40<4:42:48, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.03487398847937584, 'learning_rate': 9.250669496479061e-06, 'epoch': 6.9} 69%|██████▉ | 6902/10000 [10:51:40<4:42:48, 5.48s/it][2025-06-20 00:21:24,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:21:24,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.43 | bwd_microstep: 3324.40 | bwd_inner_microstep: 3323.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-20 00:21:24,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.43 | bwd: 3324.42 | bwd_inner: 3323.61 | bwd_allreduce: 0.77 | step: 6.65 69%|██████▉ | 6903/10000 [10:51:45<4:42:31, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00017371737339999527, 'learning_rate': 9.245207672254226e-06, 'epoch': 6.9} 69%|██████▉ | 6903/10000 [10:51:45<4:42:31, 5.47s/it][2025-06-20 00:21:30,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:21:30,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.57 | bwd_microstep: 3318.55 | bwd_inner_microstep: 3317.75 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-20 00:21:30,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.57 | bwd: 3318.57 | bwd_inner: 3317.75 | bwd_allreduce: 0.77 | step: 6.91 69%|██████▉ | 6904/10000 [10:51:50<4:42:09, 5.47s/it] {'loss': 0.0051, 'grad_norm': 1.3318501710891724, 'learning_rate': 9.239746976157194e-06, 'epoch': 6.9} 69%|██████▉ | 6904/10000 [10:51:50<4:42:09, 5.47s/it][2025-06-20 00:21:35,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:21:35,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.26 | bwd_microstep: 3368.36 | bwd_inner_microstep: 3367.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-20 00:21:35,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.26 | bwd: 3368.37 | bwd_inner: 3367.57 | bwd_allreduce: 0.77 | step: 6.64 69%|██████▉ | 6905/10000 [10:51:56<4:42:59, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0066913822665810585, 'learning_rate': 9.234287408760763e-06, 'epoch': 6.91} 69%|██████▉ | 6905/10000 [10:51:56<4:42:59, 5.49s/it][2025-06-20 00:21:41,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 00:21:41,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.08 | bwd_microstep: 3369.11 | bwd_inner_microstep: 3368.11 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.31 [2025-06-20 00:21:41,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.08 | bwd: 3369.12 | bwd_inner: 3368.11 | bwd_allreduce: 0.97 | step: 7.31 69%|██████▉ | 6906/10000 [10:52:02<4:43:36, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005454658064991236, 'learning_rate': 9.228828970637618e-06, 'epoch': 6.91} 69%|██████▉ | 6906/10000 [10:52:02<4:43:36, 5.50s/it][2025-06-20 00:21:46,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:21:46,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.85 | bwd_microstep: 3312.93 | bwd_inner_microstep: 3312.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 00:21:46,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.85 | bwd: 3312.95 | bwd_inner: 3312.14 | bwd_allreduce: 0.77 | step: 6.69 69%|██████▉ | 6907/10000 [10:52:07<4:42:47, 5.49s/it] {'loss': 0.0078, 'grad_norm': 1.5424470901489258, 'learning_rate': 9.223371662360332e-06, 'epoch': 6.91} 69%|██████▉ | 6907/10000 [10:52:07<4:42:47, 5.49s/it][2025-06-20 00:21:52,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:21:52,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.30 | bwd_microstep: 3361.55 | bwd_inner_microstep: 3360.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 00:21:52,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.30 | bwd: 3361.56 | bwd_inner: 3360.75 | bwd_allreduce: 0.77 | step: 6.72 69%|██████▉ | 6908/10000 [10:52:13<4:43:20, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.019866012036800385, 'learning_rate': 9.217915484501334e-06, 'epoch': 6.91} 69%|██████▉ | 6908/10000 [10:52:13<4:43:20, 5.50s/it][2025-06-20 00:21:57,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:21:57,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.46 | bwd_microstep: 3392.46 | bwd_inner_microstep: 3391.61 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.90 [2025-06-20 00:21:57,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.46 | bwd: 3392.48 | bwd_inner: 3391.61 | bwd_allreduce: 0.83 | step: 6.90 69%|██████▉ | 6909/10000 [10:52:18<4:44:14, 5.52s/it] {'loss': 0.0005, 'grad_norm': 0.09634585678577423, 'learning_rate': 9.21246043763296e-06, 'epoch': 6.91} 69%|██████▉ | 6909/10000 [10:52:18<4:44:14, 5.52s/it][2025-06-20 00:22:03,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:22:03,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.11 | bwd_microstep: 3312.57 | bwd_inner_microstep: 3311.76 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-20 00:22:03,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.11 | bwd: 3312.58 | bwd_inner: 3311.76 | bwd_allreduce: 0.78 | step: 7.12 69%|██████▉ | 6910/10000 [10:52:24<4:43:07, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.012681116349995136, 'learning_rate': 9.20700652232742e-06, 'epoch': 6.91} 69%|██████▉ | 6910/10000 [10:52:24<4:43:07, 5.50s/it][2025-06-20 00:22:08,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:22:08,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.47 | bwd_microstep: 3360.48 | bwd_inner_microstep: 3359.51 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.29 [2025-06-20 00:22:08,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.47 | bwd: 3360.50 | bwd_inner: 3359.51 | bwd_allreduce: 0.94 | step: 7.30 69%|██████▉ | 6911/10000 [10:52:29<4:43:31, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.003413987113162875, 'learning_rate': 9.201553739156806e-06, 'epoch': 6.91} 69%|██████▉ | 6911/10000 [10:52:29<4:43:31, 5.51s/it][2025-06-20 00:22:14,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:22:14,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.37 | bwd_microstep: 3308.18 | bwd_inner_microstep: 3307.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 00:22:14,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.37 | bwd: 3308.20 | bwd_inner: 3307.39 | bwd_allreduce: 0.76 | step: 6.63 69%|██████▉ | 6912/10000 [10:52:35<4:42:28, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004031851422041655, 'learning_rate': 9.196102088693085e-06, 'epoch': 6.91} 69%|██████▉ | 6912/10000 [10:52:35<4:42:28, 5.49s/it][2025-06-20 00:22:19,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:22:19,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.12 | bwd_microstep: 3362.09 | bwd_inner_microstep: 3361.24 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.98 [2025-06-20 00:22:19,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.12 | bwd: 3362.11 | bwd_inner: 3361.24 | bwd_allreduce: 0.81 | step: 6.98 69%|██████▉ | 6913/10000 [10:52:40<4:43:02, 5.50s/it] {'loss': 0.0078, 'grad_norm': 2.9535605907440186, 'learning_rate': 9.190651571508113e-06, 'epoch': 6.91} 69%|██████▉ | 6913/10000 [10:52:40<4:43:02, 5.50s/it][2025-06-20 00:22:25,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:22:25,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.95 | bwd_microstep: 3362.33 | bwd_inner_microstep: 3361.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 00:22:25,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.95 | bwd: 3362.35 | bwd_inner: 3361.54 | bwd_allreduce: 0.76 | step: 6.63 69%|██████▉ | 6914/10000 [10:52:46<4:43:18, 5.51s/it] {'loss': 0.0023, 'grad_norm': 0.47075679898262024, 'learning_rate': 9.185202188173623e-06, 'epoch': 6.91} 69%|██████▉ | 6914/10000 [10:52:46<4:43:18, 5.51s/it][2025-06-20 00:22:30,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 00:22:30,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.31 | bwd_microstep: 3315.06 | bwd_inner_microstep: 3314.19 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.53 [2025-06-20 00:22:30,736] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.31 | bwd: 3315.08 | bwd_inner: 3314.19 | bwd_allreduce: 0.84 | step: 7.53 69%|██████▉ | 6915/10000 [10:52:51<4:42:27, 5.49s/it] {'loss': 0.0032, 'grad_norm': 0.7473167777061462, 'learning_rate': 9.179753939261231e-06, 'epoch': 6.92} 69%|██████▉ | 6915/10000 [10:52:51<4:42:27, 5.49s/it][2025-06-20 00:22:36,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:22:36,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.47 | bwd_microstep: 3368.86 | bwd_inner_microstep: 3367.91 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.31 [2025-06-20 00:22:36,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.47 | bwd: 3368.88 | bwd_inner: 3367.91 | bwd_allreduce: 0.91 | step: 7.31 69%|██████▉ | 6916/10000 [10:52:57<4:43:08, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.027394985780119896, 'learning_rate': 9.174306825342433e-06, 'epoch': 6.92} 69%|██████▉ | 6916/10000 [10:52:57<4:43:08, 5.51s/it][2025-06-20 00:22:41,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:22:41,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.96 | bwd_microstep: 3317.79 | bwd_inner_microstep: 3317.01 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 00:22:41,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.96 | bwd: 3317.81 | bwd_inner: 3317.01 | bwd_allreduce: 0.76 | step: 6.63 69%|██████▉ | 6917/10000 [10:53:02<4:42:26, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0014107608003541827, 'learning_rate': 9.168860846988612e-06, 'epoch': 6.92} 69%|██████▉ | 6917/10000 [10:53:02<4:42:26, 5.50s/it][2025-06-20 00:22:47,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:22:47,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.62 | bwd_microstep: 3320.49 | bwd_inner_microstep: 3319.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 00:22:47,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.62 | bwd: 3320.51 | bwd_inner: 3319.71 | bwd_allreduce: 0.76 | step: 6.64 69%|██████▉ | 6918/10000 [10:53:08<4:41:55, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.006981586571782827, 'learning_rate': 9.163416004771013e-06, 'epoch': 6.92} 69%|██████▉ | 6918/10000 [10:53:08<4:41:55, 5.49s/it][2025-06-20 00:22:52,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:22:52,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.15 | bwd_microstep: 3357.69 | bwd_inner_microstep: 3356.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-20 00:22:52,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.15 | bwd: 3357.71 | bwd_inner: 3356.89 | bwd_allreduce: 0.78 | step: 7.06 69%|██████▉ | 6919/10000 [10:53:13<4:42:20, 5.50s/it] {'loss': 0.0244, 'grad_norm': 3.521381139755249, 'learning_rate': 9.15797229926078e-06, 'epoch': 6.92} 69%|██████▉ | 6919/10000 [10:53:13<4:42:20, 5.50s/it][2025-06-20 00:22:58,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:22:58,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.08 | bwd_microstep: 3322.12 | bwd_inner_microstep: 3321.24 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.18 [2025-06-20 00:22:58,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.08 | bwd: 3322.14 | bwd_inner: 3321.24 | bwd_allreduce: 0.83 | step: 7.18 69%|██████▉ | 6920/10000 [10:53:19<4:41:50, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.10247565805912018, 'learning_rate': 9.152529731028934e-06, 'epoch': 6.92} 69%|██████▉ | 6920/10000 [10:53:19<4:41:50, 5.49s/it][2025-06-20 00:23:03,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:23:03,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.70 | bwd_microstep: 3368.44 | bwd_inner_microstep: 3367.62 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.81 [2025-06-20 00:23:03,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.70 | bwd: 3368.45 | bwd_inner: 3367.62 | bwd_allreduce: 0.79 | step: 6.81 69%|██████▉ | 6921/10000 [10:53:24<4:42:29, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.03510313108563423, 'learning_rate': 9.147088300646374e-06, 'epoch': 6.92} 69%|██████▉ | 6921/10000 [10:53:24<4:42:29, 5.51s/it][2025-06-20 00:23:09,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.84 [2025-06-20 00:23:09,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.81 | bwd_microstep: 3315.05 | bwd_inner_microstep: 3314.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.94 [2025-06-20 00:23:09,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.81 | bwd: 3315.06 | bwd_inner: 3314.26 | bwd_allreduce: 0.76 | step: 6.95 69%|██████▉ | 6922/10000 [10:53:30<4:41:41, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.01628255844116211, 'learning_rate': 9.14164800868389e-06, 'epoch': 6.92} 69%|██████▉ | 6922/10000 [10:53:30<4:41:41, 5.49s/it][2025-06-20 00:23:14,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:23:14,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.87 | bwd_microstep: 3310.86 | bwd_inner_microstep: 3310.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 00:23:14,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.87 | bwd: 3310.87 | bwd_inner: 3310.06 | bwd_allreduce: 0.77 | step: 6.78 69%|██████▉ | 6923/10000 [10:53:35<4:41:06, 5.48s/it] {'loss': 0.005, 'grad_norm': 1.9589003324508667, 'learning_rate': 9.136208855712125e-06, 'epoch': 6.92} 69%|██████▉ | 6923/10000 [10:53:35<4:41:06, 5.48s/it][2025-06-20 00:23:20,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 00:23:20,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.18 | bwd_microstep: 3368.92 | bwd_inner_microstep: 3368.02 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.23 [2025-06-20 00:23:20,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.18 | bwd: 3368.94 | bwd_inner: 3368.02 | bwd_allreduce: 0.88 | step: 7.24 69%|██████▉ | 6924/10000 [10:53:41<4:41:49, 5.50s/it] {'loss': 0.14, 'grad_norm': 9.923138618469238, 'learning_rate': 9.130770842301633e-06, 'epoch': 6.92} 69%|██████▉ | 6924/10000 [10:53:41<4:41:49, 5.50s/it][2025-06-20 00:23:25,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:23:25,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.37 | bwd_microstep: 3362.89 | bwd_inner_microstep: 3362.00 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.85 [2025-06-20 00:23:25,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.37 | bwd: 3362.90 | bwd_inner: 3362.00 | bwd_allreduce: 0.86 | step: 6.85 69%|██████▉ | 6925/10000 [10:53:46<4:42:14, 5.51s/it] {'loss': 0.0013, 'grad_norm': 0.346637099981308, 'learning_rate': 9.125333969022831e-06, 'epoch': 6.92} 69%|██████▉ | 6925/10000 [10:53:46<4:42:14, 5.51s/it][2025-06-20 00:23:31,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:23:31,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.36 | bwd_microstep: 3376.57 | bwd_inner_microstep: 3375.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 00:23:31,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.36 | bwd: 3376.58 | bwd_inner: 3375.78 | bwd_allreduce: 0.76 | step: 6.57 69%|██████▉ | 6926/10000 [10:53:52<4:42:41, 5.52s/it] {'loss': 0.0012, 'grad_norm': 0.21121345460414886, 'learning_rate': 9.119898236446027e-06, 'epoch': 6.93} 69%|██████▉ | 6926/10000 [10:53:52<4:42:41, 5.52s/it][2025-06-20 00:23:36,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:23:36,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.86 | bwd_microstep: 3317.96 | bwd_inner_microstep: 3317.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-20 00:23:36,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.86 | bwd: 3317.97 | bwd_inner: 3317.18 | bwd_allreduce: 0.75 | step: 6.56 69%|██████▉ | 6927/10000 [10:53:57<4:41:46, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.002582743065431714, 'learning_rate': 9.114463645141409e-06, 'epoch': 6.93} 69%|██████▉ | 6927/10000 [10:53:57<4:41:46, 5.50s/it][2025-06-20 00:23:42,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:23:42,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.64 | bwd_microstep: 3377.87 | bwd_inner_microstep: 3377.02 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.95 [2025-06-20 00:23:42,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.64 | bwd: 3377.89 | bwd_inner: 3377.02 | bwd_allreduce: 0.81 | step: 6.95 69%|██████▉ | 6928/10000 [10:54:03<4:42:22, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.008405916392803192, 'learning_rate': 9.109030195679021e-06, 'epoch': 6.93} 69%|██████▉ | 6928/10000 [10:54:03<4:42:22, 5.52s/it][2025-06-20 00:23:47,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 00:23:47,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.46 | bwd_microstep: 3330.51 | bwd_inner_microstep: 3329.55 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.66 [2025-06-20 00:23:47,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.46 | bwd: 3330.52 | bwd_inner: 3329.55 | bwd_allreduce: 0.93 | step: 7.67 69%|██████▉ | 6929/10000 [10:54:08<4:41:44, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.20396912097930908, 'learning_rate': 9.103597888628823e-06, 'epoch': 6.93} 69%|██████▉ | 6929/10000 [10:54:08<4:41:44, 5.50s/it][2025-06-20 00:23:53,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:23:53,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.99 | bwd_microstep: 3404.45 | bwd_inner_microstep: 3403.64 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.78 [2025-06-20 00:23:53,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.99 | bwd: 3404.46 | bwd_inner: 3403.64 | bwd_allreduce: 0.78 | step: 6.79 69%|██████▉ | 6930/10000 [10:54:14<4:42:59, 5.53s/it] {'loss': 0.0533, 'grad_norm': 9.014814376831055, 'learning_rate': 9.098166724560635e-06, 'epoch': 6.93} 69%|██████▉ | 6930/10000 [10:54:14<4:42:59, 5.53s/it][2025-06-20 00:23:58,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:23:58,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.08 | bwd_microstep: 3321.45 | bwd_inner_microstep: 3320.51 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.06 [2025-06-20 00:23:58,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.08 | bwd: 3321.46 | bwd_inner: 3320.51 | bwd_allreduce: 0.90 | step: 7.07 69%|██████▉ | 6931/10000 [10:54:19<4:41:59, 5.51s/it] {'loss': 0.0329, 'grad_norm': 5.371335983276367, 'learning_rate': 9.09273670404416e-06, 'epoch': 6.93} 69%|██████▉ | 6931/10000 [10:54:19<4:41:59, 5.51s/it][2025-06-20 00:24:04,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:24:04,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.23 | bwd_microstep: 3377.01 | bwd_inner_microstep: 3376.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.52 [2025-06-20 00:24:04,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.23 | bwd: 3377.03 | bwd_inner: 3376.21 | bwd_allreduce: 0.77 | step: 6.52 69%|██████▉ | 6932/10000 [10:54:25<4:42:29, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.008260366506874561, 'learning_rate': 9.087307827648992e-06, 'epoch': 6.93} 69%|██████▉ | 6932/10000 [10:54:25<4:42:29, 5.52s/it][2025-06-20 00:24:09,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:24:09,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.89 | bwd_microstep: 3377.84 | bwd_inner_microstep: 3377.03 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.66 [2025-06-20 00:24:09,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.89 | bwd: 3377.85 | bwd_inner: 3377.03 | bwd_allreduce: 0.78 | step: 6.66 69%|██████▉ | 6933/10000 [10:54:30<4:42:53, 5.53s/it] {'loss': 0.0008, 'grad_norm': 0.25516992807388306, 'learning_rate': 9.081880095944578e-06, 'epoch': 6.93} 69%|██████▉ | 6933/10000 [10:54:30<4:42:53, 5.53s/it][2025-06-20 00:24:15,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:24:15,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.96 | bwd_microstep: 3332.40 | bwd_inner_microstep: 3331.59 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.70 [2025-06-20 00:24:15,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.96 | bwd: 3332.41 | bwd_inner: 3331.59 | bwd_allreduce: 0.78 | step: 6.71 69%|██████▉ | 6934/10000 [10:54:36<4:42:06, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.023236725479364395, 'learning_rate': 9.076453509500267e-06, 'epoch': 6.93} 69%|██████▉ | 6934/10000 [10:54:36<4:42:06, 5.52s/it][2025-06-20 00:24:20,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:24:20,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.83 | bwd_microstep: 3331.77 | bwd_inner_microstep: 3330.92 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.10 [2025-06-20 00:24:20,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.83 | bwd: 3331.79 | bwd_inner: 3330.92 | bwd_allreduce: 0.80 | step: 7.11 69%|██████▉ | 6935/10000 [10:54:41<4:41:30, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.015018617734313011, 'learning_rate': 9.071028068885288e-06, 'epoch': 6.94} 69%|██████▉ | 6935/10000 [10:54:41<4:41:30, 5.51s/it][2025-06-20 00:24:26,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:24:26,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.52 | bwd_microstep: 3378.06 | bwd_inner_microstep: 3377.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 00:24:26,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.52 | bwd: 3378.07 | bwd_inner: 3377.26 | bwd_allreduce: 0.76 | step: 6.69 69%|██████▉ | 6936/10000 [10:54:47<4:42:14, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.05200159549713135, 'learning_rate': 9.06560377466874e-06, 'epoch': 6.94} 69%|██████▉ | 6936/10000 [10:54:47<4:42:14, 5.53s/it][2025-06-20 00:24:32,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 00:24:32,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.64 | bwd_microstep: 3376.66 | bwd_inner_microstep: 3375.77 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.65 [2025-06-20 00:24:32,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.64 | bwd: 3376.68 | bwd_inner: 3375.77 | bwd_allreduce: 0.84 | step: 7.66 69%|██████▉ | 6937/10000 [10:54:52<4:42:32, 5.53s/it] {'loss': 0.002, 'grad_norm': 0.3522014021873474, 'learning_rate': 9.060180627419615e-06, 'epoch': 6.94} 69%|██████▉ | 6937/10000 [10:54:52<4:42:32, 5.53s/it][2025-06-20 00:24:37,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 00:24:37,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.72 | bwd_microstep: 3337.15 | bwd_inner_microstep: 3336.24 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.89 [2025-06-20 00:24:37,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.72 | bwd: 3337.18 | bwd_inner: 3336.24 | bwd_allreduce: 0.86 | step: 7.90 69%|██████▉ | 6938/10000 [10:54:58<4:42:23, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0012396653182804585, 'learning_rate': 9.054758627706763e-06, 'epoch': 6.94} 69%|██████▉ | 6938/10000 [10:54:58<4:42:23, 5.53s/it][2025-06-20 00:24:43,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:24:43,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.67 | bwd_microstep: 3326.22 | bwd_inner_microstep: 3325.40 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-20 00:24:43,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.67 | bwd: 3326.23 | bwd_inner: 3325.40 | bwd_allreduce: 0.79 | step: 7.23 69%|██████▉ | 6939/10000 [10:55:03<4:41:55, 5.53s/it] {'loss': 0.0008, 'grad_norm': 0.23276861011981964, 'learning_rate': 9.049337776098931e-06, 'epoch': 6.94} 69%|██████▉ | 6939/10000 [10:55:03<4:41:55, 5.53s/it][2025-06-20 00:24:48,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:24:48,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.10 | bwd_microstep: 3377.76 | bwd_inner_microstep: 3376.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-20 00:24:48,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.10 | bwd: 3377.78 | bwd_inner: 3376.96 | bwd_allreduce: 0.77 | step: 6.93 69%|██████▉ | 6940/10000 [10:55:09<4:42:06, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.001169400056824088, 'learning_rate': 9.043918073164743e-06, 'epoch': 6.94} 69%|██████▉ | 6940/10000 [10:55:09<4:42:06, 5.53s/it][2025-06-20 00:24:54,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:24:54,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.59 | bwd_microstep: 3373.62 | bwd_inner_microstep: 3372.80 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.92 [2025-06-20 00:24:54,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.59 | bwd: 3373.64 | bwd_inner: 3372.80 | bwd_allreduce: 0.79 | step: 6.92 69%|██████▉ | 6941/10000 [10:55:14<4:42:09, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.016154112294316292, 'learning_rate': 9.038499519472702e-06, 'epoch': 6.94} 69%|██████▉ | 6941/10000 [10:55:14<4:42:09, 5.53s/it][2025-06-20 00:24:59,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:24:59,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.75 | bwd_microstep: 3319.45 | bwd_inner_microstep: 3318.49 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.04 [2025-06-20 00:24:59,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.75 | bwd: 3319.47 | bwd_inner: 3318.49 | bwd_allreduce: 0.93 | step: 7.04 69%|██████▉ | 6942/10000 [10:55:20<4:41:01, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.014026890508830547, 'learning_rate': 9.033082115591185e-06, 'epoch': 6.94} 69%|██████▉ | 6942/10000 [10:55:20<4:41:01, 5.51s/it][2025-06-20 00:25:05,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:25:05,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2152.41 | bwd_microstep: 3330.17 | bwd_inner_microstep: 3329.36 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.08 [2025-06-20 00:25:05,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2152.41 | bwd: 3330.19 | bwd_inner: 3329.36 | bwd_allreduce: 0.78 | step: 7.09 69%|██████▉ | 6943/10000 [10:55:25<4:41:00, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0003035260597243905, 'learning_rate': 9.027665862088464e-06, 'epoch': 6.94} 69%|██████▉ | 6943/10000 [10:55:25<4:41:00, 5.52s/it][2025-06-20 00:25:10,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:25:10,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.90 | bwd_microstep: 3377.39 | bwd_inner_microstep: 3376.54 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.47 [2025-06-20 00:25:10,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.90 | bwd: 3377.42 | bwd_inner: 3376.54 | bwd_allreduce: 0.81 | step: 7.47 69%|██████▉ | 6944/10000 [10:55:31<4:41:29, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0013938583433628082, 'learning_rate': 9.022250759532663e-06, 'epoch': 6.94} 69%|██████▉ | 6944/10000 [10:55:31<4:41:29, 5.53s/it][2025-06-20 00:25:16,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:25:16,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2161.79 | bwd_microstep: 3374.83 | bwd_inner_microstep: 3373.96 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.96 [2025-06-20 00:25:16,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2161.79 | bwd: 3374.85 | bwd_inner: 3373.96 | bwd_allreduce: 0.84 | step: 6.98 69%|██████▉ | 6945/10000 [10:55:37<4:42:08, 5.54s/it] {'loss': 0.0001, 'grad_norm': 0.031827833503484726, 'learning_rate': 9.016836808491806e-06, 'epoch': 6.95} 69%|██████▉ | 6945/10000 [10:55:37<4:42:08, 5.54s/it][2025-06-20 00:25:21,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:25:21,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.26 | bwd_microstep: 3344.70 | bwd_inner_microstep: 3343.78 | bwd_allreduce_microstep: 0.83 | step_microstep: 8.01 [2025-06-20 00:25:21,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.26 | bwd: 3344.73 | bwd_inner: 3343.78 | bwd_allreduce: 0.87 | step: 8.02 69%|██████▉ | 6946/10000 [10:55:42<4:41:23, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0005236832075752318, 'learning_rate': 9.011424009533795e-06, 'epoch': 6.95} 69%|██████▉ | 6946/10000 [10:55:42<4:41:23, 5.53s/it][2025-06-20 00:25:27,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:25:27,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.67 | bwd_microstep: 3335.42 | bwd_inner_microstep: 3334.52 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.85 [2025-06-20 00:25:27,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.67 | bwd: 3335.45 | bwd_inner: 3334.52 | bwd_allreduce: 0.85 | step: 7.86 69%|██████▉ | 6947/10000 [10:55:48<4:41:21, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0021420661360025406, 'learning_rate': 9.006012363226406e-06, 'epoch': 6.95} 69%|██████▉ | 6947/10000 [10:55:48<4:41:21, 5.53s/it][2025-06-20 00:25:32,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 00:25:32,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2168.27 | bwd_microstep: 3381.96 | bwd_inner_microstep: 3381.05 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.53 [2025-06-20 00:25:32,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2168.27 | bwd: 3381.99 | bwd_inner: 3381.05 | bwd_allreduce: 0.87 | step: 7.53 69%|██████▉ | 6948/10000 [10:55:53<4:42:17, 5.55s/it] {'loss': 0.0, 'grad_norm': 0.0027065568137913942, 'learning_rate': 9.000601870137295e-06, 'epoch': 6.95} 69%|██████▉ | 6948/10000 [10:55:53<4:42:17, 5.55s/it][2025-06-20 00:25:38,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:25:38,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2171.51 | bwd_microstep: 3387.25 | bwd_inner_microstep: 3386.37 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.40 [2025-06-20 00:25:38,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2171.51 | bwd: 3387.28 | bwd_inner: 3386.37 | bwd_allreduce: 0.84 | step: 7.40 69%|██████▉ | 6949/10000 [10:55:59<4:43:03, 5.57s/it] {'loss': 0.0003, 'grad_norm': 0.04637465253472328, 'learning_rate': 8.995192530833998e-06, 'epoch': 6.95} 69%|██████▉ | 6949/10000 [10:55:59<4:43:03, 5.57s/it][2025-06-20 00:25:44,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:25:44,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2167.90 | bwd_microstep: 3383.21 | bwd_inner_microstep: 3382.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 00:25:44,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2167.89 | bwd: 3383.23 | bwd_inner: 3382.39 | bwd_allreduce: 0.77 | step: 6.77 70%|██████▉ | 6950/10000 [10:56:04<4:43:20, 5.57s/it] {'loss': 0.0078, 'grad_norm': 1.2753795385360718, 'learning_rate': 8.98978434588393e-06, 'epoch': 6.95} 70%|██████▉ | 6950/10000 [10:56:04<4:43:20, 5.57s/it][2025-06-20 00:25:49,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:25:49,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.03 | bwd_microstep: 3317.63 | bwd_inner_microstep: 3316.67 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.09 [2025-06-20 00:25:49,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.03 | bwd: 3317.65 | bwd_inner: 3316.67 | bwd_allreduce: 0.93 | step: 7.09 70%|██████▉ | 6951/10000 [10:56:10<4:41:53, 5.55s/it] {'loss': 0.0007, 'grad_norm': 0.16362908482551575, 'learning_rate': 8.984377315854382e-06, 'epoch': 6.95} 70%|██████▉ | 6951/10000 [10:56:10<4:41:53, 5.55s/it][2025-06-20 00:25:55,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:25:55,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2153.45 | bwd_microstep: 3384.16 | bwd_inner_microstep: 3383.25 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.50 [2025-06-20 00:25:55,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2153.45 | bwd: 3384.18 | bwd_inner: 3383.25 | bwd_allreduce: 0.87 | step: 7.50 70%|██████▉ | 6952/10000 [10:56:15<4:42:14, 5.56s/it] {'loss': 0.0, 'grad_norm': 0.0019568302668631077, 'learning_rate': 8.978971441312531e-06, 'epoch': 6.95} 70%|██████▉ | 6952/10000 [10:56:15<4:42:14, 5.56s/it][2025-06-20 00:26:00,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:26:00,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.18 | bwd_microstep: 3335.85 | bwd_inner_microstep: 3335.03 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.35 [2025-06-20 00:26:00,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.18 | bwd: 3335.86 | bwd_inner: 3335.03 | bwd_allreduce: 0.78 | step: 7.35 70%|██████▉ | 6953/10000 [10:56:21<4:41:35, 5.54s/it] {'loss': 0.0001, 'grad_norm': 0.019895615056157112, 'learning_rate': 8.973566722825434e-06, 'epoch': 6.95} 70%|██████▉ | 6953/10000 [10:56:21<4:41:35, 5.54s/it][2025-06-20 00:26:06,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.75 [2025-06-20 00:26:06,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.67 | bwd_microstep: 3338.41 | bwd_inner_microstep: 3337.42 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.17 [2025-06-20 00:26:06,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.67 | bwd: 3338.42 | bwd_inner: 3337.42 | bwd_allreduce: 0.96 | step: 7.17 70%|██████▉ | 6954/10000 [10:56:27<4:41:23, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0051797013729810715, 'learning_rate': 8.968163160960001e-06, 'epoch': 6.95} 70%|██████▉ | 6954/10000 [10:56:27<4:41:23, 5.54s/it][2025-06-20 00:26:11,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:26:11,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.32 | bwd_microstep: 3332.51 | bwd_inner_microstep: 3331.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-20 00:26:11,706] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.32 | bwd: 3332.52 | bwd_inner: 3331.71 | bwd_allreduce: 0.76 | step: 6.87 70%|██████▉ | 6955/10000 [10:56:32<4:40:32, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.004151683766394854, 'learning_rate': 8.962760756283056e-06, 'epoch': 6.96} 70%|██████▉ | 6955/10000 [10:56:32<4:40:32, 5.53s/it][2025-06-20 00:26:17,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:26:17,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.45 | bwd_microstep: 3367.02 | bwd_inner_microstep: 3366.15 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.18 [2025-06-20 00:26:17,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.45 | bwd: 3367.04 | bwd_inner: 3366.15 | bwd_allreduce: 0.82 | step: 7.19 70%|██████▉ | 6956/10000 [10:56:38<4:40:48, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.004556699655950069, 'learning_rate': 8.957359509361279e-06, 'epoch': 6.96} 70%|██████▉ | 6956/10000 [10:56:38<4:40:48, 5.54s/it][2025-06-20 00:26:22,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:26:22,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.68 | bwd_microstep: 3324.13 | bwd_inner_microstep: 3323.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.06 [2025-06-20 00:26:22,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.68 | bwd: 3324.15 | bwd_inner: 3323.34 | bwd_allreduce: 0.76 | step: 7.08 70%|██████▉ | 6957/10000 [10:56:43<4:39:58, 5.52s/it] {'loss': 0.001, 'grad_norm': 0.13331252336502075, 'learning_rate': 8.951959420761241e-06, 'epoch': 6.96} 70%|██████▉ | 6957/10000 [10:56:43<4:39:58, 5.52s/it][2025-06-20 00:26:28,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:26:28,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.80 | bwd_microstep: 3373.00 | bwd_inner_microstep: 3372.18 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.05 [2025-06-20 00:26:28,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.80 | bwd: 3373.02 | bwd_inner: 3372.18 | bwd_allreduce: 0.79 | step: 7.06 70%|██████▉ | 6958/10000 [10:56:49<4:40:18, 5.53s/it] {'loss': 0.0008, 'grad_norm': 0.3365253806114197, 'learning_rate': 8.946560491049391e-06, 'epoch': 6.96} 70%|██████▉ | 6958/10000 [10:56:49<4:40:18, 5.53s/it][2025-06-20 00:26:33,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 00:26:33,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.56 | bwd_microstep: 3326.05 | bwd_inner_microstep: 3325.11 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.91 [2025-06-20 00:26:33,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.56 | bwd: 3326.06 | bwd_inner: 3325.11 | bwd_allreduce: 0.91 | step: 7.93 70%|██████▉ | 6959/10000 [10:56:54<4:39:34, 5.52s/it] {'loss': 0.011, 'grad_norm': 2.6500818729400635, 'learning_rate': 8.941162720792039e-06, 'epoch': 6.96} 70%|██████▉ | 6959/10000 [10:56:54<4:39:34, 5.52s/it][2025-06-20 00:26:39,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:26:39,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.90 | bwd_microstep: 3325.37 | bwd_inner_microstep: 3324.58 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 00:26:39,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.90 | bwd: 3325.38 | bwd_inner: 3324.58 | bwd_allreduce: 0.76 | step: 6.68 70%|██████▉ | 6960/10000 [10:57:00<4:38:46, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001230773632414639, 'learning_rate': 8.935766110555391e-06, 'epoch': 6.96} 70%|██████▉ | 6960/10000 [10:57:00<4:38:46, 5.50s/it][2025-06-20 00:26:44,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:26:44,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.73 | bwd_microstep: 3372.46 | bwd_inner_microstep: 3371.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 00:26:44,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.73 | bwd: 3372.48 | bwd_inner: 3371.67 | bwd_allreduce: 0.76 | step: 6.71 70%|██████▉ | 6961/10000 [10:57:05<4:39:11, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.03124365210533142, 'learning_rate': 8.930370660905526e-06, 'epoch': 6.96} 70%|██████▉ | 6961/10000 [10:57:05<4:39:11, 5.51s/it][2025-06-20 00:26:50,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:26:50,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.37 | bwd_microstep: 3331.38 | bwd_inner_microstep: 3330.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-20 00:26:50,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.37 | bwd: 3331.39 | bwd_inner: 3330.57 | bwd_allreduce: 0.77 | step: 6.96 70%|██████▉ | 6962/10000 [10:57:11<4:38:47, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.0407809317111969, 'learning_rate': 8.924976372408402e-06, 'epoch': 6.96} 70%|██████▉ | 6962/10000 [10:57:11<4:38:47, 5.51s/it][2025-06-20 00:26:55,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.66 | optimizer_step: 2.73 [2025-06-20 00:26:55,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.01 | bwd_microstep: 3368.90 | bwd_inner_microstep: 3367.69 | bwd_allreduce_microstep: 1.11 | step_microstep: 8.91 [2025-06-20 00:26:55,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.01 | bwd: 3368.93 | bwd_inner: 3367.69 | bwd_allreduce: 1.15 | step: 8.92 70%|██████▉ | 6963/10000 [10:57:16<4:39:19, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.007552244234830141, 'learning_rate': 8.91958324562986e-06, 'epoch': 6.96} 70%|██████▉ | 6963/10000 [10:57:16<4:39:19, 5.52s/it][2025-06-20 00:27:01,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:27:01,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.82 | bwd_microstep: 3322.98 | bwd_inner_microstep: 3322.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.10 [2025-06-20 00:27:01,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.82 | bwd: 3322.99 | bwd_inner: 3322.18 | bwd_allreduce: 0.77 | step: 7.10 70%|██████▉ | 6964/10000 [10:57:22<4:38:47, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0015755374915897846, 'learning_rate': 8.914191281135608e-06, 'epoch': 6.96} 70%|██████▉ | 6964/10000 [10:57:22<4:38:47, 5.51s/it][2025-06-20 00:27:06,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:27:06,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.31 | bwd_microstep: 3323.49 | bwd_inner_microstep: 3322.57 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.83 [2025-06-20 00:27:06,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.31 | bwd: 3323.50 | bwd_inner: 3322.57 | bwd_allreduce: 0.88 | step: 6.83 70%|██████▉ | 6965/10000 [10:57:27<4:38:06, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0006638559862039983, 'learning_rate': 8.908800479491238e-06, 'epoch': 6.96} 70%|██████▉ | 6965/10000 [10:57:27<4:38:06, 5.50s/it][2025-06-20 00:27:12,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:27:12,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.60 | bwd_microstep: 3377.97 | bwd_inner_microstep: 3377.13 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.95 [2025-06-20 00:27:12,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.60 | bwd: 3377.98 | bwd_inner: 3377.13 | bwd_allreduce: 0.80 | step: 6.95 70%|██████▉ | 6966/10000 [10:57:33<4:38:48, 5.51s/it] {'loss': 0.0032, 'grad_norm': 0.7350932359695435, 'learning_rate': 8.903410841262223e-06, 'epoch': 6.97} 70%|██████▉ | 6966/10000 [10:57:33<4:38:48, 5.51s/it][2025-06-20 00:27:17,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:27:17,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.19 | bwd_microstep: 3322.22 | bwd_inner_microstep: 3321.42 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 00:27:17,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.19 | bwd: 3322.23 | bwd_inner: 3321.41 | bwd_allreduce: 0.77 | step: 6.85 70%|██████▉ | 6967/10000 [10:57:38<4:38:07, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.006755711045116186, 'learning_rate': 8.89802236701391e-06, 'epoch': 6.97} 70%|██████▉ | 6967/10000 [10:57:38<4:38:07, 5.50s/it][2025-06-20 00:27:23,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:27:23,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.33 | bwd_microstep: 3324.87 | bwd_inner_microstep: 3324.06 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-20 00:27:23,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.33 | bwd: 3324.89 | bwd_inner: 3324.06 | bwd_allreduce: 0.78 | step: 7.03 70%|██████▉ | 6968/10000 [10:57:44<4:37:35, 5.49s/it] {'loss': 0.0244, 'grad_norm': 6.256631851196289, 'learning_rate': 8.892635057311531e-06, 'epoch': 6.97} 70%|██████▉ | 6968/10000 [10:57:44<4:37:35, 5.49s/it][2025-06-20 00:27:28,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:27:28,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.60 | bwd_microstep: 3318.02 | bwd_inner_microstep: 3317.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 00:27:28,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.60 | bwd: 3318.04 | bwd_inner: 3317.24 | bwd_allreduce: 0.76 | step: 6.63 70%|██████▉ | 6969/10000 [10:57:49<4:36:59, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.003954943735152483, 'learning_rate': 8.88724891272018e-06, 'epoch': 6.97} 70%|██████▉ | 6969/10000 [10:57:49<4:36:59, 5.48s/it][2025-06-20 00:27:34,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:27:34,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.16 | bwd_microstep: 3319.04 | bwd_inner_microstep: 3318.25 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 00:27:34,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.16 | bwd: 3319.06 | bwd_inner: 3318.25 | bwd_allreduce: 0.76 | step: 6.71 70%|██████▉ | 6970/10000 [10:57:55<4:36:31, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0031216184142977, 'learning_rate': 8.881863933804838e-06, 'epoch': 6.97} 70%|██████▉ | 6970/10000 [10:57:55<4:36:31, 5.48s/it][2025-06-20 00:27:39,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:27:39,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.01 | bwd_microstep: 3363.28 | bwd_inner_microstep: 3362.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-20 00:27:39,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.01 | bwd: 3363.30 | bwd_inner: 3362.49 | bwd_allreduce: 0.76 | step: 6.66 70%|██████▉ | 6971/10000 [10:58:00<4:37:11, 5.49s/it] {'loss': 0.0016, 'grad_norm': 0.268243670463562, 'learning_rate': 8.87648012113037e-06, 'epoch': 6.97} 70%|██████▉ | 6971/10000 [10:58:00<4:37:11, 5.49s/it][2025-06-20 00:27:45,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:27:45,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.94 | bwd_microstep: 3362.98 | bwd_inner_microstep: 3362.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.75 [2025-06-20 00:27:45,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.94 | bwd: 3362.99 | bwd_inner: 3362.19 | bwd_allreduce: 0.76 | step: 6.75 70%|██████▉ | 6972/10000 [10:58:06<4:37:42, 5.50s/it] {'loss': 0.0126, 'grad_norm': 2.291672468185425, 'learning_rate': 8.871097475261512e-06, 'epoch': 6.97} 70%|██████▉ | 6972/10000 [10:58:06<4:37:42, 5.50s/it][2025-06-20 00:27:50,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:27:50,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.22 | bwd_microstep: 3316.29 | bwd_inner_microstep: 3315.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-20 00:27:50,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.22 | bwd: 3316.31 | bwd_inner: 3315.50 | bwd_allreduce: 0.76 | step: 6.67 70%|██████▉ | 6973/10000 [10:58:11<4:36:58, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0015531040262430906, 'learning_rate': 8.865715996762885e-06, 'epoch': 6.97} 70%|██████▉ | 6973/10000 [10:58:11<4:36:58, 5.49s/it][2025-06-20 00:27:56,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:27:56,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.67 | bwd_microstep: 3317.01 | bwd_inner_microstep: 3316.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-20 00:27:56,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.67 | bwd: 3317.03 | bwd_inner: 3316.22 | bwd_allreduce: 0.76 | step: 6.67 70%|██████▉ | 6974/10000 [10:58:16<4:36:25, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.08808214962482452, 'learning_rate': 8.860335686198966e-06, 'epoch': 6.97} 70%|██████▉ | 6974/10000 [10:58:16<4:36:25, 5.48s/it][2025-06-20 00:28:01,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:28:01,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.14 | bwd_microstep: 3374.51 | bwd_inner_microstep: 3373.69 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.35 [2025-06-20 00:28:01,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.14 | bwd: 3374.53 | bwd_inner: 3373.69 | bwd_allreduce: 0.78 | step: 7.35 70%|██████▉ | 6975/10000 [10:58:22<4:37:21, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.002981384051963687, 'learning_rate': 8.854956544134132e-06, 'epoch': 6.97} 70%|██████▉ | 6975/10000 [10:58:22<4:37:21, 5.50s/it][2025-06-20 00:28:07,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:28:07,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.80 | bwd_microstep: 3394.02 | bwd_inner_microstep: 3393.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 00:28:07,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.80 | bwd: 3394.04 | bwd_inner: 3393.23 | bwd_allreduce: 0.76 | step: 6.70 70%|██████▉ | 6976/10000 [10:58:28<4:38:19, 5.52s/it] {'loss': 0.0012, 'grad_norm': 0.33121833205223083, 'learning_rate': 8.849578571132633e-06, 'epoch': 6.98} 70%|██████▉ | 6976/10000 [10:58:28<4:38:19, 5.52s/it][2025-06-20 00:28:12,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:28:12,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.34 | bwd_microstep: 3307.98 | bwd_inner_microstep: 3307.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-20 00:28:12,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.34 | bwd: 3307.99 | bwd_inner: 3307.20 | bwd_allreduce: 0.75 | step: 6.54 70%|██████▉ | 6977/10000 [10:58:33<4:37:07, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0008753775618970394, 'learning_rate': 8.844201767758592e-06, 'epoch': 6.98} 70%|██████▉ | 6977/10000 [10:58:33<4:37:07, 5.50s/it][2025-06-20 00:28:18,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:28:18,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.93 | bwd_microstep: 3321.86 | bwd_inner_microstep: 3320.93 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.08 [2025-06-20 00:28:18,212] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.93 | bwd: 3321.87 | bwd_inner: 3320.93 | bwd_allreduce: 0.90 | step: 7.08 70%|██████▉ | 6978/10000 [10:58:39<4:36:29, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0014866350684314966, 'learning_rate': 8.838826134576012e-06, 'epoch': 6.98} 70%|██████▉ | 6978/10000 [10:58:39<4:36:29, 5.49s/it][2025-06-20 00:28:23,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 00:28:23,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.07 | bwd_microstep: 3368.49 | bwd_inner_microstep: 3367.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-20 00:28:23,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.07 | bwd: 3368.50 | bwd_inner: 3367.70 | bwd_allreduce: 0.76 | step: 6.56 70%|██████▉ | 6979/10000 [10:58:44<4:37:05, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.014775925315916538, 'learning_rate': 8.833451672148767e-06, 'epoch': 6.98} 70%|██████▉ | 6979/10000 [10:58:44<4:37:05, 5.50s/it][2025-06-20 00:28:29,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:28:29,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.55 | bwd_microstep: 3320.56 | bwd_inner_microstep: 3319.66 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.49 [2025-06-20 00:28:29,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.55 | bwd: 3320.57 | bwd_inner: 3319.66 | bwd_allreduce: 0.87 | step: 7.49 70%|██████▉ | 6980/10000 [10:58:50<4:36:30, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.021675432100892067, 'learning_rate': 8.828078381040617e-06, 'epoch': 6.98} 70%|██████▉ | 6980/10000 [10:58:50<4:36:30, 5.49s/it][2025-06-20 00:28:34,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:28:34,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.79 | bwd_microstep: 3325.01 | bwd_inner_microstep: 3324.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 00:28:34,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.79 | bwd: 3325.02 | bwd_inner: 3324.20 | bwd_allreduce: 0.78 | step: 6.81 70%|██████▉ | 6981/10000 [10:58:55<4:36:02, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.04776173457503319, 'learning_rate': 8.82270626181519e-06, 'epoch': 6.98} 70%|██████▉ | 6981/10000 [10:58:55<4:36:02, 5.49s/it][2025-06-20 00:28:40,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:28:40,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.39 | bwd_microstep: 3313.35 | bwd_inner_microstep: 3312.42 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.97 [2025-06-20 00:28:40,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.39 | bwd: 3313.37 | bwd_inner: 3312.42 | bwd_allreduce: 0.90 | step: 6.97 70%|██████▉ | 6982/10000 [10:59:00<4:35:32, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.014387618750333786, 'learning_rate': 8.817335315036004e-06, 'epoch': 6.98} 70%|██████▉ | 6982/10000 [10:59:00<4:35:32, 5.48s/it][2025-06-20 00:28:45,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.85 [2025-06-20 00:28:45,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.98 | bwd_microstep: 3364.78 | bwd_inner_microstep: 3364.00 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 00:28:45,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.98 | bwd: 3364.80 | bwd_inner: 3364.00 | bwd_allreduce: 0.75 | step: 6.68 70%|██████▉ | 6983/10000 [10:59:06<4:36:10, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0005314312875270844, 'learning_rate': 8.811965541266443e-06, 'epoch': 6.98} 70%|██████▉ | 6983/10000 [10:59:06<4:36:10, 5.49s/it][2025-06-20 00:28:51,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:28:51,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.65 | bwd_microstep: 3311.63 | bwd_inner_microstep: 3310.83 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-20 00:28:51,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.65 | bwd: 3311.65 | bwd_inner: 3310.83 | bwd_allreduce: 0.78 | step: 7.17 70%|██████▉ | 6984/10000 [10:59:11<4:35:30, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.045340847223997116, 'learning_rate': 8.80659694106977e-06, 'epoch': 6.98} 70%|██████▉ | 6984/10000 [10:59:11<4:35:30, 5.48s/it][2025-06-20 00:28:56,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 00:28:56,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.47 | bwd_microstep: 3370.30 | bwd_inner_microstep: 3369.19 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.87 [2025-06-20 00:28:56,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.47 | bwd: 3370.32 | bwd_inner: 3369.19 | bwd_allreduce: 1.07 | step: 7.88 70%|██████▉ | 6985/10000 [10:59:17<4:36:14, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.010780548676848412, 'learning_rate': 8.80122951500913e-06, 'epoch': 6.99} 70%|██████▉ | 6985/10000 [10:59:17<4:36:14, 5.50s/it][2025-06-20 00:29:02,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:29:02,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.13 | bwd_microstep: 3305.46 | bwd_inner_microstep: 3304.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-20 00:29:02,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.13 | bwd: 3305.48 | bwd_inner: 3304.65 | bwd_allreduce: 0.78 | step: 7.00 70%|██████▉ | 6986/10000 [10:59:22<4:35:28, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0024974101688712835, 'learning_rate': 8.795863263647535e-06, 'epoch': 6.99} 70%|██████▉ | 6986/10000 [10:59:22<4:35:28, 5.48s/it][2025-06-20 00:29:07,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:29:07,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.78 | bwd_microstep: 3316.63 | bwd_inner_microstep: 3315.82 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-20 00:29:07,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.78 | bwd: 3316.64 | bwd_inner: 3315.82 | bwd_allreduce: 0.78 | step: 6.86 70%|██████▉ | 6987/10000 [10:59:28<4:34:57, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.048185236752033234, 'learning_rate': 8.790498187547887e-06, 'epoch': 6.99} 70%|██████▉ | 6987/10000 [10:59:28<4:34:57, 5.48s/it][2025-06-20 00:29:13,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:29:13,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.06 | bwd_microstep: 3315.79 | bwd_inner_microstep: 3314.99 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-20 00:29:13,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.06 | bwd: 3315.81 | bwd_inner: 3314.99 | bwd_allreduce: 0.77 | step: 6.80 70%|██████▉ | 6988/10000 [10:59:33<4:34:33, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.12493179738521576, 'learning_rate': 8.78513428727296e-06, 'epoch': 6.99} 70%|██████▉ | 6988/10000 [10:59:33<4:34:33, 5.47s/it][2025-06-20 00:29:18,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:29:18,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.28 | bwd_microstep: 3324.15 | bwd_inner_microstep: 3323.34 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-20 00:29:18,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.28 | bwd: 3324.16 | bwd_inner: 3323.34 | bwd_allreduce: 0.77 | step: 6.86 70%|██████▉ | 6989/10000 [10:59:39<4:34:22, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.06765453517436981, 'learning_rate': 8.77977156338539e-06, 'epoch': 6.99} 70%|██████▉ | 6989/10000 [10:59:39<4:34:22, 5.47s/it][2025-06-20 00:29:24,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:29:24,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.79 | bwd_microstep: 3365.03 | bwd_inner_microstep: 3364.03 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.55 [2025-06-20 00:29:24,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.79 | bwd: 3365.04 | bwd_inner: 3364.03 | bwd_allreduce: 0.96 | step: 7.55 70%|██████▉ | 6990/10000 [10:59:44<4:35:12, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002412942470982671, 'learning_rate': 8.774410016447707e-06, 'epoch': 6.99} 70%|██████▉ | 6990/10000 [10:59:44<4:35:12, 5.49s/it][2025-06-20 00:29:29,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:29:29,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.95 | bwd_microstep: 3309.05 | bwd_inner_microstep: 3308.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-20 00:29:29,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.95 | bwd: 3309.06 | bwd_inner: 3308.26 | bwd_allreduce: 0.76 | step: 6.77 70%|██████▉ | 6991/10000 [10:59:50<4:34:42, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.022520514205098152, 'learning_rate': 8.769049647022315e-06, 'epoch': 6.99} 70%|██████▉ | 6991/10000 [10:59:50<4:34:42, 5.48s/it][2025-06-20 00:29:34,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:29:34,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.31 | bwd_microstep: 3309.05 | bwd_inner_microstep: 3308.23 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.06 [2025-06-20 00:29:34,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.31 | bwd: 3309.06 | bwd_inner: 3308.23 | bwd_allreduce: 0.78 | step: 7.06 70%|██████▉ | 6992/10000 [10:59:55<4:34:11, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.0159156396985054, 'learning_rate': 8.763690455671489e-06, 'epoch': 6.99} 70%|██████▉ | 6992/10000 [10:59:55<4:34:11, 5.47s/it][2025-06-20 00:29:40,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:29:40,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.86 | bwd_microstep: 3365.60 | bwd_inner_microstep: 3364.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-20 00:29:40,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.86 | bwd: 3365.61 | bwd_inner: 3364.81 | bwd_allreduce: 0.76 | step: 6.75 70%|██████▉ | 6993/10000 [11:00:01<4:34:56, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.027587225660681725, 'learning_rate': 8.758332442957394e-06, 'epoch': 6.99} 70%|██████▉ | 6993/10000 [11:00:01<4:34:56, 5.49s/it][2025-06-20 00:29:46,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:29:46,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.30 | bwd_microstep: 3394.92 | bwd_inner_microstep: 3394.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-20 00:29:46,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.30 | bwd: 3394.93 | bwd_inner: 3394.13 | bwd_allreduce: 0.76 | step: 6.77 70%|██████▉ | 6994/10000 [11:00:06<4:36:00, 5.51s/it] {'loss': 0.0078, 'grad_norm': 2.3918654918670654, 'learning_rate': 8.752975609442044e-06, 'epoch': 6.99} 70%|██████▉ | 6994/10000 [11:00:06<4:36:00, 5.51s/it][2025-06-20 00:29:51,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 00:29:51,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.05 | bwd_microstep: 3317.91 | bwd_inner_microstep: 3317.08 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.27 [2025-06-20 00:29:51,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.05 | bwd: 3317.93 | bwd_inner: 3317.08 | bwd_allreduce: 0.80 | step: 7.27 70%|██████▉ | 6995/10000 [11:00:12<4:35:07, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006932755466550589, 'learning_rate': 8.747619955687352e-06, 'epoch': 7.0} 70%|██████▉ | 6995/10000 [11:00:12<4:35:07, 5.49s/it][2025-06-20 00:29:56,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-20 00:29:57,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.32 | bwd_microstep: 3364.99 | bwd_inner_microstep: 3364.21 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 00:29:57,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.32 | bwd: 3365.00 | bwd_inner: 3364.21 | bwd_allreduce: 0.75 | step: 6.65 70%|██████▉ | 6996/10000 [11:00:17<4:35:34, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.02402251586318016, 'learning_rate': 8.742265482255105e-06, 'epoch': 7.0} 70%|██████▉ | 6996/10000 [11:00:17<4:35:34, 5.50s/it][2025-06-20 00:30:02,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:30:02,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.53 | bwd_microstep: 3393.68 | bwd_inner_microstep: 3392.79 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.90 [2025-06-20 00:30:02,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.53 | bwd: 3393.70 | bwd_inner: 3392.79 | bwd_allreduce: 0.87 | step: 6.91 70%|██████▉ | 6997/10000 [11:00:23<4:36:21, 5.52s/it] {'loss': 0.0007, 'grad_norm': 0.2018931806087494, 'learning_rate': 8.73691218970696e-06, 'epoch': 7.0} 70%|██████▉ | 6997/10000 [11:00:23<4:36:21, 5.52s/it][2025-06-20 00:30:08,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:30:08,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.39 | bwd_microstep: 3308.54 | bwd_inner_microstep: 3307.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 00:30:08,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.39 | bwd: 3308.55 | bwd_inner: 3307.73 | bwd_allreduce: 0.77 | step: 6.81 70%|██████▉ | 6998/10000 [11:00:28<4:35:19, 5.50s/it] {'loss': 0.0119, 'grad_norm': 2.619196891784668, 'learning_rate': 8.731560078604451e-06, 'epoch': 7.0} 70%|██████▉ | 6998/10000 [11:00:28<4:35:19, 5.50s/it][2025-06-20 00:30:13,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:30:13,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.24 | bwd_microstep: 3309.50 | bwd_inner_microstep: 3308.57 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.83 [2025-06-20 00:30:13,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.24 | bwd: 3309.52 | bwd_inner: 3308.57 | bwd_allreduce: 0.89 | step: 7.84 70%|██████▉ | 6999/10000 [11:00:34<4:34:32, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.08439495414495468, 'learning_rate': 8.72620914950899e-06, 'epoch': 7.0} 70%|██████▉ | 6999/10000 [11:00:34<4:34:32, 5.49s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-06-20 00:30:21,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:30:21,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.75 | bwd_microstep: 3347.95 | bwd_inner_microstep: 3346.88 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.71 [2025-06-20 00:30:21,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.75 | bwd: 3347.96 | bwd_inner: 3346.87 | bwd_allreduce: 1.04 | step: 7.73 70%|███████ | 7000/10000 [11:00:41<5:06:44, 6.13s/it] {'loss': 0.0, 'grad_norm': 0.00012888565834145993, 'learning_rate': 8.72085940298187e-06, 'epoch': 7.0} 70%|███████ | 7000/10000 [11:00:41<5:06:44, 6.13s/it]evaluate! [INFO|trainer.py:3910] 2025-06-20 00:30:31,060 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-20 00:30:31,064 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-20 00:30:31,064 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-20 00:31:26,746 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-20 00:31:26,751 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-20 00:31:26,758 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-20 00:31:26,759 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json evaluate! [INFO|trainer.py:3910] 2025-06-20 00:31:46,638 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-20 00:31:46,645 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-20 00:31:46,646 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-20 00:32:47,582 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-20 00:32:47,586 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-20 00:32:47,586 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-20 00:32:47,586 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json [2025-06-20 00:32:52,447] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-20 00:32:58,444] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-20 00:33:04,283] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-20 00:33:10,248] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-20 00:33:28,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-20 00:33:28,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.14 | bwd_microstep: 3315.60 | bwd_inner_microstep: 3314.45 | bwd_allreduce_microstep: 1.07 | step_microstep: 9.63 [2025-06-20 00:33:28,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.09 | bwd: 3315.63 | bwd_inner: 3314.45 | bwd_allreduce: 1.11 | step: 9.64 70%|███████ | 7001/10000 [11:03:49<50:25:24, 60.53s/it] {'loss': 0.0, 'grad_norm': 0.0074532534927129745, 'learning_rate': 8.715510839584247e-06, 'epoch': 7.0} 70%|███████ | 7001/10000 [11:03:49<50:25:24, 60.53s/it][2025-06-20 00:33:34,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:33:34,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.56 | bwd_microstep: 3276.11 | bwd_inner_microstep: 3275.25 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.43 [2025-06-20 00:33:34,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.56 | bwd: 3276.14 | bwd_inner: 3275.25 | bwd_allreduce: 0.82 | step: 7.43 70%|███████ | 7002/10000 [11:03:54<36:38:47, 44.01s/it] {'loss': 0.0001, 'grad_norm': 0.02069706656038761, 'learning_rate': 8.710163459877166e-06, 'epoch': 7.0} 70%|███████ | 7002/10000 [11:03:54<36:38:47, 44.01s/it][2025-06-20 00:33:39,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:33:39,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.57 | bwd_microstep: 3289.10 | bwd_inner_microstep: 3288.28 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.15 [2025-06-20 00:33:39,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.57 | bwd: 3289.11 | bwd_inner: 3288.28 | bwd_allreduce: 0.79 | step: 7.16 70%|███████ | 7003/10000 [11:04:00<27:00:29, 32.44s/it] {'loss': 0.0244, 'grad_norm': 4.336240291595459, 'learning_rate': 8.70481726442154e-06, 'epoch': 7.0} 70%|███████ | 7003/10000 [11:04:00<27:00:29, 32.44s/it][2025-06-20 00:33:45,010] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 00:33:45,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.13 | bwd_microstep: 3347.94 | bwd_inner_microstep: 3346.94 | bwd_allreduce_microstep: 0.91 | step_microstep: 8.30 [2025-06-20 00:33:45,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.13 | bwd: 3347.97 | bwd_inner: 3346.94 | bwd_allreduce: 0.95 | step: 8.31 70%|███████ | 7004/10000 [11:04:05<20:16:55, 24.37s/it] {'loss': 0.0002, 'grad_norm': 0.0398373082280159, 'learning_rate': 8.699472253778168e-06, 'epoch': 7.0} 70%|███████ | 7004/10000 [11:04:05<20:16:55, 24.37s/it][2025-06-20 00:33:50,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-20 00:33:50,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.12 | bwd_microstep: 3300.56 | bwd_inner_microstep: 3299.49 | bwd_allreduce_microstep: 0.96 | step_microstep: 9.04 [2025-06-20 00:33:50,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.12 | bwd: 3300.58 | bwd_inner: 3299.49 | bwd_allreduce: 1.02 | step: 9.05 70%|███████ | 7005/10000 [11:04:11<15:33:44, 18.71s/it] {'loss': 0.0036, 'grad_norm': 0.815729558467865, 'learning_rate': 8.6941284285077e-06, 'epoch': 7.0} 70%|███████ | 7005/10000 [11:04:11<15:33:44, 18.71s/it][2025-06-20 00:33:55,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:33:55,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.68 | bwd_microstep: 3301.87 | bwd_inner_microstep: 3301.06 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-20 00:33:55,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.68 | bwd: 3301.88 | bwd_inner: 3301.06 | bwd_allreduce: 0.78 | step: 6.86 70%|███████ | 7006/10000 [11:04:16<12:14:58, 14.73s/it] {'loss': 0.0003, 'grad_norm': 0.0959167554974556, 'learning_rate': 8.688785789170688e-06, 'epoch': 7.01} 70%|███████ | 7006/10000 [11:04:16<12:14:58, 14.73s/it][2025-06-20 00:34:01,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:34:01,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.27 | bwd_microstep: 3346.09 | bwd_inner_microstep: 3345.27 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.10 [2025-06-20 00:34:01,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.27 | bwd: 3346.11 | bwd_inner: 3345.27 | bwd_allreduce: 0.79 | step: 7.10 70%|███████ | 7007/10000 [11:04:22<9:56:44, 11.96s/it] {'loss': 0.0, 'grad_norm': 0.001276076422072947, 'learning_rate': 8.683444336327552e-06, 'epoch': 7.01} 70%|███████ | 7007/10000 [11:04:22<9:56:44, 11.96s/it][2025-06-20 00:34:06,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:34:06,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.37 | bwd_microstep: 3335.98 | bwd_inner_microstep: 3335.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-20 00:34:06,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.37 | bwd: 3335.99 | bwd_inner: 3335.18 | bwd_allreduce: 0.77 | step: 6.96 70%|███████ | 7008/10000 [11:04:27<8:19:49, 10.02s/it] {'loss': 0.0011, 'grad_norm': 0.16536487638950348, 'learning_rate': 8.678104070538581e-06, 'epoch': 7.01} 70%|███████ | 7008/10000 [11:04:27<8:19:49, 10.02s/it][2025-06-20 00:34:12,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-20 00:34:12,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.00 | bwd_microstep: 3290.91 | bwd_inner_microstep: 3289.97 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.69 [2025-06-20 00:34:12,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.00 | bwd: 3290.94 | bwd_inner: 3289.97 | bwd_allreduce: 0.89 | step: 7.68 70%|███████ | 7009/10000 [11:04:33<7:10:54, 8.64s/it] {'loss': 0.0002, 'grad_norm': 0.0507112517952919, 'learning_rate': 8.672764992363956e-06, 'epoch': 7.01} 70%|███████ | 7009/10000 [11:04:33<7:10:54, 8.64s/it][2025-06-20 00:34:17,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:34:17,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.59 | bwd_microstep: 3290.14 | bwd_inner_microstep: 3289.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 00:34:17,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.59 | bwd: 3290.16 | bwd_inner: 3289.35 | bwd_allreduce: 0.76 | step: 6.82 70%|███████ | 7010/10000 [11:04:38<6:22:44, 7.68s/it] {'loss': 0.0001, 'grad_norm': 0.037925321608781815, 'learning_rate': 8.667427102363703e-06, 'epoch': 7.01} 70%|███████ | 7010/10000 [11:04:38<6:22:44, 7.68s/it][2025-06-20 00:34:23,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:34:23,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.93 | bwd_microstep: 3291.70 | bwd_inner_microstep: 3290.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-20 00:34:23,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.93 | bwd: 3291.71 | bwd_inner: 3290.89 | bwd_allreduce: 0.78 | step: 7.15 70%|███████ | 7011/10000 [11:04:44<5:49:09, 7.01s/it] {'loss': 0.0328, 'grad_norm': 8.657411575317383, 'learning_rate': 8.662090401097752e-06, 'epoch': 7.01} 70%|███████ | 7011/10000 [11:04:44<5:49:09, 7.01s/it][2025-06-20 00:34:28,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:34:28,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2090.72 | bwd_microstep: 3302.32 | bwd_inner_microstep: 3301.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-20 00:34:28,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2090.72 | bwd: 3302.33 | bwd_inner: 3301.51 | bwd_allreduce: 0.78 | step: 6.87 70%|███████ | 7012/10000 [11:04:49<5:25:28, 6.54s/it] {'loss': 0.0001, 'grad_norm': 0.01300783734768629, 'learning_rate': 8.656754889125896e-06, 'epoch': 7.01} 70%|███████ | 7012/10000 [11:04:49<5:25:28, 6.54s/it][2025-06-20 00:34:34,126] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:34:34,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2090.06 | bwd_microstep: 3307.00 | bwd_inner_microstep: 3306.18 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.28 [2025-06-20 00:34:34,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2090.06 | bwd: 3307.01 | bwd_inner: 3306.18 | bwd_allreduce: 0.79 | step: 7.28 70%|███████ | 7013/10000 [11:04:54<5:08:56, 6.21s/it] {'loss': 0.0053, 'grad_norm': 2.157119035720825, 'learning_rate': 8.651420567007807e-06, 'epoch': 7.01} 70%|███████ | 7013/10000 [11:04:54<5:08:56, 6.21s/it][2025-06-20 00:34:39,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:34:39,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.18 | bwd_microstep: 3371.17 | bwd_inner_microstep: 3370.11 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.53 [2025-06-20 00:34:39,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.18 | bwd: 3371.19 | bwd_inner: 3370.11 | bwd_allreduce: 1.03 | step: 7.53 70%|███████ | 7014/10000 [11:05:00<4:58:49, 6.00s/it] {'loss': 0.0001, 'grad_norm': 0.019500114023685455, 'learning_rate': 8.646087435303037e-06, 'epoch': 7.01} 70%|███████ | 7014/10000 [11:05:00<4:58:49, 6.00s/it][2025-06-20 00:34:45,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:34:45,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.91 | bwd_microstep: 3313.66 | bwd_inner_microstep: 3312.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.33 [2025-06-20 00:34:45,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.91 | bwd: 3313.67 | bwd_inner: 3312.86 | bwd_allreduce: 0.77 | step: 7.33 70%|███████ | 7015/10000 [11:05:05<4:50:44, 5.84s/it] {'loss': 0.0174, 'grad_norm': 7.7938618659973145, 'learning_rate': 8.640755494570993e-06, 'epoch': 7.01} 70%|███████ | 7015/10000 [11:05:05<4:50:44, 5.84s/it][2025-06-20 00:34:50,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:34:50,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.06 | bwd_microstep: 3316.32 | bwd_inner_microstep: 3315.52 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.30 [2025-06-20 00:34:50,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.06 | bwd: 3316.34 | bwd_inner: 3315.52 | bwd_allreduce: 0.78 | step: 7.31 70%|███████ | 7016/10000 [11:05:11<4:44:50, 5.73s/it] {'loss': 0.0, 'grad_norm': 0.003733376506716013, 'learning_rate': 8.63542474537098e-06, 'epoch': 7.02} 70%|███████ | 7016/10000 [11:05:11<4:44:50, 5.73s/it][2025-06-20 00:34:56,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:34:56,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.28 | bwd_microstep: 3314.76 | bwd_inner_microstep: 3313.95 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.04 [2025-06-20 00:34:56,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.28 | bwd: 3314.78 | bwd_inner: 3313.95 | bwd_allreduce: 0.79 | step: 7.04 70%|███████ | 7017/10000 [11:05:16<4:40:38, 5.64s/it] {'loss': 0.0001, 'grad_norm': 0.024788055568933487, 'learning_rate': 8.630095188262161e-06, 'epoch': 7.02} 70%|███████ | 7017/10000 [11:05:16<4:40:38, 5.64s/it][2025-06-20 00:35:01,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.73 [2025-06-20 00:35:01,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.48 | bwd_microstep: 3336.07 | bwd_inner_microstep: 3335.00 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.98 [2025-06-20 00:35:01,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.48 | bwd: 3336.10 | bwd_inner: 3335.00 | bwd_allreduce: 1.02 | step: 8.99 70%|███████ | 7018/10000 [11:05:22<4:38:49, 5.61s/it] {'loss': 0.0001, 'grad_norm': 0.00970323383808136, 'learning_rate': 8.624766823803592e-06, 'epoch': 7.02} 70%|███████ | 7018/10000 [11:05:22<4:38:49, 5.61s/it][2025-06-20 00:35:07,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 00:35:07,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2176.13 | bwd_microstep: 3375.29 | bwd_inner_microstep: 3374.40 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.53 [2025-06-20 00:35:07,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2176.13 | bwd: 3375.32 | bwd_inner: 3374.40 | bwd_allreduce: 0.85 | step: 7.53 70%|███████ | 7019/10000 [11:05:27<4:38:46, 5.61s/it] {'loss': 0.0, 'grad_norm': 0.003297766437754035, 'learning_rate': 8.619439652554189e-06, 'epoch': 7.02} 70%|███████ | 7019/10000 [11:05:27<4:38:46, 5.61s/it][2025-06-20 00:35:12,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:35:12,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2169.80 | bwd_microstep: 3390.85 | bwd_inner_microstep: 3389.98 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.90 [2025-06-20 00:35:12,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2169.80 | bwd: 3390.87 | bwd_inner: 3389.98 | bwd_allreduce: 0.82 | step: 7.90 70%|███████ | 7020/10000 [11:05:33<4:38:36, 5.61s/it] {'loss': 0.0, 'grad_norm': 0.0004927774425595999, 'learning_rate': 8.614113675072747e-06, 'epoch': 7.02} 70%|███████ | 7020/10000 [11:05:33<4:38:36, 5.61s/it][2025-06-20 00:35:18,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 00:35:18,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.15 | bwd_microstep: 3347.84 | bwd_inner_microstep: 3346.96 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.72 [2025-06-20 00:35:18,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.15 | bwd: 3347.86 | bwd_inner: 3346.96 | bwd_allreduce: 0.84 | step: 7.73 70%|███████ | 7021/10000 [11:05:39<4:37:16, 5.58s/it] {'loss': 0.0005, 'grad_norm': 0.19297878444194794, 'learning_rate': 8.608788891917935e-06, 'epoch': 7.02} 70%|███████ | 7021/10000 [11:05:39<4:37:16, 5.58s/it][2025-06-20 00:35:23,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 00:35:23,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.77 | bwd_microstep: 3339.75 | bwd_inner_microstep: 3338.83 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.61 [2025-06-20 00:35:23,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.77 | bwd: 3339.79 | bwd_inner: 3338.83 | bwd_allreduce: 0.88 | step: 8.62 70%|███████ | 7022/10000 [11:05:44<4:35:53, 5.56s/it] {'loss': 0.0004, 'grad_norm': 0.06647031754255295, 'learning_rate': 8.603465303648302e-06, 'epoch': 7.02} 70%|███████ | 7022/10000 [11:05:44<4:35:53, 5.56s/it][2025-06-20 00:35:29,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 00:35:29,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.93 | bwd_microstep: 3321.73 | bwd_inner_microstep: 3320.80 | bwd_allreduce_microstep: 0.85 | step_microstep: 8.28 [2025-06-20 00:35:29,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.93 | bwd: 3321.76 | bwd_inner: 3320.80 | bwd_allreduce: 0.88 | step: 8.29 70%|███████ | 7023/10000 [11:05:50<4:35:03, 5.54s/it] {'loss': 0.0002, 'grad_norm': 0.0893547460436821, 'learning_rate': 8.598142910822265e-06, 'epoch': 7.02} 70%|███████ | 7023/10000 [11:05:50<4:35:03, 5.54s/it][2025-06-20 00:35:34,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:35:34,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.64 | bwd_microstep: 3322.46 | bwd_inner_microstep: 3321.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-20 00:35:34,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.64 | bwd: 3322.48 | bwd_inner: 3321.65 | bwd_allreduce: 0.78 | step: 7.16 70%|███████ | 7024/10000 [11:05:55<4:34:42, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0022749321069568396, 'learning_rate': 8.592821713998125e-06, 'epoch': 7.02} 70%|███████ | 7024/10000 [11:05:55<4:34:42, 5.54s/it][2025-06-20 00:35:40,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:35:40,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.38 | bwd_microstep: 3320.18 | bwd_inner_microstep: 3319.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 00:35:40,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.38 | bwd: 3320.19 | bwd_inner: 3319.38 | bwd_allreduce: 0.77 | step: 6.78 70%|███████ | 7025/10000 [11:06:01<4:34:03, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.010973533615469933, 'learning_rate': 8.587501713734039e-06, 'epoch': 7.03} 70%|███████ | 7025/10000 [11:06:01<4:34:03, 5.53s/it][2025-06-20 00:35:45,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 00:35:45,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.40 | bwd_microstep: 3370.11 | bwd_inner_microstep: 3369.24 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.92 [2025-06-20 00:35:45,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.40 | bwd: 3370.14 | bwd_inner: 3369.24 | bwd_allreduce: 0.83 | step: 6.91 70%|███████ | 7026/10000 [11:06:06<4:34:09, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0025365883484482765, 'learning_rate': 8.582182910588057e-06, 'epoch': 7.03} 70%|███████ | 7026/10000 [11:06:06<4:34:09, 5.53s/it][2025-06-20 00:35:51,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:35:51,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2158.91 | bwd_microstep: 3395.69 | bwd_inner_microstep: 3394.84 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.84 [2025-06-20 00:35:51,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2158.91 | bwd: 3395.71 | bwd_inner: 3394.84 | bwd_allreduce: 0.82 | step: 6.86 70%|███████ | 7027/10000 [11:06:12<4:35:00, 5.55s/it] {'loss': 0.0004, 'grad_norm': 0.10046295076608658, 'learning_rate': 8.576865305118093e-06, 'epoch': 7.03} 70%|███████ | 7027/10000 [11:06:12<4:35:00, 5.55s/it][2025-06-20 00:35:57,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:35:57,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.50 | bwd_microstep: 3409.50 | bwd_inner_microstep: 3408.68 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-20 00:35:57,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.50 | bwd: 3409.51 | bwd_inner: 3408.68 | bwd_allreduce: 0.78 | step: 7.14 70%|███████ | 7028/10000 [11:06:17<4:35:31, 5.56s/it] {'loss': 0.0001, 'grad_norm': 0.012435722164809704, 'learning_rate': 8.571548897881945e-06, 'epoch': 7.03} 70%|███████ | 7028/10000 [11:06:17<4:35:31, 5.56s/it][2025-06-20 00:36:02,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:36:02,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.56 | bwd_microstep: 3374.60 | bwd_inner_microstep: 3373.79 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 00:36:02,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.56 | bwd: 3374.61 | bwd_inner: 3373.79 | bwd_allreduce: 0.78 | step: 6.81 70%|███████ | 7029/10000 [11:06:23<4:35:18, 5.56s/it] {'loss': 0.0, 'grad_norm': 0.006887391675263643, 'learning_rate': 8.566233689437285e-06, 'epoch': 7.03} 70%|███████ | 7029/10000 [11:06:23<4:35:18, 5.56s/it][2025-06-20 00:36:08,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:36:08,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.90 | bwd_microstep: 3368.94 | bwd_inner_microstep: 3368.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-20 00:36:08,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.90 | bwd: 3368.96 | bwd_inner: 3368.15 | bwd_allreduce: 0.76 | step: 6.71 70%|███████ | 7030/10000 [11:06:28<4:34:55, 5.55s/it] {'loss': 0.0006, 'grad_norm': 0.282272070646286, 'learning_rate': 8.560919680341635e-06, 'epoch': 7.03} 70%|███████ | 7030/10000 [11:06:28<4:34:55, 5.55s/it][2025-06-20 00:36:13,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:36:13,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2153.52 | bwd_microstep: 3369.17 | bwd_inner_microstep: 3368.31 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.62 [2025-06-20 00:36:13,735] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2153.52 | bwd: 3369.19 | bwd_inner: 3368.31 | bwd_allreduce: 0.84 | step: 7.62 70%|███████ | 7031/10000 [11:06:34<4:35:00, 5.56s/it] {'loss': 0.0001, 'grad_norm': 0.009131918661296368, 'learning_rate': 8.555606871152422e-06, 'epoch': 7.03} 70%|███████ | 7031/10000 [11:06:34<4:35:00, 5.56s/it][2025-06-20 00:36:19,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:36:19,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.02 | bwd_microstep: 3317.09 | bwd_inner_microstep: 3316.26 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.88 [2025-06-20 00:36:19,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.02 | bwd: 3317.10 | bwd_inner: 3316.26 | bwd_allreduce: 0.80 | step: 6.88 70%|███████ | 7032/10000 [11:06:40<4:33:36, 5.53s/it] {'loss': 0.0002, 'grad_norm': 0.030289601534605026, 'learning_rate': 8.550295262426931e-06, 'epoch': 7.03} 70%|███████ | 7032/10000 [11:06:40<4:33:36, 5.53s/it][2025-06-20 00:36:24,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 00:36:24,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.54 | bwd_microstep: 3315.79 | bwd_inner_microstep: 3314.79 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.23 [2025-06-20 00:36:24,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.54 | bwd: 3315.81 | bwd_inner: 3314.79 | bwd_allreduce: 0.97 | step: 7.23 70%|███████ | 7033/10000 [11:06:45<4:32:33, 5.51s/it] {'loss': 0.0008, 'grad_norm': 0.2720889151096344, 'learning_rate': 8.54498485472233e-06, 'epoch': 7.03} 70%|███████ | 7033/10000 [11:06:45<4:32:33, 5.51s/it][2025-06-20 00:36:30,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:36:30,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.28 | bwd_microstep: 3312.67 | bwd_inner_microstep: 3311.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 00:36:30,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.28 | bwd: 3312.69 | bwd_inner: 3311.88 | bwd_allreduce: 0.76 | step: 6.68 70%|███████ | 7034/10000 [11:06:50<4:31:49, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.04381943866610527, 'learning_rate': 8.539675648595651e-06, 'epoch': 7.03} 70%|███████ | 7034/10000 [11:06:50<4:31:49, 5.50s/it][2025-06-20 00:36:35,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:36:35,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.12 | bwd_microstep: 3369.83 | bwd_inner_microstep: 3368.97 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.22 [2025-06-20 00:36:35,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.12 | bwd: 3369.86 | bwd_inner: 3368.97 | bwd_allreduce: 0.83 | step: 7.23 70%|███████ | 7035/10000 [11:06:56<4:32:29, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.003957975655794144, 'learning_rate': 8.534367644603807e-06, 'epoch': 7.04} 70%|███████ | 7035/10000 [11:06:56<4:32:29, 5.51s/it][2025-06-20 00:36:41,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:36:41,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.92 | bwd_microstep: 3366.29 | bwd_inner_microstep: 3365.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-20 00:36:41,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.92 | bwd: 3366.31 | bwd_inner: 3365.50 | bwd_allreduce: 0.76 | step: 7.02 70%|███████ | 7036/10000 [11:07:02<4:32:45, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.008238689042627811, 'learning_rate': 8.529060843303585e-06, 'epoch': 7.04} 70%|███████ | 7036/10000 [11:07:02<4:32:45, 5.52s/it][2025-06-20 00:36:46,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 00:36:46,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.33 | bwd_microstep: 3379.57 | bwd_inner_microstep: 3378.70 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.41 [2025-06-20 00:36:46,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.33 | bwd: 3379.59 | bwd_inner: 3378.70 | bwd_allreduce: 0.82 | step: 7.41 70%|███████ | 7037/10000 [11:07:07<4:33:03, 5.53s/it] {'loss': 0.001, 'grad_norm': 0.23960182070732117, 'learning_rate': 8.523755245251642e-06, 'epoch': 7.04} 70%|███████ | 7037/10000 [11:07:07<4:33:03, 5.53s/it][2025-06-20 00:36:52,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:36:52,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2162.54 | bwd_microstep: 3369.88 | bwd_inner_microstep: 3369.00 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.94 [2025-06-20 00:36:52,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2162.54 | bwd: 3369.91 | bwd_inner: 3369.00 | bwd_allreduce: 0.84 | step: 6.95 70%|███████ | 7038/10000 [11:07:13<4:33:34, 5.54s/it] {'loss': 0.0001, 'grad_norm': 0.023860551416873932, 'learning_rate': 8.518450851004512e-06, 'epoch': 7.04} 70%|███████ | 7038/10000 [11:07:13<4:33:34, 5.54s/it][2025-06-20 00:36:57,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:36:57,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2194.02 | bwd_microstep: 3394.32 | bwd_inner_microstep: 3393.36 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.07 [2025-06-20 00:36:57,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2194.02 | bwd: 3394.34 | bwd_inner: 3393.36 | bwd_allreduce: 0.93 | step: 7.07 70%|███████ | 7039/10000 [11:07:18<4:34:46, 5.57s/it] {'loss': 0.0, 'grad_norm': 0.0038587432354688644, 'learning_rate': 8.513147661118606e-06, 'epoch': 7.04} 70%|███████ | 7039/10000 [11:07:18<4:34:46, 5.57s/it][2025-06-20 00:37:03,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:37:03,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.01 | bwd_microstep: 3315.20 | bwd_inner_microstep: 3314.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-20 00:37:03,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.01 | bwd: 3315.22 | bwd_inner: 3314.41 | bwd_allreduce: 0.76 | step: 6.69 70%|███████ | 7040/10000 [11:07:24<4:33:17, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0044428566470742226, 'learning_rate': 8.507845676150191e-06, 'epoch': 7.04} 70%|███████ | 7040/10000 [11:07:24<4:33:17, 5.54s/it][2025-06-20 00:37:08,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:37:08,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.96 | bwd_microstep: 3360.05 | bwd_inner_microstep: 3359.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-20 00:37:08,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.96 | bwd: 3360.07 | bwd_inner: 3359.26 | bwd_allreduce: 0.77 | step: 6.90 70%|███████ | 7041/10000 [11:07:29<4:32:55, 5.53s/it] {'loss': 0.005, 'grad_norm': 0.9796313643455505, 'learning_rate': 8.502544896655426e-06, 'epoch': 7.04} 70%|███████ | 7041/10000 [11:07:29<4:32:55, 5.53s/it][2025-06-20 00:37:14,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:37:14,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.32 | bwd_microstep: 3394.06 | bwd_inner_microstep: 3393.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-20 00:37:14,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.32 | bwd: 3394.07 | bwd_inner: 3393.26 | bwd_allreduce: 0.77 | step: 6.96 70%|███████ | 7042/10000 [11:07:35<4:33:22, 5.55s/it] {'loss': 0.0, 'grad_norm': 0.0014503184938803315, 'learning_rate': 8.497245323190344e-06, 'epoch': 7.04} 70%|███████ | 7042/10000 [11:07:35<4:33:22, 5.55s/it][2025-06-20 00:37:20,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:37:20,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.11 | bwd_microstep: 3319.48 | bwd_inner_microstep: 3318.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-20 00:37:20,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.11 | bwd: 3319.50 | bwd_inner: 3318.70 | bwd_allreduce: 0.76 | step: 6.67 70%|███████ | 7043/10000 [11:07:40<4:32:06, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.006328748073428869, 'learning_rate': 8.49194695631084e-06, 'epoch': 7.04} 70%|███████ | 7043/10000 [11:07:40<4:32:06, 5.52s/it][2025-06-20 00:37:25,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:37:25,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.06 | bwd_microstep: 3360.57 | bwd_inner_microstep: 3359.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 00:37:25,529] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.06 | bwd: 3360.59 | bwd_inner: 3359.79 | bwd_allreduce: 0.75 | step: 6.59 70%|███████ | 7044/10000 [11:07:46<4:32:01, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00620092311874032, 'learning_rate': 8.486649796572698e-06, 'epoch': 7.04} 70%|███████ | 7044/10000 [11:07:46<4:32:01, 5.52s/it][2025-06-20 00:37:30,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:37:30,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.69 | bwd_microstep: 3313.59 | bwd_inner_microstep: 3312.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 00:37:30,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.69 | bwd: 3313.61 | bwd_inner: 3312.81 | bwd_allreduce: 0.76 | step: 6.65 70%|███████ | 7045/10000 [11:07:51<4:30:53, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.26924654841423035, 'learning_rate': 8.481353844531548e-06, 'epoch': 7.04} 70%|███████ | 7045/10000 [11:07:51<4:30:53, 5.50s/it][2025-06-20 00:37:36,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:37:36,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.14 | bwd_microstep: 3323.92 | bwd_inner_microstep: 3323.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-20 00:37:36,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.14 | bwd: 3323.93 | bwd_inner: 3323.11 | bwd_allreduce: 0.78 | step: 7.15 70%|███████ | 7046/10000 [11:07:57<4:30:23, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.026638580486178398, 'learning_rate': 8.476059100742919e-06, 'epoch': 7.05} 70%|███████ | 7046/10000 [11:07:57<4:30:23, 5.49s/it][2025-06-20 00:37:41,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:37:41,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.36 | bwd_microstep: 3314.32 | bwd_inner_microstep: 3313.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.93 [2025-06-20 00:37:41,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.36 | bwd: 3314.33 | bwd_inner: 3313.53 | bwd_allreduce: 0.76 | step: 6.94 70%|███████ | 7047/10000 [11:08:02<4:29:52, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00665684137493372, 'learning_rate': 8.470765565762209e-06, 'epoch': 7.05} 70%|███████ | 7047/10000 [11:08:02<4:29:52, 5.48s/it][2025-06-20 00:37:47,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:37:47,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.45 | bwd_microstep: 3312.14 | bwd_inner_microstep: 3311.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.87 [2025-06-20 00:37:47,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.45 | bwd: 3312.16 | bwd_inner: 3311.35 | bwd_allreduce: 0.76 | step: 6.87 70%|███████ | 7048/10000 [11:08:08<4:29:23, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.3023098111152649, 'learning_rate': 8.465473240144681e-06, 'epoch': 7.05} 70%|███████ | 7048/10000 [11:08:08<4:29:23, 5.48s/it][2025-06-20 00:37:52,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:37:52,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.54 | bwd_microstep: 3318.07 | bwd_inner_microstep: 3317.28 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-20 00:37:52,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.54 | bwd: 3318.08 | bwd_inner: 3317.28 | bwd_allreduce: 0.76 | step: 6.82 70%|███████ | 7049/10000 [11:08:13<4:29:03, 5.47s/it] {'loss': 0.0006, 'grad_norm': 0.21011637151241302, 'learning_rate': 8.460182124445484e-06, 'epoch': 7.05} 70%|███████ | 7049/10000 [11:08:13<4:29:03, 5.47s/it][2025-06-20 00:37:58,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:37:58,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.32 | bwd_microstep: 3355.87 | bwd_inner_microstep: 3355.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 00:37:58,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.32 | bwd: 3355.88 | bwd_inner: 3355.08 | bwd_allreduce: 0.76 | step: 6.63 70%|███████ | 7050/10000 [11:08:19<4:29:43, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.032115042209625244, 'learning_rate': 8.454892219219617e-06, 'epoch': 7.05} 70%|███████ | 7050/10000 [11:08:19<4:29:43, 5.49s/it][2025-06-20 00:38:03,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:38:03,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.63 | bwd_microstep: 3319.76 | bwd_inner_microstep: 3318.94 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.36 [2025-06-20 00:38:03,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.63 | bwd: 3319.77 | bwd_inner: 3318.94 | bwd_allreduce: 0.78 | step: 7.37 71%|███████ | 7051/10000 [11:08:24<4:29:20, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0030894859228283167, 'learning_rate': 8.449603525021974e-06, 'epoch': 7.05} 71%|███████ | 7051/10000 [11:08:24<4:29:20, 5.48s/it][2025-06-20 00:38:09,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:38:09,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.09 | bwd_microstep: 3313.76 | bwd_inner_microstep: 3312.95 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.86 [2025-06-20 00:38:09,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.09 | bwd: 3313.78 | bwd_inner: 3312.95 | bwd_allreduce: 0.78 | step: 6.86 71%|███████ | 7052/10000 [11:08:30<4:28:55, 5.47s/it] {'loss': 0.0005, 'grad_norm': 0.26169249415397644, 'learning_rate': 8.444316042407315e-06, 'epoch': 7.05} 71%|███████ | 7052/10000 [11:08:30<4:28:55, 5.47s/it][2025-06-20 00:38:14,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:38:14,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.99 | bwd_microstep: 3315.09 | bwd_inner_microstep: 3314.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.26 [2025-06-20 00:38:14,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.99 | bwd: 3315.10 | bwd_inner: 3314.28 | bwd_allreduce: 0.78 | step: 7.26 71%|███████ | 7053/10000 [11:08:35<4:28:41, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.021459121257066727, 'learning_rate': 8.439029771930269e-06, 'epoch': 7.05} 71%|███████ | 7053/10000 [11:08:35<4:28:41, 5.47s/it][2025-06-20 00:38:20,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:38:20,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.49 | bwd_microstep: 3374.25 | bwd_inner_microstep: 3373.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 00:38:20,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.49 | bwd: 3374.27 | bwd_inner: 3373.45 | bwd_allreduce: 0.77 | step: 6.76 71%|███████ | 7054/10000 [11:08:41<4:29:37, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.014453739859163761, 'learning_rate': 8.433744714145347e-06, 'epoch': 7.05} 71%|███████ | 7054/10000 [11:08:41<4:29:37, 5.49s/it][2025-06-20 00:38:25,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.68 | optimizer_step: 2.73 [2025-06-20 00:38:25,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.13 | bwd_microstep: 3369.07 | bwd_inner_microstep: 3368.27 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.37 [2025-06-20 00:38:25,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.13 | bwd: 3369.08 | bwd_inner: 3368.27 | bwd_allreduce: 0.78 | step: 7.37 71%|███████ | 7055/10000 [11:08:46<4:30:07, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.07520093768835068, 'learning_rate': 8.42846086960692e-06, 'epoch': 7.05} 71%|███████ | 7055/10000 [11:08:46<4:30:07, 5.50s/it][2025-06-20 00:38:31,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:38:31,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.75 | bwd_microstep: 3366.36 | bwd_inner_microstep: 3365.43 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.04 [2025-06-20 00:38:31,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.75 | bwd: 3366.38 | bwd_inner: 3365.43 | bwd_allreduce: 0.89 | step: 7.04 71%|███████ | 7056/10000 [11:08:52<4:30:24, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.010998811572790146, 'learning_rate': 8.423178238869245e-06, 'epoch': 7.06} 71%|███████ | 7056/10000 [11:08:52<4:30:24, 5.51s/it][2025-06-20 00:38:36,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 00:38:36,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.54 | bwd_microstep: 3317.74 | bwd_inner_microstep: 3316.64 | bwd_allreduce_microstep: 1.03 | step_microstep: 8.22 [2025-06-20 00:38:36,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.54 | bwd: 3317.76 | bwd_inner: 3316.64 | bwd_allreduce: 1.06 | step: 8.22 71%|███████ | 7057/10000 [11:08:57<4:29:42, 5.50s/it] {'loss': 0.0012, 'grad_norm': 0.3737443685531616, 'learning_rate': 8.417896822486444e-06, 'epoch': 7.06} 71%|███████ | 7057/10000 [11:08:57<4:29:42, 5.50s/it][2025-06-20 00:38:42,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.92 [2025-06-20 00:38:42,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.58 | bwd_microstep: 3319.58 | bwd_inner_microstep: 3318.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.96 [2025-06-20 00:38:42,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.58 | bwd: 3319.60 | bwd_inner: 3318.80 | bwd_allreduce: 0.75 | step: 6.96 71%|███████ | 7058/10000 [11:09:03<4:29:03, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004683992359787226, 'learning_rate': 8.41261662101251e-06, 'epoch': 7.06} 71%|███████ | 7058/10000 [11:09:03<4:29:03, 5.49s/it][2025-06-20 00:38:47,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:38:47,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.29 | bwd_microstep: 3359.96 | bwd_inner_microstep: 3359.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-20 00:38:47,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.29 | bwd: 3359.97 | bwd_inner: 3359.16 | bwd_allreduce: 0.77 | step: 7.07 71%|███████ | 7059/10000 [11:09:08<4:29:33, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.009393488056957722, 'learning_rate': 8.407337635001323e-06, 'epoch': 7.06} 71%|███████ | 7059/10000 [11:09:08<4:29:33, 5.50s/it][2025-06-20 00:38:53,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:38:53,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.24 | bwd_microstep: 3372.06 | bwd_inner_microstep: 3371.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 00:38:53,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.24 | bwd: 3372.07 | bwd_inner: 3371.27 | bwd_allreduce: 0.76 | step: 6.76 71%|███████ | 7060/10000 [11:09:14<4:30:07, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.000152943393914029, 'learning_rate': 8.402059865006607e-06, 'epoch': 7.06} 71%|███████ | 7060/10000 [11:09:14<4:30:07, 5.51s/it][2025-06-20 00:38:58,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:38:58,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.91 | bwd_microstep: 3321.30 | bwd_inner_microstep: 3320.49 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-20 00:38:58,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.91 | bwd: 3321.31 | bwd_inner: 3320.49 | bwd_allreduce: 0.78 | step: 7.17 71%|███████ | 7061/10000 [11:09:19<4:29:31, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0002803495153784752, 'learning_rate': 8.396783311581982e-06, 'epoch': 7.06} 71%|███████ | 7061/10000 [11:09:19<4:29:31, 5.50s/it][2025-06-20 00:39:04,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:39:04,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.14 | bwd_microstep: 3321.25 | bwd_inner_microstep: 3320.44 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-20 00:39:04,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.14 | bwd: 3321.27 | bwd_inner: 3320.44 | bwd_allreduce: 0.78 | step: 7.32 71%|███████ | 7062/10000 [11:09:25<4:28:52, 5.49s/it] {'loss': 0.0014, 'grad_norm': 0.4946097731590271, 'learning_rate': 8.391507975280937e-06, 'epoch': 7.06} 71%|███████ | 7062/10000 [11:09:25<4:28:52, 5.49s/it][2025-06-20 00:39:09,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:39:09,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.34 | bwd_microstep: 3310.48 | bwd_inner_microstep: 3309.33 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.46 [2025-06-20 00:39:09,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.34 | bwd: 3310.50 | bwd_inner: 3309.33 | bwd_allreduce: 1.11 | step: 7.46 71%|███████ | 7063/10000 [11:09:30<4:28:15, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0024408940225839615, 'learning_rate': 8.386233856656827e-06, 'epoch': 7.06} 71%|███████ | 7063/10000 [11:09:30<4:28:15, 5.48s/it][2025-06-20 00:39:15,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:39:15,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.94 | bwd_microstep: 3322.11 | bwd_inner_microstep: 3321.29 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.32 [2025-06-20 00:39:15,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.94 | bwd: 3322.12 | bwd_inner: 3321.29 | bwd_allreduce: 0.78 | step: 7.32 71%|███████ | 7064/10000 [11:09:36<4:28:08, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.009259790182113647, 'learning_rate': 8.380960956262886e-06, 'epoch': 7.06} 71%|███████ | 7064/10000 [11:09:36<4:28:08, 5.48s/it][2025-06-20 00:39:20,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:39:20,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.16 | bwd_microstep: 3318.09 | bwd_inner_microstep: 3317.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-20 00:39:20,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.16 | bwd: 3318.10 | bwd_inner: 3317.28 | bwd_allreduce: 0.78 | step: 6.75 71%|███████ | 7065/10000 [11:09:41<4:27:43, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0001823641505325213, 'learning_rate': 8.37568927465222e-06, 'epoch': 7.07} 71%|███████ | 7065/10000 [11:09:41<4:27:43, 5.47s/it][2025-06-20 00:39:26,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:39:26,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.28 | bwd_microstep: 3373.92 | bwd_inner_microstep: 3373.01 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.09 [2025-06-20 00:39:26,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.28 | bwd: 3373.94 | bwd_inner: 3373.01 | bwd_allreduce: 0.88 | step: 7.10 71%|███████ | 7066/10000 [11:09:47<4:28:37, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0003345521108713001, 'learning_rate': 8.370418812377797e-06, 'epoch': 7.07} 71%|███████ | 7066/10000 [11:09:47<4:28:37, 5.49s/it][2025-06-20 00:39:31,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:39:31,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.85 | bwd_microstep: 3311.93 | bwd_inner_microstep: 3310.96 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.58 [2025-06-20 00:39:31,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.85 | bwd: 3311.94 | bwd_inner: 3310.96 | bwd_allreduce: 0.94 | step: 7.58 71%|███████ | 7067/10000 [11:09:52<4:28:07, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.025965776294469833, 'learning_rate': 8.365149569992461e-06, 'epoch': 7.07} 71%|███████ | 7067/10000 [11:09:52<4:28:07, 5.48s/it][2025-06-20 00:39:37,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:39:37,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.12 | bwd_microstep: 3363.84 | bwd_inner_microstep: 3363.03 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-20 00:39:37,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.12 | bwd: 3363.85 | bwd_inner: 3363.03 | bwd_allreduce: 0.78 | step: 7.01 71%|███████ | 7068/10000 [11:09:58<4:28:49, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0034356480464339256, 'learning_rate': 8.359881548048939e-06, 'epoch': 7.07} 71%|███████ | 7068/10000 [11:09:58<4:28:49, 5.50s/it][2025-06-20 00:39:42,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:39:42,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.52 | bwd_microstep: 3366.22 | bwd_inner_microstep: 3365.08 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.59 [2025-06-20 00:39:42,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.52 | bwd: 3366.24 | bwd_inner: 3365.08 | bwd_allreduce: 1.10 | step: 7.58 71%|███████ | 7069/10000 [11:10:03<4:29:17, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0044209640473127365, 'learning_rate': 8.354614747099817e-06, 'epoch': 7.07} 71%|███████ | 7069/10000 [11:10:03<4:29:17, 5.51s/it][2025-06-20 00:39:48,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:39:48,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.35 | bwd_microstep: 3313.44 | bwd_inner_microstep: 3312.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-20 00:39:48,226] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.35 | bwd: 3313.45 | bwd_inner: 3312.65 | bwd_allreduce: 0.76 | step: 6.67 71%|███████ | 7070/10000 [11:10:09<4:28:28, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.026471305638551712, 'learning_rate': 8.34934916769756e-06, 'epoch': 7.07} 71%|███████ | 7070/10000 [11:10:09<4:28:28, 5.50s/it][2025-06-20 00:39:53,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 00:39:53,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.30 | bwd_microstep: 3370.81 | bwd_inner_microstep: 3369.79 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.79 [2025-06-20 00:39:53,762] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.30 | bwd: 3370.83 | bwd_inner: 3369.79 | bwd_allreduce: 0.98 | step: 7.80 71%|███████ | 7071/10000 [11:10:14<4:29:00, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.003276603063568473, 'learning_rate': 8.344084810394503e-06, 'epoch': 7.07} 71%|███████ | 7071/10000 [11:10:14<4:29:00, 5.51s/it][2025-06-20 00:39:59,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:39:59,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.40 | bwd_microstep: 3317.28 | bwd_inner_microstep: 3316.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 00:39:59,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.40 | bwd: 3317.30 | bwd_inner: 3316.49 | bwd_allreduce: 0.77 | step: 6.80 71%|███████ | 7072/10000 [11:10:20<4:28:20, 5.50s/it] {'loss': 0.0119, 'grad_norm': 3.1597440242767334, 'learning_rate': 8.338821675742854e-06, 'epoch': 7.07} 71%|███████ | 7072/10000 [11:10:20<4:28:20, 5.50s/it][2025-06-20 00:40:04,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 00:40:04,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.91 | bwd_microstep: 3371.24 | bwd_inner_microstep: 3370.20 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.99 [2025-06-20 00:40:04,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.91 | bwd: 3371.27 | bwd_inner: 3370.20 | bwd_allreduce: 1.00 | step: 8.00 71%|███████ | 7073/10000 [11:10:25<4:29:05, 5.52s/it] {'loss': 0.0017, 'grad_norm': 0.35724130272865295, 'learning_rate': 8.33355976429469e-06, 'epoch': 7.07} 71%|███████ | 7073/10000 [11:10:25<4:29:05, 5.52s/it][2025-06-20 00:40:10,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:40:10,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.71 | bwd_microstep: 3363.19 | bwd_inner_microstep: 3362.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 00:40:10,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.71 | bwd: 3363.20 | bwd_inner: 3362.40 | bwd_allreduce: 0.76 | step: 6.69 71%|███████ | 7074/10000 [11:10:31<4:29:13, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0009058011346496642, 'learning_rate': 8.32829907660196e-06, 'epoch': 7.07} 71%|███████ | 7074/10000 [11:10:31<4:29:13, 5.52s/it][2025-06-20 00:40:15,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:40:15,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.79 | bwd_microstep: 3324.05 | bwd_inner_microstep: 3323.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 00:40:15,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.79 | bwd: 3324.06 | bwd_inner: 3323.26 | bwd_allreduce: 0.76 | step: 6.67 71%|███████ | 7075/10000 [11:10:36<4:28:29, 5.51s/it] {'loss': 0.0004, 'grad_norm': 0.07062561810016632, 'learning_rate': 8.323039613216495e-06, 'epoch': 7.08} 71%|███████ | 7075/10000 [11:10:36<4:28:29, 5.51s/it][2025-06-20 00:40:21,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:40:21,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.23 | bwd_microstep: 3313.13 | bwd_inner_microstep: 3312.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-20 00:40:21,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.23 | bwd: 3313.15 | bwd_inner: 3312.34 | bwd_allreduce: 0.76 | step: 7.00 71%|███████ | 7076/10000 [11:10:42<4:27:43, 5.49s/it] {'loss': 0.001, 'grad_norm': 0.22227799892425537, 'learning_rate': 8.317781374689972e-06, 'epoch': 7.08} 71%|███████ | 7076/10000 [11:10:42<4:27:43, 5.49s/it][2025-06-20 00:40:26,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:40:26,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.91 | bwd_microstep: 3323.89 | bwd_inner_microstep: 3322.99 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.14 [2025-06-20 00:40:26,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.91 | bwd: 3323.90 | bwd_inner: 3322.99 | bwd_allreduce: 0.87 | step: 7.14 71%|███████ | 7077/10000 [11:10:47<4:27:21, 5.49s/it] {'loss': 0.0, 'grad_norm': 8.732264541322365e-05, 'learning_rate': 8.312524361573967e-06, 'epoch': 7.08} 71%|███████ | 7077/10000 [11:10:47<4:27:21, 5.49s/it][2025-06-20 00:40:32,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:40:32,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.09 | bwd_microstep: 3314.94 | bwd_inner_microstep: 3314.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 00:40:32,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.09 | bwd: 3314.95 | bwd_inner: 3314.15 | bwd_allreduce: 0.76 | step: 6.61 71%|███████ | 7078/10000 [11:10:52<4:26:47, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.000899044971447438, 'learning_rate': 8.30726857441991e-06, 'epoch': 7.08} 71%|███████ | 7078/10000 [11:10:52<4:26:47, 5.48s/it][2025-06-20 00:40:37,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:40:37,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.23 | bwd_microstep: 3319.04 | bwd_inner_microstep: 3318.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-20 00:40:37,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.23 | bwd: 3319.05 | bwd_inner: 3318.23 | bwd_allreduce: 0.77 | step: 7.02 71%|███████ | 7079/10000 [11:10:58<4:26:24, 5.47s/it] {'loss': 0.0006, 'grad_norm': 0.11559666693210602, 'learning_rate': 8.302014013779111e-06, 'epoch': 7.08} 71%|███████ | 7079/10000 [11:10:58<4:26:24, 5.47s/it][2025-06-20 00:40:43,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:40:43,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.65 | bwd_microstep: 3320.13 | bwd_inner_microstep: 3319.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 00:40:43,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.66 | bwd: 3320.14 | bwd_inner: 3319.35 | bwd_allreduce: 0.76 | step: 6.72 71%|███████ | 7080/10000 [11:11:03<4:26:11, 5.47s/it] {'loss': 0.0, 'grad_norm': 9.933300316333771e-05, 'learning_rate': 8.29676068020276e-06, 'epoch': 7.08} 71%|███████ | 7080/10000 [11:11:03<4:26:11, 5.47s/it][2025-06-20 00:40:48,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 00:40:48,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.17 | bwd_microstep: 3315.04 | bwd_inner_microstep: 3314.05 | bwd_allreduce_microstep: 0.91 | step_microstep: 8.07 [2025-06-20 00:40:48,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.17 | bwd: 3315.07 | bwd_inner: 3314.05 | bwd_allreduce: 0.94 | step: 8.07 71%|███████ | 7081/10000 [11:11:09<4:25:59, 5.47s/it] {'loss': 0.0532, 'grad_norm': 5.165735244750977, 'learning_rate': 8.291508574241889e-06, 'epoch': 7.08} 71%|███████ | 7081/10000 [11:11:09<4:25:59, 5.47s/it][2025-06-20 00:40:54,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:40:54,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.87 | bwd_microstep: 3308.91 | bwd_inner_microstep: 3308.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 00:40:54,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.87 | bwd: 3308.92 | bwd_inner: 3308.12 | bwd_allreduce: 0.76 | step: 6.72 71%|███████ | 7082/10000 [11:11:14<4:25:47, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0027815757784992456, 'learning_rate': 8.286257696447428e-06, 'epoch': 7.08} 71%|███████ | 7082/10000 [11:11:14<4:25:47, 5.47s/it][2025-06-20 00:40:59,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:40:59,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.46 | bwd_microstep: 3363.58 | bwd_inner_microstep: 3362.76 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.30 [2025-06-20 00:40:59,575] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.46 | bwd: 3363.60 | bwd_inner: 3362.76 | bwd_allreduce: 0.79 | step: 7.30 71%|███████ | 7083/10000 [11:11:20<4:26:46, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0040192739106714725, 'learning_rate': 8.28100804737017e-06, 'epoch': 7.08} 71%|███████ | 7083/10000 [11:11:20<4:26:46, 5.49s/it][2025-06-20 00:41:05,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:41:05,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.02 | bwd_microstep: 3316.35 | bwd_inner_microstep: 3315.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 00:41:05,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.02 | bwd: 3316.37 | bwd_inner: 3315.56 | bwd_allreduce: 0.77 | step: 6.79 71%|███████ | 7084/10000 [11:11:25<4:26:27, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0044912309385836124, 'learning_rate': 8.275759627560775e-06, 'epoch': 7.08} 71%|███████ | 7084/10000 [11:11:25<4:26:27, 5.48s/it][2025-06-20 00:41:10,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:41:10,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.46 | bwd_microstep: 3316.24 | bwd_inner_microstep: 3315.23 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.27 [2025-06-20 00:41:10,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.46 | bwd: 3316.26 | bwd_inner: 3315.23 | bwd_allreduce: 0.98 | step: 7.28 71%|███████ | 7085/10000 [11:11:31<4:26:03, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.021406440064311028, 'learning_rate': 8.270512437569791e-06, 'epoch': 7.08} 71%|███████ | 7085/10000 [11:11:31<4:26:03, 5.48s/it][2025-06-20 00:41:15,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:41:15,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.21 | bwd_microstep: 3319.65 | bwd_inner_microstep: 3318.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-20 00:41:15,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.21 | bwd: 3319.66 | bwd_inner: 3318.85 | bwd_allreduce: 0.77 | step: 7.17 71%|███████ | 7086/10000 [11:11:36<4:25:53, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.015856334939599037, 'learning_rate': 8.265266477947606e-06, 'epoch': 7.09} 71%|███████ | 7086/10000 [11:11:36<4:25:53, 5.47s/it][2025-06-20 00:41:21,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:41:21,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.62 | bwd_microstep: 3310.89 | bwd_inner_microstep: 3309.87 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.88 [2025-06-20 00:41:21,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.62 | bwd: 3310.91 | bwd_inner: 3309.87 | bwd_allreduce: 0.98 | step: 7.88 71%|███████ | 7087/10000 [11:11:42<4:25:32, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00030742361559532583, 'learning_rate': 8.260021749244506e-06, 'epoch': 7.09} 71%|███████ | 7087/10000 [11:11:42<4:25:32, 5.47s/it][2025-06-20 00:41:26,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:41:26,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.82 | bwd_microstep: 3360.42 | bwd_inner_microstep: 3359.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-20 00:41:26,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.82 | bwd: 3360.44 | bwd_inner: 3359.63 | bwd_allreduce: 0.76 | step: 6.87 71%|███████ | 7088/10000 [11:11:47<4:26:15, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.14581772685050964, 'learning_rate': 8.254778252010637e-06, 'epoch': 7.09} 71%|███████ | 7088/10000 [11:11:47<4:26:15, 5.49s/it][2025-06-20 00:41:32,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:41:32,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.40 | bwd_microstep: 3316.11 | bwd_inner_microstep: 3315.29 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.33 [2025-06-20 00:41:32,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.40 | bwd: 3316.13 | bwd_inner: 3315.29 | bwd_allreduce: 0.79 | step: 7.34 71%|███████ | 7089/10000 [11:11:53<4:25:46, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.006439704913645983, 'learning_rate': 8.249535986796018e-06, 'epoch': 7.09} 71%|███████ | 7089/10000 [11:11:53<4:25:46, 5.48s/it][2025-06-20 00:41:37,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:41:37,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.00 | bwd_microstep: 3322.43 | bwd_inner_microstep: 3321.55 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.30 [2025-06-20 00:41:37,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.00 | bwd: 3322.45 | bwd_inner: 3321.55 | bwd_allreduce: 0.85 | step: 7.30 71%|███████ | 7090/10000 [11:11:58<4:25:38, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0016385632334277034, 'learning_rate': 8.24429495415054e-06, 'epoch': 7.09} 71%|███████ | 7090/10000 [11:11:58<4:25:38, 5.48s/it][2025-06-20 00:41:43,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:41:43,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.89 | bwd_microstep: 3370.37 | bwd_inner_microstep: 3369.34 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.35 [2025-06-20 00:41:43,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.89 | bwd: 3370.39 | bwd_inner: 3369.34 | bwd_allreduce: 1.00 | step: 7.35 71%|███████ | 7091/10000 [11:12:04<4:26:24, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.017876558005809784, 'learning_rate': 8.239055154623961e-06, 'epoch': 7.09} 71%|███████ | 7091/10000 [11:12:04<4:26:24, 5.49s/it][2025-06-20 00:41:48,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:41:48,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.54 | bwd_microstep: 3325.53 | bwd_inner_microstep: 3324.43 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.69 [2025-06-20 00:41:48,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.54 | bwd: 3325.55 | bwd_inner: 3324.43 | bwd_allreduce: 1.05 | step: 7.69 71%|███████ | 7092/10000 [11:12:09<4:26:07, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.09189103543758392, 'learning_rate': 8.233816588765911e-06, 'epoch': 7.09} 71%|███████ | 7092/10000 [11:12:09<4:26:07, 5.49s/it][2025-06-20 00:41:54,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:41:54,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.41 | bwd_microstep: 3375.73 | bwd_inner_microstep: 3374.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 00:41:54,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.41 | bwd: 3375.74 | bwd_inner: 3374.94 | bwd_allreduce: 0.76 | step: 6.81 71%|███████ | 7093/10000 [11:12:15<4:26:49, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.010894514620304108, 'learning_rate': 8.228579257125891e-06, 'epoch': 7.09} 71%|███████ | 7093/10000 [11:12:15<4:26:49, 5.51s/it][2025-06-20 00:41:59,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:41:59,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.50 | bwd_microstep: 3332.00 | bwd_inner_microstep: 3331.13 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.36 [2025-06-20 00:41:59,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.50 | bwd: 3332.02 | bwd_inner: 3331.13 | bwd_allreduce: 0.83 | step: 7.37 71%|███████ | 7094/10000 [11:12:20<4:26:17, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0016789326909929514, 'learning_rate': 8.223343160253274e-06, 'epoch': 7.09} 71%|███████ | 7094/10000 [11:12:20<4:26:17, 5.50s/it][2025-06-20 00:42:05,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:42:05,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.63 | bwd_microstep: 3331.48 | bwd_inner_microstep: 3330.61 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.26 [2025-06-20 00:42:05,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.63 | bwd: 3331.51 | bwd_inner: 3330.61 | bwd_allreduce: 0.83 | step: 7.26 71%|███████ | 7095/10000 [11:12:26<4:26:29, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.0433773398399353, 'learning_rate': 8.218108298697311e-06, 'epoch': 7.09} 71%|███████ | 7095/10000 [11:12:26<4:26:29, 5.50s/it][2025-06-20 00:42:10,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:42:10,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.76 | bwd_microstep: 3325.69 | bwd_inner_microstep: 3324.80 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.49 [2025-06-20 00:42:10,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.76 | bwd: 3325.71 | bwd_inner: 3324.80 | bwd_allreduce: 0.86 | step: 7.50 71%|███████ | 7096/10000 [11:12:31<4:26:27, 5.51s/it] {'loss': 0.0032, 'grad_norm': 1.1992192268371582, 'learning_rate': 8.2128746730071e-06, 'epoch': 7.1} 71%|███████ | 7096/10000 [11:12:31<4:26:27, 5.51s/it][2025-06-20 00:42:16,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:42:16,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.97 | bwd_microstep: 3375.71 | bwd_inner_microstep: 3374.83 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.47 [2025-06-20 00:42:16,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.97 | bwd: 3375.77 | bwd_inner: 3374.83 | bwd_allreduce: 0.84 | step: 7.47 71%|███████ | 7097/10000 [11:12:37<4:27:05, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.001082031987607479, 'learning_rate': 8.20764228373163e-06, 'epoch': 7.1} 71%|███████ | 7097/10000 [11:12:37<4:27:05, 5.52s/it][2025-06-20 00:42:22,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:42:22,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.99 | bwd_microstep: 3320.72 | bwd_inner_microstep: 3319.82 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.41 [2025-06-20 00:42:22,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.00 | bwd: 3320.75 | bwd_inner: 3319.82 | bwd_allreduce: 0.86 | step: 7.41 71%|███████ | 7098/10000 [11:12:42<4:26:42, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0014544147998094559, 'learning_rate': 8.202411131419752e-06, 'epoch': 7.1} 71%|███████ | 7098/10000 [11:12:42<4:26:42, 5.51s/it][2025-06-20 00:42:27,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:42:27,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.00 | bwd_microstep: 3314.93 | bwd_inner_microstep: 3314.07 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.15 [2025-06-20 00:42:27,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.00 | bwd: 3314.96 | bwd_inner: 3314.07 | bwd_allreduce: 0.82 | step: 7.16 71%|███████ | 7099/10000 [11:12:48<4:26:14, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.010366740636527538, 'learning_rate': 8.197181216620194e-06, 'epoch': 7.1} 71%|███████ | 7099/10000 [11:12:48<4:26:14, 5.51s/it][2025-06-20 00:42:33,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:42:33,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2154.48 | bwd_microstep: 3325.94 | bwd_inner_microstep: 3325.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.09 [2025-06-20 00:42:33,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2154.48 | bwd: 3325.96 | bwd_inner: 3325.13 | bwd_allreduce: 0.77 | step: 7.09 71%|███████ | 7100/10000 [11:12:53<4:26:23, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0018791808979585767, 'learning_rate': 8.191952539881554e-06, 'epoch': 7.1} 71%|███████ | 7100/10000 [11:12:53<4:26:23, 5.51s/it][2025-06-20 00:42:38,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:42:38,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2166.77 | bwd_microstep: 3368.60 | bwd_inner_microstep: 3367.74 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.73 [2025-06-20 00:42:38,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2166.77 | bwd: 3368.63 | bwd_inner: 3367.74 | bwd_allreduce: 0.82 | step: 7.73 71%|███████ | 7101/10000 [11:12:59<4:27:16, 5.53s/it] {'loss': 0.1626, 'grad_norm': 9.095704078674316, 'learning_rate': 8.186725101752282e-06, 'epoch': 7.1} 71%|███████ | 7101/10000 [11:12:59<4:27:16, 5.53s/it][2025-06-20 00:42:44,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:42:44,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.68 | bwd_microstep: 3326.04 | bwd_inner_microstep: 3325.25 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-20 00:42:44,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.68 | bwd: 3326.24 | bwd_inner: 3325.25 | bwd_allreduce: 0.76 | step: 6.62 71%|███████ | 7102/10000 [11:13:04<4:26:12, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.005375297740101814, 'learning_rate': 8.18149890278072e-06, 'epoch': 7.1} 71%|███████ | 7102/10000 [11:13:04<4:26:12, 5.51s/it][2025-06-20 00:42:49,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:42:49,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.28 | bwd_microstep: 3376.48 | bwd_inner_microstep: 3375.62 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.09 [2025-06-20 00:42:49,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.28 | bwd: 3376.50 | bwd_inner: 3375.62 | bwd_allreduce: 0.82 | step: 7.10 71%|███████ | 7103/10000 [11:13:10<4:26:33, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.03028487227857113, 'learning_rate': 8.176273943515075e-06, 'epoch': 7.1} 71%|███████ | 7103/10000 [11:13:10<4:26:33, 5.52s/it][2025-06-20 00:42:55,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:42:55,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2151.78 | bwd_microstep: 3371.63 | bwd_inner_microstep: 3370.74 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.94 [2025-06-20 00:42:55,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2151.78 | bwd: 3371.65 | bwd_inner: 3370.74 | bwd_allreduce: 0.86 | step: 6.96 71%|███████ | 7104/10000 [11:13:15<4:27:08, 5.53s/it] {'loss': 0.0007, 'grad_norm': 0.5088681578636169, 'learning_rate': 8.171050224503411e-06, 'epoch': 7.1} 71%|███████ | 7104/10000 [11:13:15<4:27:08, 5.53s/it][2025-06-20 00:43:00,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:43:00,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.59 | bwd_microstep: 3370.06 | bwd_inner_microstep: 3368.94 | bwd_allreduce_microstep: 1.07 | step_microstep: 6.96 [2025-06-20 00:43:00,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.59 | bwd: 3370.07 | bwd_inner: 3368.94 | bwd_allreduce: 1.08 | step: 6.96 71%|███████ | 7105/10000 [11:13:21<4:27:05, 5.54s/it] {'loss': 0.0001, 'grad_norm': 0.007493308745324612, 'learning_rate': 8.165827746293684e-06, 'epoch': 7.11} 71%|███████ | 7105/10000 [11:13:21<4:27:05, 5.54s/it][2025-06-20 00:43:06,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:43:06,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.50 | bwd_microstep: 3372.95 | bwd_inner_microstep: 3372.11 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.18 [2025-06-20 00:43:06,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.50 | bwd: 3372.97 | bwd_inner: 3372.11 | bwd_allreduce: 0.80 | step: 7.18 71%|███████ | 7106/10000 [11:13:27<4:27:15, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0005716591840609908, 'learning_rate': 8.160606509433701e-06, 'epoch': 7.11} 71%|███████ | 7106/10000 [11:13:27<4:27:15, 5.54s/it][2025-06-20 00:43:11,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:43:11,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2150.81 | bwd_microstep: 3368.64 | bwd_inner_microstep: 3367.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.15 [2025-06-20 00:43:11,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2150.82 | bwd: 3368.66 | bwd_inner: 3367.85 | bwd_allreduce: 0.76 | step: 7.16 71%|███████ | 7107/10000 [11:13:32<4:27:42, 5.55s/it] {'loss': 0.004, 'grad_norm': 0.7076423168182373, 'learning_rate': 8.155386514471146e-06, 'epoch': 7.11} 71%|███████ | 7107/10000 [11:13:32<4:27:42, 5.55s/it][2025-06-20 00:43:17,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:43:17,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.07 | bwd_microstep: 3331.56 | bwd_inner_microstep: 3330.77 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 00:43:17,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.07 | bwd: 3331.58 | bwd_inner: 3330.77 | bwd_allreduce: 0.76 | step: 6.70 71%|███████ | 7108/10000 [11:13:38<4:26:43, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.009184999391436577, 'learning_rate': 8.150167761953575e-06, 'epoch': 7.11} 71%|███████ | 7108/10000 [11:13:38<4:26:43, 5.53s/it][2025-06-20 00:43:22,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:43:22,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.67 | bwd_microstep: 3337.78 | bwd_inner_microstep: 3336.85 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.96 [2025-06-20 00:43:22,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.67 | bwd: 3337.79 | bwd_inner: 3336.85 | bwd_allreduce: 0.90 | step: 6.97 71%|███████ | 7109/10000 [11:13:43<4:26:08, 5.52s/it] {'loss': 0.0, 'grad_norm': 4.477725815377198e-05, 'learning_rate': 8.144950252428408e-06, 'epoch': 7.11} 71%|███████ | 7109/10000 [11:13:43<4:26:08, 5.52s/it][2025-06-20 00:43:28,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:43:28,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.37 | bwd_microstep: 3334.93 | bwd_inner_microstep: 3334.08 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.98 [2025-06-20 00:43:28,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.37 | bwd: 3334.96 | bwd_inner: 3334.08 | bwd_allreduce: 0.81 | step: 6.98 71%|███████ | 7110/10000 [11:13:49<4:25:42, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.006666323635727167, 'learning_rate': 8.139733986442945e-06, 'epoch': 7.11} 71%|███████ | 7110/10000 [11:13:49<4:25:42, 5.52s/it][2025-06-20 00:43:33,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 00:43:33,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.65 | bwd_microstep: 3322.79 | bwd_inner_microstep: 3321.69 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.61 [2025-06-20 00:43:33,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.65 | bwd: 3322.81 | bwd_inner: 3321.69 | bwd_allreduce: 1.05 | step: 7.62 71%|███████ | 7111/10000 [11:13:54<4:25:02, 5.50s/it] {'loss': 0.0, 'grad_norm': 7.597733929287642e-05, 'learning_rate': 8.134518964544336e-06, 'epoch': 7.11} 71%|███████ | 7111/10000 [11:13:54<4:25:02, 5.50s/it][2025-06-20 00:43:39,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:43:39,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.98 | bwd_microstep: 3327.49 | bwd_inner_microstep: 3326.69 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-20 00:43:39,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.98 | bwd: 3327.50 | bwd_inner: 3326.69 | bwd_allreduce: 0.77 | step: 6.99 71%|███████ | 7112/10000 [11:14:00<4:25:01, 5.51s/it] {'loss': 0.0035, 'grad_norm': 0.5208913087844849, 'learning_rate': 8.129305187279617e-06, 'epoch': 7.11} 71%|███████ | 7112/10000 [11:14:00<4:25:01, 5.51s/it][2025-06-20 00:43:44,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.72 [2025-06-20 00:43:44,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.74 | bwd_microstep: 3321.38 | bwd_inner_microstep: 3320.20 | bwd_allreduce_microstep: 1.08 | step_microstep: 8.39 [2025-06-20 00:43:44,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.74 | bwd: 3321.41 | bwd_inner: 3320.20 | bwd_allreduce: 1.13 | step: 8.40 71%|███████ | 7113/10000 [11:14:05<4:24:31, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.11430727690458298, 'learning_rate': 8.12409265519569e-06, 'epoch': 7.11} 71%|███████ | 7113/10000 [11:14:05<4:24:31, 5.50s/it][2025-06-20 00:43:50,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:43:50,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.64 | bwd_microstep: 3327.44 | bwd_inner_microstep: 3326.34 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.37 [2025-06-20 00:43:50,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.64 | bwd: 3327.46 | bwd_inner: 3326.34 | bwd_allreduce: 1.06 | step: 7.38 71%|███████ | 7114/10000 [11:14:11<4:24:37, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.038099199533462524, 'learning_rate': 8.118881368839327e-06, 'epoch': 7.11} 71%|███████ | 7114/10000 [11:14:11<4:24:37, 5.50s/it][2025-06-20 00:43:55,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:43:55,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.05 | bwd_microstep: 3371.99 | bwd_inner_microstep: 3371.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-20 00:43:55,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.05 | bwd: 3372.01 | bwd_inner: 3371.20 | bwd_allreduce: 0.76 | step: 6.55 71%|███████ | 7115/10000 [11:14:16<4:25:16, 5.52s/it] {'loss': 0.0005, 'grad_norm': 0.2631571888923645, 'learning_rate': 8.11367132875717e-06, 'epoch': 7.12} 71%|███████ | 7115/10000 [11:14:16<4:25:16, 5.52s/it][2025-06-20 00:44:01,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:44:01,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.54 | bwd_microstep: 3379.31 | bwd_inner_microstep: 3378.52 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-20 00:44:01,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.55 | bwd: 3379.32 | bwd_inner: 3378.52 | bwd_allreduce: 0.75 | step: 6.55 71%|███████ | 7116/10000 [11:14:22<4:25:44, 5.53s/it] {'loss': 0.0002, 'grad_norm': 0.047167662531137466, 'learning_rate': 8.108462535495723e-06, 'epoch': 7.12} 71%|███████ | 7116/10000 [11:14:22<4:25:44, 5.53s/it][2025-06-20 00:44:06,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:44:06,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.60 | bwd_microstep: 3318.20 | bwd_inner_microstep: 3317.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 00:44:06,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.60 | bwd: 3318.21 | bwd_inner: 3317.40 | bwd_allreduce: 0.77 | step: 6.78 71%|███████ | 7117/10000 [11:14:27<4:24:42, 5.51s/it] {'loss': 0.0004, 'grad_norm': 0.11004730314016342, 'learning_rate': 8.10325498960136e-06, 'epoch': 7.12} 71%|███████ | 7117/10000 [11:14:27<4:24:42, 5.51s/it][2025-06-20 00:44:12,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-20 00:44:12,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.84 | bwd_microstep: 3375.84 | bwd_inner_microstep: 3374.69 | bwd_allreduce_microstep: 1.09 | step_microstep: 7.93 [2025-06-20 00:44:12,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.84 | bwd: 3375.87 | bwd_inner: 3374.69 | bwd_allreduce: 1.12 | step: 7.93 71%|███████ | 7118/10000 [11:14:33<4:25:17, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.007117925677448511, 'learning_rate': 8.098048691620337e-06, 'epoch': 7.12} 71%|███████ | 7118/10000 [11:14:33<4:25:17, 5.52s/it][2025-06-20 00:44:18,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 00:44:18,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.96 | bwd_microstep: 3384.29 | bwd_inner_microstep: 3383.12 | bwd_allreduce_microstep: 1.12 | step_microstep: 8.09 [2025-06-20 00:44:18,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.96 | bwd: 3384.31 | bwd_inner: 3383.12 | bwd_allreduce: 1.14 | step: 8.11 71%|███████ | 7119/10000 [11:14:38<4:25:50, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.00019208712910767645, 'learning_rate': 8.09284364209877e-06, 'epoch': 7.12} 71%|███████ | 7119/10000 [11:14:38<4:25:50, 5.54s/it][2025-06-20 00:44:23,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:44:23,594] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.17 | bwd_microstep: 3388.18 | bwd_inner_microstep: 3387.32 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.99 [2025-06-20 00:44:23,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.17 | bwd: 3388.20 | bwd_inner: 3387.32 | bwd_allreduce: 0.83 | step: 7.00 71%|███████ | 7120/10000 [11:14:44<4:26:24, 5.55s/it] {'loss': 0.0001, 'grad_norm': 0.015322760678827763, 'learning_rate': 8.087639841582645e-06, 'epoch': 7.12} 71%|███████ | 7120/10000 [11:14:44<4:26:24, 5.55s/it][2025-06-20 00:44:29,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 00:44:29,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.47 | bwd_microstep: 3388.68 | bwd_inner_microstep: 3387.76 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.02 [2025-06-20 00:44:29,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.47 | bwd: 3388.70 | bwd_inner: 3387.76 | bwd_allreduce: 0.89 | step: 7.02 71%|███████ | 7121/10000 [11:14:49<4:26:39, 5.56s/it] {'loss': 0.0, 'grad_norm': 0.0004592002078425139, 'learning_rate': 8.08243729061781e-06, 'epoch': 7.12} 71%|███████ | 7121/10000 [11:14:49<4:26:39, 5.56s/it][2025-06-20 00:44:34,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:44:34,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.09 | bwd_microstep: 3327.67 | bwd_inner_microstep: 3326.71 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.14 [2025-06-20 00:44:34,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.09 | bwd: 3327.68 | bwd_inner: 3326.71 | bwd_allreduce: 0.93 | step: 7.14 71%|███████ | 7122/10000 [11:14:55<4:25:26, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.000980463926680386, 'learning_rate': 8.077235989749988e-06, 'epoch': 7.12} 71%|███████ | 7122/10000 [11:14:55<4:25:26, 5.53s/it][2025-06-20 00:44:40,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:44:40,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.31 | bwd_microstep: 3323.49 | bwd_inner_microstep: 3322.52 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.14 [2025-06-20 00:44:40,119] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.31 | bwd: 3323.50 | bwd_inner: 3322.52 | bwd_allreduce: 0.94 | step: 7.14 71%|███████ | 7123/10000 [11:15:00<4:24:25, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.015957210212945938, 'learning_rate': 8.072035939524778e-06, 'epoch': 7.12} 71%|███████ | 7123/10000 [11:15:00<4:24:25, 5.51s/it][2025-06-20 00:44:45,680] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 00:44:45,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.61 | bwd_microstep: 3381.13 | bwd_inner_microstep: 3380.15 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.23 [2025-06-20 00:44:45,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.61 | bwd: 3381.15 | bwd_inner: 3380.15 | bwd_allreduce: 0.95 | step: 7.23 71%|███████ | 7124/10000 [11:15:06<4:25:00, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0010784621117636561, 'learning_rate': 8.066837140487638e-06, 'epoch': 7.12} 71%|███████ | 7124/10000 [11:15:06<4:25:00, 5.53s/it][2025-06-20 00:44:51,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 00:44:51,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.98 | bwd_microstep: 3373.78 | bwd_inner_microstep: 3372.53 | bwd_allreduce_microstep: 1.20 | step_microstep: 8.42 [2025-06-20 00:44:51,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.98 | bwd: 3373.80 | bwd_inner: 3372.53 | bwd_allreduce: 1.22 | step: 8.42 71%|███████▏ | 7125/10000 [11:15:12<4:25:20, 5.54s/it] {'loss': 0.0005, 'grad_norm': 0.101103775203228, 'learning_rate': 8.0616395931839e-06, 'epoch': 7.12} 71%|███████▏ | 7125/10000 [11:15:12<4:25:20, 5.54s/it][2025-06-20 00:44:56,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 00:44:56,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.65 | bwd_microstep: 3371.33 | bwd_inner_microstep: 3370.38 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.24 [2025-06-20 00:44:56,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.65 | bwd: 3371.35 | bwd_inner: 3370.38 | bwd_allreduce: 0.93 | step: 7.24 71%|███████▏ | 7126/10000 [11:15:17<4:25:28, 5.54s/it] {'loss': 0.0001, 'grad_norm': 0.011778670363128185, 'learning_rate': 8.056443298158757e-06, 'epoch': 7.13} 71%|███████▏ | 7126/10000 [11:15:17<4:25:28, 5.54s/it][2025-06-20 00:45:02,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:45:02,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.33 | bwd_microstep: 3321.84 | bwd_inner_microstep: 3320.79 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.99 [2025-06-20 00:45:02,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.33 | bwd: 3321.86 | bwd_inner: 3320.79 | bwd_allreduce: 1.01 | step: 8.00 71%|███████▏ | 7127/10000 [11:15:23<4:24:33, 5.53s/it] {'loss': 0.0007, 'grad_norm': 0.26175716519355774, 'learning_rate': 8.051248255957285e-06, 'epoch': 7.13} 71%|███████▏ | 7127/10000 [11:15:23<4:24:33, 5.53s/it][2025-06-20 00:45:07,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:45:07,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.97 | bwd_microstep: 3323.25 | bwd_inner_microstep: 3322.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 00:45:07,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.97 | bwd: 3323.27 | bwd_inner: 3322.46 | bwd_allreduce: 0.77 | step: 6.80 71%|███████▏ | 7128/10000 [11:15:28<4:23:57, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.001720871776342392, 'learning_rate': 8.046054467124413e-06, 'epoch': 7.13} 71%|███████▏ | 7128/10000 [11:15:28<4:23:57, 5.51s/it][2025-06-20 00:45:13,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-20 00:45:13,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.37 | bwd_microstep: 3368.60 | bwd_inner_microstep: 3367.35 | bwd_allreduce_microstep: 1.18 | step_microstep: 8.06 [2025-06-20 00:45:13,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.37 | bwd: 3368.62 | bwd_inner: 3367.35 | bwd_allreduce: 1.21 | step: 8.07 71%|███████▏ | 7129/10000 [11:15:34<4:24:24, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.006275811232626438, 'learning_rate': 8.040861932204947e-06, 'epoch': 7.13} 71%|███████▏ | 7129/10000 [11:15:34<4:24:24, 5.53s/it][2025-06-20 00:45:18,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 00:45:18,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.76 | bwd_microstep: 3323.69 | bwd_inner_microstep: 3322.80 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.21 [2025-06-20 00:45:18,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.76 | bwd: 3323.71 | bwd_inner: 3322.80 | bwd_allreduce: 0.86 | step: 8.22 71%|███████▏ | 7130/10000 [11:15:39<4:23:53, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.001040659612044692, 'learning_rate': 8.035670651743561e-06, 'epoch': 7.13} 71%|███████▏ | 7130/10000 [11:15:39<4:23:53, 5.52s/it][2025-06-20 00:45:24,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:45:24,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.60 | bwd_microstep: 3325.19 | bwd_inner_microstep: 3324.39 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-20 00:45:24,298] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.60 | bwd: 3325.21 | bwd_inner: 3324.39 | bwd_allreduce: 0.77 | step: 6.89 71%|███████▏ | 7131/10000 [11:15:45<4:23:18, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00036119503783993423, 'learning_rate': 8.030480626284802e-06, 'epoch': 7.13} 71%|███████▏ | 7131/10000 [11:15:45<4:23:18, 5.51s/it][2025-06-20 00:45:29,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:45:29,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.44 | bwd_microstep: 3373.17 | bwd_inner_microstep: 3372.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-20 00:45:29,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.44 | bwd: 3373.18 | bwd_inner: 3372.36 | bwd_allreduce: 0.77 | step: 7.13 71%|███████▏ | 7132/10000 [11:15:50<4:23:47, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0010418740566819906, 'learning_rate': 8.025291856373066e-06, 'epoch': 7.13} 71%|███████▏ | 7132/10000 [11:15:50<4:23:47, 5.52s/it][2025-06-20 00:45:35,394] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:45:35,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.89 | bwd_microstep: 3375.26 | bwd_inner_microstep: 3374.39 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.94 [2025-06-20 00:45:35,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.89 | bwd: 3375.28 | bwd_inner: 3374.39 | bwd_allreduce: 0.84 | step: 6.96 71%|███████▏ | 7133/10000 [11:15:56<4:24:08, 5.53s/it] {'loss': 0.0005, 'grad_norm': 0.1773696392774582, 'learning_rate': 8.020104342552639e-06, 'epoch': 7.13} 71%|███████▏ | 7133/10000 [11:15:56<4:24:08, 5.53s/it][2025-06-20 00:45:40,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:45:40,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.76 | bwd_microstep: 3376.59 | bwd_inner_microstep: 3375.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-20 00:45:40,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.76 | bwd: 3376.60 | bwd_inner: 3375.78 | bwd_allreduce: 0.78 | step: 7.19 71%|███████▏ | 7134/10000 [11:16:01<4:24:19, 5.53s/it] {'loss': 0.0246, 'grad_norm': 5.366697311401367, 'learning_rate': 8.014918085367664e-06, 'epoch': 7.13} 71%|███████▏ | 7134/10000 [11:16:01<4:24:19, 5.53s/it][2025-06-20 00:45:46,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:45:46,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.79 | bwd_microstep: 3323.13 | bwd_inner_microstep: 3322.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-20 00:45:46,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.79 | bwd: 3323.15 | bwd_inner: 3322.34 | bwd_allreduce: 0.76 | step: 6.84 71%|███████▏ | 7135/10000 [11:16:07<4:23:15, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.18543961644172668, 'learning_rate': 8.009733085362158e-06, 'epoch': 7.13} 71%|███████▏ | 7135/10000 [11:16:07<4:23:15, 5.51s/it][2025-06-20 00:45:51,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:45:51,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.40 | bwd_microstep: 3329.68 | bwd_inner_microstep: 3328.85 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.82 [2025-06-20 00:45:51,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.40 | bwd: 3329.70 | bwd_inner: 3328.85 | bwd_allreduce: 0.80 | step: 6.82 71%|███████▏ | 7136/10000 [11:16:12<4:22:35, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.08579923957586288, 'learning_rate': 8.004549343080013e-06, 'epoch': 7.14} 71%|███████▏ | 7136/10000 [11:16:12<4:22:35, 5.50s/it][2025-06-20 00:45:57,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:45:57,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.63 | bwd_microstep: 3321.08 | bwd_inner_microstep: 3320.17 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.59 [2025-06-20 00:45:57,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.63 | bwd: 3321.09 | bwd_inner: 3320.17 | bwd_allreduce: 0.87 | step: 7.60 71%|███████▏ | 7137/10000 [11:16:18<4:22:04, 5.49s/it] {'loss': 0.0, 'grad_norm': 8.555116801289842e-05, 'learning_rate': 7.999366859064959e-06, 'epoch': 7.14} 71%|███████▏ | 7137/10000 [11:16:18<4:22:04, 5.49s/it][2025-06-20 00:46:02,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:46:02,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.85 | bwd_microstep: 3373.16 | bwd_inner_microstep: 3372.32 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.08 [2025-06-20 00:46:02,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.85 | bwd: 3373.18 | bwd_inner: 3372.32 | bwd_allreduce: 0.81 | step: 7.08 71%|███████▏ | 7138/10000 [11:16:23<4:22:38, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.000168070851941593, 'learning_rate': 7.994185633860626e-06, 'epoch': 7.14} 71%|███████▏ | 7138/10000 [11:16:23<4:22:38, 5.51s/it][2025-06-20 00:46:08,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:46:08,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.23 | bwd_microstep: 3314.67 | bwd_inner_microstep: 3313.88 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 00:46:08,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.23 | bwd: 3314.69 | bwd_inner: 3313.88 | bwd_allreduce: 0.76 | step: 6.79 71%|███████▏ | 7139/10000 [11:16:29<4:21:52, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005327869206666946, 'learning_rate': 7.989005668010497e-06, 'epoch': 7.14} 71%|███████▏ | 7139/10000 [11:16:29<4:21:52, 5.49s/it][2025-06-20 00:46:13,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 00:46:13,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.38 | bwd_microstep: 3317.41 | bwd_inner_microstep: 3316.41 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.15 [2025-06-20 00:46:13,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.38 | bwd: 3317.43 | bwd_inner: 3316.41 | bwd_allreduce: 0.96 | step: 7.16 71%|███████▏ | 7140/10000 [11:16:34<4:21:25, 5.48s/it] {'loss': 0.0119, 'grad_norm': 2.4673519134521484, 'learning_rate': 7.98382696205793e-06, 'epoch': 7.14} 71%|███████▏ | 7140/10000 [11:16:34<4:21:25, 5.48s/it][2025-06-20 00:46:19,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.90 [2025-06-20 00:46:19,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.58 | bwd_microstep: 3324.42 | bwd_inner_microstep: 3323.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-20 00:46:19,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.58 | bwd: 3324.44 | bwd_inner: 3323.63 | bwd_allreduce: 0.76 | step: 7.06 71%|███████▏ | 7141/10000 [11:16:40<4:21:11, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.013702874071896076, 'learning_rate': 7.978649516546147e-06, 'epoch': 7.14} 71%|███████▏ | 7141/10000 [11:16:40<4:21:11, 5.48s/it][2025-06-20 00:46:24,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:46:24,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.10 | bwd_microstep: 3377.83 | bwd_inner_microstep: 3377.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.87 [2025-06-20 00:46:24,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.10 | bwd: 3377.84 | bwd_inner: 3377.04 | bwd_allreduce: 0.76 | step: 6.88 71%|███████▏ | 7142/10000 [11:16:45<4:22:15, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.01696654036641121, 'learning_rate': 7.973473332018235e-06, 'epoch': 7.14} 71%|███████▏ | 7142/10000 [11:16:45<4:22:15, 5.51s/it][2025-06-20 00:46:30,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 00:46:30,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.49 | bwd_microstep: 3374.97 | bwd_inner_microstep: 3374.13 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.10 [2025-06-20 00:46:30,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.49 | bwd: 3374.99 | bwd_inner: 3374.13 | bwd_allreduce: 0.80 | step: 7.10 71%|███████▏ | 7143/10000 [11:16:51<4:22:39, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0029474294278770685, 'learning_rate': 7.968298409017156e-06, 'epoch': 7.14} 71%|███████▏ | 7143/10000 [11:16:51<4:22:39, 5.52s/it][2025-06-20 00:46:35,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:46:35,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.51 | bwd_microstep: 3320.02 | bwd_inner_microstep: 3319.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-20 00:46:35,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.51 | bwd: 3320.03 | bwd_inner: 3319.23 | bwd_allreduce: 0.76 | step: 6.95 71%|███████▏ | 7144/10000 [11:16:56<4:21:46, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.04580630362033844, 'learning_rate': 7.963124748085734e-06, 'epoch': 7.14} 71%|███████▏ | 7144/10000 [11:16:56<4:21:46, 5.50s/it][2025-06-20 00:46:41,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:46:41,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.64 | bwd_microstep: 3324.18 | bwd_inner_microstep: 3323.36 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.15 [2025-06-20 00:46:41,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.64 | bwd: 3324.19 | bwd_inner: 3323.36 | bwd_allreduce: 0.78 | step: 7.16 71%|███████▏ | 7145/10000 [11:17:02<4:21:12, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.012798304669559002, 'learning_rate': 7.957952349766658e-06, 'epoch': 7.14} 71%|███████▏ | 7145/10000 [11:17:02<4:21:12, 5.49s/it][2025-06-20 00:46:46,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:46:46,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.30 | bwd_microstep: 3369.46 | bwd_inner_microstep: 3368.64 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.33 [2025-06-20 00:46:46,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.30 | bwd: 3369.48 | bwd_inner: 3368.64 | bwd_allreduce: 0.79 | step: 7.33 71%|███████▏ | 7146/10000 [11:17:07<4:21:59, 5.51s/it] {'loss': 0.0007, 'grad_norm': 0.1766922026872635, 'learning_rate': 7.952781214602498e-06, 'epoch': 7.15} 71%|███████▏ | 7146/10000 [11:17:07<4:21:59, 5.51s/it][2025-06-20 00:46:52,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:46:52,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.33 | bwd_microstep: 3325.56 | bwd_inner_microstep: 3324.76 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-20 00:46:52,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.33 | bwd: 3325.57 | bwd_inner: 3324.75 | bwd_allreduce: 0.77 | step: 6.96 71%|███████▏ | 7147/10000 [11:17:13<4:21:20, 5.50s/it] {'loss': 0.0032, 'grad_norm': 1.5094462633132935, 'learning_rate': 7.947611343135671e-06, 'epoch': 7.15} 71%|███████▏ | 7147/10000 [11:17:13<4:21:20, 5.50s/it][2025-06-20 00:46:57,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:46:57,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.59 | bwd_microstep: 3366.59 | bwd_inner_microstep: 3365.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 00:46:57,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.59 | bwd: 3366.60 | bwd_inner: 3365.79 | bwd_allreduce: 0.76 | step: 6.76 71%|███████▏ | 7148/10000 [11:17:18<4:21:44, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0004894590238109231, 'learning_rate': 7.942442735908478e-06, 'epoch': 7.15} 71%|███████▏ | 7148/10000 [11:17:18<4:21:44, 5.51s/it][2025-06-20 00:47:03,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:47:03,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.91 | bwd_microstep: 3366.72 | bwd_inner_microstep: 3365.91 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-20 00:47:03,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.91 | bwd: 3366.73 | bwd_inner: 3365.91 | bwd_allreduce: 0.78 | step: 7.00 71%|███████▏ | 7149/10000 [11:17:24<4:22:00, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.006195294205099344, 'learning_rate': 7.93727539346308e-06, 'epoch': 7.15} 71%|███████▏ | 7149/10000 [11:17:24<4:22:00, 5.51s/it][2025-06-20 00:47:08,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:47:08,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.99 | bwd_microstep: 3323.31 | bwd_inner_microstep: 3322.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 00:47:08,869] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.99 | bwd: 3323.32 | bwd_inner: 3322.53 | bwd_allreduce: 0.76 | step: 6.68 72%|███████▏ | 7150/10000 [11:17:29<4:21:14, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.010182134807109833, 'learning_rate': 7.932109316341508e-06, 'epoch': 7.15} 72%|███████▏ | 7150/10000 [11:17:29<4:21:14, 5.50s/it][2025-06-20 00:47:14,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:47:14,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.82 | bwd_microstep: 3315.10 | bwd_inner_microstep: 3314.27 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.83 [2025-06-20 00:47:14,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.82 | bwd: 3315.11 | bwd_inner: 3314.27 | bwd_allreduce: 0.79 | step: 6.84 72%|███████▏ | 7151/10000 [11:17:35<4:20:37, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0034139419440180063, 'learning_rate': 7.926944505085666e-06, 'epoch': 7.15} 72%|███████▏ | 7151/10000 [11:17:35<4:20:37, 5.49s/it][2025-06-20 00:47:19,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.77 [2025-06-20 00:47:19,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.90 | bwd_microstep: 3322.26 | bwd_inner_microstep: 3321.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 00:47:19,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.90 | bwd: 3322.28 | bwd_inner: 3321.46 | bwd_allreduce: 0.77 | step: 6.82 72%|███████▏ | 7152/10000 [11:17:40<4:20:19, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.013505510054528713, 'learning_rate': 7.921780960237306e-06, 'epoch': 7.15} 72%|███████▏ | 7152/10000 [11:17:40<4:20:19, 5.48s/it][2025-06-20 00:47:25,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:47:25,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.37 | bwd_microstep: 3360.47 | bwd_inner_microstep: 3359.55 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.35 [2025-06-20 00:47:25,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.38 | bwd: 3360.49 | bwd_inner: 3359.55 | bwd_allreduce: 0.89 | step: 7.36 72%|███████▏ | 7153/10000 [11:17:46<4:20:54, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.018450045958161354, 'learning_rate': 7.916618682338069e-06, 'epoch': 7.15} 72%|███████▏ | 7153/10000 [11:17:46<4:20:54, 5.50s/it][2025-06-20 00:47:30,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:47:30,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.07 | bwd_microstep: 3362.79 | bwd_inner_microstep: 3362.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 00:47:30,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.07 | bwd: 3362.80 | bwd_inner: 3362.00 | bwd_allreduce: 0.76 | step: 6.77 72%|███████▏ | 7154/10000 [11:17:51<4:21:21, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.03172501549124718, 'learning_rate': 7.911457671929447e-06, 'epoch': 7.15} 72%|███████▏ | 7154/10000 [11:17:51<4:21:21, 5.51s/it][2025-06-20 00:47:36,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:47:36,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.96 | bwd_microstep: 3316.40 | bwd_inner_microstep: 3315.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.42 [2025-06-20 00:47:36,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.96 | bwd: 3316.42 | bwd_inner: 3315.60 | bwd_allreduce: 0.78 | step: 7.42 72%|███████▏ | 7155/10000 [11:17:57<4:20:39, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.016876067966222763, 'learning_rate': 7.906297929552816e-06, 'epoch': 7.16} 72%|███████▏ | 7155/10000 [11:17:57<4:20:39, 5.50s/it][2025-06-20 00:47:41,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 00:47:41,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.94 | bwd_microstep: 3320.34 | bwd_inner_microstep: 3319.44 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.03 [2025-06-20 00:47:41,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.94 | bwd: 3320.36 | bwd_inner: 3319.44 | bwd_allreduce: 0.87 | step: 7.03 72%|███████▏ | 7156/10000 [11:18:02<4:20:06, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0009212601580657065, 'learning_rate': 7.901139455749407e-06, 'epoch': 7.16} 72%|███████▏ | 7156/10000 [11:18:02<4:20:06, 5.49s/it][2025-06-20 00:47:47,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:47:47,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.74 | bwd_microstep: 3314.68 | bwd_inner_microstep: 3313.79 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.55 [2025-06-20 00:47:47,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.74 | bwd: 3314.70 | bwd_inner: 3313.79 | bwd_allreduce: 0.86 | step: 7.56 72%|███████▏ | 7157/10000 [11:18:08<4:19:51, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0107249915599823, 'learning_rate': 7.89598225106031e-06, 'epoch': 7.16} 72%|███████▏ | 7157/10000 [11:18:08<4:19:51, 5.48s/it][2025-06-20 00:47:52,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:47:52,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.80 | bwd_microstep: 3317.76 | bwd_inner_microstep: 3316.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 00:47:52,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.80 | bwd: 3317.77 | bwd_inner: 3316.96 | bwd_allreduce: 0.77 | step: 6.81 72%|███████▏ | 7158/10000 [11:18:13<4:19:29, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001050471211783588, 'learning_rate': 7.890826316026499e-06, 'epoch': 7.16} 72%|███████▏ | 7158/10000 [11:18:13<4:19:29, 5.48s/it][2025-06-20 00:47:58,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:47:58,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.29 | bwd_microstep: 3365.51 | bwd_inner_microstep: 3364.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.09 [2025-06-20 00:47:58,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.29 | bwd: 3365.53 | bwd_inner: 3364.71 | bwd_allreduce: 0.77 | step: 7.10 72%|███████▏ | 7159/10000 [11:18:19<4:20:11, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001805935287848115, 'learning_rate': 7.885671651188807e-06, 'epoch': 7.16} 72%|███████▏ | 7159/10000 [11:18:19<4:20:11, 5.50s/it][2025-06-20 00:48:03,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-20 00:48:03,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.94 | bwd_microstep: 3361.80 | bwd_inner_microstep: 3360.75 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.48 [2025-06-20 00:48:03,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.94 | bwd: 3361.81 | bwd_inner: 3360.75 | bwd_allreduce: 1.02 | step: 8.49 72%|███████▏ | 7160/10000 [11:18:24<4:20:36, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00941791944205761, 'learning_rate': 7.880518257087935e-06, 'epoch': 7.16} 72%|███████▏ | 7160/10000 [11:18:24<4:20:36, 5.51s/it][2025-06-20 00:48:09,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:48:09,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.51 | bwd_microstep: 3364.20 | bwd_inner_microstep: 3363.37 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.93 [2025-06-20 00:48:09,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.51 | bwd: 3364.22 | bwd_inner: 3363.37 | bwd_allreduce: 0.79 | step: 6.93 72%|███████▏ | 7161/10000 [11:18:30<4:21:05, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.008475775830447674, 'learning_rate': 7.875366134264449e-06, 'epoch': 7.16} 72%|███████▏ | 7161/10000 [11:18:30<4:21:05, 5.52s/it][2025-06-20 00:48:14,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:48:14,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.43 | bwd_microstep: 3309.34 | bwd_inner_microstep: 3308.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.22 [2025-06-20 00:48:14,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.43 | bwd: 3309.36 | bwd_inner: 3308.54 | bwd_allreduce: 0.77 | step: 7.22 72%|███████▏ | 7162/10000 [11:18:35<4:20:02, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005292993504554033, 'learning_rate': 7.870215283258782e-06, 'epoch': 7.16} 72%|███████▏ | 7162/10000 [11:18:35<4:20:02, 5.50s/it][2025-06-20 00:48:20,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:48:20,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.74 | bwd_microstep: 3400.47 | bwd_inner_microstep: 3399.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-20 00:48:20,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.74 | bwd: 3400.48 | bwd_inner: 3399.67 | bwd_allreduce: 0.77 | step: 6.84 72%|███████▏ | 7163/10000 [11:18:41<4:20:58, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.02164449356496334, 'learning_rate': 7.865065704611236e-06, 'epoch': 7.16} 72%|███████▏ | 7163/10000 [11:18:41<4:20:58, 5.52s/it][2025-06-20 00:48:25,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:48:25,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.72 | bwd_microstep: 3323.12 | bwd_inner_microstep: 3322.28 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.30 [2025-06-20 00:48:25,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.72 | bwd: 3323.14 | bwd_inner: 3322.28 | bwd_allreduce: 0.81 | step: 7.30 72%|███████▏ | 7164/10000 [11:18:46<4:20:06, 5.50s/it] {'loss': 0.0021, 'grad_norm': 0.28627297282218933, 'learning_rate': 7.859917398861981e-06, 'epoch': 7.16} 72%|███████▏ | 7164/10000 [11:18:46<4:20:06, 5.50s/it][2025-06-20 00:48:31,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:48:31,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.88 | bwd_microstep: 3310.57 | bwd_inner_microstep: 3309.67 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.08 [2025-06-20 00:48:31,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.88 | bwd: 3310.59 | bwd_inner: 3309.67 | bwd_allreduce: 0.88 | step: 7.09 72%|███████▏ | 7165/10000 [11:18:52<4:19:21, 5.49s/it] {'loss': 0.0119, 'grad_norm': 2.413389205932617, 'learning_rate': 7.854770366551044e-06, 'epoch': 7.17} 72%|███████▏ | 7165/10000 [11:18:52<4:19:21, 5.49s/it][2025-06-20 00:48:36,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:48:36,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.23 | bwd_microstep: 3376.45 | bwd_inner_microstep: 3375.64 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-20 00:48:36,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.23 | bwd: 3376.46 | bwd_inner: 3375.64 | bwd_allreduce: 0.78 | step: 7.08 72%|███████▏ | 7166/10000 [11:18:57<4:20:06, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.019744111225008965, 'learning_rate': 7.849624608218334e-06, 'epoch': 7.17} 72%|███████▏ | 7166/10000 [11:18:57<4:20:06, 5.51s/it][2025-06-20 00:48:42,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:48:42,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.63 | bwd_microstep: 3361.96 | bwd_inner_microstep: 3360.95 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.11 [2025-06-20 00:48:42,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.63 | bwd: 3361.99 | bwd_inner: 3360.95 | bwd_allreduce: 0.98 | step: 7.11 72%|███████▏ | 7167/10000 [11:19:03<4:20:23, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.024534497410058975, 'learning_rate': 7.844480124403606e-06, 'epoch': 7.17} 72%|███████▏ | 7167/10000 [11:19:03<4:20:23, 5.51s/it][2025-06-20 00:48:47,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:48:47,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.79 | bwd_microstep: 3306.24 | bwd_inner_microstep: 3305.38 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.52 [2025-06-20 00:48:47,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.79 | bwd: 3306.27 | bwd_inner: 3305.38 | bwd_allreduce: 0.83 | step: 7.53 72%|███████▏ | 7168/10000 [11:19:08<4:19:27, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.003763757413253188, 'learning_rate': 7.839336915646495e-06, 'epoch': 7.17} 72%|███████▏ | 7168/10000 [11:19:08<4:19:27, 5.50s/it][2025-06-20 00:48:53,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:48:53,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.18 | bwd_microstep: 3302.65 | bwd_inner_microstep: 3301.81 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.93 [2025-06-20 00:48:53,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.18 | bwd: 3302.67 | bwd_inner: 3301.81 | bwd_allreduce: 0.80 | step: 6.93 72%|███████▏ | 7169/10000 [11:19:14<4:18:39, 5.48s/it] {'loss': 0.0006, 'grad_norm': 0.07351859658956528, 'learning_rate': 7.834194982486504e-06, 'epoch': 7.17} 72%|███████▏ | 7169/10000 [11:19:14<4:18:39, 5.48s/it][2025-06-20 00:48:58,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:48:58,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.47 | bwd_microstep: 3361.46 | bwd_inner_microstep: 3360.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-20 00:48:58,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.47 | bwd: 3361.48 | bwd_inner: 3360.65 | bwd_allreduce: 0.78 | step: 7.24 72%|███████▏ | 7170/10000 [11:19:19<4:19:10, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.010911783203482628, 'learning_rate': 7.829054325462996e-06, 'epoch': 7.17} 72%|███████▏ | 7170/10000 [11:19:19<4:19:10, 5.49s/it][2025-06-20 00:49:04,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:49:04,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.33 | bwd_microstep: 3309.91 | bwd_inner_microstep: 3309.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 00:49:04,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.33 | bwd: 3309.93 | bwd_inner: 3309.13 | bwd_allreduce: 0.76 | step: 6.73 72%|███████▏ | 7171/10000 [11:19:25<4:18:22, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.012037958018481731, 'learning_rate': 7.82391494511521e-06, 'epoch': 7.17} 72%|███████▏ | 7171/10000 [11:19:25<4:18:22, 5.48s/it][2025-06-20 00:49:09,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:49:09,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.35 | bwd_microstep: 3312.62 | bwd_inner_microstep: 3311.84 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 00:49:09,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.35 | bwd: 3312.63 | bwd_inner: 3311.84 | bwd_allreduce: 0.75 | step: 6.62 72%|███████▏ | 7172/10000 [11:19:30<4:17:50, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.004880075808614492, 'learning_rate': 7.818776841982227e-06, 'epoch': 7.17} 72%|███████▏ | 7172/10000 [11:19:30<4:17:50, 5.47s/it][2025-06-20 00:49:15,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 00:49:15,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.79 | bwd_microstep: 3311.96 | bwd_inner_microstep: 3311.11 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.42 [2025-06-20 00:49:15,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.79 | bwd: 3311.98 | bwd_inner: 3311.11 | bwd_allreduce: 0.82 | step: 7.42 72%|███████▏ | 7173/10000 [11:19:35<4:17:29, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.02979694865643978, 'learning_rate': 7.813640016603018e-06, 'epoch': 7.17} 72%|███████▏ | 7173/10000 [11:19:35<4:17:29, 5.47s/it][2025-06-20 00:49:20,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:49:20,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.82 | bwd_microstep: 3365.07 | bwd_inner_microstep: 3364.26 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.24 [2025-06-20 00:49:20,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.82 | bwd: 3365.09 | bwd_inner: 3364.26 | bwd_allreduce: 0.79 | step: 7.24 72%|███████▏ | 7174/10000 [11:19:41<4:18:29, 5.49s/it] {'loss': 0.0013, 'grad_norm': 0.3761793375015259, 'learning_rate': 7.808504469516416e-06, 'epoch': 7.17} 72%|███████▏ | 7174/10000 [11:19:41<4:18:29, 5.49s/it][2025-06-20 00:49:26,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:49:26,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.91 | bwd_microstep: 3312.25 | bwd_inner_microstep: 3311.18 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.33 [2025-06-20 00:49:26,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.91 | bwd: 3312.26 | bwd_inner: 3311.18 | bwd_allreduce: 1.04 | step: 7.34 72%|███████▏ | 7175/10000 [11:19:46<4:17:54, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.005333802197128534, 'learning_rate': 7.803370201261108e-06, 'epoch': 7.17} 72%|███████▏ | 7175/10000 [11:19:46<4:17:54, 5.48s/it][2025-06-20 00:49:31,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:49:31,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.45 | bwd_microstep: 3311.10 | bwd_inner_microstep: 3310.30 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-20 00:49:31,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.45 | bwd: 3311.11 | bwd_inner: 3310.30 | bwd_allreduce: 0.77 | step: 6.92 72%|███████▏ | 7176/10000 [11:19:52<4:17:35, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00037616799818351865, 'learning_rate': 7.798237212375663e-06, 'epoch': 7.18} 72%|███████▏ | 7176/10000 [11:19:52<4:17:35, 5.47s/it][2025-06-20 00:49:37,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:49:37,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.46 | bwd_microstep: 3314.50 | bwd_inner_microstep: 3313.68 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.32 [2025-06-20 00:49:37,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.46 | bwd: 3314.52 | bwd_inner: 3313.68 | bwd_allreduce: 0.79 | step: 7.32 72%|███████▏ | 7177/10000 [11:19:57<4:17:28, 5.47s/it] {'loss': 0.0099, 'grad_norm': 1.8787339925765991, 'learning_rate': 7.793105503398503e-06, 'epoch': 7.18} 72%|███████▏ | 7177/10000 [11:19:57<4:17:28, 5.47s/it][2025-06-20 00:49:42,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:49:42,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.84 | bwd_microstep: 3315.74 | bwd_inner_microstep: 3314.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 00:49:42,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.84 | bwd: 3315.75 | bwd_inner: 3314.94 | bwd_allreduce: 0.77 | step: 6.76 72%|███████▏ | 7178/10000 [11:20:03<4:17:09, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0006872628582641482, 'learning_rate': 7.787975074867921e-06, 'epoch': 7.18} 72%|███████▏ | 7178/10000 [11:20:03<4:17:09, 5.47s/it][2025-06-20 00:49:48,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:49:48,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.78 | bwd_microstep: 3372.12 | bwd_inner_microstep: 3371.32 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.14 [2025-06-20 00:49:48,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.79 | bwd: 3372.14 | bwd_inner: 3371.32 | bwd_allreduce: 0.77 | step: 7.15 72%|███████▏ | 7179/10000 [11:20:08<4:18:09, 5.49s/it] {'loss': 0.0007, 'grad_norm': 0.10853131115436554, 'learning_rate': 7.782845927322078e-06, 'epoch': 7.18} 72%|███████▏ | 7179/10000 [11:20:08<4:18:09, 5.49s/it][2025-06-20 00:49:53,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:49:53,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.52 | bwd_microstep: 3309.29 | bwd_inner_microstep: 3308.48 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-20 00:49:53,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.52 | bwd: 3309.31 | bwd_inner: 3308.48 | bwd_allreduce: 0.78 | step: 7.23 72%|███████▏ | 7180/10000 [11:20:14<4:17:33, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.12803871929645538, 'learning_rate': 7.777718061298996e-06, 'epoch': 7.18} 72%|███████▏ | 7180/10000 [11:20:14<4:17:33, 5.48s/it][2025-06-20 00:49:59,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:49:59,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.96 | bwd_microstep: 3366.77 | bwd_inner_microstep: 3365.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-20 00:49:59,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.96 | bwd: 3366.79 | bwd_inner: 3365.98 | bwd_allreduce: 0.76 | step: 6.96 72%|███████▏ | 7181/10000 [11:20:19<4:18:16, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0024184195790439844, 'learning_rate': 7.772591477336562e-06, 'epoch': 7.18} 72%|███████▏ | 7181/10000 [11:20:19<4:18:16, 5.50s/it][2025-06-20 00:50:04,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:50:04,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.71 | bwd_microstep: 3315.62 | bwd_inner_microstep: 3314.78 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.90 [2025-06-20 00:50:04,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.71 | bwd: 3315.64 | bwd_inner: 3314.78 | bwd_allreduce: 0.81 | step: 6.90 72%|███████▏ | 7182/10000 [11:20:25<4:17:38, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0009997799061238766, 'learning_rate': 7.76746617597254e-06, 'epoch': 7.18} 72%|███████▏ | 7182/10000 [11:20:25<4:17:38, 5.49s/it][2025-06-20 00:50:10,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:50:10,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.19 | bwd_microstep: 3371.26 | bwd_inner_microstep: 3370.46 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-20 00:50:10,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.19 | bwd: 3371.28 | bwd_inner: 3370.46 | bwd_allreduce: 0.77 | step: 7.13 72%|███████▏ | 7183/10000 [11:20:30<4:18:23, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.031211627647280693, 'learning_rate': 7.762342157744535e-06, 'epoch': 7.18} 72%|███████▏ | 7183/10000 [11:20:30<4:18:23, 5.50s/it][2025-06-20 00:50:15,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:50:15,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.30 | bwd_microstep: 3317.24 | bwd_inner_microstep: 3316.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.85 [2025-06-20 00:50:15,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.30 | bwd: 3317.25 | bwd_inner: 3316.45 | bwd_allreduce: 0.76 | step: 6.85 72%|███████▏ | 7184/10000 [11:20:36<4:17:43, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.01829741708934307, 'learning_rate': 7.757219423190044e-06, 'epoch': 7.18} 72%|███████▏ | 7184/10000 [11:20:36<4:17:43, 5.49s/it][2025-06-20 00:50:21,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:50:21,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.71 | bwd_microstep: 3319.05 | bwd_inner_microstep: 3318.06 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.51 [2025-06-20 00:50:21,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.71 | bwd: 3319.07 | bwd_inner: 3318.06 | bwd_allreduce: 0.96 | step: 7.52 72%|███████▏ | 7185/10000 [11:20:41<4:17:16, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.036622416228055954, 'learning_rate': 7.752097972846413e-06, 'epoch': 7.18} 72%|███████▏ | 7185/10000 [11:20:41<4:17:16, 5.48s/it][2025-06-20 00:50:26,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:50:26,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.95 | bwd_microstep: 3393.42 | bwd_inner_microstep: 3392.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-20 00:50:26,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.95 | bwd: 3393.44 | bwd_inner: 3392.63 | bwd_allreduce: 0.76 | step: 7.02 72%|███████▏ | 7186/10000 [11:20:47<4:18:21, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.02965448796749115, 'learning_rate': 7.74697780725086e-06, 'epoch': 7.19} 72%|███████▏ | 7186/10000 [11:20:47<4:18:21, 5.51s/it][2025-06-20 00:50:32,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:50:32,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.25 | bwd_microstep: 3307.52 | bwd_inner_microstep: 3306.74 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 00:50:32,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.25 | bwd: 3307.54 | bwd_inner: 3306.74 | bwd_allreduce: 0.75 | step: 6.63 72%|███████▏ | 7187/10000 [11:20:52<4:17:21, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006730962544679642, 'learning_rate': 7.741858926940475e-06, 'epoch': 7.19} 72%|███████▏ | 7187/10000 [11:20:52<4:17:21, 5.49s/it][2025-06-20 00:50:37,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:50:37,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.99 | bwd_microstep: 3307.61 | bwd_inner_microstep: 3306.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-20 00:50:37,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.99 | bwd: 3307.63 | bwd_inner: 3306.82 | bwd_allreduce: 0.77 | step: 6.94 72%|███████▏ | 7188/10000 [11:20:58<4:16:39, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0009069928200915456, 'learning_rate': 7.736741332452189e-06, 'epoch': 7.19} 72%|███████▏ | 7188/10000 [11:20:58<4:16:39, 5.48s/it][2025-06-20 00:50:42,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:50:42,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.82 | bwd_microstep: 3362.09 | bwd_inner_microstep: 3361.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-20 00:50:42,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.82 | bwd: 3362.10 | bwd_inner: 3361.29 | bwd_allreduce: 0.77 | step: 6.96 72%|███████▏ | 7189/10000 [11:21:03<4:17:14, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002666456624865532, 'learning_rate': 7.731625024322821e-06, 'epoch': 7.19} 72%|███████▏ | 7189/10000 [11:21:03<4:17:14, 5.49s/it][2025-06-20 00:50:48,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:50:48,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.83 | bwd_microstep: 3356.75 | bwd_inner_microstep: 3355.85 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.05 [2025-06-20 00:50:48,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.83 | bwd: 3356.76 | bwd_inner: 3355.85 | bwd_allreduce: 0.87 | step: 7.06 72%|███████▏ | 7190/10000 [11:21:09<4:17:31, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0030681451316922903, 'learning_rate': 7.726510003089052e-06, 'epoch': 7.19} 72%|███████▏ | 7190/10000 [11:21:09<4:17:31, 5.50s/it][2025-06-20 00:50:53,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:50:53,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.10 | bwd_microstep: 3325.01 | bwd_inner_microstep: 3324.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.06 [2025-06-20 00:50:53,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.10 | bwd: 3325.03 | bwd_inner: 3324.22 | bwd_allreduce: 0.77 | step: 7.06 72%|███████▏ | 7191/10000 [11:21:14<4:17:04, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.011960262432694435, 'learning_rate': 7.721396269287418e-06, 'epoch': 7.19} 72%|███████▏ | 7191/10000 [11:21:14<4:17:04, 5.49s/it][2025-06-20 00:50:59,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:50:59,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.05 | bwd_microstep: 3311.92 | bwd_inner_microstep: 3311.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 00:50:59,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.05 | bwd: 3311.94 | bwd_inner: 3311.13 | bwd_allreduce: 0.76 | step: 6.82 72%|███████▏ | 7192/10000 [11:21:20<4:16:28, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.011627400293946266, 'learning_rate': 7.716283823454336e-06, 'epoch': 7.19} 72%|███████▏ | 7192/10000 [11:21:20<4:16:28, 5.48s/it][2025-06-20 00:51:04,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:51:04,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.92 | bwd_microstep: 3307.36 | bwd_inner_microstep: 3306.51 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.81 [2025-06-20 00:51:04,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.92 | bwd: 3307.37 | bwd_inner: 3306.51 | bwd_allreduce: 0.80 | step: 6.81 72%|███████▏ | 7193/10000 [11:21:25<4:16:02, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0003533967537805438, 'learning_rate': 7.711172666126063e-06, 'epoch': 7.19} 72%|███████▏ | 7193/10000 [11:21:25<4:16:02, 5.47s/it][2025-06-20 00:51:10,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 00:51:10,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.36 | bwd_microstep: 3313.09 | bwd_inner_microstep: 3312.12 | bwd_allreduce_microstep: 0.92 | step_microstep: 6.93 [2025-06-20 00:51:10,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.36 | bwd: 3313.10 | bwd_inner: 3312.12 | bwd_allreduce: 0.94 | step: 6.94 72%|███████▏ | 7194/10000 [11:21:31<4:15:44, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.009016177617013454, 'learning_rate': 7.706062797838744e-06, 'epoch': 7.19} 72%|███████▏ | 7194/10000 [11:21:31<4:15:44, 5.47s/it][2025-06-20 00:51:15,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:51:15,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.40 | bwd_microstep: 3317.55 | bwd_inner_microstep: 3316.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-20 00:51:15,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.40 | bwd: 3317.56 | bwd_inner: 3316.76 | bwd_allreduce: 0.76 | step: 6.95 72%|███████▏ | 7195/10000 [11:21:36<4:15:32, 5.47s/it] {'loss': 0.0119, 'grad_norm': 1.2125126123428345, 'learning_rate': 7.700954219128381e-06, 'epoch': 7.2} 72%|███████▏ | 7195/10000 [11:21:36<4:15:32, 5.47s/it][2025-06-20 00:51:21,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:51:21,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.60 | bwd_microstep: 3306.66 | bwd_inner_microstep: 3305.82 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.06 [2025-06-20 00:51:21,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.60 | bwd: 3306.68 | bwd_inner: 3305.82 | bwd_allreduce: 0.81 | step: 7.07 72%|███████▏ | 7196/10000 [11:21:42<4:15:18, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0015701752854511142, 'learning_rate': 7.695846930530835e-06, 'epoch': 7.2} 72%|███████▏ | 7196/10000 [11:21:42<4:15:18, 5.46s/it][2025-06-20 00:51:26,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:51:26,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.87 | bwd_microstep: 3360.14 | bwd_inner_microstep: 3359.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.56 [2025-06-20 00:51:26,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.87 | bwd: 3360.15 | bwd_inner: 3359.35 | bwd_allreduce: 0.76 | step: 6.56 72%|███████▏ | 7197/10000 [11:21:47<4:16:01, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.005991185549646616, 'learning_rate': 7.690740932581844e-06, 'epoch': 7.2} 72%|███████▏ | 7197/10000 [11:21:47<4:16:01, 5.48s/it][2025-06-20 00:51:32,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:51:32,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.84 | bwd_microstep: 3310.38 | bwd_inner_microstep: 3309.45 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.73 [2025-06-20 00:51:32,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.84 | bwd: 3310.40 | bwd_inner: 3309.45 | bwd_allreduce: 0.89 | step: 6.73 72%|███████▏ | 7198/10000 [11:21:53<4:15:36, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0002017429651459679, 'learning_rate': 7.685636225816996e-06, 'epoch': 7.2} 72%|███████▏ | 7198/10000 [11:21:53<4:15:36, 5.47s/it][2025-06-20 00:51:37,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:51:37,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.44 | bwd_microstep: 3326.37 | bwd_inner_microstep: 3325.59 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 00:51:37,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.44 | bwd: 3326.39 | bwd_inner: 3325.59 | bwd_allreduce: 0.75 | step: 6.59 72%|███████▏ | 7199/10000 [11:21:58<4:15:29, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.09585736691951752, 'learning_rate': 7.68053281077176e-06, 'epoch': 7.2} 72%|███████▏ | 7199/10000 [11:21:58<4:15:29, 5.47s/it][2025-06-20 00:51:43,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:51:43,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.75 | bwd_microstep: 3363.62 | bwd_inner_microstep: 3362.78 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.31 [2025-06-20 00:51:43,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.75 | bwd: 3363.64 | bwd_inner: 3362.78 | bwd_allreduce: 0.81 | step: 7.32 72%|███████▏ | 7200/10000 [11:22:04<4:16:18, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0045660496689379215, 'learning_rate': 7.675430687981454e-06, 'epoch': 7.2} 72%|███████▏ | 7200/10000 [11:22:04<4:16:18, 5.49s/it][2025-06-20 00:51:48,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:51:48,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.15 | bwd_microstep: 3319.78 | bwd_inner_microstep: 3318.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-20 00:51:48,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.15 | bwd: 3319.79 | bwd_inner: 3318.99 | bwd_allreduce: 0.76 | step: 6.71 72%|███████▏ | 7201/10000 [11:22:09<4:15:43, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.2503054738044739, 'learning_rate': 7.67032985798127e-06, 'epoch': 7.2} 72%|███████▏ | 7201/10000 [11:22:09<4:15:43, 5.48s/it][2025-06-20 00:51:54,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:51:54,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.86 | bwd_microstep: 3360.20 | bwd_inner_microstep: 3359.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-20 00:51:54,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.86 | bwd: 3360.22 | bwd_inner: 3359.42 | bwd_allreduce: 0.76 | step: 6.53 72%|███████▏ | 7202/10000 [11:22:15<4:16:14, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.017235124483704567, 'learning_rate': 7.665230321306263e-06, 'epoch': 7.2} 72%|███████▏ | 7202/10000 [11:22:15<4:16:14, 5.49s/it][2025-06-20 00:51:59,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:51:59,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.04 | bwd_microstep: 3322.40 | bwd_inner_microstep: 3321.39 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.19 [2025-06-20 00:51:59,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.04 | bwd: 3322.42 | bwd_inner: 3321.39 | bwd_allreduce: 0.98 | step: 7.19 72%|███████▏ | 7203/10000 [11:22:20<4:15:45, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0012030994985252619, 'learning_rate': 7.660132078491345e-06, 'epoch': 7.2} 72%|███████▏ | 7203/10000 [11:22:20<4:15:45, 5.49s/it][2025-06-20 00:52:05,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:52:05,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.63 | bwd_microstep: 3369.18 | bwd_inner_microstep: 3368.35 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.29 [2025-06-20 00:52:05,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.63 | bwd: 3369.19 | bwd_inner: 3368.35 | bwd_allreduce: 0.79 | step: 7.29 72%|███████▏ | 7204/10000 [11:22:26<4:16:29, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0002449015446472913, 'learning_rate': 7.655035130071297e-06, 'epoch': 7.2} 72%|███████▏ | 7204/10000 [11:22:26<4:16:29, 5.50s/it][2025-06-20 00:52:10,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:52:10,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.39 | bwd_microstep: 3313.32 | bwd_inner_microstep: 3312.45 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.38 [2025-06-20 00:52:10,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.40 | bwd: 3313.34 | bwd_inner: 3312.45 | bwd_allreduce: 0.85 | step: 7.39 72%|███████▏ | 7205/10000 [11:22:31<4:15:49, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0017738486640155315, 'learning_rate': 7.649939476580771e-06, 'epoch': 7.21} 72%|███████▏ | 7205/10000 [11:22:31<4:15:49, 5.49s/it][2025-06-20 00:52:16,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:52:16,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.58 | bwd_microstep: 3320.44 | bwd_inner_microstep: 3319.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 00:52:16,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.58 | bwd: 3320.46 | bwd_inner: 3319.65 | bwd_allreduce: 0.77 | step: 6.77 72%|███████▏ | 7206/10000 [11:22:36<4:15:31, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00045381352538242936, 'learning_rate': 7.644845118554278e-06, 'epoch': 7.21} 72%|███████▏ | 7206/10000 [11:22:36<4:15:31, 5.49s/it][2025-06-20 00:52:21,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:52:21,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.49 | bwd_microstep: 3319.46 | bwd_inner_microstep: 3318.52 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.60 [2025-06-20 00:52:21,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.49 | bwd: 3319.48 | bwd_inner: 3318.52 | bwd_allreduce: 0.90 | step: 7.60 72%|███████▏ | 7207/10000 [11:22:42<4:15:08, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.09284531325101852, 'learning_rate': 7.639752056526194e-06, 'epoch': 7.21} 72%|███████▏ | 7207/10000 [11:22:42<4:15:08, 5.48s/it][2025-06-20 00:52:27,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:52:27,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.98 | bwd_microstep: 3321.71 | bwd_inner_microstep: 3320.78 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.28 [2025-06-20 00:52:27,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.98 | bwd: 3321.73 | bwd_inner: 3320.78 | bwd_allreduce: 0.90 | step: 7.28 72%|███████▏ | 7208/10000 [11:22:47<4:14:54, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.002664658473804593, 'learning_rate': 7.634660291030748e-06, 'epoch': 7.21} 72%|███████▏ | 7208/10000 [11:22:47<4:14:54, 5.48s/it][2025-06-20 00:52:32,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:52:32,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.62 | bwd_microstep: 3380.42 | bwd_inner_microstep: 3379.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-20 00:52:32,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.62 | bwd: 3380.43 | bwd_inner: 3379.63 | bwd_allreduce: 0.76 | step: 6.66 72%|███████▏ | 7209/10000 [11:22:53<4:15:52, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.02615220658481121, 'learning_rate': 7.629569822602048e-06, 'epoch': 7.21} 72%|███████▏ | 7209/10000 [11:22:53<4:15:52, 5.50s/it][2025-06-20 00:52:38,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:52:38,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.59 | bwd_microstep: 3366.50 | bwd_inner_microstep: 3365.70 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-20 00:52:38,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.59 | bwd: 3366.52 | bwd_inner: 3365.70 | bwd_allreduce: 0.78 | step: 6.73 72%|███████▏ | 7210/10000 [11:22:59<4:16:10, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.040343672037124634, 'learning_rate': 7.62448065177406e-06, 'epoch': 7.21} 72%|███████▏ | 7210/10000 [11:22:59<4:16:10, 5.51s/it][2025-06-20 00:52:43,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:52:43,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.40 | bwd_microstep: 3375.03 | bwd_inner_microstep: 3374.11 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.36 [2025-06-20 00:52:43,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.40 | bwd: 3375.05 | bwd_inner: 3374.11 | bwd_allreduce: 0.89 | step: 7.37 72%|███████▏ | 7211/10000 [11:23:04<4:16:36, 5.52s/it] {'loss': 0.0012, 'grad_norm': 0.5813853144645691, 'learning_rate': 7.619392779080617e-06, 'epoch': 7.21} 72%|███████▏ | 7211/10000 [11:23:04<4:16:36, 5.52s/it][2025-06-20 00:52:49,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:52:49,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.82 | bwd_microstep: 3374.52 | bwd_inner_microstep: 3373.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 00:52:49,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.82 | bwd: 3374.53 | bwd_inner: 3373.72 | bwd_allreduce: 0.77 | step: 6.81 72%|███████▏ | 7212/10000 [11:23:10<4:16:57, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0028489266987890005, 'learning_rate': 7.614306205055408e-06, 'epoch': 7.21} 72%|███████▏ | 7212/10000 [11:23:10<4:16:57, 5.53s/it][2025-06-20 00:52:54,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:52:54,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.53 | bwd_microstep: 3368.10 | bwd_inner_microstep: 3367.12 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.19 [2025-06-20 00:52:54,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.53 | bwd: 3368.12 | bwd_inner: 3367.12 | bwd_allreduce: 0.94 | step: 7.19 72%|███████▏ | 7213/10000 [11:23:15<4:17:01, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.005830601789057255, 'learning_rate': 7.6092209302319975e-06, 'epoch': 7.21} 72%|███████▏ | 7213/10000 [11:23:15<4:17:01, 5.53s/it][2025-06-20 00:53:00,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:53:00,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.68 | bwd_microstep: 3322.61 | bwd_inner_microstep: 3321.80 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.28 [2025-06-20 00:53:00,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.68 | bwd: 3322.63 | bwd_inner: 3321.80 | bwd_allreduce: 0.79 | step: 7.27 72%|███████▏ | 7214/10000 [11:23:21<4:16:17, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00013078117626719177, 'learning_rate': 7.604136955143802e-06, 'epoch': 7.21} 72%|███████▏ | 7214/10000 [11:23:21<4:16:17, 5.52s/it][2025-06-20 00:53:05,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:53:05,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.23 | bwd_microstep: 3374.29 | bwd_inner_microstep: 3373.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-20 00:53:05,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.23 | bwd: 3374.31 | bwd_inner: 3373.50 | bwd_allreduce: 0.76 | step: 6.80 72%|███████▏ | 7215/10000 [11:23:26<4:16:33, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0053025418892502785, 'learning_rate': 7.5990542803241095e-06, 'epoch': 7.21} 72%|███████▏ | 7215/10000 [11:23:26<4:16:33, 5.53s/it][2025-06-20 00:53:11,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:53:11,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.45 | bwd_microstep: 3368.74 | bwd_inner_microstep: 3367.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-20 00:53:11,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.45 | bwd: 3368.76 | bwd_inner: 3367.94 | bwd_allreduce: 0.78 | step: 7.18 72%|███████▏ | 7216/10000 [11:23:32<4:16:41, 5.53s/it] {'loss': 0.0002, 'grad_norm': 0.03379938751459122, 'learning_rate': 7.593972906306069e-06, 'epoch': 7.22} 72%|███████▏ | 7216/10000 [11:23:32<4:16:41, 5.53s/it][2025-06-20 00:53:16,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:53:16,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.86 | bwd_microstep: 3331.18 | bwd_inner_microstep: 3330.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 00:53:16,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.86 | bwd: 3331.19 | bwd_inner: 3330.38 | bwd_allreduce: 0.77 | step: 6.71 72%|███████▏ | 7217/10000 [11:23:37<4:15:52, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00030147688812576234, 'learning_rate': 7.588892833622699e-06, 'epoch': 7.22} 72%|███████▏ | 7217/10000 [11:23:37<4:15:52, 5.52s/it][2025-06-20 00:53:22,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:53:22,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.09 | bwd_microstep: 3324.16 | bwd_inner_microstep: 3323.22 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.51 [2025-06-20 00:53:22,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.09 | bwd: 3324.17 | bwd_inner: 3323.22 | bwd_allreduce: 0.91 | step: 7.51 72%|███████▏ | 7218/10000 [11:23:43<4:15:12, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.004942318424582481, 'learning_rate': 7.58381406280686e-06, 'epoch': 7.22} 72%|███████▏ | 7218/10000 [11:23:43<4:15:12, 5.50s/it][2025-06-20 00:53:27,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:53:27,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.18 | bwd_microstep: 3384.26 | bwd_inner_microstep: 3383.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 00:53:27,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.18 | bwd: 3384.27 | bwd_inner: 3383.46 | bwd_allreduce: 0.77 | step: 6.76 72%|███████▏ | 7219/10000 [11:23:48<4:15:48, 5.52s/it] {'loss': 0.0005, 'grad_norm': 0.08111962676048279, 'learning_rate': 7.5787365943913025e-06, 'epoch': 7.22} 72%|███████▏ | 7219/10000 [11:23:48<4:15:48, 5.52s/it][2025-06-20 00:53:33,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:53:33,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.63 | bwd_microstep: 3326.96 | bwd_inner_microstep: 3326.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-20 00:53:33,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.63 | bwd: 3326.98 | bwd_inner: 3326.16 | bwd_allreduce: 0.77 | step: 7.02 72%|███████▏ | 7220/10000 [11:23:54<4:15:05, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.003947907593101263, 'learning_rate': 7.573660428908629e-06, 'epoch': 7.22} 72%|███████▏ | 7220/10000 [11:23:54<4:15:05, 5.51s/it][2025-06-20 00:53:38,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:53:38,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.68 | bwd_microstep: 3326.23 | bwd_inner_microstep: 3325.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 00:53:38,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.88 | bwd: 3326.24 | bwd_inner: 3325.44 | bwd_allreduce: 0.76 | step: 6.73 72%|███████▏ | 7221/10000 [11:23:59<4:14:32, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00654742494225502, 'learning_rate': 7.5685855668913025e-06, 'epoch': 7.22} 72%|███████▏ | 7221/10000 [11:23:59<4:14:32, 5.50s/it][2025-06-20 00:53:44,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:53:44,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.82 | bwd_microstep: 3327.21 | bwd_inner_microstep: 3326.38 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.89 [2025-06-20 00:53:44,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.82 | bwd: 3327.23 | bwd_inner: 3326.38 | bwd_allreduce: 0.80 | step: 6.90 72%|███████▏ | 7222/10000 [11:24:05<4:14:11, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.014204255305230618, 'learning_rate': 7.5635120088716605e-06, 'epoch': 7.22} 72%|███████▏ | 7222/10000 [11:24:05<4:14:11, 5.49s/it][2025-06-20 00:53:49,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:53:49,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.91 | bwd_microstep: 3320.92 | bwd_inner_microstep: 3320.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-20 00:53:49,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.91 | bwd: 3320.93 | bwd_inner: 3320.11 | bwd_allreduce: 0.78 | step: 7.22 72%|███████▏ | 7223/10000 [11:24:10<4:13:48, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0003890912630595267, 'learning_rate': 7.558439755381881e-06, 'epoch': 7.22} 72%|███████▏ | 7223/10000 [11:24:10<4:13:48, 5.48s/it][2025-06-20 00:53:55,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:53:55,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.74 | bwd_microstep: 3372.82 | bwd_inner_microstep: 3371.92 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.37 [2025-06-20 00:53:55,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.74 | bwd: 3372.84 | bwd_inner: 3371.92 | bwd_allreduce: 0.86 | step: 7.37 72%|███████▏ | 7224/10000 [11:24:16<4:14:34, 5.50s/it] {'loss': 0.0021, 'grad_norm': 0.5108605623245239, 'learning_rate': 7.553368806954031e-06, 'epoch': 7.22} 72%|███████▏ | 7224/10000 [11:24:16<4:14:34, 5.50s/it][2025-06-20 00:54:00,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:54:00,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.59 | bwd_microstep: 3375.93 | bwd_inner_microstep: 3375.09 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.21 [2025-06-20 00:54:00,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.59 | bwd: 3375.95 | bwd_inner: 3375.09 | bwd_allreduce: 0.81 | step: 7.21 72%|███████▏ | 7225/10000 [11:24:21<4:15:10, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00038229685742408037, 'learning_rate': 7.5482991641200256e-06, 'epoch': 7.22} 72%|███████▏ | 7225/10000 [11:24:21<4:15:10, 5.52s/it][2025-06-20 00:54:06,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:54:06,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.87 | bwd_microstep: 3379.12 | bwd_inner_microstep: 3378.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.80 [2025-06-20 00:54:06,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.87 | bwd: 3379.13 | bwd_inner: 3378.33 | bwd_allreduce: 0.76 | step: 6.80 72%|███████▏ | 7226/10000 [11:24:27<4:15:33, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0019882828928530216, 'learning_rate': 7.543230827411647e-06, 'epoch': 7.23} 72%|███████▏ | 7226/10000 [11:24:27<4:15:33, 5.53s/it][2025-06-20 00:54:11,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:54:11,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.04 | bwd_microstep: 3329.54 | bwd_inner_microstep: 3328.71 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.99 [2025-06-20 00:54:11,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.04 | bwd: 3329.56 | bwd_inner: 3328.71 | bwd_allreduce: 0.80 | step: 6.99 72%|███████▏ | 7227/10000 [11:24:32<4:14:44, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.008801553398370743, 'learning_rate': 7.538163797360549e-06, 'epoch': 7.23} 72%|███████▏ | 7227/10000 [11:24:32<4:14:44, 5.51s/it][2025-06-20 00:54:17,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:54:17,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.49 | bwd_microstep: 3378.93 | bwd_inner_microstep: 3377.99 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.15 [2025-06-20 00:54:17,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.50 | bwd: 3378.94 | bwd_inner: 3377.99 | bwd_allreduce: 0.91 | step: 7.15 72%|███████▏ | 7228/10000 [11:24:38<4:15:10, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.030252730473876, 'learning_rate': 7.533098074498224e-06, 'epoch': 7.23} 72%|███████▏ | 7228/10000 [11:24:38<4:15:10, 5.52s/it][2025-06-20 00:54:23,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:54:23,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.46 | bwd_microstep: 3380.25 | bwd_inner_microstep: 3379.41 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.77 [2025-06-20 00:54:23,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.46 | bwd: 3380.27 | bwd_inner: 3379.41 | bwd_allreduce: 0.81 | step: 6.77 72%|███████▏ | 7229/10000 [11:24:43<4:15:25, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0069069876335561275, 'learning_rate': 7.528033659356049e-06, 'epoch': 7.23} 72%|███████▏ | 7229/10000 [11:24:43<4:15:25, 5.53s/it][2025-06-20 00:54:28,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 00:54:28,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.11 | bwd_microstep: 3331.24 | bwd_inner_microstep: 3330.36 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.20 [2025-06-20 00:54:28,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.11 | bwd: 3331.26 | bwd_inner: 3330.36 | bwd_allreduce: 0.86 | step: 7.21 72%|███████▏ | 7230/10000 [11:24:49<4:14:41, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.007122365292161703, 'learning_rate': 7.5229705524652605e-06, 'epoch': 7.23} 72%|███████▏ | 7230/10000 [11:24:49<4:14:41, 5.52s/it][2025-06-20 00:54:34,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:54:34,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.87 | bwd_microstep: 3373.13 | bwd_inner_microstep: 3372.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-20 00:54:34,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.87 | bwd: 3373.14 | bwd_inner: 3372.34 | bwd_allreduce: 0.76 | step: 7.09 72%|███████▏ | 7231/10000 [11:24:54<4:15:13, 5.53s/it] {'loss': 0.0008, 'grad_norm': 0.24065513908863068, 'learning_rate': 7.517908754356953e-06, 'epoch': 7.23} 72%|███████▏ | 7231/10000 [11:24:54<4:15:13, 5.53s/it][2025-06-20 00:54:39,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:54:39,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.64 | bwd_microstep: 3330.41 | bwd_inner_microstep: 3329.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-20 00:54:39,585] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.64 | bwd: 3330.43 | bwd_inner: 3329.61 | bwd_allreduce: 0.77 | step: 6.87 72%|███████▏ | 7232/10000 [11:25:00<4:14:33, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0007139061344787478, 'learning_rate': 7.512848265562087e-06, 'epoch': 7.23} 72%|███████▏ | 7232/10000 [11:25:00<4:14:33, 5.52s/it][2025-06-20 00:54:45,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-20 00:54:45,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.91 | bwd_microstep: 3326.19 | bwd_inner_microstep: 3324.99 | bwd_allreduce_microstep: 1.12 | step_microstep: 8.51 [2025-06-20 00:54:45,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.92 | bwd: 3326.22 | bwd_inner: 3324.99 | bwd_allreduce: 1.15 | step: 8.51 72%|███████▏ | 7233/10000 [11:25:05<4:13:58, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.01213997881859541, 'learning_rate': 7.5077890866114815e-06, 'epoch': 7.23} 72%|███████▏ | 7233/10000 [11:25:05<4:13:58, 5.51s/it][2025-06-20 00:54:50,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:54:50,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.34 | bwd_microstep: 3328.05 | bwd_inner_microstep: 3326.98 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.35 [2025-06-20 00:54:50,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.34 | bwd: 3328.07 | bwd_inner: 3326.98 | bwd_allreduce: 1.03 | step: 7.35 72%|███████▏ | 7234/10000 [11:25:11<4:13:36, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.003916692920029163, 'learning_rate': 7.502731218035823e-06, 'epoch': 7.23} 72%|███████▏ | 7234/10000 [11:25:11<4:13:36, 5.50s/it][2025-06-20 00:54:56,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:54:56,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.19 | bwd_microstep: 3373.82 | bwd_inner_microstep: 3372.96 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.30 [2025-06-20 00:54:56,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.19 | bwd: 3373.83 | bwd_inner: 3372.96 | bwd_allreduce: 0.82 | step: 7.30 72%|███████▏ | 7235/10000 [11:25:16<4:14:12, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0005778067861683667, 'learning_rate': 7.497674660365659e-06, 'epoch': 7.24} 72%|███████▏ | 7235/10000 [11:25:16<4:14:12, 5.52s/it][2025-06-20 00:55:01,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:55:01,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.59 | bwd_microstep: 3320.67 | bwd_inner_microstep: 3319.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 00:55:01,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.59 | bwd: 3320.68 | bwd_inner: 3319.89 | bwd_allreduce: 0.75 | step: 6.70 72%|███████▏ | 7236/10000 [11:25:22<4:13:31, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001730467309243977, 'learning_rate': 7.492619414131399e-06, 'epoch': 7.24} 72%|███████▏ | 7236/10000 [11:25:22<4:13:31, 5.50s/it][2025-06-20 00:55:07,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:55:07,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.08 | bwd_microstep: 3324.86 | bwd_inner_microstep: 3324.02 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.30 [2025-06-20 00:55:07,053] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.08 | bwd: 3324.87 | bwd_inner: 3324.02 | bwd_allreduce: 0.81 | step: 7.30 72%|███████▏ | 7237/10000 [11:25:27<4:13:01, 5.49s/it] {'loss': 0.0959, 'grad_norm': 7.6891632080078125, 'learning_rate': 7.487565479863319e-06, 'epoch': 7.24} 72%|███████▏ | 7237/10000 [11:25:27<4:13:01, 5.49s/it][2025-06-20 00:55:12,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:55:12,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.49 | bwd_microstep: 3331.49 | bwd_inner_microstep: 3330.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 00:55:12,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.49 | bwd: 3331.50 | bwd_inner: 3330.70 | bwd_allreduce: 0.76 | step: 6.69 72%|███████▏ | 7238/10000 [11:25:33<4:12:47, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.24106718599796295, 'learning_rate': 7.482512858091542e-06, 'epoch': 7.24} 72%|███████▏ | 7238/10000 [11:25:33<4:12:47, 5.49s/it][2025-06-20 00:55:18,012] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:55:18,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.01 | bwd_microstep: 3325.43 | bwd_inner_microstep: 3324.49 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.59 [2025-06-20 00:55:18,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.01 | bwd: 3325.45 | bwd_inner: 3324.49 | bwd_allreduce: 0.91 | step: 7.59 72%|███████▏ | 7239/10000 [11:25:38<4:12:30, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0027163969352841377, 'learning_rate': 7.47746154934607e-06, 'epoch': 7.24} 72%|███████▏ | 7239/10000 [11:25:38<4:12:30, 5.49s/it][2025-06-20 00:55:23,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:55:23,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.11 | bwd_microstep: 3329.99 | bwd_inner_microstep: 3329.02 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.58 [2025-06-20 00:55:23,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.10 | bwd: 3330.01 | bwd_inner: 3329.02 | bwd_allreduce: 0.93 | step: 7.59 72%|███████▏ | 7240/10000 [11:25:44<4:12:30, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006126808002591133, 'learning_rate': 7.472411554156762e-06, 'epoch': 7.24} 72%|███████▏ | 7240/10000 [11:25:44<4:12:30, 5.49s/it][2025-06-20 00:55:28,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:55:28,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.66 | bwd_microstep: 3321.94 | bwd_inner_microstep: 3321.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-20 00:55:28,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.66 | bwd: 3321.96 | bwd_inner: 3321.13 | bwd_allreduce: 0.78 | step: 7.24 72%|███████▏ | 7241/10000 [11:25:49<4:12:16, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.04150211438536644, 'learning_rate': 7.467362873053341e-06, 'epoch': 7.24} 72%|███████▏ | 7241/10000 [11:25:49<4:12:16, 5.49s/it][2025-06-20 00:55:34,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 00:55:34,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.22 | bwd_microstep: 3319.07 | bwd_inner_microstep: 3317.97 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.52 [2025-06-20 00:55:34,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.22 | bwd: 3319.09 | bwd_inner: 3317.97 | bwd_allreduce: 1.06 | step: 7.53 72%|███████▏ | 7242/10000 [11:25:55<4:11:58, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.03474905341863632, 'learning_rate': 7.462315506565389e-06, 'epoch': 7.24} 72%|███████▏ | 7242/10000 [11:25:55<4:11:58, 5.48s/it][2025-06-20 00:55:39,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:55:39,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.90 | bwd_microstep: 3330.01 | bwd_inner_microstep: 3329.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-20 00:55:39,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.90 | bwd: 3330.03 | bwd_inner: 3329.20 | bwd_allreduce: 0.78 | step: 6.93 72%|███████▏ | 7243/10000 [11:26:00<4:11:57, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0018937388667836785, 'learning_rate': 7.457269455222358e-06, 'epoch': 7.24} 72%|███████▏ | 7243/10000 [11:26:00<4:11:57, 5.48s/it][2025-06-20 00:55:45,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:55:45,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.72 | bwd_microstep: 3329.55 | bwd_inner_microstep: 3328.65 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.28 [2025-06-20 00:55:45,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.73 | bwd: 3329.56 | bwd_inner: 3328.65 | bwd_allreduce: 0.87 | step: 7.29 72%|███████▏ | 7244/10000 [11:26:06<4:11:43, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00033447102759964764, 'learning_rate': 7.45222471955354e-06, 'epoch': 7.24} 72%|███████▏ | 7244/10000 [11:26:06<4:11:43, 5.48s/it][2025-06-20 00:55:50,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 00:55:50,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.19 | bwd_microstep: 3326.07 | bwd_inner_microstep: 3325.08 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.32 [2025-06-20 00:55:50,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.19 | bwd: 3326.09 | bwd_inner: 3325.08 | bwd_allreduce: 0.95 | step: 7.32 72%|███████▏ | 7245/10000 [11:26:11<4:11:38, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.010247552767395973, 'learning_rate': 7.447181300088115e-06, 'epoch': 7.25} 72%|███████▏ | 7245/10000 [11:26:11<4:11:38, 5.48s/it][2025-06-20 00:55:56,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 00:55:56,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.88 | bwd_microstep: 3326.03 | bwd_inner_microstep: 3324.87 | bwd_allreduce_microstep: 1.09 | step_microstep: 8.04 [2025-06-20 00:55:56,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.88 | bwd: 3326.05 | bwd_inner: 3324.87 | bwd_allreduce: 1.12 | step: 8.04 72%|███████▏ | 7246/10000 [11:26:17<4:11:30, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.007070409599691629, 'learning_rate': 7.442139197355112e-06, 'epoch': 7.25} 72%|███████▏ | 7246/10000 [11:26:17<4:11:30, 5.48s/it][2025-06-20 00:56:01,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:56:01,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.96 | bwd_microstep: 3371.29 | bwd_inner_microstep: 3370.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 00:56:01,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.96 | bwd: 3371.31 | bwd_inner: 3370.51 | bwd_allreduce: 0.76 | step: 6.70 72%|███████▏ | 7247/10000 [11:26:22<4:12:20, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0024168614763766527, 'learning_rate': 7.437098411883423e-06, 'epoch': 7.25} 72%|███████▏ | 7247/10000 [11:26:22<4:12:20, 5.50s/it][2025-06-20 00:56:07,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:56:07,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.10 | bwd_microstep: 3319.07 | bwd_inner_microstep: 3318.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.13 [2025-06-20 00:56:07,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.10 | bwd: 3319.09 | bwd_inner: 3318.28 | bwd_allreduce: 0.76 | step: 7.13 72%|███████▏ | 7248/10000 [11:26:28<4:11:43, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.007253583986312151, 'learning_rate': 7.432058944201805e-06, 'epoch': 7.25} 72%|███████▏ | 7248/10000 [11:26:28<4:11:43, 5.49s/it][2025-06-20 00:56:12,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:56:12,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.51 | bwd_microstep: 3323.51 | bwd_inner_microstep: 3322.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 00:56:12,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.51 | bwd: 3323.53 | bwd_inner: 3322.72 | bwd_allreduce: 0.77 | step: 6.73 72%|███████▏ | 7249/10000 [11:26:33<4:11:23, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.01542640384286642, 'learning_rate': 7.427020794838875e-06, 'epoch': 7.25} 72%|███████▏ | 7249/10000 [11:26:33<4:11:23, 5.48s/it][2025-06-20 00:56:18,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.86 [2025-06-20 00:56:18,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.65 | bwd_microstep: 3385.94 | bwd_inner_microstep: 3385.01 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.41 [2025-06-20 00:56:18,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.65 | bwd: 3385.96 | bwd_inner: 3385.01 | bwd_allreduce: 0.91 | step: 7.41 72%|███████▎ | 7250/10000 [11:26:39<4:12:29, 5.51s/it] {'loss': 0.0012, 'grad_norm': 0.6585630774497986, 'learning_rate': 7.421983964323109e-06, 'epoch': 7.25} 72%|███████▎ | 7250/10000 [11:26:39<4:12:29, 5.51s/it][2025-06-20 00:56:23,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:56:23,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.07 | bwd_microstep: 3368.18 | bwd_inner_microstep: 3367.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-20 00:56:23,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.07 | bwd: 3368.19 | bwd_inner: 3367.40 | bwd_allreduce: 0.75 | step: 6.55 73%|███████▎ | 7251/10000 [11:26:44<4:12:53, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.044541433453559875, 'learning_rate': 7.41694845318285e-06, 'epoch': 7.25} 73%|███████▎ | 7251/10000 [11:26:44<4:12:53, 5.52s/it][2025-06-20 00:56:29,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:56:29,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.37 | bwd_microstep: 3324.83 | bwd_inner_microstep: 3324.06 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 00:56:29,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.37 | bwd: 3324.85 | bwd_inner: 3324.06 | bwd_allreduce: 0.75 | step: 6.62 73%|███████▎ | 7252/10000 [11:26:50<4:12:05, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.03302754461765289, 'learning_rate': 7.411914261946298e-06, 'epoch': 7.25} 73%|███████▎ | 7252/10000 [11:26:50<4:12:05, 5.50s/it][2025-06-20 00:56:34,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:56:34,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.06 | bwd_microstep: 3376.25 | bwd_inner_microstep: 3375.30 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.42 [2025-06-20 00:56:34,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.06 | bwd: 3376.27 | bwd_inner: 3375.30 | bwd_allreduce: 0.93 | step: 7.42 73%|███████▎ | 7253/10000 [11:26:55<4:12:42, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.006513561587780714, 'learning_rate': 7.406881391141522e-06, 'epoch': 7.25} 73%|███████▎ | 7253/10000 [11:26:55<4:12:42, 5.52s/it][2025-06-20 00:56:40,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:56:40,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.71 | bwd_microstep: 3377.87 | bwd_inner_microstep: 3377.04 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.92 [2025-06-20 00:56:40,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.71 | bwd: 3377.89 | bwd_inner: 3377.04 | bwd_allreduce: 0.80 | step: 6.92 73%|███████▎ | 7254/10000 [11:27:01<4:13:15, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0014796857722103596, 'learning_rate': 7.401849841296438e-06, 'epoch': 7.25} 73%|███████▎ | 7254/10000 [11:27:01<4:13:15, 5.53s/it][2025-06-20 00:56:46,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 00:56:46,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.94 | bwd_microstep: 3330.17 | bwd_inner_microstep: 3329.25 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.00 [2025-06-20 00:56:46,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.94 | bwd: 3330.18 | bwd_inner: 3329.25 | bwd_allreduce: 0.89 | step: 7.00 73%|███████▎ | 7255/10000 [11:27:06<4:12:26, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0042543476447463036, 'learning_rate': 7.396819612938835e-06, 'epoch': 7.25} 73%|███████▎ | 7255/10000 [11:27:06<4:12:26, 5.52s/it][2025-06-20 00:56:51,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:56:51,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.28 | bwd_microstep: 3322.02 | bwd_inner_microstep: 3321.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.84 [2025-06-20 00:56:51,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.28 | bwd: 3322.04 | bwd_inner: 3321.24 | bwd_allreduce: 0.75 | step: 6.84 73%|███████▎ | 7256/10000 [11:27:12<4:11:40, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0009992754785344005, 'learning_rate': 7.39179070659636e-06, 'epoch': 7.26} 73%|███████▎ | 7256/10000 [11:27:12<4:11:40, 5.50s/it][2025-06-20 00:56:56,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:56:56,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.20 | bwd_microstep: 3330.58 | bwd_inner_microstep: 3329.73 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.29 [2025-06-20 00:56:56,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.20 | bwd: 3330.60 | bwd_inner: 3329.73 | bwd_allreduce: 0.82 | step: 7.29 73%|███████▎ | 7257/10000 [11:27:17<4:11:18, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.0632622167468071, 'learning_rate': 7.386763122796525e-06, 'epoch': 7.26} 73%|███████▎ | 7257/10000 [11:27:17<4:11:18, 5.50s/it][2025-06-20 00:57:02,468] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:57:02,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.97 | bwd_microstep: 3322.17 | bwd_inner_microstep: 3321.09 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.48 [2025-06-20 00:57:02,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.97 | bwd: 3322.19 | bwd_inner: 3321.09 | bwd_allreduce: 1.04 | step: 7.49 73%|███████▎ | 7258/10000 [11:27:23<4:10:57, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00023991437046788633, 'learning_rate': 7.381736862066704e-06, 'epoch': 7.26} 73%|███████▎ | 7258/10000 [11:27:23<4:10:57, 5.49s/it][2025-06-20 00:57:08,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:57:08,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.15 | bwd_microstep: 3374.82 | bwd_inner_microstep: 3373.86 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.09 [2025-06-20 00:57:08,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.15 | bwd: 3374.84 | bwd_inner: 3373.86 | bwd_allreduce: 0.92 | step: 7.10 73%|███████▎ | 7259/10000 [11:27:28<4:11:44, 5.51s/it] {'loss': 0.0246, 'grad_norm': 3.969151735305786, 'learning_rate': 7.376711924934117e-06, 'epoch': 7.26} 73%|███████▎ | 7259/10000 [11:27:28<4:11:44, 5.51s/it][2025-06-20 00:57:13,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 00:57:13,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.20 | bwd_microstep: 3360.66 | bwd_inner_microstep: 3359.86 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-20 00:57:13,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.20 | bwd: 3360.68 | bwd_inner: 3359.86 | bwd_allreduce: 0.77 | step: 7.16 73%|███████▎ | 7260/10000 [11:27:34<4:11:52, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0040664831176400185, 'learning_rate': 7.371688311925862e-06, 'epoch': 7.26} 73%|███████▎ | 7260/10000 [11:27:34<4:11:52, 5.52s/it][2025-06-20 00:57:19,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:57:19,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.10 | bwd_microstep: 3366.46 | bwd_inner_microstep: 3365.56 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.94 [2025-06-20 00:57:19,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.10 | bwd: 3366.47 | bwd_inner: 3365.56 | bwd_allreduce: 0.87 | step: 6.94 73%|███████▎ | 7261/10000 [11:27:39<4:11:58, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.054555341601371765, 'learning_rate': 7.366666023568894e-06, 'epoch': 7.26} 73%|███████▎ | 7261/10000 [11:27:39<4:11:58, 5.52s/it][2025-06-20 00:57:24,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:57:24,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.51 | bwd_microstep: 3316.39 | bwd_inner_microstep: 3315.59 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-20 00:57:24,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.51 | bwd: 3316.41 | bwd_inner: 3315.59 | bwd_allreduce: 0.77 | step: 7.08 73%|███████▎ | 7262/10000 [11:27:45<4:11:11, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.322631299495697, 'learning_rate': 7.361645060390026e-06, 'epoch': 7.26} 73%|███████▎ | 7262/10000 [11:27:45<4:11:11, 5.50s/it][2025-06-20 00:57:30,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 00:57:30,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.07 | bwd_microstep: 3376.40 | bwd_inner_microstep: 3375.41 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.82 [2025-06-20 00:57:30,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.07 | bwd: 3376.42 | bwd_inner: 3375.41 | bwd_allreduce: 0.97 | step: 7.83 73%|███████▎ | 7263/10000 [11:27:50<4:11:48, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.026320140808820724, 'learning_rate': 7.356625422915942e-06, 'epoch': 7.26} 73%|███████▎ | 7263/10000 [11:27:50<4:11:48, 5.52s/it][2025-06-20 00:57:35,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:57:35,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.72 | bwd_microstep: 3379.70 | bwd_inner_microstep: 3378.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.30 [2025-06-20 00:57:35,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.72 | bwd: 3379.72 | bwd_inner: 3378.89 | bwd_allreduce: 0.78 | step: 7.30 73%|███████▎ | 7264/10000 [11:27:56<4:12:12, 5.53s/it] {'loss': 0.0005, 'grad_norm': 0.29259705543518066, 'learning_rate': 7.351607111673164e-06, 'epoch': 7.26} 73%|███████▎ | 7264/10000 [11:27:56<4:12:12, 5.53s/it][2025-06-20 00:57:41,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-20 00:57:41,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.60 | bwd_microstep: 3366.04 | bwd_inner_microstep: 3365.23 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-20 00:57:41,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.60 | bwd: 3366.05 | bwd_inner: 3365.23 | bwd_allreduce: 0.78 | step: 6.99 73%|███████▎ | 7265/10000 [11:28:01<4:12:03, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0017494920175522566, 'learning_rate': 7.346590127188098e-06, 'epoch': 7.26} 73%|███████▎ | 7265/10000 [11:28:01<4:12:03, 5.53s/it][2025-06-20 00:57:46,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:57:46,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.66 | bwd_microstep: 3330.24 | bwd_inner_microstep: 3329.44 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-20 00:57:46,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.66 | bwd: 3330.26 | bwd_inner: 3329.44 | bwd_allreduce: 0.77 | step: 6.98 73%|███████▎ | 7266/10000 [11:28:07<4:11:20, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.08137242496013641, 'learning_rate': 7.341574469987003e-06, 'epoch': 7.27} 73%|███████▎ | 7266/10000 [11:28:07<4:11:20, 5.52s/it][2025-06-20 00:57:52,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:57:52,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.76 | bwd_microstep: 3312.52 | bwd_inner_microstep: 3311.60 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.09 [2025-06-20 00:57:52,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.76 | bwd: 3312.54 | bwd_inner: 3311.60 | bwd_allreduce: 0.89 | step: 7.10 73%|███████▎ | 7267/10000 [11:28:12<4:10:23, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0028943077195435762, 'learning_rate': 7.336560140595988e-06, 'epoch': 7.27} 73%|███████▎ | 7267/10000 [11:28:12<4:10:23, 5.50s/it][2025-06-20 00:57:57,592] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 00:57:57,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.06 | bwd_microstep: 3319.58 | bwd_inner_microstep: 3318.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-20 00:57:57,593] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.06 | bwd: 3319.60 | bwd_inner: 3318.77 | bwd_allreduce: 0.78 | step: 7.25 73%|███████▎ | 7268/10000 [11:28:18<4:09:51, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.20023523271083832, 'learning_rate': 7.331547139541053e-06, 'epoch': 7.27} 73%|███████▎ | 7268/10000 [11:28:18<4:09:51, 5.49s/it][2025-06-20 00:58:03,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:58:03,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.30 | bwd_microstep: 3310.56 | bwd_inner_microstep: 3309.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 00:58:03,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.30 | bwd: 3310.57 | bwd_inner: 3309.76 | bwd_allreduce: 0.76 | step: 6.82 73%|███████▎ | 7269/10000 [11:28:23<4:09:15, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.002198952017351985, 'learning_rate': 7.326535467348021e-06, 'epoch': 7.27} 73%|███████▎ | 7269/10000 [11:28:23<4:09:15, 5.48s/it][2025-06-20 00:58:08,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:58:08,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.22 | bwd_microstep: 3314.41 | bwd_inner_microstep: 3313.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-20 00:58:08,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.22 | bwd: 3314.42 | bwd_inner: 3313.60 | bwd_allreduce: 0.78 | step: 7.09 73%|███████▎ | 7270/10000 [11:28:29<4:08:49, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.003352733328938484, 'learning_rate': 7.321525124542601e-06, 'epoch': 7.27} 73%|███████▎ | 7270/10000 [11:28:29<4:08:49, 5.47s/it][2025-06-20 00:58:13,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:58:13,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.02 | bwd_microstep: 3318.35 | bwd_inner_microstep: 3317.35 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.26 [2025-06-20 00:58:13,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.02 | bwd: 3318.37 | bwd_inner: 3317.35 | bwd_allreduce: 0.96 | step: 7.25 73%|███████▎ | 7271/10000 [11:28:34<4:08:38, 5.47s/it] {'loss': 0.0, 'grad_norm': 4.0409344364888966e-05, 'learning_rate': 7.3165161116503534e-06, 'epoch': 7.27} 73%|███████▎ | 7271/10000 [11:28:34<4:08:38, 5.47s/it][2025-06-20 00:58:19,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 00:58:19,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.52 | bwd_microstep: 3370.72 | bwd_inner_microstep: 3369.52 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.61 [2025-06-20 00:58:19,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.52 | bwd: 3370.74 | bwd_inner: 3369.52 | bwd_allreduce: 1.15 | step: 7.62 73%|███████▎ | 7272/10000 [11:28:40<4:09:41, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004284472670406103, 'learning_rate': 7.311508429196699e-06, 'epoch': 7.27} 73%|███████▎ | 7272/10000 [11:28:40<4:09:41, 5.49s/it][2025-06-20 00:58:24,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 00:58:24,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.65 | bwd_microstep: 3316.71 | bwd_inner_microstep: 3315.73 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.48 [2025-06-20 00:58:24,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.65 | bwd: 3316.72 | bwd_inner: 3315.73 | bwd_allreduce: 0.95 | step: 7.49 73%|███████▎ | 7273/10000 [11:28:45<4:09:18, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.000522852351423353, 'learning_rate': 7.306502077706925e-06, 'epoch': 7.27} 73%|███████▎ | 7273/10000 [11:28:45<4:09:18, 5.49s/it][2025-06-20 00:58:30,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:58:30,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.13 | bwd_microstep: 3317.99 | bwd_inner_microstep: 3317.12 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.99 [2025-06-20 00:58:30,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.13 | bwd: 3318.00 | bwd_inner: 3317.12 | bwd_allreduce: 0.83 | step: 6.98 73%|███████▎ | 7274/10000 [11:28:51<4:09:02, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00015837889804970473, 'learning_rate': 7.301497057706168e-06, 'epoch': 7.27} 73%|███████▎ | 7274/10000 [11:28:51<4:09:02, 5.48s/it][2025-06-20 00:58:35,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:58:35,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.98 | bwd_microstep: 3372.14 | bwd_inner_microstep: 3371.33 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.79 [2025-06-20 00:58:35,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.98 | bwd: 3372.15 | bwd_inner: 3371.33 | bwd_allreduce: 0.79 | step: 6.79 73%|███████▎ | 7275/10000 [11:28:56<4:09:42, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0044554793275892735, 'learning_rate': 7.296493369719433e-06, 'epoch': 7.28} 73%|███████▎ | 7275/10000 [11:28:56<4:09:42, 5.50s/it][2025-06-20 00:58:41,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:58:41,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.98 | bwd_microstep: 3324.58 | bwd_inner_microstep: 3323.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 00:58:41,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.98 | bwd: 3324.60 | bwd_inner: 3323.79 | bwd_allreduce: 0.76 | step: 6.70 73%|███████▎ | 7276/10000 [11:29:02<4:09:21, 5.49s/it] {'loss': 0.0119, 'grad_norm': 1.8430110216140747, 'learning_rate': 7.2914910142715835e-06, 'epoch': 7.28} 73%|███████▎ | 7276/10000 [11:29:02<4:09:21, 5.49s/it][2025-06-20 00:58:46,938] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 00:58:46,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.24 | bwd_microstep: 3321.66 | bwd_inner_microstep: 3320.55 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.69 [2025-06-20 00:58:46,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.24 | bwd: 3321.68 | bwd_inner: 3320.55 | bwd_allreduce: 1.07 | step: 7.69 73%|███████▎ | 7277/10000 [11:29:07<4:09:02, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0009444707538932562, 'learning_rate': 7.286489991887344e-06, 'epoch': 7.28} 73%|███████▎ | 7277/10000 [11:29:07<4:09:02, 5.49s/it][2025-06-20 00:58:52,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 00:58:52,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.86 | bwd_microstep: 3367.64 | bwd_inner_microstep: 3366.81 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.27 [2025-06-20 00:58:52,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.86 | bwd: 3367.66 | bwd_inner: 3366.81 | bwd_allreduce: 0.80 | step: 7.27 73%|███████▎ | 7278/10000 [11:29:13<4:09:42, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005472138989716768, 'learning_rate': 7.281490303091307e-06, 'epoch': 7.28} 73%|███████▎ | 7278/10000 [11:29:13<4:09:42, 5.50s/it][2025-06-20 00:58:57,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-20 00:58:57,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.63 | bwd_microstep: 3309.41 | bwd_inner_microstep: 3308.56 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.47 [2025-06-20 00:58:57,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.63 | bwd: 3309.42 | bwd_inner: 3308.56 | bwd_allreduce: 0.82 | step: 7.48 73%|███████▎ | 7279/10000 [11:29:18<4:09:01, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0012101155007258058, 'learning_rate': 7.276491948407902e-06, 'epoch': 7.28} 73%|███████▎ | 7279/10000 [11:29:18<4:09:01, 5.49s/it][2025-06-20 00:59:03,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 00:59:03,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.64 | bwd_microstep: 3320.96 | bwd_inner_microstep: 3320.00 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.41 [2025-06-20 00:59:03,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.64 | bwd: 3320.97 | bwd_inner: 3320.00 | bwd_allreduce: 0.93 | step: 7.41 73%|███████▎ | 7280/10000 [11:29:24<4:08:40, 5.49s/it] {'loss': 0.0078, 'grad_norm': 2.3734214305877686, 'learning_rate': 7.27149492836144e-06, 'epoch': 7.28} 73%|███████▎ | 7280/10000 [11:29:24<4:08:40, 5.49s/it][2025-06-20 00:59:08,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 00:59:08,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.86 | bwd_microstep: 3319.76 | bwd_inner_microstep: 3318.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 00:59:08,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.86 | bwd: 3319.77 | bwd_inner: 3318.97 | bwd_allreduce: 0.76 | step: 6.80 73%|███████▎ | 7281/10000 [11:29:29<4:08:21, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.01766994409263134, 'learning_rate': 7.2664992434760864e-06, 'epoch': 7.28} 73%|███████▎ | 7281/10000 [11:29:29<4:08:21, 5.48s/it][2025-06-20 00:59:14,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:59:14,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.81 | bwd_microstep: 3306.08 | bwd_inner_microstep: 3305.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 00:59:14,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.81 | bwd: 3306.10 | bwd_inner: 3305.29 | bwd_allreduce: 0.76 | step: 6.71 73%|███████▎ | 7282/10000 [11:29:35<4:07:47, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0004742471210192889, 'learning_rate': 7.261504894275863e-06, 'epoch': 7.28} 73%|███████▎ | 7282/10000 [11:29:35<4:07:47, 5.47s/it][2025-06-20 00:59:19,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:59:19,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.29 | bwd_microstep: 3316.58 | bwd_inner_microstep: 3315.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-20 00:59:19,795] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.29 | bwd: 3316.60 | bwd_inner: 3315.78 | bwd_allreduce: 0.77 | step: 7.08 73%|███████▎ | 7283/10000 [11:29:40<4:07:35, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0024130581878125668, 'learning_rate': 7.256511881284656e-06, 'epoch': 7.28} 73%|███████▎ | 7283/10000 [11:29:40<4:07:35, 5.47s/it][2025-06-20 00:59:25,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:59:25,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.08 | bwd_microstep: 3319.27 | bwd_inner_microstep: 3318.48 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 00:59:25,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.08 | bwd: 3319.28 | bwd_inner: 3318.48 | bwd_allreduce: 0.76 | step: 6.80 73%|███████▎ | 7284/10000 [11:29:46<4:07:22, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0019462746568024158, 'learning_rate': 7.251520205026206e-06, 'epoch': 7.28} 73%|███████▎ | 7284/10000 [11:29:46<4:07:22, 5.46s/it][2025-06-20 00:59:30,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.87 [2025-06-20 00:59:30,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.39 | bwd_microstep: 3364.39 | bwd_inner_microstep: 3363.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-20 00:59:30,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.39 | bwd: 3364.41 | bwd_inner: 3363.60 | bwd_allreduce: 0.76 | step: 6.82 73%|███████▎ | 7285/10000 [11:29:51<4:08:03, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.014272564090788364, 'learning_rate': 7.2465298660241215e-06, 'epoch': 7.29} 73%|███████▎ | 7285/10000 [11:29:51<4:08:03, 5.48s/it][2025-06-20 00:59:36,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 00:59:36,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.26 | bwd_microstep: 3324.62 | bwd_inner_microstep: 3323.84 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.96 [2025-06-20 00:59:36,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.26 | bwd: 3324.63 | bwd_inner: 3323.84 | bwd_allreduce: 0.75 | step: 6.97 73%|███████▎ | 7286/10000 [11:29:57<4:07:42, 5.48s/it] {'loss': 0.0012, 'grad_norm': 0.20790337026119232, 'learning_rate': 7.2415408648018616e-06, 'epoch': 7.29} 73%|███████▎ | 7286/10000 [11:29:57<4:07:42, 5.48s/it][2025-06-20 00:59:41,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 00:59:41,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.41 | bwd_microstep: 3316.96 | bwd_inner_microstep: 3316.15 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.91 [2025-06-20 00:59:41,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.41 | bwd: 3316.97 | bwd_inner: 3316.15 | bwd_allreduce: 0.78 | step: 6.91 73%|███████▎ | 7287/10000 [11:30:02<4:07:24, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.040897201746702194, 'learning_rate': 7.236553201882753e-06, 'epoch': 7.29} 73%|███████▎ | 7287/10000 [11:30:02<4:07:24, 5.47s/it][2025-06-20 00:59:47,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:59:47,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.64 | bwd_microstep: 3312.19 | bwd_inner_microstep: 3311.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.14 [2025-06-20 00:59:47,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.64 | bwd: 3312.20 | bwd_inner: 3311.39 | bwd_allreduce: 0.77 | step: 7.14 73%|███████▎ | 7288/10000 [11:30:07<4:07:01, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.017360910773277283, 'learning_rate': 7.231566877789981e-06, 'epoch': 7.29} 73%|███████▎ | 7288/10000 [11:30:07<4:07:01, 5.47s/it][2025-06-20 00:59:52,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.63 | optimizer_step: 2.73 [2025-06-20 00:59:52,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.04 | bwd_microstep: 3320.18 | bwd_inner_microstep: 3319.35 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.94 [2025-06-20 00:59:52,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.05 | bwd: 3320.20 | bwd_inner: 3319.35 | bwd_allreduce: 0.81 | step: 6.95 73%|███████▎ | 7289/10000 [11:30:13<4:06:54, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0002920411352533847, 'learning_rate': 7.226581893046578e-06, 'epoch': 7.29} 73%|███████▎ | 7289/10000 [11:30:13<4:06:54, 5.46s/it][2025-06-20 00:59:58,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 00:59:58,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.98 | bwd_microstep: 3372.03 | bwd_inner_microstep: 3371.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-20 00:59:58,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.98 | bwd: 3372.04 | bwd_inner: 3371.23 | bwd_allreduce: 0.77 | step: 6.99 73%|███████▎ | 7290/10000 [11:30:18<4:07:51, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.044495370239019394, 'learning_rate': 7.22159824817545e-06, 'epoch': 7.29} 73%|███████▎ | 7290/10000 [11:30:18<4:07:51, 5.49s/it][2025-06-20 01:00:03,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:00:03,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.71 | bwd_microstep: 3317.81 | bwd_inner_microstep: 3317.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-20 01:00:03,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.71 | bwd: 3317.82 | bwd_inner: 3317.01 | bwd_allreduce: 0.76 | step: 6.87 73%|███████▎ | 7291/10000 [11:30:24<4:07:31, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.11141183972358704, 'learning_rate': 7.216615943699361e-06, 'epoch': 7.29} 73%|███████▎ | 7291/10000 [11:30:24<4:07:31, 5.48s/it][2025-06-20 01:00:09,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:00:09,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.93 | bwd_microstep: 3370.81 | bwd_inner_microstep: 3370.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 01:00:09,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.93 | bwd: 3370.83 | bwd_inner: 3370.02 | bwd_allreduce: 0.76 | step: 6.64 73%|███████▎ | 7292/10000 [11:30:29<4:08:08, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.006605878472328186, 'learning_rate': 7.211634980140929e-06, 'epoch': 7.29} 73%|███████▎ | 7292/10000 [11:30:29<4:08:08, 5.50s/it][2025-06-20 01:00:14,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:00:14,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.10 | bwd_microstep: 3367.53 | bwd_inner_microstep: 3366.74 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 01:00:14,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.10 | bwd: 3367.54 | bwd_inner: 3366.74 | bwd_allreduce: 0.76 | step: 6.67 73%|███████▎ | 7293/10000 [11:30:35<4:08:27, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.034778665751218796, 'learning_rate': 7.20665535802264e-06, 'epoch': 7.29} 73%|███████▎ | 7293/10000 [11:30:35<4:08:27, 5.51s/it][2025-06-20 01:00:20,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:00:20,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.68 | bwd_microstep: 3321.04 | bwd_inner_microstep: 3320.14 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.93 [2025-06-20 01:00:20,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.68 | bwd: 3321.06 | bwd_inner: 3320.14 | bwd_allreduce: 0.86 | step: 6.93 73%|███████▎ | 7294/10000 [11:30:40<4:07:50, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0006751783657819033, 'learning_rate': 7.2016770778668245e-06, 'epoch': 7.29} 73%|███████▎ | 7294/10000 [11:30:40<4:07:50, 5.50s/it][2025-06-20 01:00:25,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:00:25,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.65 | bwd_microstep: 3377.44 | bwd_inner_microstep: 3376.46 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.10 [2025-06-20 01:00:25,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.65 | bwd: 3377.46 | bwd_inner: 3376.46 | bwd_allreduce: 0.95 | step: 7.10 73%|███████▎ | 7295/10000 [11:30:46<4:08:22, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00799928605556488, 'learning_rate': 7.196700140195685e-06, 'epoch': 7.29} 73%|███████▎ | 7295/10000 [11:30:46<4:08:22, 5.51s/it][2025-06-20 01:00:31,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:00:31,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.57 | bwd_microstep: 3365.22 | bwd_inner_microstep: 3364.41 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-20 01:00:31,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.57 | bwd: 3365.23 | bwd_inner: 3364.41 | bwd_allreduce: 0.78 | step: 6.89 73%|███████▎ | 7296/10000 [11:30:52<4:08:37, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.001591746462509036, 'learning_rate': 7.191724545531276e-06, 'epoch': 7.3} 73%|███████▎ | 7296/10000 [11:30:52<4:08:37, 5.52s/it][2025-06-20 01:00:36,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:00:36,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.41 | bwd_microstep: 3317.18 | bwd_inner_microstep: 3316.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 01:00:36,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.41 | bwd: 3317.19 | bwd_inner: 3316.39 | bwd_allreduce: 0.76 | step: 6.72 73%|███████▎ | 7297/10000 [11:30:57<4:07:44, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.008551256731152534, 'learning_rate': 7.186750294395519e-06, 'epoch': 7.3} 73%|███████▎ | 7297/10000 [11:30:57<4:07:44, 5.50s/it][2025-06-20 01:00:42,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:00:42,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.25 | bwd_microstep: 3310.95 | bwd_inner_microstep: 3309.98 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.16 [2025-06-20 01:00:42,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.25 | bwd: 3310.96 | bwd_inner: 3309.98 | bwd_allreduce: 0.94 | step: 7.16 73%|███████▎ | 7298/10000 [11:31:02<4:07:05, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0010019403416663408, 'learning_rate': 7.1817773873101934e-06, 'epoch': 7.3} 73%|███████▎ | 7298/10000 [11:31:02<4:07:05, 5.49s/it][2025-06-20 01:00:47,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:00:47,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.84 | bwd_microstep: 3315.24 | bwd_inner_microstep: 3314.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-20 01:00:47,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.84 | bwd: 3315.26 | bwd_inner: 3314.44 | bwd_allreduce: 0.77 | step: 6.66 73%|███████▎ | 7299/10000 [11:31:08<4:06:47, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.044968076050281525, 'learning_rate': 7.176805824796924e-06, 'epoch': 7.3} 73%|███████▎ | 7299/10000 [11:31:08<4:06:47, 5.48s/it][2025-06-20 01:00:53,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:00:53,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.02 | bwd_microstep: 3311.69 | bwd_inner_microstep: 3310.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-20 01:00:53,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.02 | bwd: 3311.71 | bwd_inner: 3310.90 | bwd_allreduce: 0.77 | step: 6.91 73%|███████▎ | 7300/10000 [11:31:13<4:06:22, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0032573805656284094, 'learning_rate': 7.171835607377206e-06, 'epoch': 7.3} 73%|███████▎ | 7300/10000 [11:31:13<4:06:22, 5.48s/it][2025-06-20 01:00:58,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 01:00:58,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.63 | bwd_microstep: 3315.28 | bwd_inner_microstep: 3314.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 01:00:58,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.63 | bwd: 3315.29 | bwd_inner: 3314.50 | bwd_allreduce: 0.75 | step: 6.63 73%|███████▎ | 7301/10000 [11:31:19<4:06:08, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0008405103581026196, 'learning_rate': 7.1668667355723995e-06, 'epoch': 7.3} 73%|███████▎ | 7301/10000 [11:31:19<4:06:08, 5.47s/it][2025-06-20 01:01:04,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:01:04,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.64 | bwd_microstep: 3370.83 | bwd_inner_microstep: 3369.75 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.24 [2025-06-20 01:01:04,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.64 | bwd: 3370.85 | bwd_inner: 3369.75 | bwd_allreduce: 1.04 | step: 7.24 73%|███████▎ | 7302/10000 [11:31:24<4:06:53, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.0232302937656641, 'learning_rate': 7.161899209903702e-06, 'epoch': 7.3} 73%|███████▎ | 7302/10000 [11:31:24<4:06:53, 5.49s/it][2025-06-20 01:01:09,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:01:09,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.91 | bwd_microstep: 3323.00 | bwd_inner_microstep: 3322.21 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-20 01:01:09,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.91 | bwd: 3323.01 | bwd_inner: 3322.21 | bwd_allreduce: 0.76 | step: 6.69 73%|███████▎ | 7303/10000 [11:31:30<4:06:29, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001240054378286004, 'learning_rate': 7.156933030892202e-06, 'epoch': 7.3} 73%|███████▎ | 7303/10000 [11:31:30<4:06:29, 5.48s/it][2025-06-20 01:01:15,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:01:15,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.11 | bwd_microstep: 3319.80 | bwd_inner_microstep: 3318.83 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.49 [2025-06-20 01:01:15,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.11 | bwd: 3319.81 | bwd_inner: 3318.83 | bwd_allreduce: 0.94 | step: 7.50 73%|███████▎ | 7304/10000 [11:31:35<4:06:06, 5.48s/it] {'loss': 0.0245, 'grad_norm': 3.031235933303833, 'learning_rate': 7.151968199058827e-06, 'epoch': 7.3} 73%|███████▎ | 7304/10000 [11:31:35<4:06:06, 5.48s/it][2025-06-20 01:01:20,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:01:20,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.92 | bwd_microstep: 3319.02 | bwd_inner_microstep: 3318.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-20 01:01:20,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.92 | bwd: 3319.03 | bwd_inner: 3318.23 | bwd_allreduce: 0.76 | step: 6.64 73%|███████▎ | 7305/10000 [11:31:41<4:05:48, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.002142710844054818, 'learning_rate': 7.147004714924353e-06, 'epoch': 7.3} 73%|███████▎ | 7305/10000 [11:31:41<4:05:48, 5.47s/it][2025-06-20 01:01:25,934] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:01:25,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.14 | bwd_microstep: 3330.12 | bwd_inner_microstep: 3329.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 01:01:25,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.14 | bwd: 3330.13 | bwd_inner: 3329.34 | bwd_allreduce: 0.76 | step: 6.63 73%|███████▎ | 7306/10000 [11:31:46<4:05:39, 5.47s/it] {'loss': 0.0174, 'grad_norm': 8.098892211914062, 'learning_rate': 7.142042579009429e-06, 'epoch': 7.31} 73%|███████▎ | 7306/10000 [11:31:46<4:05:39, 5.47s/it][2025-06-20 01:01:31,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:01:31,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.86 | bwd_microstep: 3327.52 | bwd_inner_microstep: 3326.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 01:01:31,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.86 | bwd: 3327.54 | bwd_inner: 3326.73 | bwd_allreduce: 0.76 | step: 6.69 73%|███████▎ | 7307/10000 [11:31:52<4:05:33, 5.47s/it] {'loss': 0.0226, 'grad_norm': 3.5020577907562256, 'learning_rate': 7.137081791834566e-06, 'epoch': 7.31} 73%|███████▎ | 7307/10000 [11:31:52<4:05:33, 5.47s/it][2025-06-20 01:01:36,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:01:36,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.17 | bwd_microstep: 3323.74 | bwd_inner_microstep: 3322.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 01:01:36,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.17 | bwd: 3323.76 | bwd_inner: 3322.95 | bwd_allreduce: 0.76 | step: 6.81 73%|███████▎ | 7308/10000 [11:31:57<4:05:24, 5.47s/it] {'loss': 0.0005, 'grad_norm': 0.11782088875770569, 'learning_rate': 7.132122353920121e-06, 'epoch': 7.31} 73%|███████▎ | 7308/10000 [11:31:57<4:05:24, 5.47s/it][2025-06-20 01:01:42,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:01:42,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.75 | bwd_microstep: 3366.75 | bwd_inner_microstep: 3365.77 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.63 [2025-06-20 01:01:42,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.75 | bwd: 3366.77 | bwd_inner: 3365.77 | bwd_allreduce: 0.95 | step: 7.63 73%|███████▎ | 7309/10000 [11:32:03<4:06:12, 5.49s/it] {'loss': 0.0013, 'grad_norm': 0.6563687324523926, 'learning_rate': 7.127164265786328e-06, 'epoch': 7.31} 73%|███████▎ | 7309/10000 [11:32:03<4:06:12, 5.49s/it][2025-06-20 01:01:47,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:01:47,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.26 | bwd_microstep: 3378.74 | bwd_inner_microstep: 3377.79 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.88 [2025-06-20 01:01:47,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.26 | bwd: 3378.75 | bwd_inner: 3377.79 | bwd_allreduce: 0.92 | step: 6.89 73%|███████▎ | 7310/10000 [11:32:08<4:07:05, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0018679507775232196, 'learning_rate': 7.1222075279532535e-06, 'epoch': 7.31} 73%|███████▎ | 7310/10000 [11:32:08<4:07:05, 5.51s/it][2025-06-20 01:01:53,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:01:53,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.05 | bwd_microstep: 3320.01 | bwd_inner_microstep: 3319.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-20 01:01:53,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.05 | bwd: 3320.03 | bwd_inner: 3319.23 | bwd_allreduce: 0.76 | step: 6.72 73%|███████▎ | 7311/10000 [11:32:14<4:06:28, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.000952119124121964, 'learning_rate': 7.11725214094084e-06, 'epoch': 7.31} 73%|███████▎ | 7311/10000 [11:32:14<4:06:28, 5.50s/it][2025-06-20 01:01:58,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:01:58,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.05 | bwd_microstep: 3330.42 | bwd_inner_microstep: 3329.63 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 01:01:58,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.05 | bwd: 3330.43 | bwd_inner: 3329.64 | bwd_allreduce: 0.75 | step: 6.65 73%|███████▎ | 7312/10000 [11:32:19<4:06:05, 5.49s/it] {'loss': 0.0119, 'grad_norm': 2.65580153465271, 'learning_rate': 7.112298105268886e-06, 'epoch': 7.31} 73%|███████▎ | 7312/10000 [11:32:19<4:06:05, 5.49s/it][2025-06-20 01:02:04,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 01:02:04,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.48 | bwd_microstep: 3375.53 | bwd_inner_microstep: 3374.61 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.23 [2025-06-20 01:02:04,464] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.48 | bwd: 3375.55 | bwd_inner: 3374.61 | bwd_allreduce: 0.89 | step: 7.23 73%|███████▎ | 7313/10000 [11:32:25<4:06:42, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0019204685231670737, 'learning_rate': 7.107345421457048e-06, 'epoch': 7.31} 73%|███████▎ | 7313/10000 [11:32:25<4:06:42, 5.51s/it][2025-06-20 01:02:09,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 01:02:09,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.12 | bwd_microstep: 3333.47 | bwd_inner_microstep: 3332.40 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.60 [2025-06-20 01:02:09,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.12 | bwd: 3333.49 | bwd_inner: 3332.40 | bwd_allreduce: 1.03 | step: 7.61 73%|███████▎ | 7314/10000 [11:32:30<4:06:24, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.022242479026317596, 'learning_rate': 7.102394090024845e-06, 'epoch': 7.31} 73%|███████▎ | 7314/10000 [11:32:30<4:06:24, 5.50s/it][2025-06-20 01:02:15,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:02:15,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.80 | bwd_microstep: 3328.31 | bwd_inner_microstep: 3327.14 | bwd_allreduce_microstep: 1.11 | step_microstep: 7.29 [2025-06-20 01:02:15,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.80 | bwd: 3328.33 | bwd_inner: 3327.14 | bwd_allreduce: 1.14 | step: 7.29 73%|███████▎ | 7315/10000 [11:32:36<4:06:06, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00848374143242836, 'learning_rate': 7.097444111491636e-06, 'epoch': 7.32} 73%|███████▎ | 7315/10000 [11:32:36<4:06:06, 5.50s/it][2025-06-20 01:02:21,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:02:21,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.27 | bwd_microstep: 3376.67 | bwd_inner_microstep: 3375.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 01:02:21,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.27 | bwd: 3376.68 | bwd_inner: 3375.89 | bwd_allreduce: 0.75 | step: 6.58 73%|███████▎ | 7316/10000 [11:32:41<4:06:46, 5.52s/it] {'loss': 0.0051, 'grad_norm': 1.4409360885620117, 'learning_rate': 7.092495486376656e-06, 'epoch': 7.32} 73%|███████▎ | 7316/10000 [11:32:41<4:06:46, 5.52s/it][2025-06-20 01:02:26,478] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:02:26,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.21 | bwd_microstep: 3326.66 | bwd_inner_microstep: 3325.71 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.27 [2025-06-20 01:02:26,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.21 | bwd: 3326.67 | bwd_inner: 3325.71 | bwd_allreduce: 0.92 | step: 7.28 73%|███████▎ | 7317/10000 [11:32:47<4:06:09, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.0200046319514513, 'learning_rate': 7.087548215198994e-06, 'epoch': 7.32} 73%|███████▎ | 7317/10000 [11:32:47<4:06:09, 5.50s/it][2025-06-20 01:02:31,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:02:31,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.81 | bwd_microstep: 3336.33 | bwd_inner_microstep: 3335.38 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.36 [2025-06-20 01:02:31,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.81 | bwd: 3336.35 | bwd_inner: 3335.38 | bwd_allreduce: 0.92 | step: 7.36 73%|███████▎ | 7318/10000 [11:32:52<4:05:53, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0020604233723133802, 'learning_rate': 7.082602298477596e-06, 'epoch': 7.32} 73%|███████▎ | 7318/10000 [11:32:52<4:05:53, 5.50s/it][2025-06-20 01:02:37,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 01:02:37,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.88 | bwd_microstep: 3339.11 | bwd_inner_microstep: 3338.00 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.59 [2025-06-20 01:02:37,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.88 | bwd: 3339.13 | bwd_inner: 3338.00 | bwd_allreduce: 1.08 | step: 7.59 73%|███████▎ | 7319/10000 [11:32:58<4:05:48, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.008063218556344509, 'learning_rate': 7.077657736731267e-06, 'epoch': 7.32} 73%|███████▎ | 7319/10000 [11:32:58<4:05:48, 5.50s/it][2025-06-20 01:02:43,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:02:43,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.54 | bwd_microstep: 3385.38 | bwd_inner_microstep: 3384.56 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.73 [2025-06-20 01:02:43,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.54 | bwd: 3385.40 | bwd_inner: 3384.56 | bwd_allreduce: 0.79 | step: 6.73 73%|███████▎ | 7320/10000 [11:33:03<4:06:35, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.014697366394102573, 'learning_rate': 7.072714530478666e-06, 'epoch': 7.32} 73%|███████▎ | 7320/10000 [11:33:03<4:06:35, 5.52s/it][2025-06-20 01:02:48,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:02:48,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.96 | bwd_microstep: 3337.19 | bwd_inner_microstep: 3336.39 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.79 [2025-06-20 01:02:48,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.96 | bwd: 3337.20 | bwd_inner: 3336.39 | bwd_allreduce: 0.77 | step: 6.79 73%|███████▎ | 7321/10000 [11:33:09<4:06:02, 5.51s/it] {'loss': 0.0018, 'grad_norm': 0.35630548000335693, 'learning_rate': 7.067772680238311e-06, 'epoch': 7.32} 73%|███████▎ | 7321/10000 [11:33:09<4:06:02, 5.51s/it][2025-06-20 01:02:54,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:02:54,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.70 | bwd_microstep: 3379.15 | bwd_inner_microstep: 3378.33 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.17 [2025-06-20 01:02:54,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.70 | bwd: 3379.17 | bwd_inner: 3378.33 | bwd_allreduce: 0.79 | step: 7.17 73%|███████▎ | 7322/10000 [11:33:14<4:06:34, 5.52s/it] {'loss': 0.0244, 'grad_norm': 3.721820831298828, 'learning_rate': 7.062832186528585e-06, 'epoch': 7.32} 73%|███████▎ | 7322/10000 [11:33:14<4:06:34, 5.52s/it][2025-06-20 01:02:59,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:02:59,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.25 | bwd_microstep: 3372.68 | bwd_inner_microstep: 3371.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-20 01:02:59,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.25 | bwd: 3372.69 | bwd_inner: 3371.89 | bwd_allreduce: 0.76 | step: 6.76 73%|███████▎ | 7323/10000 [11:33:20<4:06:43, 5.53s/it] {'loss': 0.0032, 'grad_norm': 0.6460714340209961, 'learning_rate': 7.057893049867717e-06, 'epoch': 7.32} 73%|███████▎ | 7323/10000 [11:33:20<4:06:43, 5.53s/it][2025-06-20 01:03:05,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:03:05,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.23 | bwd_microstep: 3373.62 | bwd_inner_microstep: 3372.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 01:03:05,170] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.23 | bwd: 3373.63 | bwd_inner: 3372.83 | bwd_allreduce: 0.76 | step: 6.70 73%|███████▎ | 7324/10000 [11:33:25<4:06:48, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0003305888967588544, 'learning_rate': 7.052955270773809e-06, 'epoch': 7.32} 73%|███████▎ | 7324/10000 [11:33:25<4:06:48, 5.53s/it][2025-06-20 01:03:10,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:03:10,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.34 | bwd_microstep: 3336.25 | bwd_inner_microstep: 3335.40 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.24 [2025-06-20 01:03:10,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.34 | bwd: 3336.27 | bwd_inner: 3335.40 | bwd_allreduce: 0.82 | step: 7.24 73%|███████▎ | 7325/10000 [11:33:31<4:06:04, 5.52s/it] {'loss': 0.0006, 'grad_norm': 0.13960325717926025, 'learning_rate': 7.0480188497648e-06, 'epoch': 7.33} 73%|███████▎ | 7325/10000 [11:33:31<4:06:04, 5.52s/it][2025-06-20 01:03:16,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:03:16,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.42 | bwd_microstep: 3320.48 | bwd_inner_microstep: 3319.68 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 01:03:16,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.42 | bwd: 3320.50 | bwd_inner: 3319.68 | bwd_allreduce: 0.77 | step: 6.81 73%|███████▎ | 7326/10000 [11:33:36<4:05:18, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.003984123934060335, 'learning_rate': 7.0430837873584975e-06, 'epoch': 7.33} 73%|███████▎ | 7326/10000 [11:33:36<4:05:18, 5.50s/it][2025-06-20 01:03:21,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:03:21,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.94 | bwd_microstep: 3386.24 | bwd_inner_microstep: 3385.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-20 01:03:21,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.94 | bwd: 3386.25 | bwd_inner: 3385.43 | bwd_allreduce: 0.78 | step: 7.16 73%|███████▎ | 7327/10000 [11:33:42<4:06:00, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0045157792046666145, 'learning_rate': 7.0381500840725725e-06, 'epoch': 7.33} 73%|███████▎ | 7327/10000 [11:33:42<4:06:00, 5.52s/it][2025-06-20 01:03:27,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:03:27,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.02 | bwd_microstep: 3327.35 | bwd_inner_microstep: 3326.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 01:03:27,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.02 | bwd: 3327.37 | bwd_inner: 3326.56 | bwd_allreduce: 0.76 | step: 6.78 73%|███████▎ | 7328/10000 [11:33:47<4:05:17, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0001864963851403445, 'learning_rate': 7.0332177404245476e-06, 'epoch': 7.33} 73%|███████▎ | 7328/10000 [11:33:47<4:05:17, 5.51s/it][2025-06-20 01:03:32,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:03:32,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.72 | bwd_microstep: 3374.28 | bwd_inner_microstep: 3373.48 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 01:03:32,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.72 | bwd: 3374.30 | bwd_inner: 3373.48 | bwd_allreduce: 0.77 | step: 6.76 73%|███████▎ | 7329/10000 [11:33:53<4:05:38, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.024503014981746674, 'learning_rate': 7.028286756931806e-06, 'epoch': 7.33} 73%|███████▎ | 7329/10000 [11:33:53<4:05:38, 5.52s/it][2025-06-20 01:03:38,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:03:38,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.17 | bwd_microstep: 3327.89 | bwd_inner_microstep: 3327.07 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.16 [2025-06-20 01:03:38,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.17 | bwd: 3327.90 | bwd_inner: 3327.07 | bwd_allreduce: 0.79 | step: 7.17 73%|███████▎ | 7330/10000 [11:33:58<4:05:02, 5.51s/it] {'loss': 0.0425, 'grad_norm': 8.68882942199707, 'learning_rate': 7.023357134111572e-06, 'epoch': 7.33} 73%|███████▎ | 7330/10000 [11:33:58<4:05:02, 5.51s/it][2025-06-20 01:03:43,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:03:43,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.13 | bwd_microstep: 3328.88 | bwd_inner_microstep: 3328.05 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.05 [2025-06-20 01:03:43,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.13 | bwd: 3328.90 | bwd_inner: 3328.05 | bwd_allreduce: 0.81 | step: 7.05 73%|███████▎ | 7331/10000 [11:34:04<4:04:34, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.021968290209770203, 'learning_rate': 7.018428872480951e-06, 'epoch': 7.33} 73%|███████▎ | 7331/10000 [11:34:04<4:04:34, 5.50s/it][2025-06-20 01:03:49,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:03:49,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.19 | bwd_microstep: 3335.02 | bwd_inner_microstep: 3334.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 01:03:49,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.19 | bwd: 3335.03 | bwd_inner: 3334.23 | bwd_allreduce: 0.76 | step: 6.79 73%|███████▎ | 7332/10000 [11:34:09<4:04:17, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0035087617579847574, 'learning_rate': 7.013501972556891e-06, 'epoch': 7.33} 73%|███████▎ | 7332/10000 [11:34:09<4:04:17, 5.49s/it][2025-06-20 01:03:54,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:03:54,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.91 | bwd_microstep: 3403.56 | bwd_inner_microstep: 3402.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-20 01:03:54,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.91 | bwd: 3403.58 | bwd_inner: 3402.76 | bwd_allreduce: 0.77 | step: 7.00 73%|███████▎ | 7333/10000 [11:34:15<4:05:23, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0023403477389365435, 'learning_rate': 7.0085764348562e-06, 'epoch': 7.33} 73%|███████▎ | 7333/10000 [11:34:15<4:05:23, 5.52s/it][2025-06-20 01:04:00,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:04:00,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.95 | bwd_microstep: 3398.89 | bwd_inner_microstep: 3398.09 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-20 01:04:00,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.95 | bwd: 3398.90 | bwd_inner: 3398.09 | bwd_allreduce: 0.77 | step: 7.02 73%|███████▎ | 7334/10000 [11:34:21<4:06:10, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0013196779182180762, 'learning_rate': 7.0036522598955545e-06, 'epoch': 7.33} 73%|███████▎ | 7334/10000 [11:34:21<4:06:10, 5.54s/it][2025-06-20 01:04:05,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:04:05,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.90 | bwd_microstep: 3319.31 | bwd_inner_microstep: 3318.22 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.29 [2025-06-20 01:04:05,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.90 | bwd: 3319.33 | bwd_inner: 3318.22 | bwd_allreduce: 1.05 | step: 7.29 73%|███████▎ | 7335/10000 [11:34:26<4:05:06, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00018451821233611554, 'learning_rate': 6.998729448191461e-06, 'epoch': 7.33} 73%|███████▎ | 7335/10000 [11:34:26<4:05:06, 5.52s/it][2025-06-20 01:04:11,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:04:11,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.38 | bwd_microstep: 3335.34 | bwd_inner_microstep: 3334.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-20 01:04:11,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.38 | bwd: 3335.35 | bwd_inner: 3334.53 | bwd_allreduce: 0.78 | step: 7.27 73%|███████▎ | 7336/10000 [11:34:32<4:04:40, 5.51s/it] {'loss': 0.0119, 'grad_norm': 2.14512038230896, 'learning_rate': 6.993808000260303e-06, 'epoch': 7.34} 73%|███████▎ | 7336/10000 [11:34:32<4:04:40, 5.51s/it][2025-06-20 01:04:16,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:04:16,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.65 | bwd_microstep: 3332.63 | bwd_inner_microstep: 3331.53 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.23 [2025-06-20 01:04:16,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.66 | bwd: 3332.65 | bwd_inner: 3331.53 | bwd_allreduce: 1.07 | step: 7.22 73%|███████▎ | 7337/10000 [11:34:37<4:04:11, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0008854677435010672, 'learning_rate': 6.988887916618328e-06, 'epoch': 7.34} 73%|███████▎ | 7337/10000 [11:34:37<4:04:11, 5.50s/it][2025-06-20 01:04:22,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:04:22,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.50 | bwd_microstep: 3331.40 | bwd_inner_microstep: 3330.59 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-20 01:04:22,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.50 | bwd: 3331.42 | bwd_inner: 3330.59 | bwd_allreduce: 0.78 | step: 7.12 73%|███████▎ | 7338/10000 [11:34:43<4:03:47, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.06838325411081314, 'learning_rate': 6.983969197781624e-06, 'epoch': 7.34} 73%|███████▎ | 7338/10000 [11:34:43<4:03:47, 5.49s/it][2025-06-20 01:04:27,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:04:27,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.89 | bwd_microstep: 3320.04 | bwd_inner_microstep: 3319.25 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-20 01:04:27,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.89 | bwd: 3320.05 | bwd_inner: 3319.25 | bwd_allreduce: 0.76 | step: 6.78 73%|███████▎ | 7339/10000 [11:34:48<4:03:18, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.042699724435806274, 'learning_rate': 6.979051844266147e-06, 'epoch': 7.34} 73%|███████▎ | 7339/10000 [11:34:48<4:03:18, 5.49s/it][2025-06-20 01:04:33,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:04:33,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.33 | bwd_microstep: 3337.15 | bwd_inner_microstep: 3336.34 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-20 01:04:33,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.33 | bwd: 3337.17 | bwd_inner: 3336.34 | bwd_allreduce: 0.78 | step: 6.94 73%|███████▎ | 7340/10000 [11:34:53<4:03:13, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002740760101005435, 'learning_rate': 6.974135856587696e-06, 'epoch': 7.34} 73%|███████▎ | 7340/10000 [11:34:53<4:03:13, 5.49s/it][2025-06-20 01:04:38,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:04:38,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.97 | bwd_microstep: 3322.44 | bwd_inner_microstep: 3321.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 01:04:38,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.97 | bwd: 3322.46 | bwd_inner: 3321.65 | bwd_allreduce: 0.76 | step: 6.86 73%|███████▎ | 7341/10000 [11:34:59<4:02:55, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0014977280516177416, 'learning_rate': 6.969221235261936e-06, 'epoch': 7.34} 73%|███████▎ | 7341/10000 [11:34:59<4:02:55, 5.48s/it][2025-06-20 01:04:44,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:04:44,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.89 | bwd_microstep: 3336.86 | bwd_inner_microstep: 3336.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 01:04:44,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.89 | bwd: 3336.88 | bwd_inner: 3336.08 | bwd_allreduce: 0.76 | step: 6.61 73%|███████▎ | 7342/10000 [11:35:04<4:02:48, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.003806498134508729, 'learning_rate': 6.9643079808043926e-06, 'epoch': 7.34} 73%|███████▎ | 7342/10000 [11:35:04<4:02:48, 5.48s/it][2025-06-20 01:04:49,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:04:49,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.73 | bwd_microstep: 3330.57 | bwd_inner_microstep: 3329.73 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.96 [2025-06-20 01:04:49,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.73 | bwd: 3330.58 | bwd_inner: 3329.73 | bwd_allreduce: 0.81 | step: 6.96 73%|███████▎ | 7343/10000 [11:35:10<4:02:37, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00148267752956599, 'learning_rate': 6.95939609373044e-06, 'epoch': 7.34} 73%|███████▎ | 7343/10000 [11:35:10<4:02:37, 5.48s/it][2025-06-20 01:04:55,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:04:55,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.00 | bwd_microstep: 3324.21 | bwd_inner_microstep: 3323.31 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.05 [2025-06-20 01:04:55,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.00 | bwd: 3324.23 | bwd_inner: 3323.31 | bwd_allreduce: 0.87 | step: 7.05 73%|███████▎ | 7344/10000 [11:35:15<4:02:30, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0004868465766776353, 'learning_rate': 6.954485574555321e-06, 'epoch': 7.34} 73%|███████▎ | 7344/10000 [11:35:15<4:02:30, 5.48s/it][2025-06-20 01:05:00,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:05:00,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.73 | bwd_microstep: 3323.54 | bwd_inner_microstep: 3322.72 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.61 [2025-06-20 01:05:00,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.73 | bwd: 3323.56 | bwd_inner: 3322.72 | bwd_allreduce: 0.80 | step: 7.61 73%|███████▎ | 7345/10000 [11:35:21<4:02:20, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.005858240649104118, 'learning_rate': 6.949576423794113e-06, 'epoch': 7.34} 73%|███████▎ | 7345/10000 [11:35:21<4:02:20, 5.48s/it][2025-06-20 01:05:06,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:05:06,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.70 | bwd_microstep: 3378.10 | bwd_inner_microstep: 3377.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 01:05:06,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.70 | bwd: 3378.11 | bwd_inner: 3377.31 | bwd_allreduce: 0.76 | step: 6.61 73%|███████▎ | 7346/10000 [11:35:26<4:03:12, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.03806721419095993, 'learning_rate': 6.944668641961767e-06, 'epoch': 7.35} 73%|███████▎ | 7346/10000 [11:35:26<4:03:12, 5.50s/it][2025-06-20 01:05:11,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:05:11,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.14 | bwd_microstep: 3328.22 | bwd_inner_microstep: 3327.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 01:05:11,586] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.14 | bwd: 3328.23 | bwd_inner: 3327.43 | bwd_allreduce: 0.76 | step: 6.69 73%|███████▎ | 7347/10000 [11:35:32<4:02:48, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004650604911148548, 'learning_rate': 6.939762229573091e-06, 'epoch': 7.35} 73%|███████▎ | 7347/10000 [11:35:32<4:02:48, 5.49s/it][2025-06-20 01:05:17,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:05:17,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.92 | bwd_microstep: 3367.73 | bwd_inner_microstep: 3366.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-20 01:05:17,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.92 | bwd: 3367.74 | bwd_inner: 3366.92 | bwd_allreduce: 0.78 | step: 7.22 73%|███████▎ | 7348/10000 [11:35:37<4:03:24, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0016510230489075184, 'learning_rate': 6.934857187142743e-06, 'epoch': 7.35} 73%|███████▎ | 7348/10000 [11:35:37<4:03:24, 5.51s/it][2025-06-20 01:05:22,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:05:22,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.31 | bwd_microstep: 3316.89 | bwd_inner_microstep: 3316.04 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.93 [2025-06-20 01:05:22,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.31 | bwd: 3316.90 | bwd_inner: 3316.04 | bwd_allreduce: 0.82 | step: 6.94 73%|███████▎ | 7349/10000 [11:35:43<4:02:47, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.002626464469358325, 'learning_rate': 6.929953515185243e-06, 'epoch': 7.35} 73%|███████▎ | 7349/10000 [11:35:43<4:02:47, 5.50s/it][2025-06-20 01:05:28,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:05:28,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.57 | bwd_microstep: 3317.57 | bwd_inner_microstep: 3316.49 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.30 [2025-06-20 01:05:28,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.57 | bwd: 3317.58 | bwd_inner: 3316.49 | bwd_allreduce: 1.05 | step: 7.32 74%|███████▎ | 7350/10000 [11:35:48<4:02:19, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0005169589421711862, 'learning_rate': 6.925051214214955e-06, 'epoch': 7.35} 74%|███████▎ | 7350/10000 [11:35:48<4:02:19, 5.49s/it][2025-06-20 01:05:33,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:05:33,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.57 | bwd_microstep: 3325.83 | bwd_inner_microstep: 3325.03 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-20 01:05:33,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.57 | bwd: 3325.85 | bwd_inner: 3325.03 | bwd_allreduce: 0.78 | step: 7.22 74%|███████▎ | 7351/10000 [11:35:54<4:02:07, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0030786804854869843, 'learning_rate': 6.920150284746112e-06, 'epoch': 7.35} 74%|███████▎ | 7351/10000 [11:35:54<4:02:07, 5.48s/it][2025-06-20 01:05:39,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:05:39,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.11 | bwd_microstep: 3363.25 | bwd_inner_microstep: 3362.45 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-20 01:05:39,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.11 | bwd: 3363.27 | bwd_inner: 3362.45 | bwd_allreduce: 0.78 | step: 6.93 74%|███████▎ | 7352/10000 [11:35:59<4:02:43, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0024129352532327175, 'learning_rate': 6.915250727292799e-06, 'epoch': 7.35} 74%|███████▎ | 7352/10000 [11:35:59<4:02:43, 5.50s/it][2025-06-20 01:05:44,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:05:44,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.40 | bwd_microstep: 3391.87 | bwd_inner_microstep: 3391.07 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 01:05:44,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.40 | bwd: 3391.89 | bwd_inner: 3391.07 | bwd_allreduce: 0.77 | step: 6.81 74%|███████▎ | 7353/10000 [11:36:05<4:03:30, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0021651671268045902, 'learning_rate': 6.910352542368954e-06, 'epoch': 7.35} 74%|███████▎ | 7353/10000 [11:36:05<4:03:30, 5.52s/it][2025-06-20 01:05:50,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:05:50,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.33 | bwd_microstep: 3393.31 | bwd_inner_microstep: 3392.50 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-20 01:05:50,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.33 | bwd: 3393.32 | bwd_inner: 3392.50 | bwd_allreduce: 0.77 | step: 6.84 74%|███████▎ | 7354/10000 [11:36:11<4:04:02, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.006172714754939079, 'learning_rate': 6.905455730488379e-06, 'epoch': 7.35} 74%|███████▎ | 7354/10000 [11:36:11<4:04:02, 5.53s/it][2025-06-20 01:05:55,667] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:05:55,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.55 | bwd_microstep: 3316.58 | bwd_inner_microstep: 3315.76 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.27 [2025-06-20 01:05:55,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.55 | bwd: 3316.59 | bwd_inner: 3315.76 | bwd_allreduce: 0.79 | step: 7.27 74%|███████▎ | 7355/10000 [11:36:16<4:02:56, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.003589990083128214, 'learning_rate': 6.900560292164722e-06, 'epoch': 7.36} 74%|███████▎ | 7355/10000 [11:36:16<4:02:56, 5.51s/it][2025-06-20 01:06:01,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:06:01,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.30 | bwd_microstep: 3372.27 | bwd_inner_microstep: 3371.47 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 01:06:01,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.30 | bwd: 3372.28 | bwd_inner: 3371.47 | bwd_allreduce: 0.77 | step: 6.82 74%|███████▎ | 7356/10000 [11:36:22<4:03:12, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00363254570402205, 'learning_rate': 6.895666227911495e-06, 'epoch': 7.36} 74%|███████▎ | 7356/10000 [11:36:22<4:03:12, 5.52s/it][2025-06-20 01:06:06,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:06:06,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.23 | bwd_microstep: 3314.75 | bwd_inner_microstep: 3313.89 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.04 [2025-06-20 01:06:06,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.23 | bwd: 3314.77 | bwd_inner: 3313.89 | bwd_allreduce: 0.83 | step: 7.03 74%|███████▎ | 7357/10000 [11:36:27<4:02:18, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.041940342634916306, 'learning_rate': 6.890773538242061e-06, 'epoch': 7.36} 74%|███████▎ | 7357/10000 [11:36:27<4:02:18, 5.50s/it][2025-06-20 01:06:12,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:06:12,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.54 | bwd_microstep: 3325.47 | bwd_inner_microstep: 3324.54 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.11 [2025-06-20 01:06:12,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.54 | bwd: 3325.49 | bwd_inner: 3324.54 | bwd_allreduce: 0.91 | step: 7.11 74%|███████▎ | 7358/10000 [11:36:32<4:01:53, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006762630771845579, 'learning_rate': 6.885882223669644e-06, 'epoch': 7.36} 74%|███████▎ | 7358/10000 [11:36:32<4:01:53, 5.49s/it][2025-06-20 01:06:17,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:06:17,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.96 | bwd_microstep: 3372.58 | bwd_inner_microstep: 3371.77 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-20 01:06:17,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.96 | bwd: 3372.60 | bwd_inner: 3371.77 | bwd_allreduce: 0.78 | step: 7.12 74%|███████▎ | 7359/10000 [11:36:38<4:02:27, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.011884627863764763, 'learning_rate': 6.880992284707322e-06, 'epoch': 7.36} 74%|███████▎ | 7359/10000 [11:36:38<4:02:27, 5.51s/it][2025-06-20 01:06:23,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:06:23,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.18 | bwd_microstep: 3314.75 | bwd_inner_microstep: 3313.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 01:06:23,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.19 | bwd: 3314.77 | bwd_inner: 3313.96 | bwd_allreduce: 0.76 | step: 6.77 74%|███████▎ | 7360/10000 [11:36:43<4:01:46, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0016164795961230993, 'learning_rate': 6.876103721868015e-06, 'epoch': 7.36} 74%|███████▎ | 7360/10000 [11:36:43<4:01:46, 5.49s/it][2025-06-20 01:06:28,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:06:28,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.62 | bwd_microstep: 3361.45 | bwd_inner_microstep: 3360.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 01:06:28,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.63 | bwd: 3361.46 | bwd_inner: 3360.66 | bwd_allreduce: 0.76 | step: 6.66 74%|███████▎ | 7361/10000 [11:36:49<4:02:08, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.010318208485841751, 'learning_rate': 6.871216535664518e-06, 'epoch': 7.36} 74%|███████▎ | 7361/10000 [11:36:49<4:02:08, 5.51s/it][2025-06-20 01:06:34,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:06:34,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.65 | bwd_microstep: 3364.72 | bwd_inner_microstep: 3363.81 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.09 [2025-06-20 01:06:34,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.65 | bwd: 3364.73 | bwd_inner: 3363.81 | bwd_allreduce: 0.88 | step: 7.10 74%|███████▎ | 7362/10000 [11:36:55<4:02:29, 5.52s/it] {'loss': 0.0012, 'grad_norm': 0.41685307025909424, 'learning_rate': 6.866330726609471e-06, 'epoch': 7.36} 74%|███████▎ | 7362/10000 [11:36:55<4:02:29, 5.52s/it][2025-06-20 01:06:39,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:06:39,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.34 | bwd_microstep: 3317.68 | bwd_inner_microstep: 3316.84 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.93 [2025-06-20 01:06:39,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.34 | bwd: 3317.69 | bwd_inner: 3316.84 | bwd_allreduce: 0.79 | step: 6.93 74%|███████▎ | 7363/10000 [11:37:00<4:01:53, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.002131971064954996, 'learning_rate': 6.861446295215379e-06, 'epoch': 7.36} 74%|███████▎ | 7363/10000 [11:37:00<4:01:53, 5.50s/it][2025-06-20 01:06:45,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:06:45,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.06 | bwd_microstep: 3318.73 | bwd_inner_microstep: 3317.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 01:06:45,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.06 | bwd: 3318.75 | bwd_inner: 3317.94 | bwd_allreduce: 0.77 | step: 6.85 74%|███████▎ | 7364/10000 [11:37:05<4:01:11, 5.49s/it] {'loss': 0.0012, 'grad_norm': 0.4784833490848541, 'learning_rate': 6.856563241994592e-06, 'epoch': 7.36} 74%|███████▎ | 7364/10000 [11:37:05<4:01:11, 5.49s/it][2025-06-20 01:06:50,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:06:50,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.25 | bwd_microstep: 3312.71 | bwd_inner_microstep: 3311.78 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.08 [2025-06-20 01:06:50,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.25 | bwd: 3312.72 | bwd_inner: 3311.78 | bwd_allreduce: 0.90 | step: 7.08 74%|███████▎ | 7365/10000 [11:37:11<4:00:40, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.002294055884703994, 'learning_rate': 6.851681567459327e-06, 'epoch': 7.37} 74%|███████▎ | 7365/10000 [11:37:11<4:00:40, 5.48s/it][2025-06-20 01:06:56,067] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:06:56,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.14 | bwd_microstep: 3308.57 | bwd_inner_microstep: 3307.74 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.75 [2025-06-20 01:06:56,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.14 | bwd: 3308.58 | bwd_inner: 3307.74 | bwd_allreduce: 0.80 | step: 7.76 74%|███████▎ | 7366/10000 [11:37:16<4:00:18, 5.47s/it] {'loss': 0.0012, 'grad_norm': 0.22659628093242645, 'learning_rate': 6.8468012721216346e-06, 'epoch': 7.37} 74%|███████▎ | 7366/10000 [11:37:16<4:00:18, 5.47s/it][2025-06-20 01:07:01,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:07:01,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.74 | bwd_microstep: 3364.43 | bwd_inner_microstep: 3363.47 | bwd_allreduce_microstep: 0.92 | step_microstep: 6.58 [2025-06-20 01:07:01,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.74 | bwd: 3364.44 | bwd_inner: 3363.47 | bwd_allreduce: 0.93 | step: 6.58 74%|███████▎ | 7367/10000 [11:37:22<4:00:56, 5.49s/it] {'loss': 0.0051, 'grad_norm': 1.255481481552124, 'learning_rate': 6.841922356493445e-06, 'epoch': 7.37} 74%|███████▎ | 7367/10000 [11:37:22<4:00:56, 5.49s/it][2025-06-20 01:07:07,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:07:07,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.33 | bwd_microstep: 3369.56 | bwd_inner_microstep: 3368.77 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 01:07:07,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.33 | bwd: 3369.58 | bwd_inner: 3368.77 | bwd_allreduce: 0.77 | step: 6.75 74%|███████▎ | 7368/10000 [11:37:27<4:01:28, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.015365853905677795, 'learning_rate': 6.837044821086532e-06, 'epoch': 7.37} 74%|███████▎ | 7368/10000 [11:37:27<4:01:28, 5.50s/it][2025-06-20 01:07:12,595] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:07:12,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.76 | bwd_microstep: 3320.87 | bwd_inner_microstep: 3320.07 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.38 [2025-06-20 01:07:12,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.76 | bwd: 3320.89 | bwd_inner: 3320.07 | bwd_allreduce: 0.78 | step: 7.38 74%|███████▎ | 7369/10000 [11:37:33<4:00:49, 5.49s/it] {'loss': 0.0013, 'grad_norm': 0.4909753203392029, 'learning_rate': 6.8321686664125246e-06, 'epoch': 7.37} 74%|███████▎ | 7369/10000 [11:37:33<4:00:49, 5.49s/it][2025-06-20 01:07:18,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:07:18,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.24 | bwd_microstep: 3356.67 | bwd_inner_microstep: 3355.70 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.31 [2025-06-20 01:07:18,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.24 | bwd: 3356.69 | bwd_inner: 3355.70 | bwd_allreduce: 0.94 | step: 7.32 74%|███████▎ | 7370/10000 [11:37:38<4:01:03, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0017162295989692211, 'learning_rate': 6.827293892982916e-06, 'epoch': 7.37} 74%|███████▎ | 7370/10000 [11:37:38<4:01:03, 5.50s/it][2025-06-20 01:07:23,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:07:23,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.38 | bwd_microstep: 3311.43 | bwd_inner_microstep: 3310.56 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.79 [2025-06-20 01:07:23,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.38 | bwd: 3311.45 | bwd_inner: 3310.56 | bwd_allreduce: 0.84 | step: 6.79 74%|███████▎ | 7371/10000 [11:37:44<4:00:30, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004691243637353182, 'learning_rate': 6.822420501309028e-06, 'epoch': 7.37} 74%|███████▎ | 7371/10000 [11:37:44<4:00:30, 5.49s/it][2025-06-20 01:07:29,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:07:29,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.80 | bwd_microstep: 3360.97 | bwd_inner_microstep: 3360.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-20 01:07:29,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.80 | bwd: 3360.98 | bwd_inner: 3360.18 | bwd_allreduce: 0.77 | step: 6.99 74%|███████▎ | 7372/10000 [11:37:49<4:00:53, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.025514941662549973, 'learning_rate': 6.817548491902079e-06, 'epoch': 7.37} 74%|███████▎ | 7372/10000 [11:37:49<4:00:53, 5.50s/it][2025-06-20 01:07:34,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:07:34,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.34 | bwd_microstep: 3316.26 | bwd_inner_microstep: 3315.48 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.49 [2025-06-20 01:07:34,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.34 | bwd: 3316.27 | bwd_inner: 3315.48 | bwd_allreduce: 0.75 | step: 6.50 74%|███████▎ | 7373/10000 [11:37:55<4:00:08, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.08553090691566467, 'learning_rate': 6.812677865273112e-06, 'epoch': 7.37} 74%|███████▎ | 7373/10000 [11:37:55<4:00:08, 5.48s/it][2025-06-20 01:07:40,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:07:40,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.58 | bwd_microstep: 3361.41 | bwd_inner_microstep: 3360.56 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.88 [2025-06-20 01:07:40,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.58 | bwd: 3361.42 | bwd_inner: 3360.56 | bwd_allreduce: 0.82 | step: 6.88 74%|███████▎ | 7374/10000 [11:38:00<4:00:33, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.007792017888277769, 'learning_rate': 6.80780862193303e-06, 'epoch': 7.37} 74%|███████▎ | 7374/10000 [11:38:00<4:00:33, 5.50s/it][2025-06-20 01:07:45,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:07:45,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.50 | bwd_microstep: 3319.00 | bwd_inner_microstep: 3318.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 01:07:45,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.50 | bwd: 3319.01 | bwd_inner: 3318.20 | bwd_allreduce: 0.76 | step: 6.73 74%|███████▍ | 7375/10000 [11:38:06<3:59:57, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.053809210658073425, 'learning_rate': 6.802940762392605e-06, 'epoch': 7.38} 74%|███████▍ | 7375/10000 [11:38:06<3:59:57, 5.48s/it][2025-06-20 01:07:50,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:07:50,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.20 | bwd_microstep: 3319.37 | bwd_inner_microstep: 3318.46 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.50 [2025-06-20 01:07:50,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.20 | bwd: 3319.38 | bwd_inner: 3318.46 | bwd_allreduce: 0.88 | step: 7.51 74%|███████▍ | 7376/10000 [11:38:11<3:59:29, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.016755828633904457, 'learning_rate': 6.798074287162437e-06, 'epoch': 7.38} 74%|███████▍ | 7376/10000 [11:38:11<3:59:29, 5.48s/it][2025-06-20 01:07:56,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:07:56,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.02 | bwd_microstep: 3317.02 | bwd_inner_microstep: 3316.17 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.96 [2025-06-20 01:07:56,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.02 | bwd: 3317.04 | bwd_inner: 3316.17 | bwd_allreduce: 0.81 | step: 6.95 74%|███████▍ | 7377/10000 [11:38:17<3:59:08, 5.47s/it] {'loss': 0.0053, 'grad_norm': 1.2335039377212524, 'learning_rate': 6.793209196753006e-06, 'epoch': 7.38} 74%|███████▍ | 7377/10000 [11:38:17<3:59:08, 5.47s/it][2025-06-20 01:08:01,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:08:01,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.70 | bwd_microstep: 3316.79 | bwd_inner_microstep: 3315.98 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-20 01:08:01,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.70 | bwd: 3316.80 | bwd_inner: 3315.98 | bwd_allreduce: 0.78 | step: 7.19 74%|███████▍ | 7378/10000 [11:38:22<3:58:56, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.014360691420733929, 'learning_rate': 6.788345491674635e-06, 'epoch': 7.38} 74%|███████▍ | 7378/10000 [11:38:22<3:58:56, 5.47s/it][2025-06-20 01:08:07,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 01:08:07,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.49 | bwd_microstep: 3316.19 | bwd_inner_microstep: 3315.01 | bwd_allreduce_microstep: 1.11 | step_microstep: 7.39 [2025-06-20 01:08:07,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.49 | bwd: 3316.22 | bwd_inner: 3315.01 | bwd_allreduce: 1.14 | step: 7.39 74%|███████▍ | 7379/10000 [11:38:28<3:58:46, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.032873764634132385, 'learning_rate': 6.783483172437504e-06, 'epoch': 7.38} 74%|███████▍ | 7379/10000 [11:38:28<3:58:46, 5.47s/it][2025-06-20 01:08:12,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:08:12,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.04 | bwd_microstep: 3316.75 | bwd_inner_microstep: 3315.73 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.60 [2025-06-20 01:08:12,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.04 | bwd: 3316.76 | bwd_inner: 3315.73 | bwd_allreduce: 0.99 | step: 7.61 74%|███████▍ | 7380/10000 [11:38:33<3:58:41, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0040222532115876675, 'learning_rate': 6.778622239551655e-06, 'epoch': 7.38} 74%|███████▍ | 7380/10000 [11:38:33<3:58:41, 5.47s/it][2025-06-20 01:08:18,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:08:18,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.00 | bwd_microstep: 3362.42 | bwd_inner_microstep: 3361.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 01:08:18,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.00 | bwd: 3362.43 | bwd_inner: 3361.63 | bwd_allreduce: 0.76 | step: 6.76 74%|███████▍ | 7381/10000 [11:38:39<3:59:22, 5.48s/it] {'loss': 0.0004, 'grad_norm': 0.05987691879272461, 'learning_rate': 6.773762693526967e-06, 'epoch': 7.38} 74%|███████▍ | 7381/10000 [11:38:39<3:59:22, 5.48s/it][2025-06-20 01:08:23,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:08:23,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.58 | bwd_microstep: 3316.78 | bwd_inner_microstep: 3315.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-20 01:08:23,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.58 | bwd: 3316.80 | bwd_inner: 3315.98 | bwd_allreduce: 0.78 | step: 7.16 74%|███████▍ | 7382/10000 [11:38:44<3:58:53, 5.48s/it] {'loss': 0.0895, 'grad_norm': 10.461544036865234, 'learning_rate': 6.7689045348731845e-06, 'epoch': 7.38} 74%|███████▍ | 7382/10000 [11:38:44<3:58:53, 5.48s/it][2025-06-20 01:08:29,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:08:29,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.21 | bwd_microstep: 3321.62 | bwd_inner_microstep: 3320.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 01:08:29,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.21 | bwd: 3321.64 | bwd_inner: 3320.82 | bwd_allreduce: 0.77 | step: 6.81 74%|███████▍ | 7383/10000 [11:38:50<3:58:42, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.005295102950185537, 'learning_rate': 6.764047764099908e-06, 'epoch': 7.38} 74%|███████▍ | 7383/10000 [11:38:50<3:58:42, 5.47s/it][2025-06-20 01:08:34,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:08:34,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.23 | bwd_microstep: 3316.38 | bwd_inner_microstep: 3315.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-20 01:08:34,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.23 | bwd: 3316.39 | bwd_inner: 3315.57 | bwd_allreduce: 0.78 | step: 6.98 74%|███████▍ | 7384/10000 [11:38:55<3:58:27, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0009979961905628443, 'learning_rate': 6.759192381716593e-06, 'epoch': 7.38} 74%|███████▍ | 7384/10000 [11:38:55<3:58:27, 5.47s/it][2025-06-20 01:08:40,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:08:40,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.87 | bwd_microstep: 3366.29 | bwd_inner_microstep: 3365.48 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.14 [2025-06-20 01:08:40,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.88 | bwd: 3366.31 | bwd_inner: 3365.47 | bwd_allreduce: 0.79 | step: 7.14 74%|███████▍ | 7385/10000 [11:39:01<3:59:15, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0001981529640033841, 'learning_rate': 6.754338388232551e-06, 'epoch': 7.38} 74%|███████▍ | 7385/10000 [11:39:01<3:59:15, 5.49s/it][2025-06-20 01:08:45,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:08:45,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.69 | bwd_microstep: 3371.47 | bwd_inner_microstep: 3370.66 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.25 [2025-06-20 01:08:45,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.69 | bwd: 3371.49 | bwd_inner: 3370.66 | bwd_allreduce: 0.79 | step: 7.25 74%|███████▍ | 7386/10000 [11:39:06<3:59:52, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.035100605338811874, 'learning_rate': 6.74948578415693e-06, 'epoch': 7.39} 74%|███████▍ | 7386/10000 [11:39:06<3:59:52, 5.51s/it][2025-06-20 01:08:51,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:08:51,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.91 | bwd_microstep: 3356.44 | bwd_inner_microstep: 3355.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-20 01:08:51,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.91 | bwd: 3356.45 | bwd_inner: 3355.64 | bwd_allreduce: 0.77 | step: 6.83 74%|███████▍ | 7387/10000 [11:39:12<3:59:57, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.009886707179248333, 'learning_rate': 6.744634569998754e-06, 'epoch': 7.39} 74%|███████▍ | 7387/10000 [11:39:12<3:59:57, 5.51s/it][2025-06-20 01:08:56,875] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:08:56,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.60 | bwd_microstep: 3361.85 | bwd_inner_microstep: 3361.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-20 01:08:56,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.60 | bwd: 3361.87 | bwd_inner: 3361.06 | bwd_allreduce: 0.76 | step: 6.65 74%|███████▍ | 7388/10000 [11:39:17<4:00:06, 5.52s/it] {'loss': 0.0008, 'grad_norm': 0.14592844247817993, 'learning_rate': 6.7397847462668905e-06, 'epoch': 7.39} 74%|███████▍ | 7388/10000 [11:39:17<4:00:06, 5.52s/it][2025-06-20 01:09:02,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:09:02,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.02 | bwd_microstep: 3360.02 | bwd_inner_microstep: 3359.21 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-20 01:09:02,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.02 | bwd: 3360.04 | bwd_inner: 3359.21 | bwd_allreduce: 0.78 | step: 6.98 74%|███████▍ | 7389/10000 [11:39:23<4:00:07, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.03582063689827919, 'learning_rate': 6.7349363134700664e-06, 'epoch': 7.39} 74%|███████▍ | 7389/10000 [11:39:23<4:00:07, 5.52s/it][2025-06-20 01:09:07,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:09:07,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.14 | bwd_microstep: 3307.65 | bwd_inner_microstep: 3306.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-20 01:09:07,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.14 | bwd: 3307.67 | bwd_inner: 3306.85 | bwd_allreduce: 0.77 | step: 6.97 74%|███████▍ | 7390/10000 [11:39:28<3:59:04, 5.50s/it] {'loss': 0.0012, 'grad_norm': 0.5969419479370117, 'learning_rate': 6.730089272116855e-06, 'epoch': 7.39} 74%|███████▍ | 7390/10000 [11:39:28<3:59:04, 5.50s/it][2025-06-20 01:09:13,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:09:13,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.50 | bwd_microstep: 3320.02 | bwd_inner_microstep: 3319.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 01:09:13,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.50 | bwd: 3320.03 | bwd_inner: 3319.23 | bwd_allreduce: 0.76 | step: 6.69 74%|███████▍ | 7391/10000 [11:39:34<3:58:32, 5.49s/it] {'loss': 0.0, 'grad_norm': 3.861971345031634e-05, 'learning_rate': 6.725243622715693e-06, 'epoch': 7.39} 74%|███████▍ | 7391/10000 [11:39:34<3:58:32, 5.49s/it][2025-06-20 01:09:18,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:09:18,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.48 | bwd_microstep: 3360.76 | bwd_inner_microstep: 3359.93 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-20 01:09:18,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.48 | bwd: 3360.78 | bwd_inner: 3359.93 | bwd_allreduce: 0.79 | step: 7.31 74%|███████▍ | 7392/10000 [11:39:39<3:58:55, 5.50s/it] {'loss': 0.0021, 'grad_norm': 0.7501225471496582, 'learning_rate': 6.720399365774866e-06, 'epoch': 7.39} 74%|███████▍ | 7392/10000 [11:39:39<3:58:55, 5.50s/it][2025-06-20 01:09:24,283] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:09:24,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.50 | bwd_microstep: 3311.17 | bwd_inner_microstep: 3310.35 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.18 [2025-06-20 01:09:24,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.50 | bwd: 3311.18 | bwd_inner: 3310.35 | bwd_allreduce: 0.78 | step: 7.18 74%|███████▍ | 7393/10000 [11:39:45<3:58:18, 5.48s/it] {'loss': 0.0006, 'grad_norm': 0.09539010375738144, 'learning_rate': 6.715556501802511e-06, 'epoch': 7.39} 74%|███████▍ | 7393/10000 [11:39:45<3:58:18, 5.48s/it][2025-06-20 01:09:29,737] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:09:29,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.88 | bwd_microstep: 3303.52 | bwd_inner_microstep: 3302.71 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-20 01:09:29,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.88 | bwd: 3303.54 | bwd_inner: 3302.71 | bwd_allreduce: 0.78 | step: 6.97 74%|███████▍ | 7394/10000 [11:39:50<3:57:49, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.007242864463478327, 'learning_rate': 6.710715031306625e-06, 'epoch': 7.39} 74%|███████▍ | 7394/10000 [11:39:50<3:57:49, 5.48s/it][2025-06-20 01:09:35,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:09:35,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.76 | bwd_microstep: 3392.20 | bwd_inner_microstep: 3391.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-20 01:09:35,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.76 | bwd: 3392.22 | bwd_inner: 3391.41 | bwd_allreduce: 0.77 | step: 6.93 74%|███████▍ | 7395/10000 [11:39:56<3:58:53, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.011362512595951557, 'learning_rate': 6.70587495479506e-06, 'epoch': 7.39} 74%|███████▍ | 7395/10000 [11:39:56<3:58:53, 5.50s/it][2025-06-20 01:09:40,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:09:40,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.06 | bwd_microstep: 3308.76 | bwd_inner_microstep: 3307.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 01:09:40,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.06 | bwd: 3308.78 | bwd_inner: 3307.97 | bwd_allreduce: 0.76 | step: 6.73 74%|███████▍ | 7396/10000 [11:40:01<3:58:02, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.03979698568582535, 'learning_rate': 6.701036272775507e-06, 'epoch': 7.4} 74%|███████▍ | 7396/10000 [11:40:01<3:58:02, 5.48s/it][2025-06-20 01:09:46,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:09:46,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.13 | bwd_microstep: 3359.38 | bwd_inner_microstep: 3358.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-20 01:09:46,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.13 | bwd: 3359.40 | bwd_inner: 3358.59 | bwd_allreduce: 0.76 | step: 6.89 74%|███████▍ | 7397/10000 [11:40:07<3:58:27, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.14964404702186584, 'learning_rate': 6.696198985755527e-06, 'epoch': 7.4} 74%|███████▍ | 7397/10000 [11:40:07<3:58:27, 5.50s/it][2025-06-20 01:09:51,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:09:51,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.05 | bwd_microstep: 3315.93 | bwd_inner_microstep: 3315.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 01:09:51,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.05 | bwd: 3315.94 | bwd_inner: 3315.14 | bwd_allreduce: 0.76 | step: 6.72 74%|███████▍ | 7398/10000 [11:40:12<3:57:50, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0006270617595873773, 'learning_rate': 6.691363094242527e-06, 'epoch': 7.4} 74%|███████▍ | 7398/10000 [11:40:12<3:57:50, 5.48s/it][2025-06-20 01:09:57,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:09:57,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.39 | bwd_microstep: 3307.68 | bwd_inner_microstep: 3306.77 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.61 [2025-06-20 01:09:57,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.39 | bwd: 3307.70 | bwd_inner: 3306.77 | bwd_allreduce: 0.88 | step: 7.61 74%|███████▍ | 7399/10000 [11:40:17<3:57:16, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0006977740558795631, 'learning_rate': 6.686528598743771e-06, 'epoch': 7.4} 74%|███████▍ | 7399/10000 [11:40:17<3:57:16, 5.47s/it][2025-06-20 01:10:02,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.73 [2025-06-20 01:10:02,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.21 | bwd_microstep: 3362.90 | bwd_inner_microstep: 3361.84 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.29 [2025-06-20 01:10:02,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.21 | bwd: 3362.91 | bwd_inner: 3361.84 | bwd_allreduce: 1.03 | step: 7.30 74%|███████▍ | 7400/10000 [11:40:23<3:58:00, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.05051626265048981, 'learning_rate': 6.681695499766383e-06, 'epoch': 7.4} 74%|███████▍ | 7400/10000 [11:40:23<3:58:00, 5.49s/it][2025-06-20 01:10:08,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:10:08,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.10 | bwd_microstep: 3314.66 | bwd_inner_microstep: 3313.85 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-20 01:10:08,174] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.10 | bwd: 3314.67 | bwd_inner: 3313.85 | bwd_allreduce: 0.78 | step: 7.14 74%|███████▍ | 7401/10000 [11:40:28<3:57:31, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.020519984886050224, 'learning_rate': 6.676863797817319e-06, 'epoch': 7.4} 74%|███████▍ | 7401/10000 [11:40:28<3:57:31, 5.48s/it][2025-06-20 01:10:13,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:10:13,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.27 | bwd_microstep: 3322.59 | bwd_inner_microstep: 3321.63 | bwd_allreduce_microstep: 0.91 | step_microstep: 6.63 [2025-06-20 01:10:13,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.27 | bwd: 3322.60 | bwd_inner: 3321.63 | bwd_allreduce: 0.93 | step: 6.63 74%|███████▍ | 7402/10000 [11:40:34<3:57:15, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.049181751906871796, 'learning_rate': 6.672033493403407e-06, 'epoch': 7.4} 74%|███████▍ | 7402/10000 [11:40:34<3:57:15, 5.48s/it][2025-06-20 01:10:19,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.89 [2025-06-20 01:10:19,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.13 | bwd_microstep: 3326.29 | bwd_inner_microstep: 3325.35 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.28 [2025-06-20 01:10:19,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.13 | bwd: 3326.31 | bwd_inner: 3325.35 | bwd_allreduce: 0.92 | step: 7.28 74%|███████▍ | 7403/10000 [11:40:39<3:57:05, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.011659154668450356, 'learning_rate': 6.667204587031324e-06, 'epoch': 7.4} 74%|███████▍ | 7403/10000 [11:40:39<3:57:05, 5.48s/it][2025-06-20 01:10:24,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:10:24,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.97 | bwd_microstep: 3317.46 | bwd_inner_microstep: 3316.56 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.00 [2025-06-20 01:10:24,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.97 | bwd: 3317.47 | bwd_inner: 3316.56 | bwd_allreduce: 0.87 | step: 7.00 74%|███████▍ | 7404/10000 [11:40:45<3:56:48, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0009156748419627547, 'learning_rate': 6.6623770792076005e-06, 'epoch': 7.4} 74%|███████▍ | 7404/10000 [11:40:45<3:56:48, 5.47s/it][2025-06-20 01:10:30,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:10:30,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.12 | bwd_microstep: 3318.19 | bwd_inner_microstep: 3317.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-20 01:10:30,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.12 | bwd: 3318.20 | bwd_inner: 3317.39 | bwd_allreduce: 0.76 | step: 6.95 74%|███████▍ | 7405/10000 [11:40:50<3:56:39, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0021302008535712957, 'learning_rate': 6.657550970438629e-06, 'epoch': 7.41} 74%|███████▍ | 7405/10000 [11:40:50<3:56:39, 5.47s/it][2025-06-20 01:10:35,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 01:10:35,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.57 | bwd_microstep: 3313.35 | bwd_inner_microstep: 3312.57 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 01:10:35,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.57 | bwd: 3313.37 | bwd_inner: 3312.57 | bwd_allreduce: 0.75 | step: 6.59 74%|███████▍ | 7406/10000 [11:40:56<3:56:20, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.004628892056643963, 'learning_rate': 6.65272626123062e-06, 'epoch': 7.41} 74%|███████▍ | 7406/10000 [11:40:56<3:56:20, 5.47s/it][2025-06-20 01:10:40,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:10:40,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.52 | bwd_microstep: 3317.79 | bwd_inner_microstep: 3316.85 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.03 [2025-06-20 01:10:40,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.52 | bwd: 3317.81 | bwd_inner: 3316.85 | bwd_allreduce: 0.91 | step: 7.03 74%|███████▍ | 7407/10000 [11:41:01<3:56:12, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.03399183228611946, 'learning_rate': 6.647902952089686e-06, 'epoch': 7.41} 74%|███████▍ | 7407/10000 [11:41:01<3:56:12, 5.47s/it][2025-06-20 01:10:46,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:10:46,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.67 | bwd_microstep: 3325.98 | bwd_inner_microstep: 3325.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-20 01:10:46,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.67 | bwd: 3326.00 | bwd_inner: 3325.20 | bwd_allreduce: 0.75 | step: 6.71 74%|███████▍ | 7408/10000 [11:41:07<3:56:09, 5.47s/it] {'loss': 0.004, 'grad_norm': 0.9488377571105957, 'learning_rate': 6.643081043521766e-06, 'epoch': 7.41} 74%|███████▍ | 7408/10000 [11:41:07<3:56:09, 5.47s/it][2025-06-20 01:10:52,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:10:52,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.93 | bwd_microstep: 3400.59 | bwd_inner_microstep: 3399.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-20 01:10:52,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.93 | bwd: 3400.61 | bwd_inner: 3399.78 | bwd_allreduce: 0.78 | step: 6.83 74%|███████▍ | 7409/10000 [11:41:12<3:57:30, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0003636517212726176, 'learning_rate': 6.63826053603265e-06, 'epoch': 7.41} 74%|███████▍ | 7409/10000 [11:41:12<3:57:30, 5.50s/it][2025-06-20 01:10:57,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 01:10:57,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.17 | bwd_microstep: 3371.30 | bwd_inner_microstep: 3370.27 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.88 [2025-06-20 01:10:57,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.17 | bwd: 3371.32 | bwd_inner: 3370.27 | bwd_allreduce: 1.01 | step: 7.90 74%|███████▍ | 7410/10000 [11:41:18<3:57:56, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.011160152964293957, 'learning_rate': 6.633441430128001e-06, 'epoch': 7.41} 74%|███████▍ | 7410/10000 [11:41:18<3:57:56, 5.51s/it][2025-06-20 01:11:03,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:11:03,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.37 | bwd_microstep: 3377.50 | bwd_inner_microstep: 3376.66 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.89 [2025-06-20 01:11:03,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.37 | bwd: 3377.52 | bwd_inner: 3376.66 | bwd_allreduce: 0.80 | step: 6.89 74%|███████▍ | 7411/10000 [11:41:23<3:58:28, 5.53s/it] {'loss': 0.0005, 'grad_norm': 0.08783853054046631, 'learning_rate': 6.628623726313299e-06, 'epoch': 7.41} 74%|███████▍ | 7411/10000 [11:41:23<3:58:28, 5.53s/it][2025-06-20 01:11:08,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:11:08,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.83 | bwd_microstep: 3323.68 | bwd_inner_microstep: 3322.69 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.08 [2025-06-20 01:11:08,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.83 | bwd: 3323.69 | bwd_inner: 3322.69 | bwd_allreduce: 0.96 | step: 7.08 74%|███████▍ | 7412/10000 [11:41:29<3:57:36, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0011103537399321795, 'learning_rate': 6.623807425093911e-06, 'epoch': 7.41} 74%|███████▍ | 7412/10000 [11:41:29<3:57:36, 5.51s/it][2025-06-20 01:11:14,116] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:11:14,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.47 | bwd_microstep: 3361.76 | bwd_inner_microstep: 3360.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-20 01:11:14,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.47 | bwd: 3361.77 | bwd_inner: 3360.96 | bwd_allreduce: 0.77 | step: 6.93 74%|███████▍ | 7413/10000 [11:41:34<3:57:51, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0058727324940264225, 'learning_rate': 6.618992526975043e-06, 'epoch': 7.41} 74%|███████▍ | 7413/10000 [11:41:34<3:57:51, 5.52s/it][2025-06-20 01:11:19,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:11:19,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.44 | bwd_microstep: 3328.95 | bwd_inner_microstep: 3328.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 01:11:19,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.44 | bwd: 3328.97 | bwd_inner: 3328.17 | bwd_allreduce: 0.75 | step: 6.67 74%|███████▍ | 7414/10000 [11:41:40<3:57:20, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0003730136959347874, 'learning_rate': 6.614179032461756e-06, 'epoch': 7.41} 74%|███████▍ | 7414/10000 [11:41:40<3:57:20, 5.51s/it][2025-06-20 01:11:25,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:11:25,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.39 | bwd_microstep: 3375.43 | bwd_inner_microstep: 3374.65 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 01:11:25,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.39 | bwd: 3375.45 | bwd_inner: 3374.65 | bwd_allreduce: 0.76 | step: 6.60 74%|███████▍ | 7415/10000 [11:41:45<3:57:48, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.05967523157596588, 'learning_rate': 6.60936694205897e-06, 'epoch': 7.42} 74%|███████▍ | 7415/10000 [11:41:45<3:57:48, 5.52s/it][2025-06-20 01:11:30,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:11:30,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.39 | bwd_microstep: 3374.51 | bwd_inner_microstep: 3373.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 01:11:30,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.39 | bwd: 3374.52 | bwd_inner: 3373.72 | bwd_allreduce: 0.75 | step: 6.71 74%|███████▍ | 7416/10000 [11:41:51<3:58:06, 5.53s/it] {'loss': 0.0015, 'grad_norm': 0.4093617796897888, 'learning_rate': 6.604556256271437e-06, 'epoch': 7.42} 74%|███████▍ | 7416/10000 [11:41:51<3:58:06, 5.53s/it][2025-06-20 01:11:36,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:11:36,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.23 | bwd_microstep: 3329.13 | bwd_inner_microstep: 3328.17 | bwd_allreduce_microstep: 0.92 | step_microstep: 6.65 [2025-06-20 01:11:36,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.23 | bwd: 3329.15 | bwd_inner: 3328.17 | bwd_allreduce: 0.93 | step: 6.66 74%|███████▍ | 7417/10000 [11:41:56<3:57:26, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.023170536383986473, 'learning_rate': 6.599746975603782e-06, 'epoch': 7.42} 74%|███████▍ | 7417/10000 [11:41:56<3:57:26, 5.52s/it][2025-06-20 01:11:41,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:11:41,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.57 | bwd_microstep: 3313.70 | bwd_inner_microstep: 3312.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.06 [2025-06-20 01:11:41,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.57 | bwd: 3313.71 | bwd_inner: 3312.91 | bwd_allreduce: 0.76 | step: 7.06 74%|███████▍ | 7418/10000 [11:42:02<3:56:36, 5.50s/it] {'loss': 0.0, 'grad_norm': 9.533137927064672e-05, 'learning_rate': 6.594939100560478e-06, 'epoch': 7.42} 74%|███████▍ | 7418/10000 [11:42:02<3:56:36, 5.50s/it][2025-06-20 01:11:47,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:11:47,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.86 | bwd_microstep: 3378.32 | bwd_inner_microstep: 3377.32 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.24 [2025-06-20 01:11:47,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.86 | bwd: 3378.34 | bwd_inner: 3377.32 | bwd_allreduce: 0.97 | step: 7.25 74%|███████▍ | 7419/10000 [11:42:07<3:57:15, 5.52s/it] {'loss': 0.0244, 'grad_norm': 4.108769416809082, 'learning_rate': 6.590132631645847e-06, 'epoch': 7.42} 74%|███████▍ | 7419/10000 [11:42:07<3:57:15, 5.52s/it][2025-06-20 01:11:52,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:11:52,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.03 | bwd_microstep: 3325.77 | bwd_inner_microstep: 3324.68 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.31 [2025-06-20 01:11:52,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.03 | bwd: 3325.79 | bwd_inner: 3324.68 | bwd_allreduce: 1.06 | step: 7.31 74%|███████▍ | 7420/10000 [11:42:13<3:56:42, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.008819581009447575, 'learning_rate': 6.585327569364073e-06, 'epoch': 7.42} 74%|███████▍ | 7420/10000 [11:42:13<3:56:42, 5.50s/it][2025-06-20 01:11:58,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:11:58,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.57 | bwd_microstep: 3328.63 | bwd_inner_microstep: 3327.65 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.49 [2025-06-20 01:11:58,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.57 | bwd: 3328.65 | bwd_inner: 3327.65 | bwd_allreduce: 0.95 | step: 7.50 74%|███████▍ | 7421/10000 [11:42:18<3:56:25, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00027241421048529446, 'learning_rate': 6.58052391421917e-06, 'epoch': 7.42} 74%|███████▍ | 7421/10000 [11:42:18<3:56:25, 5.50s/it][2025-06-20 01:12:03,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:12:03,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.89 | bwd_microstep: 3319.03 | bwd_inner_microstep: 3318.10 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.95 [2025-06-20 01:12:03,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.89 | bwd: 3319.05 | bwd_inner: 3318.10 | bwd_allreduce: 0.91 | step: 6.95 74%|███████▍ | 7422/10000 [11:42:24<3:55:56, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.011088509112596512, 'learning_rate': 6.575721666715027e-06, 'epoch': 7.42} 74%|███████▍ | 7422/10000 [11:42:24<3:55:56, 5.49s/it][2025-06-20 01:12:09,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:12:09,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.16 | bwd_microstep: 3376.02 | bwd_inner_microstep: 3375.08 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.21 [2025-06-20 01:12:09,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.16 | bwd: 3376.03 | bwd_inner: 3375.08 | bwd_allreduce: 0.92 | step: 7.21 74%|███████▍ | 7423/10000 [11:42:29<3:56:34, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00047045320388861, 'learning_rate': 6.570920827355378e-06, 'epoch': 7.42} 74%|███████▍ | 7423/10000 [11:42:29<3:56:34, 5.51s/it][2025-06-20 01:12:14,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:12:14,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.37 | bwd_microstep: 3326.75 | bwd_inner_microstep: 3325.79 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.93 [2025-06-20 01:12:14,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.37 | bwd: 3326.77 | bwd_inner: 3325.79 | bwd_allreduce: 0.93 | step: 6.93 74%|███████▍ | 7424/10000 [11:42:35<3:56:04, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.07765819877386093, 'learning_rate': 6.5661213966438075e-06, 'epoch': 7.42} 74%|███████▍ | 7424/10000 [11:42:35<3:56:04, 5.50s/it][2025-06-20 01:12:20,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:12:20,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.80 | bwd_microstep: 3329.52 | bwd_inner_microstep: 3328.62 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.92 [2025-06-20 01:12:20,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.80 | bwd: 3329.53 | bwd_inner: 3328.62 | bwd_allreduce: 0.86 | step: 6.92 74%|███████▍ | 7425/10000 [11:42:40<3:55:51, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.015537670813500881, 'learning_rate': 6.561323375083752e-06, 'epoch': 7.42} 74%|███████▍ | 7425/10000 [11:42:40<3:55:51, 5.50s/it][2025-06-20 01:12:25,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:12:25,629] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.98 | bwd_microstep: 3328.56 | bwd_inner_microstep: 3327.67 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.00 [2025-06-20 01:12:25,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.98 | bwd: 3328.57 | bwd_inner: 3327.67 | bwd_allreduce: 0.86 | step: 7.01 74%|███████▍ | 7426/10000 [11:42:46<3:55:31, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.008945000357925892, 'learning_rate': 6.556526763178506e-06, 'epoch': 7.43} 74%|███████▍ | 7426/10000 [11:42:46<3:55:31, 5.49s/it][2025-06-20 01:12:31,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:12:31,100] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.42 | bwd_microstep: 3326.05 | bwd_inner_microstep: 3324.95 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.85 [2025-06-20 01:12:31,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.42 | bwd: 3326.07 | bwd_inner: 3324.95 | bwd_allreduce: 1.06 | step: 7.85 74%|███████▍ | 7427/10000 [11:42:51<3:55:12, 5.48s/it] {'loss': 0.0063, 'grad_norm': 1.1667143106460571, 'learning_rate': 6.551731561431209e-06, 'epoch': 7.43} 74%|███████▍ | 7427/10000 [11:42:51<3:55:12, 5.48s/it][2025-06-20 01:12:36,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:12:36,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.23 | bwd_microstep: 3389.33 | bwd_inner_microstep: 3388.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 01:12:36,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.23 | bwd: 3389.34 | bwd_inner: 3388.54 | bwd_allreduce: 0.76 | step: 6.79 74%|███████▍ | 7428/10000 [11:42:57<3:56:06, 5.51s/it] {'loss': 0.001, 'grad_norm': 0.17697785794734955, 'learning_rate': 6.546937770344855e-06, 'epoch': 7.43} 74%|███████▍ | 7428/10000 [11:42:57<3:56:06, 5.51s/it][2025-06-20 01:12:42,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:12:42,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.53 | bwd_microstep: 3404.35 | bwd_inner_microstep: 3403.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-20 01:12:42,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.53 | bwd: 3404.36 | bwd_inner: 3403.56 | bwd_allreduce: 0.77 | step: 6.65 74%|███████▍ | 7429/10000 [11:43:03<3:56:59, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.00016770462389104068, 'learning_rate': 6.542145390422292e-06, 'epoch': 7.43} 74%|███████▍ | 7429/10000 [11:43:03<3:56:59, 5.53s/it][2025-06-20 01:12:47,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 01:12:47,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.06 | bwd_microstep: 3330.39 | bwd_inner_microstep: 3329.48 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.69 [2025-06-20 01:12:47,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.06 | bwd: 3330.40 | bwd_inner: 3329.48 | bwd_allreduce: 0.88 | step: 7.70 74%|███████▍ | 7430/10000 [11:43:08<3:56:19, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00015960438759066164, 'learning_rate': 6.537354422166218e-06, 'epoch': 7.43} 74%|███████▍ | 7430/10000 [11:43:08<3:56:19, 5.52s/it][2025-06-20 01:12:53,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:12:53,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.93 | bwd_microstep: 3396.30 | bwd_inner_microstep: 3395.52 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-20 01:12:53,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.93 | bwd: 3396.31 | bwd_inner: 3395.52 | bwd_allreduce: 0.75 | step: 6.77 74%|███████▍ | 7431/10000 [11:43:14<3:57:08, 5.54s/it] {'loss': 0.0002, 'grad_norm': 0.052537813782691956, 'learning_rate': 6.532564866079189e-06, 'epoch': 7.43} 74%|███████▍ | 7431/10000 [11:43:14<3:57:08, 5.54s/it][2025-06-20 01:12:58,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.86 [2025-06-20 01:12:58,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.83 | bwd_microstep: 3366.98 | bwd_inner_microstep: 3366.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.89 [2025-06-20 01:12:58,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.83 | bwd: 3366.99 | bwd_inner: 3366.20 | bwd_allreduce: 0.75 | step: 6.89 74%|███████▍ | 7432/10000 [11:43:19<3:56:59, 5.54s/it] {'loss': 0.0083, 'grad_norm': 2.215729236602783, 'learning_rate': 6.527776722663595e-06, 'epoch': 7.43} 74%|███████▍ | 7432/10000 [11:43:19<3:56:59, 5.54s/it][2025-06-20 01:13:04,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:13:04,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.21 | bwd_microstep: 3328.84 | bwd_inner_microstep: 3328.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 01:13:04,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.21 | bwd: 3328.86 | bwd_inner: 3328.04 | bwd_allreduce: 0.76 | step: 6.77 74%|███████▍ | 7433/10000 [11:43:25<3:56:08, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.009815205819904804, 'learning_rate': 6.5229899924216955e-06, 'epoch': 7.43} 74%|███████▍ | 7433/10000 [11:43:25<3:56:08, 5.52s/it][2025-06-20 01:13:09,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:13:09,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.35 | bwd_microstep: 3373.61 | bwd_inner_microstep: 3372.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-20 01:13:09,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.35 | bwd: 3373.62 | bwd_inner: 3372.81 | bwd_allreduce: 0.77 | step: 6.94 74%|███████▍ | 7434/10000 [11:43:30<3:56:24, 5.53s/it] {'loss': 0.0002, 'grad_norm': 0.04554271697998047, 'learning_rate': 6.518204675855597e-06, 'epoch': 7.43} 74%|███████▍ | 7434/10000 [11:43:30<3:56:24, 5.53s/it][2025-06-20 01:13:15,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:13:15,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.55 | bwd_microstep: 3328.62 | bwd_inner_microstep: 3327.81 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.76 [2025-06-20 01:13:15,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.55 | bwd: 3328.64 | bwd_inner: 3327.81 | bwd_allreduce: 0.78 | step: 6.76 74%|███████▍ | 7435/10000 [11:43:36<3:55:37, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0001023181393975392, 'learning_rate': 6.513420773467258e-06, 'epoch': 7.43} 74%|███████▍ | 7435/10000 [11:43:36<3:55:37, 5.51s/it][2025-06-20 01:13:20,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:13:20,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.30 | bwd_microstep: 3323.74 | bwd_inner_microstep: 3322.97 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.54 [2025-06-20 01:13:20,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.30 | bwd: 3323.76 | bwd_inner: 3322.97 | bwd_allreduce: 0.75 | step: 6.54 74%|███████▍ | 7436/10000 [11:43:41<3:55:00, 5.50s/it] {'loss': 0.0329, 'grad_norm': 15.220653533935547, 'learning_rate': 6.508638285758493e-06, 'epoch': 7.44} 74%|███████▍ | 7436/10000 [11:43:41<3:55:00, 5.50s/it][2025-06-20 01:13:26,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:13:26,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.75 | bwd_microstep: 3363.98 | bwd_inner_microstep: 3363.21 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.59 [2025-06-20 01:13:26,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.75 | bwd: 3364.00 | bwd_inner: 3363.21 | bwd_allreduce: 0.75 | step: 6.59 74%|███████▍ | 7437/10000 [11:43:47<3:55:14, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.005366304889321327, 'learning_rate': 6.5038572132309505e-06, 'epoch': 7.44} 74%|███████▍ | 7437/10000 [11:43:47<3:55:14, 5.51s/it][2025-06-20 01:13:31,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:13:31,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.36 | bwd_microstep: 3365.62 | bwd_inner_microstep: 3364.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.04 [2025-06-20 01:13:31,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.36 | bwd: 3365.63 | bwd_inner: 3364.82 | bwd_allreduce: 0.77 | step: 7.05 74%|███████▍ | 7438/10000 [11:43:52<3:55:31, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00040649535367265344, 'learning_rate': 6.499077556386148e-06, 'epoch': 7.44} 74%|███████▍ | 7438/10000 [11:43:52<3:55:31, 5.52s/it][2025-06-20 01:13:37,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:13:37,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2147.32 | bwd_microstep: 3409.26 | bwd_inner_microstep: 3408.44 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.87 [2025-06-20 01:13:37,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2147.31 | bwd: 3409.27 | bwd_inner: 3408.44 | bwd_allreduce: 0.79 | step: 6.87 74%|███████▍ | 7439/10000 [11:43:58<3:56:28, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.010404554195702076, 'learning_rate': 6.49429931572545e-06, 'epoch': 7.44} 74%|███████▍ | 7439/10000 [11:43:58<3:56:28, 5.54s/it][2025-06-20 01:13:42,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:13:42,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.39 | bwd_microstep: 3333.04 | bwd_inner_microstep: 3332.21 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.77 [2025-06-20 01:13:42,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.39 | bwd: 3333.06 | bwd_inner: 3332.21 | bwd_allreduce: 0.79 | step: 6.78 74%|███████▍ | 7440/10000 [11:44:03<3:55:39, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.06100912019610405, 'learning_rate': 6.489522491750073e-06, 'epoch': 7.44} 74%|███████▍ | 7440/10000 [11:44:03<3:55:39, 5.52s/it][2025-06-20 01:13:48,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.77 [2025-06-20 01:13:48,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.19 | bwd_microstep: 3312.55 | bwd_inner_microstep: 3311.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 01:13:48,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.19 | bwd: 3312.56 | bwd_inner: 3311.76 | bwd_allreduce: 0.76 | step: 6.61 74%|███████▍ | 7441/10000 [11:44:09<3:54:36, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00017984106671065092, 'learning_rate': 6.484747084961083e-06, 'epoch': 7.44} 74%|███████▍ | 7441/10000 [11:44:09<3:54:36, 5.50s/it][2025-06-20 01:13:53,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:13:53,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.07 | bwd_microstep: 3322.16 | bwd_inner_microstep: 3321.37 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 01:13:53,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.07 | bwd: 3322.17 | bwd_inner: 3321.37 | bwd_allreduce: 0.76 | step: 6.80 74%|███████▍ | 7442/10000 [11:44:14<3:54:06, 5.49s/it] {'loss': 0.002, 'grad_norm': 0.4454216957092285, 'learning_rate': 6.4799730958593955e-06, 'epoch': 7.44} 74%|███████▍ | 7442/10000 [11:44:14<3:54:06, 5.49s/it][2025-06-20 01:13:59,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:13:59,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.27 | bwd_microstep: 3368.35 | bwd_inner_microstep: 3367.54 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.60 [2025-06-20 01:13:59,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.27 | bwd: 3368.37 | bwd_inner: 3367.54 | bwd_allreduce: 0.78 | step: 7.60 74%|███████▍ | 7443/10000 [11:44:20<3:54:36, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.000156732538016513, 'learning_rate': 6.475200524945782e-06, 'epoch': 7.44} 74%|███████▍ | 7443/10000 [11:44:20<3:54:36, 5.51s/it][2025-06-20 01:14:04,977] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:14:04,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.83 | bwd_microstep: 3376.88 | bwd_inner_microstep: 3376.06 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.75 [2025-06-20 01:14:04,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.83 | bwd: 3376.90 | bwd_inner: 3376.06 | bwd_allreduce: 0.79 | step: 6.76 74%|███████▍ | 7444/10000 [11:44:25<3:55:08, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00012361930566839874, 'learning_rate': 6.470429372720863e-06, 'epoch': 7.44} 74%|███████▍ | 7444/10000 [11:44:25<3:55:08, 5.52s/it][2025-06-20 01:14:10,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:14:10,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.48 | bwd_microstep: 3365.33 | bwd_inner_microstep: 3364.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-20 01:14:10,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.48 | bwd: 3365.35 | bwd_inner: 3364.51 | bwd_allreduce: 0.79 | step: 6.86 74%|███████▍ | 7445/10000 [11:44:31<3:55:15, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.007440030574798584, 'learning_rate': 6.465659639685111e-06, 'epoch': 7.45} 74%|███████▍ | 7445/10000 [11:44:31<3:55:15, 5.52s/it][2025-06-20 01:14:15,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:14:15,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.63 | bwd_microstep: 3323.95 | bwd_inner_microstep: 3323.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-20 01:14:15,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.63 | bwd: 3323.97 | bwd_inner: 3323.14 | bwd_allreduce: 0.78 | step: 7.17 74%|███████▍ | 7446/10000 [11:44:36<3:54:37, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.009475532919168472, 'learning_rate': 6.460891326338854e-06, 'epoch': 7.45} 74%|███████▍ | 7446/10000 [11:44:36<3:54:37, 5.51s/it][2025-06-20 01:14:21,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:14:21,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.10 | bwd_microstep: 3369.89 | bwd_inner_microstep: 3369.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 01:14:21,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.10 | bwd: 3369.90 | bwd_inner: 3369.10 | bwd_allreduce: 0.76 | step: 6.85 74%|███████▍ | 7447/10000 [11:44:42<3:54:51, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.029827911406755447, 'learning_rate': 6.4561244331822534e-06, 'epoch': 7.45} 74%|███████▍ | 7447/10000 [11:44:42<3:54:51, 5.52s/it][2025-06-20 01:14:27,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:14:27,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.18 | bwd_microstep: 3379.70 | bwd_inner_microstep: 3378.85 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.29 [2025-06-20 01:14:27,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.18 | bwd: 3379.72 | bwd_inner: 3378.85 | bwd_allreduce: 0.82 | step: 7.29 74%|███████▍ | 7448/10000 [11:44:47<3:55:04, 5.53s/it] {'loss': 0.0013, 'grad_norm': 0.5077102184295654, 'learning_rate': 6.451358960715341e-06, 'epoch': 7.45} 74%|███████▍ | 7448/10000 [11:44:47<3:55:04, 5.53s/it][2025-06-20 01:14:32,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:14:32,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.94 | bwd_microstep: 3375.55 | bwd_inner_microstep: 3374.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-20 01:14:32,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.94 | bwd: 3375.56 | bwd_inner: 3374.76 | bwd_allreduce: 0.76 | step: 6.71 74%|███████▍ | 7449/10000 [11:44:53<3:55:19, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0004918626509606838, 'learning_rate': 6.446594909437993e-06, 'epoch': 7.45} 74%|███████▍ | 7449/10000 [11:44:53<3:55:19, 5.53s/it][2025-06-20 01:14:38,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:14:38,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.72 | bwd_microstep: 3367.06 | bwd_inner_microstep: 3366.23 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.94 [2025-06-20 01:14:38,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.72 | bwd: 3367.07 | bwd_inner: 3366.23 | bwd_allreduce: 0.79 | step: 6.94 74%|███████▍ | 7450/10000 [11:44:58<3:55:13, 5.53s/it] {'loss': 0.0151, 'grad_norm': 2.0141701698303223, 'learning_rate': 6.4418322798499355e-06, 'epoch': 7.45} 74%|███████▍ | 7450/10000 [11:44:58<3:55:13, 5.53s/it][2025-06-20 01:14:43,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:14:43,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.37 | bwd_microstep: 3365.90 | bwd_inner_microstep: 3365.08 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.27 [2025-06-20 01:14:43,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.37 | bwd: 3365.92 | bwd_inner: 3365.08 | bwd_allreduce: 0.79 | step: 7.27 75%|███████▍ | 7451/10000 [11:45:04<3:55:12, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0177988950163126, 'learning_rate': 6.437071072450751e-06, 'epoch': 7.45} 75%|███████▍ | 7451/10000 [11:45:04<3:55:12, 5.54s/it][2025-06-20 01:14:49,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:14:49,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.31 | bwd_microstep: 3313.46 | bwd_inner_microstep: 3312.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 01:14:49,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.31 | bwd: 3313.48 | bwd_inner: 3312.67 | bwd_allreduce: 0.77 | step: 6.82 75%|███████▍ | 7452/10000 [11:45:09<3:54:07, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.03961346670985222, 'learning_rate': 6.432311287739861e-06, 'epoch': 7.45} 75%|███████▍ | 7452/10000 [11:45:09<3:54:07, 5.51s/it][2025-06-20 01:14:54,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:14:54,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.42 | bwd_microstep: 3367.75 | bwd_inner_microstep: 3366.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-20 01:14:54,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.42 | bwd: 3367.76 | bwd_inner: 3366.95 | bwd_allreduce: 0.77 | step: 6.88 75%|███████▍ | 7453/10000 [11:45:15<3:54:18, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0005443589179776609, 'learning_rate': 6.427552926216543e-06, 'epoch': 7.45} 75%|███████▍ | 7453/10000 [11:45:15<3:54:18, 5.52s/it][2025-06-20 01:15:00,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:15:00,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.34 | bwd_microstep: 3312.90 | bwd_inner_microstep: 3312.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-20 01:15:00,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.34 | bwd: 3312.91 | bwd_inner: 3312.11 | bwd_allreduce: 0.76 | step: 6.88 75%|███████▍ | 7454/10000 [11:45:20<3:53:28, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.007844635285437107, 'learning_rate': 6.422795988379933e-06, 'epoch': 7.45} 75%|███████▍ | 7454/10000 [11:45:20<3:53:28, 5.50s/it][2025-06-20 01:15:05,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:15:05,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.63 | bwd_microstep: 3312.81 | bwd_inner_microstep: 3311.93 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.98 [2025-06-20 01:15:05,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.63 | bwd: 3312.82 | bwd_inner: 3311.93 | bwd_allreduce: 0.85 | step: 6.98 75%|███████▍ | 7455/10000 [11:45:26<3:52:44, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00012459864956326783, 'learning_rate': 6.41804047472901e-06, 'epoch': 7.46} 75%|███████▍ | 7455/10000 [11:45:26<3:52:44, 5.49s/it][2025-06-20 01:15:11,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:15:11,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.09 | bwd_microstep: 3367.29 | bwd_inner_microstep: 3366.48 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-20 01:15:11,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.09 | bwd: 3367.30 | bwd_inner: 3366.48 | bwd_allreduce: 0.78 | step: 7.06 75%|███████▍ | 7456/10000 [11:45:31<3:53:17, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.019819140434265137, 'learning_rate': 6.4132863857626135e-06, 'epoch': 7.46} 75%|███████▍ | 7456/10000 [11:45:31<3:53:17, 5.50s/it][2025-06-20 01:15:16,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:15:16,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.76 | bwd_microstep: 3329.81 | bwd_inner_microstep: 3328.86 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.19 [2025-06-20 01:15:16,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.76 | bwd: 3329.82 | bwd_inner: 3328.86 | bwd_allreduce: 0.92 | step: 7.19 75%|███████▍ | 7457/10000 [11:45:37<3:52:50, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00041982930270023644, 'learning_rate': 6.4085337219794085e-06, 'epoch': 7.46} 75%|███████▍ | 7457/10000 [11:45:37<3:52:50, 5.49s/it][2025-06-20 01:15:22,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:15:22,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.59 | bwd_microstep: 3363.61 | bwd_inner_microstep: 3362.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 01:15:22,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.59 | bwd: 3363.62 | bwd_inner: 3362.82 | bwd_allreduce: 0.76 | step: 6.65 75%|███████▍ | 7458/10000 [11:45:42<3:53:14, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0006010149954818189, 'learning_rate': 6.403782483877938e-06, 'epoch': 7.46} 75%|███████▍ | 7458/10000 [11:45:42<3:53:14, 5.51s/it][2025-06-20 01:15:27,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:15:27,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.60 | bwd_microstep: 3365.29 | bwd_inner_microstep: 3364.49 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-20 01:15:27,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.60 | bwd: 3365.31 | bwd_inner: 3364.49 | bwd_allreduce: 0.78 | step: 6.98 75%|███████▍ | 7459/10000 [11:45:48<3:53:28, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0008329494739882648, 'learning_rate': 6.399032671956582e-06, 'epoch': 7.46} 75%|███████▍ | 7459/10000 [11:45:48<3:53:28, 5.51s/it][2025-06-20 01:15:33,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:15:33,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.80 | bwd_microstep: 3313.86 | bwd_inner_microstep: 3313.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 01:15:33,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.80 | bwd: 3313.88 | bwd_inner: 3313.08 | bwd_allreduce: 0.76 | step: 6.77 75%|███████▍ | 7460/10000 [11:45:53<3:52:40, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0025189872831106186, 'learning_rate': 6.394284286713574e-06, 'epoch': 7.46} 75%|███████▍ | 7460/10000 [11:45:53<3:52:40, 5.50s/it][2025-06-20 01:15:38,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-20 01:15:38,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.30 | bwd_microstep: 3391.32 | bwd_inner_microstep: 3390.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 01:15:38,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.30 | bwd: 3391.34 | bwd_inner: 3390.54 | bwd_allreduce: 0.75 | step: 6.63 75%|███████▍ | 7461/10000 [11:45:59<3:53:23, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.06040075793862343, 'learning_rate': 6.389537328647e-06, 'epoch': 7.46} 75%|███████▍ | 7461/10000 [11:45:59<3:53:23, 5.52s/it][2025-06-20 01:15:44,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:15:44,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.49 | bwd_microstep: 3363.88 | bwd_inner_microstep: 3363.03 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.92 [2025-06-20 01:15:44,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.49 | bwd: 3363.90 | bwd_inner: 3363.03 | bwd_allreduce: 0.80 | step: 6.92 75%|███████▍ | 7462/10000 [11:46:05<3:53:28, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0015428034821525216, 'learning_rate': 6.38479179825479e-06, 'epoch': 7.46} 75%|███████▍ | 7462/10000 [11:46:05<3:53:28, 5.52s/it][2025-06-20 01:15:49,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:15:49,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.26 | bwd_microstep: 3314.20 | bwd_inner_microstep: 3313.39 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-20 01:15:49,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.26 | bwd: 3314.22 | bwd_inner: 3313.39 | bwd_allreduce: 0.78 | step: 7.13 75%|███████▍ | 7463/10000 [11:46:10<3:52:40, 5.50s/it] {'loss': 0.0004, 'grad_norm': 0.07950782030820847, 'learning_rate': 6.38004769603473e-06, 'epoch': 7.46} 75%|███████▍ | 7463/10000 [11:46:10<3:52:40, 5.50s/it][2025-06-20 01:15:55,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 01:15:55,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.53 | bwd_microstep: 3316.14 | bwd_inner_microstep: 3315.06 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.41 [2025-06-20 01:15:55,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.53 | bwd: 3316.16 | bwd_inner: 3315.06 | bwd_allreduce: 1.05 | step: 7.42 75%|███████▍ | 7464/10000 [11:46:15<3:52:00, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002041527768597007, 'learning_rate': 6.375305022484457e-06, 'epoch': 7.46} 75%|███████▍ | 7464/10000 [11:46:15<3:52:00, 5.49s/it][2025-06-20 01:16:00,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:16:00,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.91 | bwd_microstep: 3315.22 | bwd_inner_microstep: 3314.23 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.51 [2025-06-20 01:16:00,616] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.91 | bwd: 3315.23 | bwd_inner: 3314.23 | bwd_allreduce: 0.95 | step: 7.52 75%|███████▍ | 7465/10000 [11:46:21<3:51:35, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.06654422730207443, 'learning_rate': 6.370563778101451e-06, 'epoch': 7.46} 75%|███████▍ | 7465/10000 [11:46:21<3:51:35, 5.48s/it][2025-06-20 01:16:06,156] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:16:06,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.64 | bwd_microstep: 3363.35 | bwd_inner_microstep: 3362.26 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.66 [2025-06-20 01:16:06,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.64 | bwd: 3363.36 | bwd_inner: 3362.26 | bwd_allreduce: 1.06 | step: 7.67 75%|███████▍ | 7466/10000 [11:46:26<3:52:17, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0005312102730385959, 'learning_rate': 6.365823963383055e-06, 'epoch': 7.47} 75%|███████▍ | 7466/10000 [11:46:26<3:52:17, 5.50s/it][2025-06-20 01:16:11,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:16:11,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.25 | bwd_microstep: 3315.45 | bwd_inner_microstep: 3314.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.89 [2025-06-20 01:16:11,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.25 | bwd: 3315.46 | bwd_inner: 3314.66 | bwd_allreduce: 0.76 | step: 6.90 75%|███████▍ | 7467/10000 [11:46:32<3:51:49, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002492150291800499, 'learning_rate': 6.361085578826442e-06, 'epoch': 7.47} 75%|███████▍ | 7467/10000 [11:46:32<3:51:49, 5.49s/it][2025-06-20 01:16:17,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:16:17,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.12 | bwd_microstep: 3319.64 | bwd_inner_microstep: 3318.85 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 01:16:17,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.12 | bwd: 3319.65 | bwd_inner: 3318.85 | bwd_allreduce: 0.76 | step: 6.64 75%|███████▍ | 7468/10000 [11:46:37<3:51:25, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00023608146875631064, 'learning_rate': 6.356348624928652e-06, 'epoch': 7.47} 75%|███████▍ | 7468/10000 [11:46:37<3:51:25, 5.48s/it][2025-06-20 01:16:22,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:16:22,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.52 | bwd_microstep: 3308.17 | bwd_inner_microstep: 3307.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.17 [2025-06-20 01:16:22,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.52 | bwd: 3308.18 | bwd_inner: 3307.36 | bwd_allreduce: 0.78 | step: 7.18 75%|███████▍ | 7469/10000 [11:46:43<3:50:53, 5.47s/it] {'loss': 0.0006, 'grad_norm': 0.11602508276700974, 'learning_rate': 6.351613102186566e-06, 'epoch': 7.47} 75%|███████▍ | 7469/10000 [11:46:43<3:50:53, 5.47s/it][2025-06-20 01:16:27,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:16:27,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.07 | bwd_microstep: 3304.45 | bwd_inner_microstep: 3303.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.87 [2025-06-20 01:16:27,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.07 | bwd: 3304.47 | bwd_inner: 3303.65 | bwd_allreduce: 0.77 | step: 6.87 75%|███████▍ | 7470/10000 [11:46:48<3:50:24, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.0388178825378418, 'learning_rate': 6.346879011096925e-06, 'epoch': 7.47} 75%|███████▍ | 7470/10000 [11:46:48<3:50:24, 5.46s/it][2025-06-20 01:16:33,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 01:16:33,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.68 | bwd_microstep: 3314.68 | bwd_inner_microstep: 3313.64 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.75 [2025-06-20 01:16:33,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.68 | bwd: 3314.70 | bwd_inner: 3313.64 | bwd_allreduce: 1.00 | step: 7.76 75%|███████▍ | 7471/10000 [11:46:54<3:50:16, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0012369133764877915, 'learning_rate': 6.342146352156315e-06, 'epoch': 7.47} 75%|███████▍ | 7471/10000 [11:46:54<3:50:16, 5.46s/it][2025-06-20 01:16:38,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-20 01:16:38,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.25 | bwd_microstep: 3315.94 | bwd_inner_microstep: 3314.73 | bwd_allreduce_microstep: 1.14 | step_microstep: 8.16 [2025-06-20 01:16:38,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.25 | bwd: 3315.97 | bwd_inner: 3314.73 | bwd_allreduce: 1.17 | step: 8.17 75%|███████▍ | 7472/10000 [11:46:59<3:50:17, 5.47s/it] {'loss': 0.0014, 'grad_norm': 0.33564963936805725, 'learning_rate': 6.3374151258611574e-06, 'epoch': 7.47} 75%|███████▍ | 7472/10000 [11:46:59<3:50:17, 5.47s/it][2025-06-20 01:16:44,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:16:44,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.08 | bwd_microstep: 3311.85 | bwd_inner_microstep: 3311.05 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-20 01:16:44,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.08 | bwd: 3311.87 | bwd_inner: 3311.05 | bwd_allreduce: 0.77 | step: 6.84 75%|███████▍ | 7473/10000 [11:47:05<3:50:02, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.03514110669493675, 'learning_rate': 6.332685332707744e-06, 'epoch': 7.47} 75%|███████▍ | 7473/10000 [11:47:05<3:50:02, 5.46s/it][2025-06-20 01:16:49,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:16:49,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.80 | bwd_microstep: 3310.56 | bwd_inner_microstep: 3309.67 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.67 [2025-06-20 01:16:49,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.80 | bwd: 3310.58 | bwd_inner: 3309.67 | bwd_allreduce: 0.87 | step: 7.68 75%|███████▍ | 7474/10000 [11:47:10<3:49:54, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.00037920085014775395, 'learning_rate': 6.327956973192206e-06, 'epoch': 7.47} 75%|███████▍ | 7474/10000 [11:47:10<3:49:54, 5.46s/it][2025-06-20 01:16:55,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:16:55,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.96 | bwd_microstep: 3312.21 | bwd_inner_microstep: 3311.31 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.25 [2025-06-20 01:16:55,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.96 | bwd: 3312.23 | bwd_inner: 3311.31 | bwd_allreduce: 0.88 | step: 7.26 75%|███████▍ | 7475/10000 [11:47:16<3:49:48, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.010007242672145367, 'learning_rate': 6.323230047810529e-06, 'epoch': 7.47} 75%|███████▍ | 7475/10000 [11:47:16<3:49:48, 5.46s/it][2025-06-20 01:17:00,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:17:00,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.28 | bwd_microstep: 3314.66 | bwd_inner_microstep: 3313.86 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.73 [2025-06-20 01:17:00,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.28 | bwd: 3314.67 | bwd_inner: 3313.86 | bwd_allreduce: 0.77 | step: 6.74 75%|███████▍ | 7476/10000 [11:47:21<3:49:46, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.001518533448688686, 'learning_rate': 6.318504557058543e-06, 'epoch': 7.48} 75%|███████▍ | 7476/10000 [11:47:21<3:49:46, 5.46s/it][2025-06-20 01:17:06,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:17:06,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.85 | bwd_microstep: 3321.88 | bwd_inner_microstep: 3321.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 01:17:06,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.85 | bwd: 3321.90 | bwd_inner: 3321.08 | bwd_allreduce: 0.77 | step: 6.85 75%|███████▍ | 7477/10000 [11:47:27<3:49:46, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0010052386205643415, 'learning_rate': 6.313780501431932e-06, 'epoch': 7.48} 75%|███████▍ | 7477/10000 [11:47:27<3:49:46, 5.46s/it][2025-06-20 01:17:11,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:17:11,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.98 | bwd_microstep: 3322.64 | bwd_inner_microstep: 3321.84 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-20 01:17:11,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.98 | bwd: 3322.65 | bwd_inner: 3321.84 | bwd_allreduce: 0.77 | step: 7.06 75%|███████▍ | 7478/10000 [11:47:32<3:49:40, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.011366810649633408, 'learning_rate': 6.3090578814262256e-06, 'epoch': 7.48} 75%|███████▍ | 7478/10000 [11:47:32<3:49:40, 5.46s/it][h264 @ 0x4a0957c0] Reference 5 >= 5 [h264 @ 0x4a0957c0] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x48c0f440] left block unavailable for requested intra mode [h264 @ 0x48c0f440] error while decoding MB 0 25, bytestream 45493 [2025-06-20 01:17:17,153] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:17:17,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.11 | bwd_microstep: 3318.07 | bwd_inner_microstep: 3317.14 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.08 [2025-06-20 01:17:17,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.11 | bwd: 3318.08 | bwd_inner: 3317.14 | bwd_allreduce: 0.89 | step: 7.09 75%|███████▍ | 7479/10000 [11:47:37<3:49:32, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.01934090442955494, 'learning_rate': 6.304336697536806e-06, 'epoch': 7.48} 75%|███████▍ | 7479/10000 [11:47:37<3:49:32, 5.46s/it][h264 @ 0x48baa300] Reference 5 >= 5 [h264 @ 0x48baa300] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x48baa300] left block unavailable for requested intra mode [h264 @ 0x48baa300] error while decoding MB 0 25, bytestream 45493 [2025-06-20 01:17:22,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:17:22,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.01 | bwd_microstep: 3372.84 | bwd_inner_microstep: 3371.98 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.93 [2025-06-20 01:17:22,695] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.01 | bwd: 3372.85 | bwd_inner: 3371.98 | bwd_allreduce: 0.83 | step: 6.93 75%|███████▍ | 7480/10000 [11:47:43<3:50:26, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.013902422040700912, 'learning_rate': 6.299616950258905e-06, 'epoch': 7.48} 75%|███████▍ | 7480/10000 [11:47:43<3:50:26, 5.49s/it][2025-06-20 01:17:28,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:17:28,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.98 | bwd_microstep: 3315.27 | bwd_inner_microstep: 3314.46 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.25 [2025-06-20 01:17:28,164] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.98 | bwd: 3315.29 | bwd_inner: 3314.46 | bwd_allreduce: 0.79 | step: 7.25 75%|███████▍ | 7481/10000 [11:47:48<3:50:08, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.000275196332950145, 'learning_rate': 6.294898640087606e-06, 'epoch': 7.48} 75%|███████▍ | 7481/10000 [11:47:48<3:50:08, 5.48s/it][2025-06-20 01:17:33,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:17:33,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.49 | bwd_microstep: 3362.25 | bwd_inner_microstep: 3361.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 01:17:33,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.49 | bwd: 3362.26 | bwd_inner: 3361.46 | bwd_allreduce: 0.76 | step: 6.61 75%|███████▍ | 7482/10000 [11:47:54<3:50:34, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.1372789889574051, 'learning_rate': 6.290181767517824e-06, 'epoch': 7.48} 75%|███████▍ | 7482/10000 [11:47:54<3:50:34, 5.49s/it][2025-06-20 01:17:39,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:17:39,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.56 | bwd_microstep: 3361.00 | bwd_inner_microstep: 3360.05 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.25 [2025-06-20 01:17:39,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.56 | bwd: 3361.02 | bwd_inner: 3360.05 | bwd_allreduce: 0.93 | step: 7.26 75%|███████▍ | 7483/10000 [11:48:00<3:50:48, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0012106470530852675, 'learning_rate': 6.285466333044348e-06, 'epoch': 7.48} 75%|███████▍ | 7483/10000 [11:48:00<3:50:48, 5.50s/it][2025-06-20 01:17:44,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:17:44,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.24 | bwd_microstep: 3312.90 | bwd_inner_microstep: 3312.07 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.98 [2025-06-20 01:17:44,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.24 | bwd: 3312.92 | bwd_inner: 3312.07 | bwd_allreduce: 0.80 | step: 6.98 75%|███████▍ | 7484/10000 [11:48:05<3:50:07, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0006336274091154337, 'learning_rate': 6.280752337161801e-06, 'epoch': 7.48} 75%|███████▍ | 7484/10000 [11:48:05<3:50:07, 5.49s/it][2025-06-20 01:17:50,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:17:50,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.87 | bwd_microstep: 3312.69 | bwd_inner_microstep: 3311.88 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.12 [2025-06-20 01:17:50,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.87 | bwd: 3312.70 | bwd_inner: 3311.88 | bwd_allreduce: 0.79 | step: 7.12 75%|███████▍ | 7485/10000 [11:48:10<3:49:37, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004968822933733463, 'learning_rate': 6.2760397803646625e-06, 'epoch': 7.49} 75%|███████▍ | 7485/10000 [11:48:10<3:49:37, 5.48s/it][2025-06-20 01:17:55,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:17:55,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.93 | bwd_microstep: 3320.33 | bwd_inner_microstep: 3319.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 01:17:55,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.93 | bwd: 3320.35 | bwd_inner: 3319.54 | bwd_allreduce: 0.76 | step: 6.72 75%|███████▍ | 7486/10000 [11:48:16<3:49:22, 5.47s/it] {'loss': 0.0016, 'grad_norm': 0.2862686514854431, 'learning_rate': 6.271328663147254e-06, 'epoch': 7.49} 75%|███████▍ | 7486/10000 [11:48:16<3:49:22, 5.47s/it][2025-06-20 01:18:01,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:18:01,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.25 | bwd_microstep: 3313.78 | bwd_inner_microstep: 3312.97 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-20 01:18:01,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.25 | bwd: 3313.80 | bwd_inner: 3312.97 | bwd_allreduce: 0.78 | step: 7.14 75%|███████▍ | 7487/10000 [11:48:21<3:49:04, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.05505749583244324, 'learning_rate': 6.266618986003759e-06, 'epoch': 7.49} 75%|███████▍ | 7487/10000 [11:48:21<3:49:04, 5.47s/it][2025-06-20 01:18:06,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:18:06,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.80 | bwd_microstep: 3303.17 | bwd_inner_microstep: 3302.17 | bwd_allreduce_microstep: 0.96 | step_microstep: 6.90 [2025-06-20 01:18:06,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.80 | bwd: 3303.18 | bwd_inner: 3302.17 | bwd_allreduce: 0.97 | step: 6.91 75%|███████▍ | 7488/10000 [11:48:27<3:48:43, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0004592407785821706, 'learning_rate': 6.261910749428188e-06, 'epoch': 7.49} 75%|███████▍ | 7488/10000 [11:48:27<3:48:43, 5.46s/it][2025-06-20 01:18:11,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:18:11,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.77 | bwd_microstep: 3314.83 | bwd_inner_microstep: 3313.78 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.13 [2025-06-20 01:18:11,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.77 | bwd: 3314.84 | bwd_inner: 3313.78 | bwd_allreduce: 1.01 | step: 7.13 75%|███████▍ | 7489/10000 [11:48:32<3:48:34, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.004962252918630838, 'learning_rate': 6.257203953914419e-06, 'epoch': 7.49} 75%|███████▍ | 7489/10000 [11:48:32<3:48:34, 5.46s/it][2025-06-20 01:18:17,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:18:17,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.36 | bwd_microstep: 3363.38 | bwd_inner_microstep: 3362.36 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.68 [2025-06-20 01:18:17,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.36 | bwd: 3363.39 | bwd_inner: 3362.36 | bwd_allreduce: 0.99 | step: 7.68 75%|███████▍ | 7490/10000 [11:48:38<3:49:24, 5.48s/it] {'loss': 0.001, 'grad_norm': 0.19710558652877808, 'learning_rate': 6.252498599956172e-06, 'epoch': 7.49} 75%|███████▍ | 7490/10000 [11:48:38<3:49:24, 5.48s/it][2025-06-20 01:18:23,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:18:23,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.64 | bwd_microstep: 3363.94 | bwd_inner_microstep: 3363.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 01:18:23,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.64 | bwd: 3363.96 | bwd_inner: 3363.16 | bwd_allreduce: 0.76 | step: 6.62 75%|███████▍ | 7491/10000 [11:48:43<3:49:53, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.41805994510650635, 'learning_rate': 6.2477946880470196e-06, 'epoch': 7.49} 75%|███████▍ | 7491/10000 [11:48:43<3:49:53, 5.50s/it][2025-06-20 01:18:28,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:18:28,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.57 | bwd_microstep: 3360.27 | bwd_inner_microstep: 3359.46 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-20 01:18:28,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.58 | bwd: 3360.28 | bwd_inner: 3359.46 | bwd_allreduce: 0.78 | step: 6.74 75%|███████▍ | 7492/10000 [11:48:49<3:50:04, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00012182041973574087, 'learning_rate': 6.2430922186803825e-06, 'epoch': 7.49} 75%|███████▍ | 7492/10000 [11:48:49<3:50:04, 5.50s/it][2025-06-20 01:18:34,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:18:34,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.40 | bwd_microstep: 3368.35 | bwd_inner_microstep: 3367.53 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.22 [2025-06-20 01:18:34,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.40 | bwd: 3368.36 | bwd_inner: 3367.53 | bwd_allreduce: 0.79 | step: 7.23 75%|███████▍ | 7493/10000 [11:48:54<3:50:24, 5.51s/it] {'loss': 0.0244, 'grad_norm': 5.385951042175293, 'learning_rate': 6.23839119234952e-06, 'epoch': 7.49} 75%|███████▍ | 7493/10000 [11:48:54<3:50:24, 5.51s/it][2025-06-20 01:18:39,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:18:39,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.53 | bwd_microstep: 3367.01 | bwd_inner_microstep: 3366.05 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.16 [2025-06-20 01:18:39,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.53 | bwd: 3367.03 | bwd_inner: 3366.05 | bwd_allreduce: 0.93 | step: 7.16 75%|███████▍ | 7494/10000 [11:49:00<3:50:40, 5.52s/it] {'loss': 0.0005, 'grad_norm': 0.12939409911632538, 'learning_rate': 6.23369160954755e-06, 'epoch': 7.49} 75%|███████▍ | 7494/10000 [11:49:00<3:50:40, 5.52s/it][2025-06-20 01:18:45,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.78 [2025-06-20 01:18:45,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.43 | bwd_microstep: 3311.41 | bwd_inner_microstep: 3310.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-20 01:18:45,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.43 | bwd: 3311.42 | bwd_inner: 3310.62 | bwd_allreduce: 0.75 | step: 6.77 75%|███████▍ | 7495/10000 [11:49:05<3:49:40, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001137442304752767, 'learning_rate': 6.228993470767439e-06, 'epoch': 7.5} 75%|███████▍ | 7495/10000 [11:49:05<3:49:40, 5.50s/it][2025-06-20 01:18:50,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:18:50,626] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.74 | bwd_microstep: 3387.20 | bwd_inner_microstep: 3386.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 01:18:50,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.74 | bwd: 3387.21 | bwd_inner: 3386.41 | bwd_allreduce: 0.76 | step: 6.63 75%|███████▍ | 7496/10000 [11:49:11<3:50:19, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0049878740683197975, 'learning_rate': 6.224296776501999e-06, 'epoch': 7.5} 75%|███████▍ | 7496/10000 [11:49:11<3:50:19, 5.52s/it][2025-06-20 01:18:56,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:18:56,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.78 | bwd_microstep: 3327.06 | bwd_inner_microstep: 3326.19 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.50 [2025-06-20 01:18:56,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.78 | bwd: 3327.08 | bwd_inner: 3326.19 | bwd_allreduce: 0.83 | step: 7.50 75%|███████▍ | 7497/10000 [11:49:16<3:49:38, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001472136122174561, 'learning_rate': 6.219601527243893e-06, 'epoch': 7.5} 75%|███████▍ | 7497/10000 [11:49:16<3:49:38, 5.50s/it][2025-06-20 01:19:01,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:19:01,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.51 | bwd_microstep: 3320.25 | bwd_inner_microstep: 3319.44 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 01:19:01,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.51 | bwd: 3320.26 | bwd_inner: 3319.44 | bwd_allreduce: 0.78 | step: 6.82 75%|███████▍ | 7498/10000 [11:49:22<3:49:04, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.009782901033759117, 'learning_rate': 6.21490772348563e-06, 'epoch': 7.5} 75%|███████▍ | 7498/10000 [11:49:22<3:49:04, 5.49s/it][2025-06-20 01:19:07,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:19:07,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.12 | bwd_microstep: 3323.14 | bwd_inner_microstep: 3322.22 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.11 [2025-06-20 01:19:07,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.12 | bwd: 3323.16 | bwd_inner: 3322.22 | bwd_allreduce: 0.88 | step: 7.11 75%|███████▍ | 7499/10000 [11:49:27<3:50:08, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.03376942500472069, 'learning_rate': 6.21021536571957e-06, 'epoch': 7.5} 75%|███████▍ | 7499/10000 [11:49:27<3:50:08, 5.52s/it][2025-06-20 01:19:12,612] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:19:12,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.86 | bwd_microstep: 3317.05 | bwd_inner_microstep: 3316.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-20 01:19:12,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.86 | bwd: 3317.06 | bwd_inner: 3316.24 | bwd_allreduce: 0.77 | step: 7.05 75%|███████▌ | 7500/10000 [11:49:33<3:49:18, 5.50s/it] {'loss': 0.0173, 'grad_norm': 6.634639263153076, 'learning_rate': 6.2055244544379145e-06, 'epoch': 7.5} 75%|███████▌ | 7500/10000 [11:49:33<3:49:18, 5.50s/it][2025-06-20 01:19:18,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:19:18,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.08 | bwd_microstep: 3367.38 | bwd_inner_microstep: 3366.59 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-20 01:19:18,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.08 | bwd: 3367.40 | bwd_inner: 3366.59 | bwd_allreduce: 0.76 | step: 6.77 75%|███████▌ | 7501/10000 [11:49:38<3:49:34, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0013642450794577599, 'learning_rate': 6.2008349901327225e-06, 'epoch': 7.5} 75%|███████▌ | 7501/10000 [11:49:38<3:49:34, 5.51s/it][2025-06-20 01:19:23,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:19:23,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.35 | bwd_microstep: 3363.32 | bwd_inner_microstep: 3362.53 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 01:19:23,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.35 | bwd: 3363.34 | bwd_inner: 3362.53 | bwd_allreduce: 0.76 | step: 6.70 75%|███████▌ | 7502/10000 [11:49:44<3:49:45, 5.52s/it] {'loss': 0.0025, 'grad_norm': 0.676811695098877, 'learning_rate': 6.196146973295905e-06, 'epoch': 7.5} 75%|███████▌ | 7502/10000 [11:49:44<3:49:45, 5.52s/it][2025-06-20 01:19:29,216] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:19:29,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.28 | bwd_microstep: 3372.76 | bwd_inner_microstep: 3371.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-20 01:19:29,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.28 | bwd: 3372.77 | bwd_inner: 3371.97 | bwd_allreduce: 0.76 | step: 6.63 75%|███████▌ | 7503/10000 [11:49:50<3:49:54, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.01798504777252674, 'learning_rate': 6.191460404419194e-06, 'epoch': 7.5} 75%|███████▌ | 7503/10000 [11:49:50<3:49:54, 5.52s/it][2025-06-20 01:19:34,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.95 [2025-06-20 01:19:34,675] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.96 | bwd_microstep: 3312.54 | bwd_inner_microstep: 3311.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.06 [2025-06-20 01:19:34,676] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.96 | bwd: 3312.56 | bwd_inner: 3311.76 | bwd_allreduce: 0.76 | step: 7.07 75%|███████▌ | 7504/10000 [11:49:55<3:49:00, 5.50s/it] {'loss': 0.0007, 'grad_norm': 0.6648096442222595, 'learning_rate': 6.1867752839942016e-06, 'epoch': 7.5} 75%|███████▌ | 7504/10000 [11:49:55<3:49:00, 5.50s/it][2025-06-20 01:19:40,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.85 [2025-06-20 01:19:40,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.81 | bwd_microstep: 3324.00 | bwd_inner_microstep: 3323.19 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-20 01:19:40,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.81 | bwd: 3324.02 | bwd_inner: 3323.19 | bwd_allreduce: 0.78 | step: 6.98 75%|███████▌ | 7505/10000 [11:50:00<3:48:24, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00023673796385992318, 'learning_rate': 6.182091612512373e-06, 'epoch': 7.5} 75%|███████▌ | 7505/10000 [11:50:00<3:48:24, 5.49s/it][2025-06-20 01:19:45,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:19:45,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.17 | bwd_microstep: 3407.08 | bwd_inner_microstep: 3406.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-20 01:19:45,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.17 | bwd: 3407.10 | bwd_inner: 3406.29 | bwd_allreduce: 0.77 | step: 6.84 75%|███████▌ | 7506/10000 [11:50:06<3:49:25, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.05703936144709587, 'learning_rate': 6.177409390465003e-06, 'epoch': 7.51} 75%|███████▌ | 7506/10000 [11:50:06<3:49:25, 5.52s/it][2025-06-20 01:19:51,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:19:51,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.80 | bwd_microstep: 3383.21 | bwd_inner_microstep: 3382.38 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.88 [2025-06-20 01:19:51,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.80 | bwd: 3383.23 | bwd_inner: 3382.38 | bwd_allreduce: 0.80 | step: 6.89 75%|███████▌ | 7507/10000 [11:50:12<3:49:50, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.004723557271063328, 'learning_rate': 6.172728618343242e-06, 'epoch': 7.51} 75%|███████▌ | 7507/10000 [11:50:12<3:49:50, 5.53s/it][2025-06-20 01:19:56,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 01:19:56,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.71 | bwd_microstep: 3321.33 | bwd_inner_microstep: 3320.31 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.32 [2025-06-20 01:19:56,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.71 | bwd: 3321.35 | bwd_inner: 3320.30 | bwd_allreduce: 0.99 | step: 7.32 75%|███████▌ | 7508/10000 [11:50:17<3:48:53, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.06436293572187424, 'learning_rate': 6.1680492966380675e-06, 'epoch': 7.51} 75%|███████▌ | 7508/10000 [11:50:17<3:48:53, 5.51s/it][2025-06-20 01:20:02,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:20:02,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.62 | bwd_microstep: 3372.78 | bwd_inner_microstep: 3372.01 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.59 [2025-06-20 01:20:02,288] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.62 | bwd: 3372.80 | bwd_inner: 3372.01 | bwd_allreduce: 0.75 | step: 6.59 75%|███████▌ | 7509/10000 [11:50:23<3:49:11, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.010902122594416142, 'learning_rate': 6.1633714258403254e-06, 'epoch': 7.51} 75%|███████▌ | 7509/10000 [11:50:23<3:49:11, 5.52s/it][2025-06-20 01:20:07,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:20:07,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.74 | bwd_microstep: 3323.05 | bwd_inner_microstep: 3322.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-20 01:20:07,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.74 | bwd: 3323.06 | bwd_inner: 3322.24 | bwd_allreduce: 0.78 | step: 7.12 75%|███████▌ | 7510/10000 [11:50:28<3:48:24, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.03682028129696846, 'learning_rate': 6.158695006440703e-06, 'epoch': 7.51} 75%|███████▌ | 7510/10000 [11:50:28<3:48:24, 5.50s/it][2025-06-20 01:20:13,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:20:13,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.48 | bwd_microstep: 3317.16 | bwd_inner_microstep: 3316.28 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.08 [2025-06-20 01:20:13,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.48 | bwd: 3317.18 | bwd_inner: 3316.28 | bwd_allreduce: 0.84 | step: 7.08 75%|███████▌ | 7511/10000 [11:50:34<3:47:55, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004589228890836239, 'learning_rate': 6.1540200389297355e-06, 'epoch': 7.51} 75%|███████▌ | 7511/10000 [11:50:34<3:47:55, 5.49s/it][2025-06-20 01:20:18,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-20 01:20:18,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.88 | bwd_microstep: 3369.20 | bwd_inner_microstep: 3368.24 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.02 [2025-06-20 01:20:18,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.88 | bwd: 3369.22 | bwd_inner: 3368.24 | bwd_allreduce: 0.93 | step: 7.02 75%|███████▌ | 7512/10000 [11:50:39<3:48:23, 5.51s/it] {'loss': 0.0036, 'grad_norm': 0.9617035984992981, 'learning_rate': 6.149346523797806e-06, 'epoch': 7.51} 75%|███████▌ | 7512/10000 [11:50:39<3:48:23, 5.51s/it][2025-06-20 01:20:24,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:20:24,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.23 | bwd_microstep: 3384.20 | bwd_inner_microstep: 3383.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-20 01:20:24,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.23 | bwd: 3384.21 | bwd_inner: 3383.40 | bwd_allreduce: 0.77 | step: 6.87 75%|███████▌ | 7513/10000 [11:50:45<3:48:54, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.007472701836377382, 'learning_rate': 6.14467446153514e-06, 'epoch': 7.51} 75%|███████▌ | 7513/10000 [11:50:45<3:48:54, 5.52s/it][2025-06-20 01:20:29,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:20:29,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.46 | bwd_microstep: 3323.90 | bwd_inner_microstep: 3323.09 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-20 01:20:29,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.46 | bwd: 3323.91 | bwd_inner: 3323.09 | bwd_allreduce: 0.78 | step: 6.80 75%|███████▌ | 7514/10000 [11:50:50<3:48:10, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00018665670359041542, 'learning_rate': 6.140003852631822e-06, 'epoch': 7.51} 75%|███████▌ | 7514/10000 [11:50:50<3:48:10, 5.51s/it][2025-06-20 01:20:35,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:20:35,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.72 | bwd_microstep: 3383.39 | bwd_inner_microstep: 3382.50 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.85 [2025-06-20 01:20:35,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.72 | bwd: 3383.40 | bwd_inner: 3382.50 | bwd_allreduce: 0.86 | step: 6.85 75%|███████▌ | 7515/10000 [11:50:56<3:48:39, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.007174666505306959, 'learning_rate': 6.135334697577771e-06, 'epoch': 7.51} 75%|███████▌ | 7515/10000 [11:50:56<3:48:39, 5.52s/it][2025-06-20 01:20:40,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 01:20:40,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.19 | bwd_microstep: 3403.04 | bwd_inner_microstep: 3401.92 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.55 [2025-06-20 01:20:40,929] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.19 | bwd: 3403.06 | bwd_inner: 3401.92 | bwd_allreduce: 1.08 | step: 7.55 75%|███████▌ | 7516/10000 [11:51:01<3:49:22, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0010180146200582385, 'learning_rate': 6.130666996862762e-06, 'epoch': 7.52} 75%|███████▌ | 7516/10000 [11:51:01<3:49:22, 5.54s/it][2025-06-20 01:20:46,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 01:20:46,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.89 | bwd_microstep: 3335.38 | bwd_inner_microstep: 3334.41 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.93 [2025-06-20 01:20:46,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.89 | bwd: 3335.39 | bwd_inner: 3334.41 | bwd_allreduce: 0.94 | step: 7.93 75%|███████▌ | 7517/10000 [11:51:07<3:48:49, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.010818944312632084, 'learning_rate': 6.126000750976424e-06, 'epoch': 7.52} 75%|███████▌ | 7517/10000 [11:51:07<3:48:49, 5.53s/it][2025-06-20 01:20:51,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.88 [2025-06-20 01:20:51,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.56 | bwd_microstep: 3334.20 | bwd_inner_microstep: 3333.38 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.03 [2025-06-20 01:20:51,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.56 | bwd: 3334.21 | bwd_inner: 3333.38 | bwd_allreduce: 0.80 | step: 7.03 75%|███████▌ | 7518/10000 [11:51:12<3:48:18, 5.52s/it] {'loss': 0.0026, 'grad_norm': 0.40725523233413696, 'learning_rate': 6.121335960408208e-06, 'epoch': 7.52} 75%|███████▌ | 7518/10000 [11:51:12<3:48:18, 5.52s/it][2025-06-20 01:20:57,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:20:57,404] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.89 | bwd_microstep: 3328.34 | bwd_inner_microstep: 3327.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.21 [2025-06-20 01:20:57,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.89 | bwd: 3328.35 | bwd_inner: 3327.53 | bwd_allreduce: 0.78 | step: 7.22 75%|███████▌ | 7519/10000 [11:51:18<3:47:40, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.04476635530591011, 'learning_rate': 6.116672625647438e-06, 'epoch': 7.52} 75%|███████▌ | 7519/10000 [11:51:18<3:47:40, 5.51s/it][2025-06-20 01:21:02,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:21:02,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.71 | bwd_microstep: 3386.08 | bwd_inner_microstep: 3385.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 01:21:02,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.71 | bwd: 3386.09 | bwd_inner: 3385.28 | bwd_allreduce: 0.77 | step: 6.69 75%|███████▌ | 7520/10000 [11:51:23<3:48:15, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.023280803114175797, 'learning_rate': 6.1120107471832744e-06, 'epoch': 7.52} 75%|███████▌ | 7520/10000 [11:51:23<3:48:15, 5.52s/it][2025-06-20 01:21:08,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:21:08,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.13 | bwd_microstep: 3337.42 | bwd_inner_microstep: 3336.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 01:21:08,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.14 | bwd: 3337.44 | bwd_inner: 3336.63 | bwd_allreduce: 0.76 | step: 6.71 75%|███████▌ | 7521/10000 [11:51:29<3:47:42, 5.51s/it] {'loss': 0.0004, 'grad_norm': 0.07248756289482117, 'learning_rate': 6.107350325504729e-06, 'epoch': 7.52} 75%|███████▌ | 7521/10000 [11:51:29<3:47:42, 5.51s/it][2025-06-20 01:21:13,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:21:13,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.63 | bwd_microstep: 3318.98 | bwd_inner_microstep: 3318.17 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-20 01:21:13,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.64 | bwd: 3319.00 | bwd_inner: 3318.17 | bwd_allreduce: 0.78 | step: 6.99 75%|███████▌ | 7522/10000 [11:51:34<3:47:03, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.002873876830562949, 'learning_rate': 6.102691361100661e-06, 'epoch': 7.52} 75%|███████▌ | 7522/10000 [11:51:34<3:47:03, 5.50s/it][2025-06-20 01:21:19,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:21:19,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.08 | bwd_microstep: 3327.01 | bwd_inner_microstep: 3326.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-20 01:21:19,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.08 | bwd: 3327.03 | bwd_inner: 3326.20 | bwd_allreduce: 0.78 | step: 7.11 75%|███████▌ | 7523/10000 [11:51:40<3:46:39, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.16804201900959015, 'learning_rate': 6.098033854459764e-06, 'epoch': 7.52} 75%|███████▌ | 7523/10000 [11:51:40<3:46:39, 5.49s/it][2025-06-20 01:21:24,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:21:24,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.03 | bwd_microstep: 3387.18 | bwd_inner_microstep: 3386.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 01:21:24,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.03 | bwd: 3387.20 | bwd_inner: 3386.39 | bwd_allreduce: 0.76 | step: 6.69 75%|███████▌ | 7524/10000 [11:51:45<3:47:24, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.001857682247646153, 'learning_rate': 6.093377806070595e-06, 'epoch': 7.52} 75%|███████▌ | 7524/10000 [11:51:45<3:47:24, 5.51s/it][2025-06-20 01:21:30,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:21:30,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.30 | bwd_microstep: 3394.56 | bwd_inner_microstep: 3393.74 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.19 [2025-06-20 01:21:30,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.30 | bwd: 3394.58 | bwd_inner: 3393.74 | bwd_allreduce: 0.79 | step: 7.20 75%|███████▌ | 7525/10000 [11:51:51<3:48:03, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0040712314657866955, 'learning_rate': 6.088723216421551e-06, 'epoch': 7.53} 75%|███████▌ | 7525/10000 [11:51:51<3:48:03, 5.53s/it][2025-06-20 01:21:36,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:21:36,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.85 | bwd_microstep: 3375.77 | bwd_inner_microstep: 3374.90 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.31 [2025-06-20 01:21:36,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.85 | bwd: 3375.79 | bwd_inner: 3374.90 | bwd_allreduce: 0.83 | step: 7.32 75%|███████▌ | 7526/10000 [11:51:56<3:48:21, 5.54s/it] {'loss': 0.0001, 'grad_norm': 0.027798276394605637, 'learning_rate': 6.0840700860008776e-06, 'epoch': 7.53} 75%|███████▌ | 7526/10000 [11:51:56<3:48:21, 5.54s/it][2025-06-20 01:21:41,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:21:41,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.31 | bwd_microstep: 3318.94 | bwd_inner_microstep: 3318.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-20 01:21:41,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.31 | bwd: 3318.95 | bwd_inner: 3318.13 | bwd_allreduce: 0.78 | step: 7.19 75%|███████▌ | 7527/10000 [11:52:02<3:47:28, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.002823882969096303, 'learning_rate': 6.079418415296672e-06, 'epoch': 7.53} 75%|███████▌ | 7527/10000 [11:52:02<3:47:28, 5.52s/it][2025-06-20 01:21:47,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:21:47,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.42 | bwd_microstep: 3328.32 | bwd_inner_microstep: 3327.52 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-20 01:21:47,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.42 | bwd: 3328.34 | bwd_inner: 3327.52 | bwd_allreduce: 0.78 | step: 6.77 75%|███████▌ | 7528/10000 [11:52:07<3:46:53, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.004381472710520029, 'learning_rate': 6.074768204796862e-06, 'epoch': 7.53} 75%|███████▌ | 7528/10000 [11:52:07<3:46:53, 5.51s/it][2025-06-20 01:21:52,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:21:52,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.57 | bwd_microstep: 3318.68 | bwd_inner_microstep: 3317.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-20 01:21:52,507] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.57 | bwd: 3318.69 | bwd_inner: 3317.87 | bwd_allreduce: 0.78 | step: 7.06 75%|███████▌ | 7529/10000 [11:52:13<3:46:23, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00022579636424779892, 'learning_rate': 6.070119454989238e-06, 'epoch': 7.53} 75%|███████▌ | 7529/10000 [11:52:13<3:46:23, 5.50s/it][2025-06-20 01:21:57,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 01:21:57,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.00 | bwd_microstep: 3319.55 | bwd_inner_microstep: 3318.55 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.48 [2025-06-20 01:21:57,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.00 | bwd: 3319.57 | bwd_inner: 3318.55 | bwd_allreduce: 0.97 | step: 7.48 75%|███████▌ | 7530/10000 [11:52:18<3:45:59, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.24905672669410706, 'learning_rate': 6.065472166361432e-06, 'epoch': 7.53} 75%|███████▌ | 7530/10000 [11:52:18<3:45:59, 5.49s/it][2025-06-20 01:22:03,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:22:03,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.90 | bwd_microstep: 3326.67 | bwd_inner_microstep: 3325.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 01:22:03,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.90 | bwd: 3326.69 | bwd_inner: 3325.89 | bwd_allreduce: 0.75 | step: 6.61 75%|███████▌ | 7531/10000 [11:52:24<3:45:46, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.03520781546831131, 'learning_rate': 6.060826339400925e-06, 'epoch': 7.53} 75%|███████▌ | 7531/10000 [11:52:24<3:45:46, 5.49s/it][2025-06-20 01:22:08,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:22:08,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.41 | bwd_microstep: 3322.44 | bwd_inner_microstep: 3321.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 01:22:08,922] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.41 | bwd: 3322.46 | bwd_inner: 3321.65 | bwd_allreduce: 0.76 | step: 6.80 75%|███████▌ | 7532/10000 [11:52:29<3:45:23, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0013488717377185822, 'learning_rate': 6.056181974595039e-06, 'epoch': 7.53} 75%|███████▌ | 7532/10000 [11:52:29<3:45:23, 5.48s/it][2025-06-20 01:22:14,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:22:14,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.05 | bwd_microstep: 3378.73 | bwd_inner_microstep: 3377.80 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.41 [2025-06-20 01:22:14,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.05 | bwd: 3378.75 | bwd_inner: 3377.80 | bwd_allreduce: 0.90 | step: 7.41 75%|███████▌ | 7533/10000 [11:52:35<3:46:11, 5.50s/it] {'loss': 0.0, 'grad_norm': 6.0383994423318654e-05, 'learning_rate': 6.05153907243095e-06, 'epoch': 7.53} 75%|███████▌ | 7533/10000 [11:52:35<3:46:11, 5.50s/it][2025-06-20 01:22:19,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:22:19,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.11 | bwd_microstep: 3320.83 | bwd_inner_microstep: 3319.87 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.13 [2025-06-20 01:22:19,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.11 | bwd: 3320.84 | bwd_inner: 3319.88 | bwd_allreduce: 0.92 | step: 7.13 75%|███████▌ | 7534/10000 [11:52:40<3:45:53, 5.50s/it] {'loss': 0.0006, 'grad_norm': 0.10051538795232773, 'learning_rate': 6.046897633395676e-06, 'epoch': 7.53} 75%|███████▌ | 7534/10000 [11:52:40<3:45:53, 5.50s/it][2025-06-20 01:22:25,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 01:22:25,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.02 | bwd_microstep: 3379.03 | bwd_inner_microstep: 3377.82 | bwd_allreduce_microstep: 1.13 | step_microstep: 8.12 [2025-06-20 01:22:25,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.02 | bwd: 3379.05 | bwd_inner: 3377.82 | bwd_allreduce: 1.16 | step: 8.15 75%|███████▌ | 7535/10000 [11:52:46<3:46:40, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.07585659623146057, 'learning_rate': 6.042257657976081e-06, 'epoch': 7.54} 75%|███████▌ | 7535/10000 [11:52:46<3:46:40, 5.52s/it][2025-06-20 01:22:30,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:22:30,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.61 | bwd_microstep: 3315.62 | bwd_inner_microstep: 3314.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-20 01:22:30,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.61 | bwd: 3315.64 | bwd_inner: 3314.83 | bwd_allreduce: 0.76 | step: 6.92 75%|███████▌ | 7536/10000 [11:52:51<3:45:56, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.1471073180437088, 'learning_rate': 6.0376191466588775e-06, 'epoch': 7.54} 75%|███████▌ | 7536/10000 [11:52:51<3:45:56, 5.50s/it][2025-06-20 01:22:36,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:22:36,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.79 | bwd_microstep: 3330.46 | bwd_inner_microstep: 3329.53 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.97 [2025-06-20 01:22:36,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.79 | bwd: 3330.48 | bwd_inner: 3329.53 | bwd_allreduce: 0.89 | step: 7.99 75%|███████▌ | 7537/10000 [11:52:57<3:45:38, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.02932550385594368, 'learning_rate': 6.032982099930631e-06, 'epoch': 7.54} 75%|███████▌ | 7537/10000 [11:52:57<3:45:38, 5.50s/it][2025-06-20 01:22:41,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:22:41,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.61 | bwd_microstep: 3332.57 | bwd_inner_microstep: 3331.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 01:22:41,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.61 | bwd: 3332.59 | bwd_inner: 3331.79 | bwd_allreduce: 0.76 | step: 6.59 75%|███████▌ | 7538/10000 [11:53:02<3:45:21, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.0171751007437706, 'learning_rate': 6.028346518277734e-06, 'epoch': 7.54} 75%|███████▌ | 7538/10000 [11:53:02<3:45:21, 5.49s/it][2025-06-20 01:22:47,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:22:47,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.84 | bwd_microstep: 3317.71 | bwd_inner_microstep: 3316.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 01:22:47,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.84 | bwd: 3317.72 | bwd_inner: 3316.92 | bwd_allreduce: 0.76 | step: 6.64 75%|███████▌ | 7539/10000 [11:53:08<3:44:55, 5.48s/it] {'loss': 0.0063, 'grad_norm': 1.5408936738967896, 'learning_rate': 6.023712402186442e-06, 'epoch': 7.54} 75%|███████▌ | 7539/10000 [11:53:08<3:44:55, 5.48s/it][2025-06-20 01:22:52,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:22:52,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.45 | bwd_microstep: 3316.83 | bwd_inner_microstep: 3315.93 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.36 [2025-06-20 01:22:52,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.45 | bwd: 3316.86 | bwd_inner: 3315.93 | bwd_allreduce: 0.85 | step: 7.35 75%|███████▌ | 7540/10000 [11:53:13<3:44:42, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.005428242962807417, 'learning_rate': 6.019079752142854e-06, 'epoch': 7.54} 75%|███████▌ | 7540/10000 [11:53:13<3:44:42, 5.48s/it][2025-06-20 01:22:58,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 01:22:58,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.72 | bwd_microstep: 3371.97 | bwd_inner_microstep: 3371.20 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.65 [2025-06-20 01:22:58,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.72 | bwd: 3371.99 | bwd_inner: 3371.20 | bwd_allreduce: 0.75 | step: 6.66 75%|███████▌ | 7541/10000 [11:53:19<3:45:25, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0030079104471951723, 'learning_rate': 6.014448568632911e-06, 'epoch': 7.54} 75%|███████▌ | 7541/10000 [11:53:19<3:45:25, 5.50s/it][2025-06-20 01:23:03,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:23:03,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.17 | bwd_microstep: 3321.51 | bwd_inner_microstep: 3320.58 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.93 [2025-06-20 01:23:03,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.17 | bwd: 3321.53 | bwd_inner: 3320.58 | bwd_allreduce: 0.89 | step: 6.94 75%|███████▌ | 7542/10000 [11:53:24<3:44:55, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005475529003888369, 'learning_rate': 6.009818852142411e-06, 'epoch': 7.54} 75%|███████▌ | 7542/10000 [11:53:24<3:44:55, 5.49s/it][2025-06-20 01:23:09,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:23:09,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.93 | bwd_microstep: 3320.05 | bwd_inner_microstep: 3319.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.59 [2025-06-20 01:23:09,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.93 | bwd: 3320.06 | bwd_inner: 3319.26 | bwd_allreduce: 0.76 | step: 6.59 75%|███████▌ | 7543/10000 [11:53:30<3:44:31, 5.48s/it] {'loss': 0.0078, 'grad_norm': 2.1965789794921875, 'learning_rate': 6.005190603156976e-06, 'epoch': 7.54} 75%|███████▌ | 7543/10000 [11:53:30<3:44:31, 5.48s/it][2025-06-20 01:23:14,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:23:14,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.26 | bwd_microstep: 3315.17 | bwd_inner_microstep: 3314.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-20 01:23:14,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.26 | bwd: 3315.19 | bwd_inner: 3314.37 | bwd_allreduce: 0.77 | step: 6.91 75%|███████▌ | 7544/10000 [11:53:35<3:44:11, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.07836231589317322, 'learning_rate': 6.000563822162095e-06, 'epoch': 7.54} 75%|███████▌ | 7544/10000 [11:53:35<3:44:11, 5.48s/it][2025-06-20 01:23:20,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:23:20,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.41 | bwd_microstep: 3323.74 | bwd_inner_microstep: 3322.93 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-20 01:23:20,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.41 | bwd: 3323.75 | bwd_inner: 3322.93 | bwd_allreduce: 0.78 | step: 7.03 75%|███████▌ | 7545/10000 [11:53:41<3:43:59, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.005535628646612167, 'learning_rate': 5.995938509643096e-06, 'epoch': 7.54} 75%|███████▌ | 7545/10000 [11:53:41<3:43:59, 5.47s/it][2025-06-20 01:23:25,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:23:25,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.95 | bwd_microstep: 3317.83 | bwd_inner_microstep: 3317.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 01:23:25,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.95 | bwd: 3317.84 | bwd_inner: 3317.04 | bwd_allreduce: 0.76 | step: 6.64 75%|███████▌ | 7546/10000 [11:53:46<3:43:45, 5.47s/it] {'loss': 0.001, 'grad_norm': 0.4397992193698883, 'learning_rate': 5.991314666085153e-06, 'epoch': 7.55} 75%|███████▌ | 7546/10000 [11:53:46<3:43:45, 5.47s/it][2025-06-20 01:23:31,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:23:31,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.03 | bwd_microstep: 3362.41 | bwd_inner_microstep: 3361.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.06 [2025-06-20 01:23:31,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.03 | bwd: 3362.43 | bwd_inner: 3361.61 | bwd_allreduce: 0.77 | step: 7.06 75%|███████▌ | 7547/10000 [11:53:52<3:44:26, 5.49s/it] {'loss': 0.0025, 'grad_norm': 1.2014379501342773, 'learning_rate': 5.986692291973284e-06, 'epoch': 7.55} 75%|███████▌ | 7547/10000 [11:53:52<3:44:26, 5.49s/it][2025-06-20 01:23:36,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:23:36,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.60 | bwd_microstep: 3314.40 | bwd_inner_microstep: 3313.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 01:23:36,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.60 | bwd: 3314.42 | bwd_inner: 3313.62 | bwd_allreduce: 0.75 | step: 6.70 75%|███████▌ | 7548/10000 [11:53:57<3:44:01, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0004302675079088658, 'learning_rate': 5.9820713877923565e-06, 'epoch': 7.55} 75%|███████▌ | 7548/10000 [11:53:57<3:44:01, 5.48s/it][2025-06-20 01:23:42,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:23:42,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.31 | bwd_microstep: 3318.34 | bwd_inner_microstep: 3317.39 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.14 [2025-06-20 01:23:42,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.31 | bwd: 3318.36 | bwd_inner: 3317.39 | bwd_allreduce: 0.92 | step: 7.15 75%|███████▌ | 7549/10000 [11:54:03<3:43:39, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00029551947955042124, 'learning_rate': 5.977451954027083e-06, 'epoch': 7.55} 75%|███████▌ | 7549/10000 [11:54:03<3:43:39, 5.47s/it][2025-06-20 01:23:47,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:23:47,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.91 | bwd_microstep: 3364.21 | bwd_inner_microstep: 3363.35 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.15 [2025-06-20 01:23:47,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.91 | bwd: 3364.23 | bwd_inner: 3363.35 | bwd_allreduce: 0.81 | step: 7.15 76%|███████▌ | 7550/10000 [11:54:08<3:44:19, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00015354888455476612, 'learning_rate': 5.972833991162017e-06, 'epoch': 7.55} 76%|███████▌ | 7550/10000 [11:54:08<3:44:19, 5.49s/it][2025-06-20 01:23:53,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:23:53,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.13 | bwd_microstep: 3367.39 | bwd_inner_microstep: 3366.52 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.43 [2025-06-20 01:23:53,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.13 | bwd: 3367.42 | bwd_inner: 3366.52 | bwd_allreduce: 0.84 | step: 7.44 76%|███████▌ | 7551/10000 [11:54:14<3:44:45, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.005941792856901884, 'learning_rate': 5.968217499681563e-06, 'epoch': 7.55} 76%|███████▌ | 7551/10000 [11:54:14<3:44:45, 5.51s/it][2025-06-20 01:23:58,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 01:23:58,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.43 | bwd_microstep: 3317.91 | bwd_inner_microstep: 3316.86 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.88 [2025-06-20 01:23:58,771] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.43 | bwd: 3317.93 | bwd_inner: 3316.86 | bwd_allreduce: 1.01 | step: 7.88 76%|███████▌ | 7552/10000 [11:54:19<3:44:16, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.020066363736987114, 'learning_rate': 5.96360248006997e-06, 'epoch': 7.55} 76%|███████▌ | 7552/10000 [11:54:19<3:44:16, 5.50s/it][2025-06-20 01:24:04,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-20 01:24:04,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.53 | bwd_microstep: 3369.33 | bwd_inner_microstep: 3368.09 | bwd_allreduce_microstep: 1.17 | step_microstep: 7.93 [2025-06-20 01:24:04,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.53 | bwd: 3369.35 | bwd_inner: 3368.09 | bwd_allreduce: 1.20 | step: 7.92 76%|███████▌ | 7553/10000 [11:54:25<3:44:45, 5.51s/it] {'loss': 0.002, 'grad_norm': 0.7713295817375183, 'learning_rate': 5.958988932811338e-06, 'epoch': 7.55} 76%|███████▌ | 7553/10000 [11:54:25<3:44:45, 5.51s/it][2025-06-20 01:24:09,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:24:09,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.02 | bwd_microstep: 3320.03 | bwd_inner_microstep: 3319.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 01:24:09,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.02 | bwd: 3320.04 | bwd_inner: 3319.24 | bwd_allreduce: 0.76 | step: 6.83 76%|███████▌ | 7554/10000 [11:54:30<3:44:13, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.012187967076897621, 'learning_rate': 5.954376858389594e-06, 'epoch': 7.55} 76%|███████▌ | 7554/10000 [11:54:30<3:44:13, 5.50s/it][2025-06-20 01:24:15,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:24:15,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.67 | bwd_microstep: 3317.15 | bwd_inner_microstep: 3316.31 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.52 [2025-06-20 01:24:15,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.67 | bwd: 3317.16 | bwd_inner: 3316.31 | bwd_allreduce: 0.81 | step: 7.52 76%|███████▌ | 7555/10000 [11:54:36<3:43:45, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.013932593166828156, 'learning_rate': 5.949766257288532e-06, 'epoch': 7.55} 76%|███████▌ | 7555/10000 [11:54:36<3:43:45, 5.49s/it][2025-06-20 01:24:20,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:24:20,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.64 | bwd_microstep: 3367.74 | bwd_inner_microstep: 3366.89 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.04 [2025-06-20 01:24:20,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.64 | bwd: 3367.76 | bwd_inner: 3366.89 | bwd_allreduce: 0.81 | step: 7.05 76%|███████▌ | 7556/10000 [11:54:41<3:44:17, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.02361132577061653, 'learning_rate': 5.945157129991779e-06, 'epoch': 7.56} 76%|███████▌ | 7556/10000 [11:54:41<3:44:17, 5.51s/it][2025-06-20 01:24:26,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-20 01:24:26,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.32 | bwd_microstep: 3365.85 | bwd_inner_microstep: 3364.74 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.69 [2025-06-20 01:24:26,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.32 | bwd: 3365.87 | bwd_inner: 3364.74 | bwd_allreduce: 1.09 | step: 7.69 76%|███████▌ | 7557/10000 [11:54:47<3:44:37, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0001478889462305233, 'learning_rate': 5.940549476982811e-06, 'epoch': 7.56} 76%|███████▌ | 7557/10000 [11:54:47<3:44:37, 5.52s/it][2025-06-20 01:24:31,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:24:31,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.97 | bwd_microstep: 3321.67 | bwd_inner_microstep: 3320.86 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-20 01:24:31,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.97 | bwd: 3321.68 | bwd_inner: 3320.86 | bwd_allreduce: 0.77 | step: 7.15 76%|███████▌ | 7558/10000 [11:54:52<3:44:05, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0009043096797540784, 'learning_rate': 5.935943298744959e-06, 'epoch': 7.56} 76%|███████▌ | 7558/10000 [11:54:52<3:44:05, 5.51s/it][2025-06-20 01:24:37,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:24:37,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.10 | bwd_microstep: 3309.56 | bwd_inner_microstep: 3308.74 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.76 [2025-06-20 01:24:37,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.10 | bwd: 3309.57 | bwd_inner: 3308.74 | bwd_allreduce: 0.79 | step: 6.76 76%|███████▌ | 7559/10000 [11:54:58<3:43:27, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0003683636605273932, 'learning_rate': 5.931338595761376e-06, 'epoch': 7.56} 76%|███████▌ | 7559/10000 [11:54:58<3:43:27, 5.49s/it][2025-06-20 01:24:42,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:24:42,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.11 | bwd_microstep: 3366.10 | bwd_inner_microstep: 3365.30 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-20 01:24:42,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.12 | bwd: 3366.12 | bwd_inner: 3365.30 | bwd_allreduce: 0.78 | step: 7.05 76%|███████▌ | 7560/10000 [11:55:03<3:43:48, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0005646046483889222, 'learning_rate': 5.926735368515077e-06, 'epoch': 7.56} 76%|███████▌ | 7560/10000 [11:55:03<3:43:48, 5.50s/it][2025-06-20 01:24:48,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:24:48,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.47 | bwd_microstep: 3314.47 | bwd_inner_microstep: 3313.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-20 01:24:48,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.48 | bwd: 3314.48 | bwd_inner: 3313.69 | bwd_allreduce: 0.76 | step: 6.76 76%|███████▌ | 7561/10000 [11:55:09<3:43:07, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.007860691286623478, 'learning_rate': 5.922133617488923e-06, 'epoch': 7.56} 76%|███████▌ | 7561/10000 [11:55:09<3:43:07, 5.49s/it][2025-06-20 01:24:53,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:24:53,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.22 | bwd_microstep: 3401.04 | bwd_inner_microstep: 3400.08 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.03 [2025-06-20 01:24:53,851] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.22 | bwd: 3401.05 | bwd_inner: 3400.08 | bwd_allreduce: 0.93 | step: 7.03 76%|███████▌ | 7562/10000 [11:55:14<3:44:09, 5.52s/it] {'loss': 0.0013, 'grad_norm': 0.9047313332557678, 'learning_rate': 5.917533343165618e-06, 'epoch': 7.56} 76%|███████▌ | 7562/10000 [11:55:14<3:44:09, 5.52s/it][2025-06-20 01:24:59,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.78 [2025-06-20 01:24:59,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.93 | bwd_microstep: 3360.41 | bwd_inner_microstep: 3359.54 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.58 [2025-06-20 01:24:59,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.93 | bwd: 3360.43 | bwd_inner: 3359.54 | bwd_allreduce: 0.83 | step: 7.59 76%|███████▌ | 7563/10000 [11:55:20<3:44:14, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.007125830743461847, 'learning_rate': 5.9129345460277085e-06, 'epoch': 7.56} 76%|███████▌ | 7563/10000 [11:55:20<3:44:14, 5.52s/it][2025-06-20 01:25:04,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:25:04,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.56 | bwd_microstep: 3315.88 | bwd_inner_microstep: 3315.02 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.86 [2025-06-20 01:25:04,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.56 | bwd: 3315.89 | bwd_inner: 3315.02 | bwd_allreduce: 0.83 | step: 6.86 76%|███████▌ | 7564/10000 [11:55:25<3:43:21, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.21833766996860504, 'learning_rate': 5.9083372265575814e-06, 'epoch': 7.56} 76%|███████▌ | 7564/10000 [11:55:25<3:43:21, 5.50s/it][2025-06-20 01:25:10,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:25:10,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.72 | bwd_microstep: 3316.96 | bwd_inner_microstep: 3316.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-20 01:25:10,297] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.72 | bwd: 3316.98 | bwd_inner: 3316.16 | bwd_allreduce: 0.78 | step: 6.97 76%|███████▌ | 7565/10000 [11:55:31<3:42:45, 5.49s/it] {'loss': 0.0007, 'grad_norm': 0.21968519687652588, 'learning_rate': 5.903741385237478e-06, 'epoch': 7.56} 76%|███████▌ | 7565/10000 [11:55:31<3:42:45, 5.49s/it][2025-06-20 01:25:15,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:25:15,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.27 | bwd_microstep: 3365.50 | bwd_inner_microstep: 3364.66 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.26 [2025-06-20 01:25:15,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.27 | bwd: 3365.52 | bwd_inner: 3364.66 | bwd_allreduce: 0.81 | step: 7.26 76%|███████▌ | 7566/10000 [11:55:36<3:43:13, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0020305311772972345, 'learning_rate': 5.899147022549478e-06, 'epoch': 7.57} 76%|███████▌ | 7566/10000 [11:55:36<3:43:13, 5.50s/it][2025-06-20 01:25:21,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:25:21,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.00 | bwd_microstep: 3313.53 | bwd_inner_microstep: 3312.52 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.24 [2025-06-20 01:25:21,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.00 | bwd: 3313.55 | bwd_inner: 3312.52 | bwd_allreduce: 0.98 | step: 7.24 76%|███████▌ | 7567/10000 [11:55:42<3:42:36, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.009836683981120586, 'learning_rate': 5.894554138975515e-06, 'epoch': 7.57} 76%|███████▌ | 7567/10000 [11:55:42<3:42:36, 5.49s/it][2025-06-20 01:25:26,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:25:26,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.89 | bwd_microstep: 3311.63 | bwd_inner_microstep: 3310.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 01:25:26,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.89 | bwd: 3311.65 | bwd_inner: 3310.83 | bwd_allreduce: 0.77 | step: 6.76 76%|███████▌ | 7568/10000 [11:55:47<3:42:07, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.04411713778972626, 'learning_rate': 5.889962734997354e-06, 'epoch': 7.57} 76%|███████▌ | 7568/10000 [11:55:47<3:42:07, 5.48s/it][2025-06-20 01:25:32,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.80 [2025-06-20 01:25:32,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.87 | bwd_microstep: 3363.60 | bwd_inner_microstep: 3362.79 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.25 [2025-06-20 01:25:32,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.87 | bwd: 3363.62 | bwd_inner: 3362.79 | bwd_allreduce: 0.78 | step: 7.44 76%|███████▌ | 7569/10000 [11:55:53<3:42:38, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.02185366488993168, 'learning_rate': 5.885372811096617e-06, 'epoch': 7.57} 76%|███████▌ | 7569/10000 [11:55:53<3:42:38, 5.49s/it][2025-06-20 01:25:37,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:25:37,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.49 | bwd_microstep: 3312.79 | bwd_inner_microstep: 3311.99 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-20 01:25:37,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.49 | bwd: 3312.81 | bwd_inner: 3311.99 | bwd_allreduce: 0.78 | step: 6.86 76%|███████▌ | 7570/10000 [11:55:58<3:42:02, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001048706704750657, 'learning_rate': 5.880784367754764e-06, 'epoch': 7.57} 76%|███████▌ | 7570/10000 [11:55:58<3:42:02, 5.48s/it][2025-06-20 01:25:43,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:25:43,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.22 | bwd_microstep: 3309.85 | bwd_inner_microstep: 3309.04 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-20 01:25:43,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.22 | bwd: 3309.86 | bwd_inner: 3309.04 | bwd_allreduce: 0.78 | step: 7.20 76%|███████▌ | 7571/10000 [11:56:03<3:41:36, 5.47s/it] {'loss': 0.0027, 'grad_norm': 0.5544184446334839, 'learning_rate': 5.876197405453101e-06, 'epoch': 7.57} 76%|███████▌ | 7571/10000 [11:56:03<3:41:36, 5.47s/it][2025-06-20 01:25:48,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:25:48,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.66 | bwd_microstep: 3305.35 | bwd_inner_microstep: 3304.57 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 01:25:48,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.66 | bwd: 3305.36 | bwd_inner: 3304.57 | bwd_allreduce: 0.75 | step: 6.59 76%|███████▌ | 7572/10000 [11:56:09<3:41:08, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0017307575326412916, 'learning_rate': 5.87161192467278e-06, 'epoch': 7.57} 76%|███████▌ | 7572/10000 [11:56:09<3:41:08, 5.46s/it][2025-06-20 01:25:54,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:25:54,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.66 | bwd_microstep: 3315.27 | bwd_inner_microstep: 3314.49 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 01:25:54,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.66 | bwd: 3315.29 | bwd_inner: 3314.49 | bwd_allreduce: 0.75 | step: 6.61 76%|███████▌ | 7573/10000 [11:56:14<3:40:53, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0021192103158682585, 'learning_rate': 5.867027925894802e-06, 'epoch': 7.57} 76%|███████▌ | 7573/10000 [11:56:14<3:40:53, 5.46s/it][2025-06-20 01:25:59,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:25:59,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.48 | bwd_microstep: 3361.51 | bwd_inner_microstep: 3360.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.98 [2025-06-20 01:25:59,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.48 | bwd: 3361.53 | bwd_inner: 3360.72 | bwd_allreduce: 0.76 | step: 6.99 76%|███████▌ | 7574/10000 [11:56:20<3:41:33, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.06171920895576477, 'learning_rate': 5.862445409599995e-06, 'epoch': 7.57} 76%|███████▌ | 7574/10000 [11:56:20<3:41:33, 5.48s/it][2025-06-20 01:26:05,127] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:26:05,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.53 | bwd_microstep: 3359.03 | bwd_inner_microstep: 3358.15 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.06 [2025-06-20 01:26:05,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.53 | bwd: 3359.04 | bwd_inner: 3358.15 | bwd_allreduce: 0.85 | step: 7.06 76%|███████▌ | 7575/10000 [11:56:25<3:41:59, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0015813646605238318, 'learning_rate': 5.857864376269051e-06, 'epoch': 7.58} 76%|███████▌ | 7575/10000 [11:56:25<3:41:59, 5.49s/it][2025-06-20 01:26:10,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:26:10,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.96 | bwd_microstep: 3308.82 | bwd_inner_microstep: 3308.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 01:26:10,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.96 | bwd: 3308.84 | bwd_inner: 3308.02 | bwd_allreduce: 0.77 | step: 6.86 76%|███████▌ | 7576/10000 [11:56:31<3:41:23, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00015123073535505682, 'learning_rate': 5.853284826382499e-06, 'epoch': 7.58} 76%|███████▌ | 7576/10000 [11:56:31<3:41:23, 5.48s/it][2025-06-20 01:26:16,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:26:16,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.10 | bwd_microstep: 3318.83 | bwd_inner_microstep: 3318.00 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.45 [2025-06-20 01:26:16,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.10 | bwd: 3318.84 | bwd_inner: 3318.00 | bwd_allreduce: 0.79 | step: 7.45 76%|███████▌ | 7577/10000 [11:56:36<3:41:05, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.016359936445951462, 'learning_rate': 5.848706760420711e-06, 'epoch': 7.58} 76%|███████▌ | 7577/10000 [11:56:36<3:41:05, 5.48s/it][2025-06-20 01:26:21,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:26:21,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.70 | bwd_microstep: 3355.26 | bwd_inner_microstep: 3354.42 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.74 [2025-06-20 01:26:21,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.70 | bwd: 3355.27 | bwd_inner: 3354.42 | bwd_allreduce: 0.80 | step: 6.75 76%|███████▌ | 7578/10000 [11:56:42<3:41:35, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0002690411056391895, 'learning_rate': 5.8441301788639114e-06, 'epoch': 7.58} 76%|███████▌ | 7578/10000 [11:56:42<3:41:35, 5.49s/it][2025-06-20 01:26:27,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.73 [2025-06-20 01:26:27,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.37 | bwd_microstep: 3311.52 | bwd_inner_microstep: 3310.30 | bwd_allreduce_microstep: 1.17 | step_microstep: 7.40 [2025-06-20 01:26:27,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.37 | bwd: 3311.54 | bwd_inner: 3310.30 | bwd_allreduce: 1.19 | step: 7.40 76%|███████▌ | 7579/10000 [11:56:47<3:41:13, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.007849196903407574, 'learning_rate': 5.83955508219215e-06, 'epoch': 7.58} 76%|███████▌ | 7579/10000 [11:56:47<3:41:13, 5.48s/it][2025-06-20 01:26:32,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-20 01:26:32,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.50 | bwd_microstep: 3309.77 | bwd_inner_microstep: 3308.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 01:26:32,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.50 | bwd: 3309.79 | bwd_inner: 3308.98 | bwd_allreduce: 0.76 | step: 6.73 76%|███████▌ | 7580/10000 [11:56:53<3:40:49, 5.48s/it] {'loss': 0.1017, 'grad_norm': 5.624526023864746, 'learning_rate': 5.834981470885339e-06, 'epoch': 7.58} 76%|███████▌ | 7580/10000 [11:56:53<3:40:49, 5.48s/it][2025-06-20 01:26:37,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:26:37,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.31 | bwd_microstep: 3319.14 | bwd_inner_microstep: 3318.34 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-20 01:26:37,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.31 | bwd: 3319.16 | bwd_inner: 3318.34 | bwd_allreduce: 0.78 | step: 7.13 76%|███████▌ | 7581/10000 [11:56:58<3:40:46, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0014070197939872742, 'learning_rate': 5.8304093454232315e-06, 'epoch': 7.58} 76%|███████▌ | 7581/10000 [11:56:58<3:40:46, 5.48s/it][2025-06-20 01:26:43,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.91 [2025-06-20 01:26:43,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.23 | bwd_microstep: 3315.14 | bwd_inner_microstep: 3314.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-20 01:26:43,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.23 | bwd: 3315.15 | bwd_inner: 3314.35 | bwd_allreduce: 0.76 | step: 6.79 76%|███████▌ | 7582/10000 [11:57:04<3:40:32, 5.47s/it] {'loss': 0.0037, 'grad_norm': 0.623163104057312, 'learning_rate': 5.82583870628542e-06, 'epoch': 7.58} 76%|███████▌ | 7582/10000 [11:57:04<3:40:32, 5.47s/it][2025-06-20 01:26:48,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:26:48,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.58 | bwd_microstep: 3328.23 | bwd_inner_microstep: 3327.18 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.76 [2025-06-20 01:26:48,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.58 | bwd: 3328.24 | bwd_inner: 3327.18 | bwd_allreduce: 1.01 | step: 7.76 76%|███████▌ | 7583/10000 [11:57:09<3:40:28, 5.47s/it] {'loss': 0.0005, 'grad_norm': 0.06271126121282578, 'learning_rate': 5.8212695539513455e-06, 'epoch': 7.58} 76%|███████▌ | 7583/10000 [11:57:09<3:40:28, 5.47s/it][2025-06-20 01:26:54,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:26:54,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.93 | bwd_microstep: 3368.03 | bwd_inner_microstep: 3367.12 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.94 [2025-06-20 01:26:54,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.93 | bwd: 3368.05 | bwd_inner: 3367.12 | bwd_allreduce: 0.88 | step: 6.94 76%|███████▌ | 7584/10000 [11:57:15<3:41:12, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.21689918637275696, 'learning_rate': 5.8167018889002865e-06, 'epoch': 7.58} 76%|███████▌ | 7584/10000 [11:57:15<3:41:12, 5.49s/it][2025-06-20 01:26:59,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 01:26:59,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.31 | bwd_microstep: 3319.76 | bwd_inner_microstep: 3318.65 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.98 [2025-06-20 01:26:59,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.31 | bwd: 3319.78 | bwd_inner: 3318.65 | bwd_allreduce: 1.08 | step: 7.99 76%|███████▌ | 7585/10000 [11:57:20<3:40:52, 5.49s/it] {'loss': 0.0009, 'grad_norm': 0.22835752367973328, 'learning_rate': 5.812135711611375e-06, 'epoch': 7.58} 76%|███████▌ | 7585/10000 [11:57:20<3:40:52, 5.49s/it][2025-06-20 01:27:05,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:27:05,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.27 | bwd_microstep: 3314.99 | bwd_inner_microstep: 3314.16 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.19 [2025-06-20 01:27:05,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.27 | bwd: 3315.00 | bwd_inner: 3314.16 | bwd_allreduce: 0.80 | step: 7.20 76%|███████▌ | 7586/10000 [11:57:26<3:40:27, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.007520026061683893, 'learning_rate': 5.807571022563579e-06, 'epoch': 7.59} 76%|███████▌ | 7586/10000 [11:57:26<3:40:27, 5.48s/it][2025-06-20 01:27:10,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:27:10,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.49 | bwd_microstep: 3368.79 | bwd_inner_microstep: 3368.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-20 01:27:10,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.49 | bwd: 3368.81 | bwd_inner: 3368.00 | bwd_allreduce: 0.77 | step: 6.99 76%|███████▌ | 7587/10000 [11:57:31<3:40:58, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00013929182023275644, 'learning_rate': 5.803007822235713e-06, 'epoch': 7.59} 76%|███████▌ | 7587/10000 [11:57:31<3:40:58, 5.49s/it][2025-06-20 01:27:16,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:27:16,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.81 | bwd_microstep: 3316.17 | bwd_inner_microstep: 3315.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 01:27:16,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.81 | bwd: 3316.18 | bwd_inner: 3315.39 | bwd_allreduce: 0.76 | step: 6.66 76%|███████▌ | 7588/10000 [11:57:37<3:40:24, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0016769791254773736, 'learning_rate': 5.798446111106442e-06, 'epoch': 7.59} 76%|███████▌ | 7588/10000 [11:57:37<3:40:24, 5.48s/it][2025-06-20 01:27:21,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:27:21,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.45 | bwd_microstep: 3309.42 | bwd_inner_microstep: 3308.56 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.43 [2025-06-20 01:27:21,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.46 | bwd: 3309.43 | bwd_inner: 3308.56 | bwd_allreduce: 0.83 | step: 7.43 76%|███████▌ | 7589/10000 [11:57:42<3:39:56, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.07210613042116165, 'learning_rate': 5.793885889654258e-06, 'epoch': 7.59} 76%|███████▌ | 7589/10000 [11:57:42<3:39:56, 5.47s/it][2025-06-20 01:27:27,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:27:27,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.42 | bwd_microstep: 3315.12 | bwd_inner_microstep: 3314.33 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-20 01:27:27,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.42 | bwd: 3315.14 | bwd_inner: 3314.33 | bwd_allreduce: 0.76 | step: 6.96 76%|███████▌ | 7590/10000 [11:57:48<3:39:37, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.006395366508513689, 'learning_rate': 5.789327158357509e-06, 'epoch': 7.59} 76%|███████▌ | 7590/10000 [11:57:48<3:39:37, 5.47s/it][2025-06-20 01:27:32,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:27:32,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.22 | bwd_microstep: 3309.25 | bwd_inner_microstep: 3308.45 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-20 01:27:32,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.22 | bwd: 3309.26 | bwd_inner: 3308.45 | bwd_allreduce: 0.78 | step: 7.02 76%|███████▌ | 7591/10000 [11:57:53<3:39:24, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.016698608174920082, 'learning_rate': 5.784769917694388e-06, 'epoch': 7.59} 76%|███████▌ | 7591/10000 [11:57:53<3:39:24, 5.46s/it][2025-06-20 01:27:38,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:27:38,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.27 | bwd_microstep: 3309.27 | bwd_inner_microstep: 3308.49 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 01:27:38,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.27 | bwd: 3309.29 | bwd_inner: 3308.49 | bwd_allreduce: 0.75 | step: 6.60 76%|███████▌ | 7592/10000 [11:57:58<3:39:07, 5.46s/it] {'loss': 0.0012, 'grad_norm': 0.5252009630203247, 'learning_rate': 5.78021416814293e-06, 'epoch': 7.59} 76%|███████▌ | 7592/10000 [11:57:58<3:39:07, 5.46s/it][2025-06-20 01:27:43,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:27:43,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.62 | bwd_microstep: 3372.63 | bwd_inner_microstep: 3371.82 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.07 [2025-06-20 01:27:43,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.62 | bwd: 3372.64 | bwd_inner: 3371.82 | bwd_allreduce: 0.78 | step: 7.07 76%|███████▌ | 7593/10000 [11:58:04<3:40:03, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.01011249516159296, 'learning_rate': 5.775659910181013e-06, 'epoch': 7.59} 76%|███████▌ | 7593/10000 [11:58:04<3:40:03, 5.49s/it][2025-06-20 01:27:49,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:27:49,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.47 | bwd_microstep: 3315.93 | bwd_inner_microstep: 3315.06 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.08 [2025-06-20 01:27:49,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.47 | bwd: 3315.95 | bwd_inner: 3315.06 | bwd_allreduce: 0.85 | step: 7.09 76%|███████▌ | 7594/10000 [11:58:09<3:39:41, 5.48s/it] {'loss': 0.0033, 'grad_norm': 0.7224555611610413, 'learning_rate': 5.77110714428635e-06, 'epoch': 7.59} 76%|███████▌ | 7594/10000 [11:58:09<3:39:41, 5.48s/it][2025-06-20 01:27:54,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:27:54,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.28 | bwd_microstep: 3308.85 | bwd_inner_microstep: 3308.05 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.10 [2025-06-20 01:27:54,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.29 | bwd: 3308.86 | bwd_inner: 3308.05 | bwd_allreduce: 0.77 | step: 7.10 76%|███████▌ | 7595/10000 [11:58:15<3:39:14, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0010897985193878412, 'learning_rate': 5.766555870936508e-06, 'epoch': 7.59} 76%|███████▌ | 7595/10000 [11:58:15<3:39:14, 5.47s/it][2025-06-20 01:28:00,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:28:00,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.53 | bwd_microstep: 3367.46 | bwd_inner_microstep: 3366.67 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 01:28:00,163] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.53 | bwd: 3367.47 | bwd_inner: 3366.67 | bwd_allreduce: 0.76 | step: 6.66 76%|███████▌ | 7596/10000 [11:58:20<3:39:49, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.026805797591805458, 'learning_rate': 5.762006090608896e-06, 'epoch': 7.6} 76%|███████▌ | 7596/10000 [11:58:20<3:39:49, 5.49s/it][2025-06-20 01:28:05,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:28:05,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.70 | bwd_microstep: 3319.37 | bwd_inner_microstep: 3318.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-20 01:28:05,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.70 | bwd: 3319.39 | bwd_inner: 3318.57 | bwd_allreduce: 0.77 | step: 7.10 76%|███████▌ | 7597/10000 [11:58:26<3:39:24, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0018428605981171131, 'learning_rate': 5.757457803780766e-06, 'epoch': 7.6} 76%|███████▌ | 7597/10000 [11:58:26<3:39:24, 5.48s/it][2025-06-20 01:28:11,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:28:11,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.73 | bwd_microstep: 3316.01 | bwd_inner_microstep: 3315.22 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 01:28:11,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.73 | bwd: 3316.02 | bwd_inner: 3315.22 | bwd_allreduce: 0.76 | step: 6.62 76%|███████▌ | 7598/10000 [11:58:31<3:39:03, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0006702399696223438, 'learning_rate': 5.752911010929217e-06, 'epoch': 7.6} 76%|███████▌ | 7598/10000 [11:58:31<3:39:03, 5.47s/it][2025-06-20 01:28:16,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:28:16,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.57 | bwd_microstep: 3363.58 | bwd_inner_microstep: 3362.66 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.31 [2025-06-20 01:28:16,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.57 | bwd: 3363.61 | bwd_inner: 3362.66 | bwd_allreduce: 0.88 | step: 7.31 76%|███████▌ | 7599/10000 [11:58:37<3:39:38, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0014700567116960883, 'learning_rate': 5.748365712531172e-06, 'epoch': 7.6} 76%|███████▌ | 7599/10000 [11:58:37<3:39:38, 5.49s/it][2025-06-20 01:28:22,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:28:22,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.73 | bwd_microstep: 3316.63 | bwd_inner_microstep: 3315.66 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.06 [2025-06-20 01:28:22,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.73 | bwd: 3316.65 | bwd_inner: 3315.66 | bwd_allreduce: 0.95 | step: 7.07 76%|███████▌ | 7600/10000 [11:58:42<3:39:11, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004023860674351454, 'learning_rate': 5.7438219090634205e-06, 'epoch': 7.6} 76%|███████▌ | 7600/10000 [11:58:42<3:39:11, 5.48s/it][2025-06-20 01:28:27,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:28:27,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.26 | bwd_microstep: 3369.68 | bwd_inner_microstep: 3368.62 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.33 [2025-06-20 01:28:27,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.26 | bwd: 3369.70 | bwd_inner: 3368.62 | bwd_allreduce: 1.02 | step: 7.33 76%|███████▌ | 7601/10000 [11:58:48<3:39:49, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0071909306570887566, 'learning_rate': 5.739279601002585e-06, 'epoch': 7.6} 76%|███████▌ | 7601/10000 [11:58:48<3:39:49, 5.50s/it][2025-06-20 01:28:33,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:28:33,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.71 | bwd_microstep: 3312.98 | bwd_inner_microstep: 3312.13 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.15 [2025-06-20 01:28:33,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.71 | bwd: 3313.00 | bwd_inner: 3312.13 | bwd_allreduce: 0.82 | step: 7.15 76%|███████▌ | 7602/10000 [11:58:53<3:39:12, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.013962358236312866, 'learning_rate': 5.734738788825134e-06, 'epoch': 7.6} 76%|███████▌ | 7602/10000 [11:58:53<3:39:12, 5.48s/it][2025-06-20 01:28:38,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:28:38,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.58 | bwd_microstep: 3318.34 | bwd_inner_microstep: 3317.51 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.88 [2025-06-20 01:28:38,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.58 | bwd: 3318.35 | bwd_inner: 3317.51 | bwd_allreduce: 0.80 | step: 6.89 76%|███████▌ | 7603/10000 [11:58:59<3:38:46, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.000586426118388772, 'learning_rate': 5.730199473007376e-06, 'epoch': 7.6} 76%|███████▌ | 7603/10000 [11:58:59<3:38:46, 5.48s/it][2025-06-20 01:28:43,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:28:43,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.11 | bwd_microstep: 3318.89 | bwd_inner_microstep: 3318.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 01:28:43,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.11 | bwd: 3318.90 | bwd_inner: 3318.10 | bwd_allreduce: 0.76 | step: 6.68 76%|███████▌ | 7604/10000 [11:59:04<3:38:31, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.022094426676630974, 'learning_rate': 5.725661654025467e-06, 'epoch': 7.6} 76%|███████▌ | 7604/10000 [11:59:04<3:38:31, 5.47s/it][2025-06-20 01:28:49,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:28:49,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.85 | bwd_microstep: 3316.35 | bwd_inner_microstep: 3315.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 01:28:49,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.86 | bwd: 3316.37 | bwd_inner: 3315.55 | bwd_allreduce: 0.77 | step: 6.82 76%|███████▌ | 7605/10000 [11:59:10<3:38:17, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.01626770570874214, 'learning_rate': 5.721125332355399e-06, 'epoch': 7.61} 76%|███████▌ | 7605/10000 [11:59:10<3:38:17, 5.47s/it][2025-06-20 01:28:54,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:28:54,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.40 | bwd_microstep: 3366.50 | bwd_inner_microstep: 3365.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 01:28:54,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.40 | bwd: 3366.51 | bwd_inner: 3365.70 | bwd_allreduce: 0.76 | step: 6.77 76%|███████▌ | 7606/10000 [11:59:15<3:38:59, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.13546344637870789, 'learning_rate': 5.7165905084730166e-06, 'epoch': 7.61} 76%|███████▌ | 7606/10000 [11:59:15<3:38:59, 5.49s/it][2025-06-20 01:29:00,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:29:00,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.96 | bwd_microstep: 3377.04 | bwd_inner_microstep: 3376.12 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.18 [2025-06-20 01:29:00,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.96 | bwd: 3377.06 | bwd_inner: 3376.12 | bwd_allreduce: 0.89 | step: 7.18 76%|███████▌ | 7607/10000 [11:59:21<3:39:34, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0005976047250442207, 'learning_rate': 5.712057182853996e-06, 'epoch': 7.61} 76%|███████▌ | 7607/10000 [11:59:21<3:39:34, 5.51s/it][2025-06-20 01:29:05,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:29:05,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.44 | bwd_microstep: 3318.04 | bwd_inner_microstep: 3317.18 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.94 [2025-06-20 01:29:05,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.44 | bwd: 3318.05 | bwd_inner: 3317.18 | bwd_allreduce: 0.83 | step: 6.94 76%|███████▌ | 7608/10000 [11:59:26<3:38:59, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0006663858075626194, 'learning_rate': 5.707525355973864e-06, 'epoch': 7.61} 76%|███████▌ | 7608/10000 [11:59:26<3:38:59, 5.49s/it][2025-06-20 01:29:11,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.80 [2025-06-20 01:29:11,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.16 | bwd_microstep: 3370.95 | bwd_inner_microstep: 3369.99 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.17 [2025-06-20 01:29:11,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.16 | bwd: 3370.97 | bwd_inner: 3369.99 | bwd_allreduce: 0.93 | step: 7.17 76%|███████▌ | 7609/10000 [11:59:32<3:39:30, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.08299289643764496, 'learning_rate': 5.702995028307999e-06, 'epoch': 7.61} 76%|███████▌ | 7609/10000 [11:59:32<3:39:30, 5.51s/it][2025-06-20 01:29:17,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:29:17,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.29 | bwd_microstep: 3366.77 | bwd_inner_microstep: 3365.96 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-20 01:29:17,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.29 | bwd: 3366.78 | bwd_inner: 3365.96 | bwd_allreduce: 0.78 | step: 7.22 76%|███████▌ | 7610/10000 [11:59:37<3:39:43, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0013722239527851343, 'learning_rate': 5.69846620033159e-06, 'epoch': 7.61} 76%|███████▌ | 7610/10000 [11:59:37<3:39:43, 5.52s/it][2025-06-20 01:29:22,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:29:22,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.68 | bwd_microstep: 3317.53 | bwd_inner_microstep: 3316.55 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.44 [2025-06-20 01:29:22,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.68 | bwd: 3317.55 | bwd_inner: 3316.55 | bwd_allreduce: 0.95 | step: 7.44 76%|███████▌ | 7611/10000 [11:59:43<3:38:58, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0020128421019762754, 'learning_rate': 5.693938872519704e-06, 'epoch': 7.61} 76%|███████▌ | 7611/10000 [11:59:43<3:38:58, 5.50s/it][2025-06-20 01:29:27,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:29:27,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.25 | bwd_microstep: 3318.48 | bwd_inner_microstep: 3317.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 01:29:27,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.25 | bwd: 3318.50 | bwd_inner: 3317.69 | bwd_allreduce: 0.76 | step: 6.85 76%|███████▌ | 7612/10000 [11:59:48<3:38:27, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.027446256950497627, 'learning_rate': 5.689413045347232e-06, 'epoch': 7.61} 76%|███████▌ | 7612/10000 [11:59:48<3:38:27, 5.49s/it][2025-06-20 01:29:33,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:29:33,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.09 | bwd_microstep: 3324.69 | bwd_inner_microstep: 3323.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 01:29:33,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.09 | bwd: 3324.70 | bwd_inner: 3323.90 | bwd_allreduce: 0.76 | step: 6.73 76%|███████▌ | 7613/10000 [11:59:54<3:38:11, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.014635931700468063, 'learning_rate': 5.684888719288914e-06, 'epoch': 7.61} 76%|███████▌ | 7613/10000 [11:59:54<3:38:11, 5.48s/it][2025-06-20 01:29:38,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:29:38,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.30 | bwd_microstep: 3322.04 | bwd_inner_microstep: 3321.13 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.85 [2025-06-20 01:29:38,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.30 | bwd: 3322.06 | bwd_inner: 3321.13 | bwd_allreduce: 0.87 | step: 6.85 76%|███████▌ | 7614/10000 [11:59:59<3:37:52, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.019671428948640823, 'learning_rate': 5.680365894819339e-06, 'epoch': 7.61} 76%|███████▌ | 7614/10000 [11:59:59<3:37:52, 5.48s/it][2025-06-20 01:29:44,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:29:44,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.51 | bwd_microstep: 3320.12 | bwd_inner_microstep: 3319.15 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.27 [2025-06-20 01:29:44,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.51 | bwd: 3320.14 | bwd_inner: 3319.15 | bwd_allreduce: 0.94 | step: 7.27 76%|███████▌ | 7615/10000 [12:00:05<3:37:40, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.05638767033815384, 'learning_rate': 5.675844572412914e-06, 'epoch': 7.62} 76%|███████▌ | 7615/10000 [12:00:05<3:37:40, 5.48s/it][2025-06-20 01:29:49,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:29:49,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.75 | bwd_microstep: 3374.74 | bwd_inner_microstep: 3373.88 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.12 [2025-06-20 01:29:49,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.75 | bwd: 3374.75 | bwd_inner: 3373.88 | bwd_allreduce: 0.82 | step: 7.12 76%|███████▌ | 7616/10000 [12:00:10<3:38:30, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005596526432782412, 'learning_rate': 5.671324752543914e-06, 'epoch': 7.62} 76%|███████▌ | 7616/10000 [12:00:10<3:38:30, 5.50s/it][2025-06-20 01:29:55,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:29:55,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.77 | bwd_microstep: 3324.98 | bwd_inner_microstep: 3324.02 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.20 [2025-06-20 01:29:55,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.78 | bwd: 3325.00 | bwd_inner: 3324.02 | bwd_allreduce: 0.93 | step: 7.20 76%|███████▌ | 7617/10000 [12:00:16<3:38:08, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0003865315520670265, 'learning_rate': 5.666806435686447e-06, 'epoch': 7.62} 76%|███████▌ | 7617/10000 [12:00:16<3:38:08, 5.49s/it][2025-06-20 01:30:00,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:30:00,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.98 | bwd_microstep: 3377.21 | bwd_inner_microstep: 3376.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.09 [2025-06-20 01:30:00,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.98 | bwd: 3377.23 | bwd_inner: 3376.41 | bwd_allreduce: 0.77 | step: 7.10 76%|███████▌ | 7618/10000 [12:00:21<3:38:43, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.002339486498385668, 'learning_rate': 5.662289622314461e-06, 'epoch': 7.62} 76%|███████▌ | 7618/10000 [12:00:21<3:38:43, 5.51s/it][2025-06-20 01:30:06,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:30:06,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.45 | bwd_microstep: 3338.97 | bwd_inner_microstep: 3338.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.22 [2025-06-20 01:30:06,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.45 | bwd: 3338.98 | bwd_inner: 3338.16 | bwd_allreduce: 0.78 | step: 7.22 76%|███████▌ | 7619/10000 [12:00:27<3:38:22, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.036696191877126694, 'learning_rate': 5.657774312901749e-06, 'epoch': 7.62} 76%|███████▌ | 7619/10000 [12:00:27<3:38:22, 5.50s/it][2025-06-20 01:30:11,956] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:30:11,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.07 | bwd_microstep: 3341.19 | bwd_inner_microstep: 3340.37 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.22 [2025-06-20 01:30:11,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.07 | bwd: 3341.20 | bwd_inner: 3340.37 | bwd_allreduce: 0.79 | step: 7.23 76%|███████▌ | 7620/10000 [12:00:32<3:38:11, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0033656505402177572, 'learning_rate': 5.653260507921949e-06, 'epoch': 7.62} 76%|███████▌ | 7620/10000 [12:00:32<3:38:11, 5.50s/it][2025-06-20 01:30:17,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:30:17,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.87 | bwd_microstep: 3386.33 | bwd_inner_microstep: 3385.40 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.48 [2025-06-20 01:30:17,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.87 | bwd: 3386.35 | bwd_inner: 3385.40 | bwd_allreduce: 0.91 | step: 7.29 76%|███████▌ | 7621/10000 [12:00:38<3:38:48, 5.52s/it] {'loss': 0.0376, 'grad_norm': 4.796508312225342, 'learning_rate': 5.648748207848538e-06, 'epoch': 7.62} 76%|███████▌ | 7621/10000 [12:00:38<3:38:48, 5.52s/it][2025-06-20 01:30:22,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:30:22,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.18 | bwd_microstep: 3324.55 | bwd_inner_microstep: 3323.75 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-20 01:30:22,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.18 | bwd: 3324.57 | bwd_inner: 3323.75 | bwd_allreduce: 0.78 | step: 7.02 76%|███████▌ | 7622/10000 [12:00:43<3:38:10, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.22187989950180054, 'learning_rate': 5.644237413154832e-06, 'epoch': 7.62} 76%|███████▌ | 7622/10000 [12:00:43<3:38:10, 5.50s/it][2025-06-20 01:30:28,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:30:28,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.96 | bwd_microstep: 3342.35 | bwd_inner_microstep: 3341.41 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.22 [2025-06-20 01:30:28,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.95 | bwd: 3342.36 | bwd_inner: 3341.41 | bwd_allreduce: 0.91 | step: 7.23 76%|███████▌ | 7623/10000 [12:00:49<3:37:52, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.02050667442381382, 'learning_rate': 5.639728124313995e-06, 'epoch': 7.62} 76%|███████▌ | 7623/10000 [12:00:49<3:37:52, 5.50s/it][2025-06-20 01:30:33,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 01:30:33,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.32 | bwd_microstep: 3340.29 | bwd_inner_microstep: 3339.28 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.20 [2025-06-20 01:30:33,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.32 | bwd: 3340.31 | bwd_inner: 3339.28 | bwd_allreduce: 0.97 | step: 7.21 76%|███████▌ | 7624/10000 [12:00:54<3:37:44, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00010560957161942497, 'learning_rate': 5.635220341799035e-06, 'epoch': 7.62} 76%|███████▌ | 7624/10000 [12:00:54<3:37:44, 5.50s/it][2025-06-20 01:30:39,469] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:30:39,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.39 | bwd_microstep: 3335.95 | bwd_inner_microstep: 3335.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.71 [2025-06-20 01:30:39,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.39 | bwd: 3335.96 | bwd_inner: 3335.14 | bwd_allreduce: 0.77 | step: 6.71 76%|███████▋ | 7625/10000 [12:01:00<3:37:36, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.07861381769180298, 'learning_rate': 5.630714066082785e-06, 'epoch': 7.62} 76%|███████▋ | 7625/10000 [12:01:00<3:37:36, 5.50s/it][2025-06-20 01:30:44,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:30:44,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.24 | bwd_microstep: 3319.37 | bwd_inner_microstep: 3318.47 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.03 [2025-06-20 01:30:44,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.24 | bwd: 3319.39 | bwd_inner: 3318.47 | bwd_allreduce: 0.88 | step: 7.03 76%|███████▋ | 7626/10000 [12:01:05<3:37:09, 5.49s/it] {'loss': 0.005, 'grad_norm': 1.5023741722106934, 'learning_rate': 5.626209297637941e-06, 'epoch': 7.63} 76%|███████▋ | 7626/10000 [12:01:05<3:37:09, 5.49s/it][2025-06-20 01:30:50,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 01:30:50,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.19 | bwd_microstep: 3327.02 | bwd_inner_microstep: 3326.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 01:30:50,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.19 | bwd: 3327.03 | bwd_inner: 3326.24 | bwd_allreduce: 0.75 | step: 6.60 76%|███████▋ | 7627/10000 [12:01:11<3:36:55, 5.48s/it] {'loss': 0.0, 'grad_norm': 2.2083693693275563e-05, 'learning_rate': 5.621706036937029e-06, 'epoch': 7.63} 76%|███████▋ | 7627/10000 [12:01:11<3:36:55, 5.48s/it][2025-06-20 01:30:55,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:30:55,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.81 | bwd_microstep: 3324.83 | bwd_inner_microstep: 3324.05 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 01:30:55,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.81 | bwd: 3324.84 | bwd_inner: 3324.05 | bwd_allreduce: 0.75 | step: 6.59 76%|███████▋ | 7628/10000 [12:01:16<3:36:45, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.016998223960399628, 'learning_rate': 5.617204284452424e-06, 'epoch': 7.63} 76%|███████▋ | 7628/10000 [12:01:16<3:36:45, 5.48s/it][2025-06-20 01:31:01,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:31:01,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.90 | bwd_microstep: 3325.82 | bwd_inner_microstep: 3325.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 01:31:01,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.90 | bwd: 3325.83 | bwd_inner: 3325.03 | bwd_allreduce: 0.76 | step: 6.65 76%|███████▋ | 7629/10000 [12:01:22<3:36:34, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00019487352983560413, 'learning_rate': 5.612704040656343e-06, 'epoch': 7.63} 76%|███████▋ | 7629/10000 [12:01:22<3:36:34, 5.48s/it][2025-06-20 01:31:06,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:31:06,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.49 | bwd_microstep: 3377.88 | bwd_inner_microstep: 3377.09 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 01:31:06,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.49 | bwd: 3377.89 | bwd_inner: 3377.09 | bwd_allreduce: 0.76 | step: 6.73 76%|███████▋ | 7630/10000 [12:01:27<3:37:17, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0008483895217068493, 'learning_rate': 5.608205306020829e-06, 'epoch': 7.63} 76%|███████▋ | 7630/10000 [12:01:27<3:37:17, 5.50s/it][2025-06-20 01:31:12,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:31:12,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.19 | bwd_microstep: 3368.59 | bwd_inner_microstep: 3367.58 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.40 [2025-06-20 01:31:12,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.19 | bwd: 3368.60 | bwd_inner: 3367.58 | bwd_allreduce: 0.98 | step: 7.40 76%|███████▋ | 7631/10000 [12:01:33<3:37:40, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.005484062246978283, 'learning_rate': 5.603708081017783e-06, 'epoch': 7.63} 76%|███████▋ | 7631/10000 [12:01:33<3:37:40, 5.51s/it][2025-06-20 01:31:17,932] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:31:17,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.80 | bwd_microstep: 3327.98 | bwd_inner_microstep: 3327.13 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.99 [2025-06-20 01:31:17,933] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.80 | bwd: 3328.00 | bwd_inner: 3327.13 | bwd_allreduce: 0.81 | step: 7.00 76%|███████▋ | 7632/10000 [12:01:38<3:37:09, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0010267802281305194, 'learning_rate': 5.599212366118943e-06, 'epoch': 7.63} 76%|███████▋ | 7632/10000 [12:01:38<3:37:09, 5.50s/it][2025-06-20 01:31:23,407] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:31:23,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.54 | bwd_microstep: 3324.68 | bwd_inner_microstep: 3323.70 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.13 [2025-06-20 01:31:23,408] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.54 | bwd: 3324.70 | bwd_inner: 3323.70 | bwd_allreduce: 0.95 | step: 7.13 76%|███████▋ | 7633/10000 [12:01:44<3:36:45, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.001314870547503233, 'learning_rate': 5.594718161795891e-06, 'epoch': 7.63} 76%|███████▋ | 7633/10000 [12:01:44<3:36:45, 5.49s/it][2025-06-20 01:31:28,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.89 [2025-06-20 01:31:28,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.92 | bwd_microstep: 3373.97 | bwd_inner_microstep: 3373.11 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.50 [2025-06-20 01:31:28,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.92 | bwd: 3373.99 | bwd_inner: 3373.11 | bwd_allreduce: 0.81 | step: 7.50 76%|███████▋ | 7634/10000 [12:01:49<3:37:15, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.012469163164496422, 'learning_rate': 5.590225468520052e-06, 'epoch': 7.63} 76%|███████▋ | 7634/10000 [12:01:49<3:37:15, 5.51s/it][2025-06-20 01:31:34,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:31:34,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.50 | bwd_microstep: 3315.75 | bwd_inner_microstep: 3314.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-20 01:31:34,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.50 | bwd: 3315.77 | bwd_inner: 3314.95 | bwd_allreduce: 0.77 | step: 7.17 76%|███████▋ | 7635/10000 [12:01:55<3:36:42, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.007353861816227436, 'learning_rate': 5.5857342867626784e-06, 'epoch': 7.63} 76%|███████▋ | 7635/10000 [12:01:55<3:36:42, 5.50s/it][2025-06-20 01:31:39,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:31:39,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.80 | bwd_microstep: 3319.99 | bwd_inner_microstep: 3319.21 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.64 [2025-06-20 01:31:39,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.80 | bwd: 3320.00 | bwd_inner: 3319.21 | bwd_allreduce: 0.75 | step: 6.65 76%|███████▋ | 7636/10000 [12:02:00<3:36:14, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0007234755903482437, 'learning_rate': 5.581244616994881e-06, 'epoch': 7.64} 76%|███████▋ | 7636/10000 [12:02:00<3:36:14, 5.49s/it][2025-06-20 01:31:45,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:31:45,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.37 | bwd_microstep: 3378.29 | bwd_inner_microstep: 3377.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-20 01:31:45,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.37 | bwd: 3378.30 | bwd_inner: 3377.51 | bwd_allreduce: 0.75 | step: 6.53 76%|███████▋ | 7637/10000 [12:02:06<3:36:55, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00046388048212975264, 'learning_rate': 5.5767564596876004e-06, 'epoch': 7.64} 76%|███████▋ | 7637/10000 [12:02:06<3:36:55, 5.51s/it][2025-06-20 01:31:50,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:31:50,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.32 | bwd_microstep: 3367.77 | bwd_inner_microstep: 3366.75 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.33 [2025-06-20 01:31:50,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.32 | bwd: 3367.79 | bwd_inner: 3366.75 | bwd_allreduce: 0.98 | step: 7.33 76%|███████▋ | 7638/10000 [12:02:11<3:37:14, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.09573710709810257, 'learning_rate': 5.572269815311628e-06, 'epoch': 7.64} 76%|███████▋ | 7638/10000 [12:02:11<3:37:14, 5.52s/it][2025-06-20 01:31:56,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:31:56,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.59 | bwd_microstep: 3366.67 | bwd_inner_microstep: 3365.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 01:31:56,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.59 | bwd: 3366.68 | bwd_inner: 3365.88 | bwd_allreduce: 0.76 | step: 6.58 76%|███████▋ | 7639/10000 [12:02:17<3:37:19, 5.52s/it] {'loss': 0.0, 'grad_norm': 2.857886465790216e-05, 'learning_rate': 5.567784684337592e-06, 'epoch': 7.64} 76%|███████▋ | 7639/10000 [12:02:17<3:37:19, 5.52s/it][2025-06-20 01:32:01,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:32:01,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.40 | bwd_microstep: 3320.07 | bwd_inner_microstep: 3319.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-20 01:32:01,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.40 | bwd: 3320.08 | bwd_inner: 3319.28 | bwd_allreduce: 0.76 | step: 6.63 76%|███████▋ | 7640/10000 [12:02:22<3:36:28, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.03011602908372879, 'learning_rate': 5.56330106723596e-06, 'epoch': 7.64} 76%|███████▋ | 7640/10000 [12:02:22<3:36:28, 5.50s/it][2025-06-20 01:32:07,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:32:07,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.05 | bwd_microstep: 3326.89 | bwd_inner_microstep: 3325.87 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.08 [2025-06-20 01:32:07,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.05 | bwd: 3326.91 | bwd_inner: 3325.87 | bwd_allreduce: 0.98 | step: 7.08 76%|███████▋ | 7641/10000 [12:02:28<3:36:03, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.45452776551246643, 'learning_rate': 5.558818964477044e-06, 'epoch': 7.64} 76%|███████▋ | 7641/10000 [12:02:28<3:36:03, 5.50s/it][2025-06-20 01:32:13,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:32:13,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.10 | bwd_microstep: 3392.87 | bwd_inner_microstep: 3392.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 01:32:13,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.10 | bwd: 3392.88 | bwd_inner: 3392.08 | bwd_allreduce: 0.75 | step: 6.60 76%|███████▋ | 7642/10000 [12:02:33<3:36:53, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.000473821914056316, 'learning_rate': 5.554338376530992e-06, 'epoch': 7.64} 76%|███████▋ | 7642/10000 [12:02:33<3:36:53, 5.52s/it][2025-06-20 01:32:18,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:32:18,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.40 | bwd_microstep: 3325.53 | bwd_inner_microstep: 3324.51 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.04 [2025-06-20 01:32:18,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.40 | bwd: 3325.54 | bwd_inner: 3324.51 | bwd_allreduce: 0.99 | step: 7.04 76%|███████▋ | 7643/10000 [12:02:39<3:36:11, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00014143294538371265, 'learning_rate': 5.549859303867804e-06, 'epoch': 7.64} 76%|███████▋ | 7643/10000 [12:02:39<3:36:11, 5.50s/it][2025-06-20 01:32:23,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:32:23,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.95 | bwd_microstep: 3321.66 | bwd_inner_microstep: 3320.78 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.52 [2025-06-20 01:32:23,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.95 | bwd: 3321.68 | bwd_inner: 3320.78 | bwd_allreduce: 0.85 | step: 7.53 76%|███████▋ | 7644/10000 [12:02:44<3:35:39, 5.49s/it] {'loss': 0.0007, 'grad_norm': 0.12722094357013702, 'learning_rate': 5.545381746957312e-06, 'epoch': 7.64} 76%|███████▋ | 7644/10000 [12:02:44<3:35:39, 5.49s/it][2025-06-20 01:32:29,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:32:29,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.52 | bwd_microstep: 3370.79 | bwd_inner_microstep: 3369.93 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.22 [2025-06-20 01:32:29,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.52 | bwd: 3370.81 | bwd_inner: 3369.93 | bwd_allreduce: 0.83 | step: 7.23 76%|███████▋ | 7645/10000 [12:02:50<3:36:10, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.02063322439789772, 'learning_rate': 5.540905706269186e-06, 'epoch': 7.64} 76%|███████▋ | 7645/10000 [12:02:50<3:36:10, 5.51s/it][2025-06-20 01:32:34,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.88 [2025-06-20 01:32:34,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.75 | bwd_microstep: 3320.15 | bwd_inner_microstep: 3319.37 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.75 [2025-06-20 01:32:34,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.76 | bwd: 3320.16 | bwd_inner: 3319.37 | bwd_allreduce: 0.75 | step: 6.75 76%|███████▋ | 7646/10000 [12:02:55<3:35:37, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00013345137995202094, 'learning_rate': 5.5364311822729435e-06, 'epoch': 7.65} 76%|███████▋ | 7646/10000 [12:02:55<3:35:37, 5.50s/it][2025-06-20 01:32:40,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:32:40,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.38 | bwd_microstep: 3319.56 | bwd_inner_microstep: 3318.49 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.09 [2025-06-20 01:32:40,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.38 | bwd: 3319.58 | bwd_inner: 3318.49 | bwd_allreduce: 1.03 | step: 7.10 76%|███████▋ | 7647/10000 [12:03:01<3:35:09, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.001718692947179079, 'learning_rate': 5.531958175437942e-06, 'epoch': 7.65} 76%|███████▋ | 7647/10000 [12:03:01<3:35:09, 5.49s/it][2025-06-20 01:32:45,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:32:45,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.94 | bwd_microstep: 3314.20 | bwd_inner_microstep: 3313.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 01:32:45,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.94 | bwd: 3314.22 | bwd_inner: 3313.42 | bwd_allreduce: 0.76 | step: 6.68 76%|███████▋ | 7648/10000 [12:03:06<3:34:43, 5.48s/it] {'loss': 0.0, 'grad_norm': 7.90617850725539e-05, 'learning_rate': 5.527486686233381e-06, 'epoch': 7.65} 76%|███████▋ | 7648/10000 [12:03:06<3:34:43, 5.48s/it][2025-06-20 01:32:51,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:32:51,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.41 | bwd_microstep: 3313.46 | bwd_inner_microstep: 3312.67 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 01:32:51,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.41 | bwd: 3313.47 | bwd_inner: 3312.67 | bwd_allreduce: 0.75 | step: 6.62 76%|███████▋ | 7649/10000 [12:03:12<3:34:33, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0017789429984986782, 'learning_rate': 5.523016715128302e-06, 'epoch': 7.65} 76%|███████▋ | 7649/10000 [12:03:12<3:34:33, 5.48s/it][2025-06-20 01:32:56,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 01:32:56,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.54 | bwd_microstep: 3320.55 | bwd_inner_microstep: 3319.77 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-20 01:32:56,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.55 | bwd: 3320.56 | bwd_inner: 3319.77 | bwd_allreduce: 0.75 | step: 6.54 76%|███████▋ | 7650/10000 [12:03:17<3:34:17, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.01037337351590395, 'learning_rate': 5.518548262591574e-06, 'epoch': 7.65} 76%|███████▋ | 7650/10000 [12:03:17<3:34:17, 5.47s/it][2025-06-20 01:33:02,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:33:02,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.06 | bwd_microstep: 3312.92 | bwd_inner_microstep: 3312.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 01:33:02,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.06 | bwd: 3312.94 | bwd_inner: 3312.14 | bwd_allreduce: 0.75 | step: 6.63 77%|███████▋ | 7651/10000 [12:03:23<3:33:57, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.04611982777714729, 'learning_rate': 5.514081329091922e-06, 'epoch': 7.65} 77%|███████▋ | 7651/10000 [12:03:23<3:33:57, 5.47s/it][2025-06-20 01:33:07,801] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:33:07,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.75 | bwd_microstep: 3359.93 | bwd_inner_microstep: 3358.99 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.34 [2025-06-20 01:33:07,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.75 | bwd: 3359.95 | bwd_inner: 3358.99 | bwd_allreduce: 0.92 | step: 7.34 77%|███████▋ | 7652/10000 [12:03:28<3:34:32, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.007062623742967844, 'learning_rate': 5.50961591509791e-06, 'epoch': 7.65} 77%|███████▋ | 7652/10000 [12:03:28<3:34:32, 5.48s/it][2025-06-20 01:33:13,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:33:13,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.94 | bwd_microstep: 3310.23 | bwd_inner_microstep: 3309.42 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-20 01:33:13,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.94 | bwd: 3310.24 | bwd_inner: 3309.42 | bwd_allreduce: 0.78 | step: 7.01 77%|███████▋ | 7653/10000 [12:03:34<3:34:15, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00023885142582003027, 'learning_rate': 5.505152021077935e-06, 'epoch': 7.65} 77%|███████▋ | 7653/10000 [12:03:34<3:34:15, 5.48s/it][2025-06-20 01:33:18,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:33:18,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.66 | bwd_microstep: 3316.28 | bwd_inner_microstep: 3315.37 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.96 [2025-06-20 01:33:18,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.66 | bwd: 3316.29 | bwd_inner: 3315.37 | bwd_allreduce: 0.88 | step: 6.96 77%|███████▋ | 7654/10000 [12:03:39<3:33:56, 5.47s/it] {'loss': 0.0078, 'grad_norm': 1.9380961656570435, 'learning_rate': 5.5006896475002415e-06, 'epoch': 7.65} 77%|███████▋ | 7654/10000 [12:03:39<3:33:56, 5.47s/it][2025-06-20 01:33:24,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:33:24,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.93 | bwd_microstep: 3312.23 | bwd_inner_microstep: 3311.44 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 01:33:24,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.93 | bwd: 3312.24 | bwd_inner: 3311.44 | bwd_allreduce: 0.76 | step: 6.65 77%|███████▋ | 7655/10000 [12:03:44<3:33:37, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.010586542077362537, 'learning_rate': 5.4962287948329096e-06, 'epoch': 7.66} 77%|███████▋ | 7655/10000 [12:03:44<3:33:37, 5.47s/it][2025-06-20 01:33:29,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:33:29,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.91 | bwd_microstep: 3315.07 | bwd_inner_microstep: 3314.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-20 01:33:29,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.91 | bwd: 3315.09 | bwd_inner: 3314.28 | bwd_allreduce: 0.76 | step: 6.74 77%|███████▋ | 7656/10000 [12:03:50<3:33:30, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.006872973870486021, 'learning_rate': 5.491769463543862e-06, 'epoch': 7.66} 77%|███████▋ | 7656/10000 [12:03:50<3:33:30, 5.47s/it][2025-06-20 01:33:35,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:33:35,097] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.88 | bwd_microstep: 3317.25 | bwd_inner_microstep: 3316.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.85 [2025-06-20 01:33:35,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.88 | bwd: 3317.26 | bwd_inner: 3316.46 | bwd_allreduce: 0.76 | step: 6.86 77%|███████▋ | 7657/10000 [12:03:55<3:33:17, 5.46s/it] {'loss': 0.0, 'grad_norm': 8.627872011857107e-05, 'learning_rate': 5.487311654100864e-06, 'epoch': 7.66} 77%|███████▋ | 7657/10000 [12:03:55<3:33:17, 5.46s/it][2025-06-20 01:33:40,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:33:40,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.39 | bwd_microstep: 3393.39 | bwd_inner_microstep: 3392.44 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.64 [2025-06-20 01:33:40,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.39 | bwd: 3393.40 | bwd_inner: 3392.44 | bwd_allreduce: 0.92 | step: 7.65 77%|███████▋ | 7658/10000 [12:04:01<3:34:22, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0025687115266919136, 'learning_rate': 5.482855366971518e-06, 'epoch': 7.66} 77%|███████▋ | 7658/10000 [12:04:01<3:34:22, 5.49s/it][2025-06-20 01:33:46,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:33:46,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.70 | bwd_microstep: 3313.93 | bwd_inner_microstep: 3313.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 01:33:46,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.70 | bwd: 3313.94 | bwd_inner: 3313.13 | bwd_allreduce: 0.76 | step: 6.68 77%|███████▋ | 7659/10000 [12:04:06<3:33:47, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0016890746774151921, 'learning_rate': 5.4784006026232725e-06, 'epoch': 7.66} 77%|███████▋ | 7659/10000 [12:04:06<3:33:47, 5.48s/it][2025-06-20 01:33:51,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:33:51,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.49 | bwd_microstep: 3314.94 | bwd_inner_microstep: 3314.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 01:33:51,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.49 | bwd: 3314.96 | bwd_inner: 3314.14 | bwd_allreduce: 0.77 | step: 6.79 77%|███████▋ | 7660/10000 [12:04:12<3:33:26, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0002501568815205246, 'learning_rate': 5.473947361523404e-06, 'epoch': 7.66} 77%|███████▋ | 7660/10000 [12:04:12<3:33:26, 5.47s/it][2025-06-20 01:33:57,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:33:57,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.36 | bwd_microstep: 3395.99 | bwd_inner_microstep: 3395.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 01:33:57,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.36 | bwd: 3396.01 | bwd_inner: 3395.20 | bwd_allreduce: 0.77 | step: 6.74 77%|███████▋ | 7661/10000 [12:04:17<3:34:29, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0016280206618830562, 'learning_rate': 5.469495644139038e-06, 'epoch': 7.66} 77%|███████▋ | 7661/10000 [12:04:17<3:34:29, 5.50s/it][2025-06-20 01:34:02,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:34:02,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.47 | bwd_microstep: 3321.44 | bwd_inner_microstep: 3320.63 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-20 01:34:02,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.47 | bwd: 3321.45 | bwd_inner: 3320.63 | bwd_allreduce: 0.78 | step: 7.09 77%|███████▋ | 7662/10000 [12:04:23<3:34:01, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0002502113638911396, 'learning_rate': 5.465045450937141e-06, 'epoch': 7.66} 77%|███████▋ | 7662/10000 [12:04:23<3:34:01, 5.49s/it][2025-06-20 01:34:08,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:34:08,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.81 | bwd_microstep: 3360.73 | bwd_inner_microstep: 3359.82 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.12 [2025-06-20 01:34:08,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.81 | bwd: 3360.75 | bwd_inner: 3359.82 | bwd_allreduce: 0.88 | step: 7.13 77%|███████▋ | 7663/10000 [12:04:28<3:34:16, 5.50s/it] {'loss': 0.0012, 'grad_norm': 0.39671197533607483, 'learning_rate': 5.460596782384515e-06, 'epoch': 7.66} 77%|███████▋ | 7663/10000 [12:04:28<3:34:16, 5.50s/it][2025-06-20 01:34:13,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:34:13,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.49 | bwd_microstep: 3312.95 | bwd_inner_microstep: 3312.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 01:34:13,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.49 | bwd: 3312.96 | bwd_inner: 3312.15 | bwd_allreduce: 0.76 | step: 6.69 77%|███████▋ | 7664/10000 [12:04:34<3:33:37, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00031803594902157784, 'learning_rate': 5.456149638947816e-06, 'epoch': 7.66} 77%|███████▋ | 7664/10000 [12:04:34<3:33:37, 5.49s/it][2025-06-20 01:34:19,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:34:19,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.59 | bwd_microstep: 3320.00 | bwd_inner_microstep: 3319.15 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.25 [2025-06-20 01:34:19,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.59 | bwd: 3320.01 | bwd_inner: 3319.15 | bwd_allreduce: 0.82 | step: 7.26 77%|███████▋ | 7665/10000 [12:04:39<3:33:10, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.034508321434259415, 'learning_rate': 5.451704021093513e-06, 'epoch': 7.67} 77%|███████▋ | 7665/10000 [12:04:39<3:33:10, 5.48s/it][2025-06-20 01:34:24,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:34:24,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.19 | bwd_microstep: 3367.25 | bwd_inner_microstep: 3366.43 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.21 [2025-06-20 01:34:24,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.19 | bwd: 3367.26 | bwd_inner: 3366.43 | bwd_allreduce: 0.78 | step: 7.21 77%|███████▋ | 7666/10000 [12:04:45<3:33:48, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0050776246935129166, 'learning_rate': 5.447259929287938e-06, 'epoch': 7.67} 77%|███████▋ | 7666/10000 [12:04:45<3:33:48, 5.50s/it][2025-06-20 01:34:30,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:34:30,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.79 | bwd_microstep: 3314.46 | bwd_inner_microstep: 3313.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 01:34:30,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.79 | bwd: 3314.48 | bwd_inner: 3313.67 | bwd_allreduce: 0.77 | step: 6.81 77%|███████▋ | 7667/10000 [12:04:50<3:33:13, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001365477917715907, 'learning_rate': 5.4428173639972544e-06, 'epoch': 7.67} 77%|███████▋ | 7667/10000 [12:04:50<3:33:13, 5.48s/it][2025-06-20 01:34:35,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:34:35,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.86 | bwd_microstep: 3314.90 | bwd_inner_microstep: 3313.90 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.08 [2025-06-20 01:34:35,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.86 | bwd: 3314.91 | bwd_inner: 3313.90 | bwd_allreduce: 0.97 | step: 7.09 77%|███████▋ | 7668/10000 [12:04:56<3:32:51, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.03459477424621582, 'learning_rate': 5.438376325687467e-06, 'epoch': 7.67} 77%|███████▋ | 7668/10000 [12:04:56<3:32:51, 5.48s/it][2025-06-20 01:34:40,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:34:40,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.53 | bwd_microstep: 3315.02 | bwd_inner_microstep: 3314.20 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.32 [2025-06-20 01:34:40,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.53 | bwd: 3315.03 | bwd_inner: 3314.20 | bwd_allreduce: 0.79 | step: 7.33 77%|███████▋ | 7669/10000 [12:05:01<3:32:32, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.1575075089931488, 'learning_rate': 5.433936814824421e-06, 'epoch': 7.67} 77%|███████▋ | 7669/10000 [12:05:01<3:32:32, 5.47s/it][2025-06-20 01:34:46,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:34:46,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.58 | bwd_microstep: 3320.75 | bwd_inner_microstep: 3319.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-20 01:34:46,420] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.58 | bwd: 3320.77 | bwd_inner: 3319.95 | bwd_allreduce: 0.78 | step: 6.85 77%|███████▋ | 7670/10000 [12:05:07<3:32:25, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.016729163005948067, 'learning_rate': 5.429498831873808e-06, 'epoch': 7.67} 77%|███████▋ | 7670/10000 [12:05:07<3:32:25, 5.47s/it][2025-06-20 01:34:51,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.89 [2025-06-20 01:34:51,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.71 | bwd_microstep: 3360.22 | bwd_inner_microstep: 3359.41 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.33 [2025-06-20 01:34:51,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.71 | bwd: 3360.23 | bwd_inner: 3359.41 | bwd_allreduce: 0.78 | step: 7.34 77%|███████▋ | 7671/10000 [12:05:12<3:33:05, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004439112730324268, 'learning_rate': 5.425062377301133e-06, 'epoch': 7.67} 77%|███████▋ | 7671/10000 [12:05:12<3:33:05, 5.49s/it][2025-06-20 01:34:57,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:34:57,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.22 | bwd_microstep: 3362.02 | bwd_inner_microstep: 3361.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-20 01:34:57,477] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.22 | bwd: 3362.04 | bwd_inner: 3361.22 | bwd_allreduce: 0.77 | step: 7.13 77%|███████▋ | 7672/10000 [12:05:18<3:33:22, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005069078877568245, 'learning_rate': 5.4206274515717735e-06, 'epoch': 7.67} 77%|███████▋ | 7672/10000 [12:05:18<3:33:22, 5.50s/it][2025-06-20 01:35:02,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:35:02,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.61 | bwd_microstep: 3315.82 | bwd_inner_microstep: 3314.84 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.08 [2025-06-20 01:35:02,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.61 | bwd: 3315.83 | bwd_inner: 3314.84 | bwd_allreduce: 0.95 | step: 7.09 77%|███████▋ | 7673/10000 [12:05:23<3:32:45, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0003924126795027405, 'learning_rate': 5.4161940551509295e-06, 'epoch': 7.67} 77%|███████▋ | 7673/10000 [12:05:23<3:32:45, 5.49s/it][2025-06-20 01:35:08,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:35:08,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.00 | bwd_microstep: 3314.53 | bwd_inner_microstep: 3313.59 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.22 [2025-06-20 01:35:08,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.00 | bwd: 3314.55 | bwd_inner: 3313.59 | bwd_allreduce: 0.92 | step: 7.22 77%|███████▋ | 7674/10000 [12:05:29<3:32:20, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.018541986122727394, 'learning_rate': 5.411762188503642e-06, 'epoch': 7.67} 77%|███████▋ | 7674/10000 [12:05:29<3:32:20, 5.48s/it][2025-06-20 01:35:13,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:35:13,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.47 | bwd_microstep: 3315.62 | bwd_inner_microstep: 3314.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-20 01:35:13,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.47 | bwd: 3315.63 | bwd_inner: 3314.81 | bwd_allreduce: 0.77 | step: 6.90 77%|███████▋ | 7675/10000 [12:05:34<3:32:08, 5.47s/it] {'loss': 0.0, 'grad_norm': 9.822563151828945e-05, 'learning_rate': 5.407331852094795e-06, 'epoch': 7.67} 77%|███████▋ | 7675/10000 [12:05:34<3:32:08, 5.47s/it][2025-06-20 01:35:19,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.70 | optimizer_step: 2.73 [2025-06-20 01:35:19,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.76 | bwd_microstep: 3309.04 | bwd_inner_microstep: 3307.87 | bwd_allreduce_microstep: 1.12 | step_microstep: 7.46 [2025-06-20 01:35:19,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.76 | bwd: 3309.06 | bwd_inner: 3307.87 | bwd_allreduce: 1.14 | step: 7.46 77%|███████▋ | 7676/10000 [12:05:40<3:31:44, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00010677087993826717, 'learning_rate': 5.4029030463891115e-06, 'epoch': 7.68} 77%|███████▋ | 7676/10000 [12:05:40<3:31:44, 5.47s/it][2025-06-20 01:35:24,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:35:24,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.16 | bwd_microstep: 3367.01 | bwd_inner_microstep: 3366.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 01:35:24,838] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.16 | bwd: 3367.02 | bwd_inner: 3366.22 | bwd_allreduce: 0.77 | step: 6.76 77%|███████▋ | 7677/10000 [12:05:45<3:32:24, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.0229757372289896, 'learning_rate': 5.398475771851151e-06, 'epoch': 7.68} 77%|███████▋ | 7677/10000 [12:05:45<3:32:24, 5.49s/it][2025-06-20 01:35:30,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:35:30,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.27 | bwd_microstep: 3323.72 | bwd_inner_microstep: 3322.74 | bwd_allreduce_microstep: 0.93 | step_microstep: 8.02 [2025-06-20 01:35:30,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.27 | bwd: 3323.74 | bwd_inner: 3322.74 | bwd_allreduce: 0.95 | step: 8.03 77%|███████▋ | 7678/10000 [12:05:51<3:32:07, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.08265678584575653, 'learning_rate': 5.394050028945313e-06, 'epoch': 7.68} 77%|███████▋ | 7678/10000 [12:05:51<3:32:07, 5.48s/it][2025-06-20 01:35:35,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:35:35,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.66 | bwd_microstep: 3310.50 | bwd_inner_microstep: 3309.67 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.79 [2025-06-20 01:35:35,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.66 | bwd: 3310.51 | bwd_inner: 3309.67 | bwd_allreduce: 0.80 | step: 6.80 77%|███████▋ | 7679/10000 [12:05:56<3:31:44, 5.47s/it] {'loss': 0.0012, 'grad_norm': 0.23856058716773987, 'learning_rate': 5.38962581813584e-06, 'epoch': 7.68} 77%|███████▋ | 7679/10000 [12:05:56<3:31:44, 5.47s/it][2025-06-20 01:35:41,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:35:41,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.02 | bwd_microstep: 3313.11 | bwd_inner_microstep: 3312.31 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 01:35:41,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.02 | bwd: 3313.13 | bwd_inner: 3312.31 | bwd_allreduce: 0.77 | step: 6.69 77%|███████▋ | 7680/10000 [12:06:02<3:31:28, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0005951550556346774, 'learning_rate': 5.385203139886814e-06, 'epoch': 7.68} 77%|███████▋ | 7680/10000 [12:06:02<3:31:28, 5.47s/it][2025-06-20 01:35:46,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:35:46,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.49 | bwd_microstep: 3313.10 | bwd_inner_microstep: 3312.30 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-20 01:35:46,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.49 | bwd: 3313.11 | bwd_inner: 3312.30 | bwd_allreduce: 0.77 | step: 6.91 77%|███████▋ | 7681/10000 [12:06:07<3:31:14, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.011706185527145863, 'learning_rate': 5.380781994662145e-06, 'epoch': 7.68} 77%|███████▋ | 7681/10000 [12:06:07<3:31:14, 5.47s/it][2025-06-20 01:35:52,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:35:52,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.43 | bwd_microstep: 3315.56 | bwd_inner_microstep: 3314.77 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 01:35:52,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.43 | bwd: 3315.57 | bwd_inner: 3314.77 | bwd_allreduce: 0.76 | step: 6.70 77%|███████▋ | 7682/10000 [12:06:12<3:31:00, 5.46s/it] {'loss': 0.0002, 'grad_norm': 0.03153666853904724, 'learning_rate': 5.376362382925595e-06, 'epoch': 7.68} 77%|███████▋ | 7682/10000 [12:06:12<3:31:00, 5.46s/it][2025-06-20 01:35:57,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:35:57,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.17 | bwd_microstep: 3315.54 | bwd_inner_microstep: 3314.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-20 01:35:57,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.17 | bwd: 3315.55 | bwd_inner: 3314.75 | bwd_allreduce: 0.76 | step: 6.64 77%|███████▋ | 7683/10000 [12:06:18<3:30:50, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.010517291724681854, 'learning_rate': 5.371944305140759e-06, 'epoch': 7.68} 77%|███████▋ | 7683/10000 [12:06:18<3:30:50, 5.46s/it][2025-06-20 01:36:03,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:36:03,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.82 | bwd_microstep: 3362.43 | bwd_inner_microstep: 3361.60 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.94 [2025-06-20 01:36:03,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.82 | bwd: 3362.45 | bwd_inner: 3361.60 | bwd_allreduce: 0.80 | step: 6.94 77%|███████▋ | 7684/10000 [12:06:23<3:31:28, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0011338910553604364, 'learning_rate': 5.367527761771076e-06, 'epoch': 7.68} 77%|███████▋ | 7684/10000 [12:06:23<3:31:28, 5.48s/it][2025-06-20 01:36:08,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:36:08,641] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.32 | bwd_microstep: 3363.85 | bwd_inner_microstep: 3362.99 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.43 [2025-06-20 01:36:08,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.32 | bwd: 3363.88 | bwd_inner: 3362.99 | bwd_allreduce: 0.82 | step: 7.43 77%|███████▋ | 7685/10000 [12:06:29<3:32:02, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.04909352585673332, 'learning_rate': 5.3631127532798245e-06, 'epoch': 7.69} 77%|███████▋ | 7685/10000 [12:06:29<3:32:02, 5.50s/it][2025-06-20 01:36:14,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:36:14,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.19 | bwd_microstep: 3365.48 | bwd_inner_microstep: 3364.55 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.03 [2025-06-20 01:36:14,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.19 | bwd: 3365.50 | bwd_inner: 3364.55 | bwd_allreduce: 0.90 | step: 7.04 77%|███████▋ | 7686/10000 [12:06:34<3:32:22, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0011590952053666115, 'learning_rate': 5.358699280130106e-06, 'epoch': 7.69} 77%|███████▋ | 7686/10000 [12:06:34<3:32:22, 5.51s/it][2025-06-20 01:36:19,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:36:19,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.60 | bwd_microstep: 3315.98 | bwd_inner_microstep: 3315.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.51 [2025-06-20 01:36:19,640] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.60 | bwd: 3315.99 | bwd_inner: 3315.20 | bwd_allreduce: 0.75 | step: 6.52 77%|███████▋ | 7687/10000 [12:06:40<3:31:45, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004371717106550932, 'learning_rate': 5.354287342784883e-06, 'epoch': 7.69} 77%|███████▋ | 7687/10000 [12:06:40<3:31:45, 5.49s/it][2025-06-20 01:36:25,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:36:25,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.26 | bwd_microstep: 3362.58 | bwd_inner_microstep: 3361.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-20 01:36:25,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.26 | bwd: 3362.59 | bwd_inner: 3361.80 | bwd_allreduce: 0.75 | step: 6.77 77%|███████▋ | 7688/10000 [12:06:45<3:32:15, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.05805404484272003, 'learning_rate': 5.349876941706944e-06, 'epoch': 7.69} 77%|███████▋ | 7688/10000 [12:06:45<3:32:15, 5.51s/it][2025-06-20 01:36:30,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:36:30,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.35 | bwd_microstep: 3320.22 | bwd_inner_microstep: 3319.44 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 01:36:30,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.36 | bwd: 3320.23 | bwd_inner: 3319.43 | bwd_allreduce: 0.76 | step: 6.64 77%|███████▋ | 7689/10000 [12:06:51<3:31:37, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0005118914414197206, 'learning_rate': 5.345468077358922e-06, 'epoch': 7.69} 77%|███████▋ | 7689/10000 [12:06:51<3:31:37, 5.49s/it][2025-06-20 01:36:36,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.81 [2025-06-20 01:36:36,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.88 | bwd_microstep: 3361.46 | bwd_inner_microstep: 3360.56 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.15 [2025-06-20 01:36:36,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.89 | bwd: 3361.47 | bwd_inner: 3360.56 | bwd_allreduce: 0.87 | step: 7.15 77%|███████▋ | 7690/10000 [12:06:56<3:31:53, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.13312120735645294, 'learning_rate': 5.341060750203282e-06, 'epoch': 7.69} 77%|███████▋ | 7690/10000 [12:06:56<3:31:53, 5.50s/it][2025-06-20 01:36:41,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:36:41,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.46 | bwd_microstep: 3308.67 | bwd_inner_microstep: 3307.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-20 01:36:41,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.46 | bwd: 3308.68 | bwd_inner: 3307.87 | bwd_allreduce: 0.77 | step: 6.88 77%|███████▋ | 7691/10000 [12:07:02<3:31:09, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006536468863487244, 'learning_rate': 5.336654960702336e-06, 'epoch': 7.69} 77%|███████▋ | 7691/10000 [12:07:02<3:31:09, 5.49s/it][2025-06-20 01:36:47,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-20 01:36:47,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.10 | bwd_microstep: 3324.86 | bwd_inner_microstep: 3324.04 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.86 [2025-06-20 01:36:47,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.10 | bwd: 3324.87 | bwd_inner: 3324.04 | bwd_allreduce: 0.78 | step: 7.88 77%|███████▋ | 7692/10000 [12:07:07<3:30:50, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.059966254979372025, 'learning_rate': 5.3322507093182315e-06, 'epoch': 7.69} 77%|███████▋ | 7692/10000 [12:07:07<3:30:50, 5.48s/it][2025-06-20 01:36:52,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:36:52,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.47 | bwd_microstep: 3324.30 | bwd_inner_microstep: 3323.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-20 01:36:52,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.47 | bwd: 3324.31 | bwd_inner: 3323.51 | bwd_allreduce: 0.76 | step: 6.63 77%|███████▋ | 7693/10000 [12:07:13<3:30:38, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.032493945211172104, 'learning_rate': 5.327847996512951e-06, 'epoch': 7.69} 77%|███████▋ | 7693/10000 [12:07:13<3:30:38, 5.48s/it][2025-06-20 01:36:58,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:36:58,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.86 | bwd_microstep: 3309.22 | bwd_inner_microstep: 3308.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 01:36:58,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.86 | bwd: 3309.23 | bwd_inner: 3308.43 | bwd_allreduce: 0.76 | step: 6.72 77%|███████▋ | 7694/10000 [12:07:18<3:30:10, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.05043921619653702, 'learning_rate': 5.323446822748322e-06, 'epoch': 7.69} 77%|███████▋ | 7694/10000 [12:07:18<3:30:10, 5.47s/it][2025-06-20 01:37:03,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 01:37:03,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.03 | bwd_microstep: 3361.20 | bwd_inner_microstep: 3360.14 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.61 [2025-06-20 01:37:03,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.04 | bwd: 3361.21 | bwd_inner: 3360.14 | bwd_allreduce: 1.03 | step: 7.61 77%|███████▋ | 7695/10000 [12:07:24<3:30:44, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.14702314138412476, 'learning_rate': 5.319047188486009e-06, 'epoch': 7.7} 77%|███████▋ | 7695/10000 [12:07:24<3:30:44, 5.49s/it][2025-06-20 01:37:08,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:37:08,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.21 | bwd_microstep: 3312.89 | bwd_inner_microstep: 3312.09 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-20 01:37:08,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.21 | bwd: 3312.90 | bwd_inner: 3312.09 | bwd_allreduce: 0.77 | step: 6.98 77%|███████▋ | 7696/10000 [12:07:29<3:30:18, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004617284517735243, 'learning_rate': 5.3146490941875055e-06, 'epoch': 7.7} 77%|███████▋ | 7696/10000 [12:07:29<3:30:18, 5.48s/it][2025-06-20 01:37:14,512] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:37:14,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.00 | bwd_microstep: 3363.41 | bwd_inner_microstep: 3362.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-20 01:37:14,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.00 | bwd: 3363.42 | bwd_inner: 3362.61 | bwd_allreduce: 0.77 | step: 6.65 77%|███████▋ | 7697/10000 [12:07:35<3:30:48, 5.49s/it] {'loss': 0.0013, 'grad_norm': 0.5525612831115723, 'learning_rate': 5.3102525403141535e-06, 'epoch': 7.7} 77%|███████▋ | 7697/10000 [12:07:35<3:30:48, 5.49s/it][2025-06-20 01:37:19,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:37:19,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.83 | bwd_microstep: 3312.17 | bwd_inner_microstep: 3311.34 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.93 [2025-06-20 01:37:19,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.83 | bwd: 3312.19 | bwd_inner: 3311.34 | bwd_allreduce: 0.80 | step: 6.93 77%|███████▋ | 7698/10000 [12:07:40<3:30:17, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00016320464783348143, 'learning_rate': 5.305857527327134e-06, 'epoch': 7.7} 77%|███████▋ | 7698/10000 [12:07:40<3:30:17, 5.48s/it][2025-06-20 01:37:25,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:37:25,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.69 | bwd_microstep: 3363.94 | bwd_inner_microstep: 3363.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-20 01:37:25,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.69 | bwd: 3363.96 | bwd_inner: 3363.13 | bwd_allreduce: 0.78 | step: 6.78 77%|███████▋ | 7699/10000 [12:07:46<3:30:42, 5.49s/it] {'loss': 0.0, 'grad_norm': 9.065267659025267e-05, 'learning_rate': 5.301464055687459e-06, 'epoch': 7.7} 77%|███████▋ | 7699/10000 [12:07:46<3:30:42, 5.49s/it][2025-06-20 01:37:31,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:37:31,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.64 | bwd_microstep: 3368.16 | bwd_inner_microstep: 3367.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-20 01:37:31,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.64 | bwd: 3368.18 | bwd_inner: 3367.35 | bwd_allreduce: 0.78 | step: 6.91 77%|███████▋ | 7700/10000 [12:07:51<3:31:00, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00020636561384890229, 'learning_rate': 5.297072125855998e-06, 'epoch': 7.7} 77%|███████▋ | 7700/10000 [12:07:51<3:31:00, 5.50s/it][2025-06-20 01:37:36,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:37:36,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.02 | bwd_microstep: 3315.59 | bwd_inner_microstep: 3314.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-20 01:37:36,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.02 | bwd: 3315.61 | bwd_inner: 3314.81 | bwd_allreduce: 0.76 | step: 6.77 77%|███████▋ | 7701/10000 [12:07:57<3:30:25, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00011975195229751989, 'learning_rate': 5.292681738293421e-06, 'epoch': 7.7} 77%|███████▋ | 7701/10000 [12:07:57<3:30:25, 5.49s/it][2025-06-20 01:37:41,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:37:41,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.74 | bwd_microstep: 3324.37 | bwd_inner_microstep: 3323.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 01:37:41,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.74 | bwd: 3324.39 | bwd_inner: 3323.58 | bwd_allreduce: 0.76 | step: 6.76 77%|███████▋ | 7702/10000 [12:08:02<3:30:05, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006094863638281822, 'learning_rate': 5.288292893460276e-06, 'epoch': 7.7} 77%|███████▋ | 7702/10000 [12:08:02<3:30:05, 5.49s/it][2025-06-20 01:37:47,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:37:47,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.10 | bwd_microstep: 3366.60 | bwd_inner_microstep: 3365.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-20 01:37:47,487] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.10 | bwd: 3366.61 | bwd_inner: 3365.81 | bwd_allreduce: 0.76 | step: 6.64 77%|███████▋ | 7703/10000 [12:08:08<3:30:31, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0019594430923461914, 'learning_rate': 5.283905591816924e-06, 'epoch': 7.7} 77%|███████▋ | 7703/10000 [12:08:08<3:30:31, 5.50s/it][2025-06-20 01:37:52,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:37:52,957] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.73 | bwd_microstep: 3326.32 | bwd_inner_microstep: 3325.45 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.93 [2025-06-20 01:37:52,958] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.73 | bwd: 3326.34 | bwd_inner: 3325.45 | bwd_allreduce: 0.83 | step: 6.94 77%|███████▋ | 7704/10000 [12:08:13<3:30:06, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0003930058446712792, 'learning_rate': 5.279519833823576e-06, 'epoch': 7.7} 77%|███████▋ | 7704/10000 [12:08:13<3:30:06, 5.49s/it][2025-06-20 01:37:58,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:37:58,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.17 | bwd_microstep: 3314.61 | bwd_inner_microstep: 3313.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 01:37:58,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.17 | bwd: 3314.62 | bwd_inner: 3313.81 | bwd_allreduce: 0.77 | step: 6.71 77%|███████▋ | 7705/10000 [12:08:19<3:29:40, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.005998568143695593, 'learning_rate': 5.275135619940286e-06, 'epoch': 7.71} 77%|███████▋ | 7705/10000 [12:08:19<3:29:40, 5.48s/it][2025-06-20 01:38:03,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:38:03,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.78 | bwd_microstep: 3313.39 | bwd_inner_microstep: 3312.52 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.84 [2025-06-20 01:38:03,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.78 | bwd: 3313.41 | bwd_inner: 3312.52 | bwd_allreduce: 0.84 | step: 6.85 77%|███████▋ | 7706/10000 [12:08:24<3:29:15, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.006701089441776276, 'learning_rate': 5.270752950626921e-06, 'epoch': 7.71} 77%|███████▋ | 7706/10000 [12:08:24<3:29:15, 5.47s/it][2025-06-20 01:38:09,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:38:09,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.98 | bwd_microstep: 3392.20 | bwd_inner_microstep: 3391.38 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.22 [2025-06-20 01:38:09,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.98 | bwd: 3392.21 | bwd_inner: 3391.38 | bwd_allreduce: 0.79 | step: 7.23 77%|███████▋ | 7707/10000 [12:08:30<3:30:18, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.02325507067143917, 'learning_rate': 5.266371826343213e-06, 'epoch': 7.71} 77%|███████▋ | 7707/10000 [12:08:30<3:30:18, 5.50s/it][2025-06-20 01:38:14,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:38:14,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.49 | bwd_microstep: 3329.69 | bwd_inner_microstep: 3328.71 | bwd_allreduce_microstep: 0.94 | step_microstep: 6.63 [2025-06-20 01:38:14,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.49 | bwd: 3329.70 | bwd_inner: 3328.71 | bwd_allreduce: 0.95 | step: 6.64 77%|███████▋ | 7708/10000 [12:08:35<3:29:53, 5.49s/it] {'loss': 0.0016, 'grad_norm': 0.3268240690231323, 'learning_rate': 5.261992247548717e-06, 'epoch': 7.71} 77%|███████▋ | 7708/10000 [12:08:35<3:29:53, 5.49s/it][2025-06-20 01:38:20,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:38:20,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.37 | bwd_microstep: 3312.03 | bwd_inner_microstep: 3311.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-20 01:38:20,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.37 | bwd: 3312.05 | bwd_inner: 3311.24 | bwd_allreduce: 0.76 | step: 6.77 77%|███████▋ | 7709/10000 [12:08:41<3:29:17, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0003265540872234851, 'learning_rate': 5.257614214702835e-06, 'epoch': 7.71} 77%|███████▋ | 7709/10000 [12:08:41<3:29:17, 5.48s/it][2025-06-20 01:38:25,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:38:25,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.59 | bwd_microstep: 3372.43 | bwd_inner_microstep: 3371.57 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.44 [2025-06-20 01:38:25,909] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.59 | bwd: 3372.44 | bwd_inner: 3371.57 | bwd_allreduce: 0.83 | step: 7.44 77%|███████▋ | 7710/10000 [12:08:46<3:29:53, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00431586429476738, 'learning_rate': 5.2532377282648e-06, 'epoch': 7.71} 77%|███████▋ | 7710/10000 [12:08:46<3:29:53, 5.50s/it][2025-06-20 01:38:31,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:38:31,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.75 | bwd_microstep: 3322.23 | bwd_inner_microstep: 3321.32 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.04 [2025-06-20 01:38:31,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.75 | bwd: 3322.25 | bwd_inner: 3321.32 | bwd_allreduce: 0.88 | step: 7.05 77%|███████▋ | 7711/10000 [12:08:52<3:29:27, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.008885609917342663, 'learning_rate': 5.248862788693685e-06, 'epoch': 7.71} 77%|███████▋ | 7711/10000 [12:08:52<3:29:27, 5.49s/it][2025-06-20 01:38:36,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:38:36,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.58 | bwd_microstep: 3332.29 | bwd_inner_microstep: 3331.37 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.12 [2025-06-20 01:38:36,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.58 | bwd: 3332.30 | bwd_inner: 3331.37 | bwd_allreduce: 0.89 | step: 7.13 77%|███████▋ | 7712/10000 [12:08:57<3:29:19, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.15307657420635223, 'learning_rate': 5.2444893964484e-06, 'epoch': 7.71} 77%|███████▋ | 7712/10000 [12:08:57<3:29:19, 5.49s/it][2025-06-20 01:38:42,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 01:38:42,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.18 | bwd_microstep: 3322.86 | bwd_inner_microstep: 3321.78 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.62 [2025-06-20 01:38:42,344] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.18 | bwd: 3322.88 | bwd_inner: 3321.78 | bwd_allreduce: 1.04 | step: 7.63 77%|███████▋ | 7713/10000 [12:09:03<3:29:10, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0012526442296802998, 'learning_rate': 5.240117551987698e-06, 'epoch': 7.71} 77%|███████▋ | 7713/10000 [12:09:03<3:29:10, 5.49s/it][2025-06-20 01:38:47,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 01:38:47,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.78 | bwd_microstep: 3371.43 | bwd_inner_microstep: 3370.34 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.64 [2025-06-20 01:38:47,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.78 | bwd: 3371.45 | bwd_inner: 3370.34 | bwd_allreduce: 1.05 | step: 7.65 77%|███████▋ | 7714/10000 [12:09:08<3:29:43, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0010188602609559894, 'learning_rate': 5.2357472557701585e-06, 'epoch': 7.71} 77%|███████▋ | 7714/10000 [12:09:08<3:29:43, 5.50s/it][2025-06-20 01:38:53,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:38:53,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.82 | bwd_microstep: 3319.68 | bwd_inner_microstep: 3318.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-20 01:38:53,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.82 | bwd: 3319.69 | bwd_inner: 3318.89 | bwd_allreduce: 0.76 | step: 6.60 77%|███████▋ | 7715/10000 [12:09:14<3:29:13, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0014021226670593023, 'learning_rate': 5.231378508254217e-06, 'epoch': 7.71} 77%|███████▋ | 7715/10000 [12:09:14<3:29:13, 5.49s/it][2025-06-20 01:38:58,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:38:58,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.43 | bwd_microstep: 3326.16 | bwd_inner_microstep: 3325.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 01:38:58,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.44 | bwd: 3326.17 | bwd_inner: 3325.35 | bwd_allreduce: 0.78 | step: 6.82 77%|███████▋ | 7716/10000 [12:09:19<3:28:51, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.016792723909020424, 'learning_rate': 5.22701130989812e-06, 'epoch': 7.72} 77%|███████▋ | 7716/10000 [12:09:19<3:28:51, 5.49s/it][2025-06-20 01:39:04,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:39:04,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.60 | bwd_microstep: 3370.17 | bwd_inner_microstep: 3369.37 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-20 01:39:04,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.60 | bwd: 3370.19 | bwd_inner: 3369.37 | bwd_allreduce: 0.77 | step: 6.94 77%|███████▋ | 7717/10000 [12:09:25<3:29:25, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.004890648648142815, 'learning_rate': 5.222645661159973e-06, 'epoch': 7.72} 77%|███████▋ | 7717/10000 [12:09:25<3:29:25, 5.50s/it][2025-06-20 01:39:09,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:39:09,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.40 | bwd_microstep: 3377.62 | bwd_inner_microstep: 3376.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-20 01:39:09,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.40 | bwd: 3377.64 | bwd_inner: 3376.83 | bwd_allreduce: 0.77 | step: 6.96 77%|███████▋ | 7718/10000 [12:09:30<3:29:47, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.001584459445439279, 'learning_rate': 5.2182815624977135e-06, 'epoch': 7.72} 77%|███████▋ | 7718/10000 [12:09:30<3:29:47, 5.52s/it][2025-06-20 01:39:15,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:39:15,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.03 | bwd_microstep: 3405.34 | bwd_inner_microstep: 3404.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 01:39:15,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.03 | bwd: 3405.36 | bwd_inner: 3404.56 | bwd_allreduce: 0.76 | step: 6.62 77%|███████▋ | 7719/10000 [12:09:36<3:30:23, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.003134120022878051, 'learning_rate': 5.213919014369113e-06, 'epoch': 7.72} 77%|███████▋ | 7719/10000 [12:09:36<3:30:23, 5.53s/it][2025-06-20 01:39:21,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:39:21,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.61 | bwd_microstep: 3377.49 | bwd_inner_microstep: 3376.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 01:39:21,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.61 | bwd: 3377.51 | bwd_inner: 3376.70 | bwd_allreduce: 0.76 | step: 6.65 77%|███████▋ | 7720/10000 [12:09:41<3:30:25, 5.54s/it] {'loss': 0.0012, 'grad_norm': 0.31047478318214417, 'learning_rate': 5.20955801723179e-06, 'epoch': 7.72} 77%|███████▋ | 7720/10000 [12:09:41<3:30:25, 5.54s/it][2025-06-20 01:39:26,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:39:26,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.54 | bwd_microstep: 3334.81 | bwd_inner_microstep: 3334.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.33 [2025-06-20 01:39:26,522] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.54 | bwd: 3334.82 | bwd_inner: 3334.00 | bwd_allreduce: 0.78 | step: 7.33 77%|███████▋ | 7721/10000 [12:09:47<3:29:40, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0038341854233294725, 'learning_rate': 5.20519857154318e-06, 'epoch': 7.72} 77%|███████▋ | 7721/10000 [12:09:47<3:29:40, 5.52s/it][2025-06-20 01:39:32,073] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 01:39:32,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.12 | bwd_microstep: 3379.94 | bwd_inner_microstep: 3378.81 | bwd_allreduce_microstep: 1.06 | step_microstep: 7.41 [2025-06-20 01:39:32,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.12 | bwd: 3379.96 | bwd_inner: 3378.81 | bwd_allreduce: 1.09 | step: 7.41 77%|███████▋ | 7722/10000 [12:09:52<3:29:59, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0037247471045702696, 'learning_rate': 5.200840677760577e-06, 'epoch': 7.72} 77%|███████▋ | 7722/10000 [12:09:52<3:29:59, 5.53s/it][2025-06-20 01:39:37,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:39:37,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.02 | bwd_microstep: 3317.31 | bwd_inner_microstep: 3316.52 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 01:39:37,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.02 | bwd: 3317.32 | bwd_inner: 3316.52 | bwd_allreduce: 0.76 | step: 6.61 77%|███████▋ | 7723/10000 [12:09:58<3:29:09, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.043945927172899246, 'learning_rate': 5.1964843363410985e-06, 'epoch': 7.72} 77%|███████▋ | 7723/10000 [12:09:58<3:29:09, 5.51s/it][2025-06-20 01:39:43,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:39:43,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.07 | bwd_microstep: 3323.45 | bwd_inner_microstep: 3322.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 01:39:43,011] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.07 | bwd: 3323.46 | bwd_inner: 3322.66 | bwd_allreduce: 0.76 | step: 6.72 77%|███████▋ | 7724/10000 [12:10:03<3:28:33, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0016072707949206233, 'learning_rate': 5.192129547741711e-06, 'epoch': 7.72} 77%|███████▋ | 7724/10000 [12:10:03<3:28:33, 5.50s/it][2025-06-20 01:39:48,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:39:48,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.76 | bwd_microstep: 3338.69 | bwd_inner_microstep: 3337.61 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.71 [2025-06-20 01:39:48,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.76 | bwd: 3338.72 | bwd_inner: 3337.61 | bwd_allreduce: 1.05 | step: 7.72 77%|███████▋ | 7725/10000 [12:10:09<3:28:22, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0025096654426306486, 'learning_rate': 5.187776312419206e-06, 'epoch': 7.72} 77%|███████▋ | 7725/10000 [12:10:09<3:28:22, 5.50s/it][2025-06-20 01:39:53,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:39:53,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.17 | bwd_microstep: 3331.44 | bwd_inner_microstep: 3330.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-20 01:39:53,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.17 | bwd: 3331.46 | bwd_inner: 3330.66 | bwd_allreduce: 0.75 | step: 6.53 77%|███████▋ | 7726/10000 [12:10:14<3:28:12, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004345057997852564, 'learning_rate': 5.18342463083022e-06, 'epoch': 7.73} 77%|███████▋ | 7726/10000 [12:10:14<3:28:12, 5.49s/it][2025-06-20 01:39:59,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 01:39:59,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.93 | bwd_microstep: 3328.99 | bwd_inner_microstep: 3328.21 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-20 01:39:59,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.93 | bwd: 3329.00 | bwd_inner: 3328.21 | bwd_allreduce: 0.75 | step: 6.53 77%|███████▋ | 7727/10000 [12:10:20<3:28:00, 5.49s/it] {'loss': 0.0006, 'grad_norm': 0.11117543280124664, 'learning_rate': 5.1790745034312275e-06, 'epoch': 7.73} 77%|███████▋ | 7727/10000 [12:10:20<3:28:00, 5.49s/it][2025-06-20 01:40:04,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:40:04,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.66 | bwd_microstep: 3327.28 | bwd_inner_microstep: 3326.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 01:40:04,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.66 | bwd: 3327.30 | bwd_inner: 3326.50 | bwd_allreduce: 0.75 | step: 6.58 77%|███████▋ | 7728/10000 [12:10:25<3:27:45, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.003329485422000289, 'learning_rate': 5.174725930678531e-06, 'epoch': 7.73} 77%|███████▋ | 7728/10000 [12:10:25<3:27:45, 5.49s/it][2025-06-20 01:40:10,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:40:10,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.62 | bwd_microstep: 3334.55 | bwd_inner_microstep: 3333.65 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.99 [2025-06-20 01:40:10,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.62 | bwd: 3334.58 | bwd_inner: 3333.65 | bwd_allreduce: 0.87 | step: 7.00 77%|███████▋ | 7729/10000 [12:10:31<3:27:44, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.33061710000038147, 'learning_rate': 5.170378913028278e-06, 'epoch': 7.73} 77%|███████▋ | 7729/10000 [12:10:31<3:27:44, 5.49s/it][2025-06-20 01:40:15,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:40:15,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.38 | bwd_microstep: 3325.67 | bwd_inner_microstep: 3324.80 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.54 [2025-06-20 01:40:15,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.38 | bwd: 3325.70 | bwd_inner: 3324.80 | bwd_allreduce: 0.84 | step: 7.54 77%|███████▋ | 7730/10000 [12:10:36<3:27:37, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.010061088018119335, 'learning_rate': 5.166033450936452e-06, 'epoch': 7.73} 77%|███████▋ | 7730/10000 [12:10:36<3:27:37, 5.49s/it][2025-06-20 01:40:21,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:40:21,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.56 | bwd_microstep: 3368.96 | bwd_inner_microstep: 3368.16 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-20 01:40:21,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.56 | bwd: 3368.98 | bwd_inner: 3368.16 | bwd_allreduce: 0.77 | step: 6.82 77%|███████▋ | 7731/10000 [12:10:42<3:28:11, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.046057820320129395, 'learning_rate': 5.161689544858876e-06, 'epoch': 7.73} 77%|███████▋ | 7731/10000 [12:10:42<3:28:11, 5.51s/it][2025-06-20 01:40:26,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:40:26,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.39 | bwd_microstep: 3323.16 | bwd_inner_microstep: 3322.16 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.19 [2025-06-20 01:40:26,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.39 | bwd: 3323.17 | bwd_inner: 3322.16 | bwd_allreduce: 0.97 | step: 7.19 77%|███████▋ | 7732/10000 [12:10:47<3:27:40, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.05296339839696884, 'learning_rate': 5.157347195251194e-06, 'epoch': 7.73} 77%|███████▋ | 7732/10000 [12:10:47<3:27:40, 5.49s/it][2025-06-20 01:40:32,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:40:32,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.35 | bwd_microstep: 3376.51 | bwd_inner_microstep: 3375.59 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.04 [2025-06-20 01:40:32,491] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.35 | bwd: 3376.52 | bwd_inner: 3375.59 | bwd_allreduce: 0.88 | step: 7.04 77%|███████▋ | 7733/10000 [12:10:53<3:28:11, 5.51s/it] {'loss': 0.0, 'grad_norm': 5.6252869399031624e-05, 'learning_rate': 5.153006402568905e-06, 'epoch': 7.73} 77%|███████▋ | 7733/10000 [12:10:53<3:28:11, 5.51s/it][2025-06-20 01:40:37,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:40:37,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.08 | bwd_microstep: 3324.05 | bwd_inner_microstep: 3323.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-20 01:40:37,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.08 | bwd: 3324.06 | bwd_inner: 3323.24 | bwd_allreduce: 0.78 | step: 7.07 77%|███████▋ | 7734/10000 [12:10:58<3:27:40, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.06031076982617378, 'learning_rate': 5.148667167267336e-06, 'epoch': 7.73} 77%|███████▋ | 7734/10000 [12:10:58<3:27:40, 5.50s/it][2025-06-20 01:40:43,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:40:43,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.08 | bwd_microstep: 3323.83 | bwd_inner_microstep: 3323.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-20 01:40:43,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.08 | bwd: 3323.84 | bwd_inner: 3323.04 | bwd_allreduce: 0.76 | step: 6.65 77%|███████▋ | 7735/10000 [12:11:04<3:27:15, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.08700637519359589, 'learning_rate': 5.144329489801654e-06, 'epoch': 7.74} 77%|███████▋ | 7735/10000 [12:11:04<3:27:15, 5.49s/it][2025-06-20 01:40:48,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.86 [2025-06-20 01:40:48,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.40 | bwd_microstep: 3372.71 | bwd_inner_microstep: 3371.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-20 01:40:48,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.40 | bwd: 3372.73 | bwd_inner: 3371.90 | bwd_allreduce: 0.78 | step: 6.98 77%|███████▋ | 7736/10000 [12:11:09<3:27:42, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0014957473613321781, 'learning_rate': 5.139993370626868e-06, 'epoch': 7.74} 77%|███████▋ | 7736/10000 [12:11:09<3:27:42, 5.50s/it][2025-06-20 01:40:54,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:40:54,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.39 | bwd_microstep: 3320.91 | bwd_inner_microstep: 3320.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-20 01:40:54,437] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.39 | bwd: 3320.93 | bwd_inner: 3320.12 | bwd_allreduce: 0.76 | step: 6.88 77%|███████▋ | 7737/10000 [12:11:15<3:27:09, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.007154590915888548, 'learning_rate': 5.1356588101978035e-06, 'epoch': 7.74} 77%|███████▋ | 7737/10000 [12:11:15<3:27:09, 5.49s/it][2025-06-20 01:41:00,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:41:00,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.65 | bwd_microstep: 3405.69 | bwd_inner_microstep: 3404.87 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.15 [2025-06-20 01:41:00,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.65 | bwd: 3405.71 | bwd_inner: 3404.87 | bwd_allreduce: 0.79 | step: 7.15 77%|███████▋ | 7738/10000 [12:11:20<3:28:07, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0014887730358168483, 'learning_rate': 5.131325808969143e-06, 'epoch': 7.74} 77%|███████▋ | 7738/10000 [12:11:20<3:28:07, 5.52s/it][2025-06-20 01:41:05,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:41:05,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.74 | bwd_microstep: 3319.86 | bwd_inner_microstep: 3319.07 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 01:41:05,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.74 | bwd: 3319.88 | bwd_inner: 3319.07 | bwd_allreduce: 0.76 | step: 6.77 77%|███████▋ | 7739/10000 [12:11:26<3:27:29, 5.51s/it] {'loss': 0.0007, 'grad_norm': 0.30683475732803345, 'learning_rate': 5.126994367395397e-06, 'epoch': 7.74} 77%|███████▋ | 7739/10000 [12:11:26<3:27:29, 5.51s/it][2025-06-20 01:41:10,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:41:10,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.57 | bwd_microstep: 3318.90 | bwd_inner_microstep: 3317.95 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.86 [2025-06-20 01:41:10,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.57 | bwd: 3318.92 | bwd_inner: 3317.95 | bwd_allreduce: 0.92 | step: 6.87 77%|███████▋ | 7740/10000 [12:11:31<3:26:52, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.014502352103590965, 'learning_rate': 5.1226644859309145e-06, 'epoch': 7.74} 77%|███████▋ | 7740/10000 [12:11:31<3:26:52, 5.49s/it][2025-06-20 01:41:16,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:41:16,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.99 | bwd_microstep: 3319.25 | bwd_inner_microstep: 3318.43 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.96 [2025-06-20 01:41:16,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.99 | bwd: 3319.27 | bwd_inner: 3318.43 | bwd_allreduce: 0.78 | step: 6.97 77%|███████▋ | 7741/10000 [12:11:37<3:26:28, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.04379231482744217, 'learning_rate': 5.1183361650298846e-06, 'epoch': 7.74} 77%|███████▋ | 7741/10000 [12:11:37<3:26:28, 5.48s/it][2025-06-20 01:41:21,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:41:21,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.33 | bwd_microstep: 3333.11 | bwd_inner_microstep: 3332.31 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.12 [2025-06-20 01:41:21,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.33 | bwd: 3333.12 | bwd_inner: 3332.31 | bwd_allreduce: 0.77 | step: 7.13 77%|███████▋ | 7742/10000 [12:11:42<3:26:20, 5.48s/it] {'loss': 0.002, 'grad_norm': 0.8598336577415466, 'learning_rate': 5.114009405146319e-06, 'epoch': 7.74} 77%|███████▋ | 7742/10000 [12:11:42<3:26:20, 5.48s/it][2025-06-20 01:41:27,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:41:27,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.00 | bwd_microstep: 3320.19 | bwd_inner_microstep: 3319.41 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 01:41:27,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.00 | bwd: 3320.20 | bwd_inner: 3319.41 | bwd_allreduce: 0.75 | step: 6.63 77%|███████▋ | 7743/10000 [12:11:48<3:25:59, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.023806992918252945, 'learning_rate': 5.109684206734076e-06, 'epoch': 7.74} 77%|███████▋ | 7743/10000 [12:11:48<3:25:59, 5.48s/it][2025-06-20 01:41:32,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 01:41:32,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.30 | bwd_microstep: 3317.49 | bwd_inner_microstep: 3316.72 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.54 [2025-06-20 01:41:32,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.30 | bwd: 3317.51 | bwd_inner: 3316.72 | bwd_allreduce: 0.75 | step: 6.54 77%|███████▋ | 7744/10000 [12:11:53<3:25:44, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0021348954178392887, 'learning_rate': 5.1053605702468535e-06, 'epoch': 7.74} 77%|███████▋ | 7744/10000 [12:11:53<3:25:44, 5.47s/it][2025-06-20 01:41:38,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:41:38,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.47 | bwd_microstep: 3317.92 | bwd_inner_microstep: 3317.13 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 01:41:38,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.47 | bwd: 3317.93 | bwd_inner: 3317.13 | bwd_allreduce: 0.76 | step: 6.68 77%|███████▋ | 7745/10000 [12:11:59<3:25:28, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00033335373154841363, 'learning_rate': 5.101038496138173e-06, 'epoch': 7.75} 77%|███████▋ | 7745/10000 [12:11:59<3:25:28, 5.47s/it][2025-06-20 01:41:43,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:41:43,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.51 | bwd_microstep: 3394.11 | bwd_inner_microstep: 3393.13 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.48 [2025-06-20 01:41:43,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.51 | bwd: 3394.13 | bwd_inner: 3393.13 | bwd_allreduce: 0.95 | step: 7.48 77%|███████▋ | 7746/10000 [12:12:04<3:26:32, 5.50s/it] {'loss': 0.0051, 'grad_norm': 1.0301003456115723, 'learning_rate': 5.096717984861417e-06, 'epoch': 7.75} 77%|███████▋ | 7746/10000 [12:12:04<3:26:32, 5.50s/it][2025-06-20 01:41:49,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:41:49,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.98 | bwd_microstep: 3373.05 | bwd_inner_microstep: 3372.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 01:41:49,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.98 | bwd: 3373.07 | bwd_inner: 3372.26 | bwd_allreduce: 0.76 | step: 6.73 77%|███████▋ | 7747/10000 [12:12:10<3:26:55, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.011054105125367641, 'learning_rate': 5.092399036869771e-06, 'epoch': 7.75} 77%|███████▋ | 7747/10000 [12:12:10<3:26:55, 5.51s/it][2025-06-20 01:41:54,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:41:54,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.26 | bwd_microstep: 3371.09 | bwd_inner_microstep: 3370.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.75 [2025-06-20 01:41:54,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.26 | bwd: 3371.10 | bwd_inner: 3370.30 | bwd_allreduce: 0.76 | step: 6.75 77%|███████▋ | 7748/10000 [12:12:15<3:27:06, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.01280492078512907, 'learning_rate': 5.088081652616277e-06, 'epoch': 7.75} 77%|███████▋ | 7748/10000 [12:12:15<3:27:06, 5.52s/it][2025-06-20 01:42:00,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:42:00,460] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.89 | bwd_microstep: 3373.16 | bwd_inner_microstep: 3372.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 01:42:00,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.89 | bwd: 3373.18 | bwd_inner: 3372.38 | bwd_allreduce: 0.76 | step: 6.71 77%|███████▋ | 7749/10000 [12:12:21<3:27:13, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.060615602880716324, 'learning_rate': 5.0837658325538105e-06, 'epoch': 7.75} 77%|███████▋ | 7749/10000 [12:12:21<3:27:13, 5.52s/it][2025-06-20 01:42:05,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:42:05,987] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.19 | bwd_microstep: 3364.66 | bwd_inner_microstep: 3363.85 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-20 01:42:05,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.19 | bwd: 3364.68 | bwd_inner: 3363.85 | bwd_allreduce: 0.78 | step: 7.31 78%|███████▊ | 7750/10000 [12:12:26<3:27:11, 5.52s/it] {'loss': 0.0012, 'grad_norm': 0.2178065925836563, 'learning_rate': 5.079451577135079e-06, 'epoch': 7.75} 78%|███████▊ | 7750/10000 [12:12:26<3:27:11, 5.52s/it][2025-06-20 01:42:11,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:42:11,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.49 | bwd_microstep: 3315.90 | bwd_inner_microstep: 3315.08 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.82 [2025-06-20 01:42:11,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.49 | bwd: 3315.91 | bwd_inner: 3315.08 | bwd_allreduce: 0.79 | step: 6.83 78%|███████▊ | 7751/10000 [12:12:32<3:26:20, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0025501360651105642, 'learning_rate': 5.075138886812634e-06, 'epoch': 7.75} 78%|███████▊ | 7751/10000 [12:12:32<3:26:20, 5.50s/it][2025-06-20 01:42:16,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:42:16,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.52 | bwd_microstep: 3364.88 | bwd_inner_microstep: 3364.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 01:42:16,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.52 | bwd: 3364.89 | bwd_inner: 3364.09 | bwd_allreduce: 0.75 | step: 6.65 78%|███████▊ | 7752/10000 [12:12:37<3:26:33, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.022615130990743637, 'learning_rate': 5.070827762038846e-06, 'epoch': 7.75} 78%|███████▊ | 7752/10000 [12:12:37<3:26:33, 5.51s/it][2025-06-20 01:42:22,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:42:22,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.34 | bwd_microstep: 3367.42 | bwd_inner_microstep: 3366.47 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.51 [2025-06-20 01:42:22,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.34 | bwd: 3367.44 | bwd_inner: 3366.47 | bwd_allreduce: 0.92 | step: 7.51 78%|███████▊ | 7753/10000 [12:12:43<3:26:43, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.006252450868487358, 'learning_rate': 5.066518203265938e-06, 'epoch': 7.75} 78%|███████▊ | 7753/10000 [12:12:43<3:26:43, 5.52s/it][2025-06-20 01:42:28,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:42:28,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.60 | bwd_microstep: 3373.68 | bwd_inner_microstep: 3372.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 01:42:28,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.60 | bwd: 3373.70 | bwd_inner: 3372.90 | bwd_allreduce: 0.76 | step: 6.66 78%|███████▊ | 7754/10000 [12:12:48<3:26:53, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.00031144495005719364, 'learning_rate': 5.062210210945961e-06, 'epoch': 7.75} 78%|███████▊ | 7754/10000 [12:12:48<3:26:53, 5.53s/it][2025-06-20 01:42:33,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:42:33,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.92 | bwd_microstep: 3318.21 | bwd_inner_microstep: 3317.35 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.11 [2025-06-20 01:42:33,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.92 | bwd: 3318.23 | bwd_inner: 3317.35 | bwd_allreduce: 0.82 | step: 7.11 78%|███████▊ | 7755/10000 [12:12:54<3:26:03, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0009430811041966081, 'learning_rate': 5.057903785530804e-06, 'epoch': 7.75} 78%|███████▊ | 7755/10000 [12:12:54<3:26:03, 5.51s/it][2025-06-20 01:42:38,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:42:38,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.94 | bwd_microstep: 3318.96 | bwd_inner_microstep: 3318.07 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.32 [2025-06-20 01:42:38,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.94 | bwd: 3318.98 | bwd_inner: 3318.07 | bwd_allreduce: 0.86 | step: 7.33 78%|███████▊ | 7756/10000 [12:12:59<3:25:29, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0004581485118251294, 'learning_rate': 5.053598927472198e-06, 'epoch': 7.76} 78%|███████▊ | 7756/10000 [12:12:59<3:25:29, 5.49s/it][2025-06-20 01:42:44,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:42:44,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.56 | bwd_microstep: 3363.72 | bwd_inner_microstep: 3362.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 01:42:44,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.56 | bwd: 3363.73 | bwd_inner: 3362.93 | bwd_allreduce: 0.76 | step: 6.62 78%|███████▊ | 7757/10000 [12:13:05<3:25:48, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.006826123222708702, 'learning_rate': 5.049295637221692e-06, 'epoch': 7.76} 78%|███████▊ | 7757/10000 [12:13:05<3:25:48, 5.51s/it][2025-06-20 01:42:49,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:42:49,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.09 | bwd_microstep: 3320.59 | bwd_inner_microstep: 3319.78 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-20 01:42:49,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.09 | bwd: 3320.60 | bwd_inner: 3319.78 | bwd_allreduce: 0.78 | step: 6.75 78%|███████▊ | 7758/10000 [12:13:10<3:25:16, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.022364744916558266, 'learning_rate': 5.044993915230683e-06, 'epoch': 7.76} 78%|███████▊ | 7758/10000 [12:13:10<3:25:16, 5.49s/it][2025-06-20 01:42:55,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:42:55,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.56 | bwd_microstep: 3324.96 | bwd_inner_microstep: 3323.94 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.20 [2025-06-20 01:42:55,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.56 | bwd: 3324.98 | bwd_inner: 3323.94 | bwd_allreduce: 0.98 | step: 7.21 78%|███████▊ | 7759/10000 [12:13:16<3:25:01, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.026682747527956963, 'learning_rate': 5.040693761950406e-06, 'epoch': 7.76} 78%|███████▊ | 7759/10000 [12:13:16<3:25:01, 5.49s/it][2025-06-20 01:43:00,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:43:00,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.88 | bwd_microstep: 3310.67 | bwd_inner_microstep: 3309.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-20 01:43:00,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.88 | bwd: 3310.69 | bwd_inner: 3309.87 | bwd_allreduce: 0.77 | step: 6.93 78%|███████▊ | 7760/10000 [12:13:21<3:24:32, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0011010802118107677, 'learning_rate': 5.036395177831923e-06, 'epoch': 7.76} 78%|███████▊ | 7760/10000 [12:13:21<3:24:32, 5.48s/it][2025-06-20 01:43:06,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.82 [2025-06-20 01:43:06,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.26 | bwd_microstep: 3322.04 | bwd_inner_microstep: 3321.08 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.35 [2025-06-20 01:43:06,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.26 | bwd: 3322.06 | bwd_inner: 3321.08 | bwd_allreduce: 0.93 | step: 7.35 78%|███████▊ | 7761/10000 [12:13:27<3:24:17, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00035017068148590624, 'learning_rate': 5.032098163326138e-06, 'epoch': 7.76} 78%|███████▊ | 7761/10000 [12:13:27<3:24:17, 5.47s/it][2025-06-20 01:43:11,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:43:11,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.51 | bwd_microstep: 3323.47 | bwd_inner_microstep: 3322.68 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 01:43:11,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.51 | bwd: 3323.48 | bwd_inner: 3322.68 | bwd_allreduce: 0.76 | step: 6.58 78%|███████▊ | 7762/10000 [12:13:32<3:24:11, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0008352695149369538, 'learning_rate': 5.027802718883788e-06, 'epoch': 7.76} 78%|███████▊ | 7762/10000 [12:13:32<3:24:11, 5.47s/it][2025-06-20 01:43:17,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 01:43:17,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.49 | bwd_microstep: 3312.96 | bwd_inner_microstep: 3311.97 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.29 [2025-06-20 01:43:17,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.49 | bwd: 3312.98 | bwd_inner: 3311.97 | bwd_allreduce: 0.96 | step: 7.29 78%|███████▊ | 7763/10000 [12:13:38<3:23:51, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0004935759352520108, 'learning_rate': 5.0235088449554444e-06, 'epoch': 7.76} 78%|███████▊ | 7763/10000 [12:13:38<3:23:51, 5.47s/it][2025-06-20 01:43:22,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:43:22,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.94 | bwd_microstep: 3404.14 | bwd_inner_microstep: 3403.25 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.89 [2025-06-20 01:43:22,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.94 | bwd: 3404.15 | bwd_inner: 3403.25 | bwd_allreduce: 0.86 | step: 6.89 78%|███████▊ | 7764/10000 [12:13:43<3:25:06, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.01655128225684166, 'learning_rate': 5.019216541991516e-06, 'epoch': 7.76} 78%|███████▊ | 7764/10000 [12:13:43<3:25:06, 5.50s/it][2025-06-20 01:43:28,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:43:28,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.06 | bwd_microstep: 3309.25 | bwd_inner_microstep: 3308.19 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.42 [2025-06-20 01:43:28,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.06 | bwd: 3309.26 | bwd_inner: 3308.19 | bwd_allreduce: 1.02 | step: 7.42 78%|███████▊ | 7765/10000 [12:13:49<3:24:24, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.003064232412725687, 'learning_rate': 5.014925810442244e-06, 'epoch': 7.76} 78%|███████▊ | 7765/10000 [12:13:49<3:24:24, 5.49s/it][2025-06-20 01:43:33,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:43:33,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.64 | bwd_microstep: 3365.05 | bwd_inner_microstep: 3364.23 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-20 01:43:33,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.64 | bwd: 3365.07 | bwd_inner: 3364.23 | bwd_allreduce: 0.78 | step: 6.92 78%|███████▊ | 7766/10000 [12:13:54<3:24:51, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.026957189664244652, 'learning_rate': 5.010636650757712e-06, 'epoch': 7.77} 78%|███████▊ | 7766/10000 [12:13:54<3:24:51, 5.50s/it][2025-06-20 01:43:39,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-20 01:43:39,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.46 | bwd_microstep: 3314.32 | bwd_inner_microstep: 3313.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-20 01:43:39,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.46 | bwd: 3314.33 | bwd_inner: 3313.54 | bwd_allreduce: 0.75 | step: 6.55 78%|███████▊ | 7767/10000 [12:14:00<3:24:23, 5.49s/it] {'loss': 0.0532, 'grad_norm': 10.30605697631836, 'learning_rate': 5.006349063387821e-06, 'epoch': 7.77} 78%|███████▊ | 7767/10000 [12:14:00<3:24:23, 5.49s/it][2025-06-20 01:43:44,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:43:44,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.77 | bwd_microstep: 3365.47 | bwd_inner_microstep: 3364.46 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.08 [2025-06-20 01:43:44,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.77 | bwd: 3365.49 | bwd_inner: 3364.46 | bwd_allreduce: 0.98 | step: 7.08 78%|███████▊ | 7768/10000 [12:14:05<3:24:43, 5.50s/it] {'loss': 0.0051, 'grad_norm': 1.8029522895812988, 'learning_rate': 5.002063048782326e-06, 'epoch': 7.77} 78%|███████▊ | 7768/10000 [12:14:05<3:24:43, 5.50s/it][2025-06-20 01:43:50,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:43:50,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.08 | bwd_microstep: 3360.89 | bwd_inner_microstep: 3360.11 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.53 [2025-06-20 01:43:50,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.08 | bwd: 3360.90 | bwd_inner: 3360.11 | bwd_allreduce: 0.75 | step: 6.54 78%|███████▊ | 7769/10000 [12:14:11<3:24:57, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.003960520029067993, 'learning_rate': 4.997778607390809e-06, 'epoch': 7.77} 78%|███████▊ | 7769/10000 [12:14:11<3:24:57, 5.51s/it][2025-06-20 01:43:55,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:43:55,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.77 | bwd_microstep: 3366.51 | bwd_inner_microstep: 3365.58 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.94 [2025-06-20 01:43:55,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.77 | bwd: 3366.53 | bwd_inner: 3365.58 | bwd_allreduce: 0.91 | step: 6.94 78%|███████▊ | 7770/10000 [12:14:16<3:25:08, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.003233529394492507, 'learning_rate': 4.99349573966269e-06, 'epoch': 7.77} 78%|███████▊ | 7770/10000 [12:14:16<3:25:08, 5.52s/it][2025-06-20 01:44:01,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:44:01,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.46 | bwd_microstep: 3369.01 | bwd_inner_microstep: 3368.17 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.29 [2025-06-20 01:44:01,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.46 | bwd: 3369.03 | bwd_inner: 3368.17 | bwd_allreduce: 0.82 | step: 7.30 78%|███████▊ | 7771/10000 [12:14:22<3:25:16, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.009147239848971367, 'learning_rate': 4.989214446047226e-06, 'epoch': 7.77} 78%|███████▊ | 7771/10000 [12:14:22<3:25:16, 5.53s/it][2025-06-20 01:44:07,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:44:07,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.84 | bwd_microstep: 3368.71 | bwd_inner_microstep: 3367.75 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.51 [2025-06-20 01:44:07,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.84 | bwd: 3368.73 | bwd_inner: 3367.75 | bwd_allreduce: 0.93 | step: 7.51 78%|███████▊ | 7772/10000 [12:14:27<3:25:21, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.003563148668035865, 'learning_rate': 4.984934726993494e-06, 'epoch': 7.77} 78%|███████▊ | 7772/10000 [12:14:27<3:25:21, 5.53s/it][2025-06-20 01:44:12,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:44:12,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.49 | bwd_microstep: 3359.13 | bwd_inner_microstep: 3358.32 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-20 01:44:12,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.49 | bwd: 3359.14 | bwd_inner: 3358.32 | bwd_allreduce: 0.78 | step: 7.04 78%|███████▊ | 7773/10000 [12:14:33<3:25:15, 5.53s/it] {'loss': 0.0012, 'grad_norm': 0.3888019323348999, 'learning_rate': 4.980656582950421e-06, 'epoch': 7.77} 78%|███████▊ | 7773/10000 [12:14:33<3:25:15, 5.53s/it][2025-06-20 01:44:18,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:44:18,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.18 | bwd_microstep: 3316.48 | bwd_inner_microstep: 3315.65 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.12 [2025-06-20 01:44:18,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.18 | bwd: 3316.49 | bwd_inner: 3315.65 | bwd_allreduce: 0.80 | step: 7.13 78%|███████▊ | 7774/10000 [12:14:38<3:24:31, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.06558871269226074, 'learning_rate': 4.976380014366766e-06, 'epoch': 7.77} 78%|███████▊ | 7774/10000 [12:14:38<3:24:31, 5.51s/it][2025-06-20 01:44:23,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.72 [2025-06-20 01:44:23,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.24 | bwd_microstep: 3315.09 | bwd_inner_microstep: 3314.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 01:44:23,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.24 | bwd: 3315.10 | bwd_inner: 3314.29 | bwd_allreduce: 0.77 | step: 6.82 78%|███████▊ | 7775/10000 [12:14:44<3:23:47, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0008030537865124643, 'learning_rate': 4.97210502169112e-06, 'epoch': 7.78} 78%|███████▊ | 7775/10000 [12:14:44<3:23:47, 5.50s/it][2025-06-20 01:44:29,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:44:29,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.98 | bwd_microstep: 3364.77 | bwd_inner_microstep: 3363.82 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.11 [2025-06-20 01:44:29,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.98 | bwd: 3364.78 | bwd_inner: 3363.82 | bwd_allreduce: 0.91 | step: 7.11 78%|███████▊ | 7776/10000 [12:14:49<3:24:05, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.005260071717202663, 'learning_rate': 4.967831605371918e-06, 'epoch': 7.78} 78%|███████▊ | 7776/10000 [12:14:49<3:24:05, 5.51s/it][2025-06-20 01:44:34,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:44:34,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.93 | bwd_microstep: 3356.84 | bwd_inner_microstep: 3356.02 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.29 [2025-06-20 01:44:34,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.93 | bwd: 3356.85 | bwd_inner: 3356.02 | bwd_allreduce: 0.79 | step: 7.29 78%|███████▊ | 7777/10000 [12:14:55<3:24:14, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0031184011604636908, 'learning_rate': 4.963559765857406e-06, 'epoch': 7.78} 78%|███████▊ | 7777/10000 [12:14:55<3:24:14, 5.51s/it][2025-06-20 01:44:40,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:44:40,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.76 | bwd_microstep: 3362.81 | bwd_inner_microstep: 3362.01 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-20 01:44:40,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.76 | bwd: 3362.82 | bwd_inner: 3362.01 | bwd_allreduce: 0.77 | step: 7.05 78%|███████▊ | 7778/10000 [12:15:00<3:24:23, 5.52s/it] {'loss': 0.1005, 'grad_norm': 20.205238342285156, 'learning_rate': 4.95928950359569e-06, 'epoch': 7.78} 78%|███████▊ | 7778/10000 [12:15:00<3:24:23, 5.52s/it][2025-06-20 01:44:45,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 01:44:45,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.88 | bwd_microstep: 3318.82 | bwd_inner_microstep: 3317.70 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.39 [2025-06-20 01:44:45,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.88 | bwd: 3318.84 | bwd_inner: 3317.70 | bwd_allreduce: 1.07 | step: 7.38 78%|███████▊ | 7779/10000 [12:15:06<3:23:40, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00907586794346571, 'learning_rate': 4.955020819034699e-06, 'epoch': 7.78} 78%|███████▊ | 7779/10000 [12:15:06<3:23:40, 5.50s/it][2025-06-20 01:44:50,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:44:50,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.06 | bwd_microstep: 3306.46 | bwd_inner_microstep: 3305.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-20 01:44:50,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.06 | bwd: 3306.48 | bwd_inner: 3305.67 | bwd_allreduce: 0.76 | step: 6.92 78%|███████▊ | 7780/10000 [12:15:11<3:22:55, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0001613010826986283, 'learning_rate': 4.950753712622194e-06, 'epoch': 7.78} 78%|███████▊ | 7780/10000 [12:15:11<3:22:55, 5.48s/it][2025-06-20 01:44:56,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:44:56,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.72 | bwd_microstep: 3315.14 | bwd_inner_microstep: 3314.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 01:44:56,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.73 | bwd: 3315.15 | bwd_inner: 3314.35 | bwd_allreduce: 0.76 | step: 6.72 78%|███████▊ | 7781/10000 [12:15:17<3:22:32, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001596256042830646, 'learning_rate': 4.946488184805787e-06, 'epoch': 7.78} 78%|███████▊ | 7781/10000 [12:15:17<3:22:32, 5.48s/it][2025-06-20 01:45:01,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 01:45:01,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.29 | bwd_microstep: 3313.74 | bwd_inner_microstep: 3312.64 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.83 [2025-06-20 01:45:01,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.29 | bwd: 3313.77 | bwd_inner: 3312.64 | bwd_allreduce: 1.06 | step: 7.83 78%|███████▊ | 7782/10000 [12:15:22<3:22:13, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0015682632802054286, 'learning_rate': 4.9422242360329e-06, 'epoch': 7.78} 78%|███████▊ | 7782/10000 [12:15:22<3:22:13, 5.47s/it][2025-06-20 01:45:07,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 01:45:07,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.09 | bwd_microstep: 3311.49 | bwd_inner_microstep: 3310.43 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.70 [2025-06-20 01:45:07,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.09 | bwd: 3311.52 | bwd_inner: 3310.43 | bwd_allreduce: 1.02 | step: 7.71 78%|███████▊ | 7783/10000 [12:15:28<3:21:58, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.040811605751514435, 'learning_rate': 4.9379618667508044e-06, 'epoch': 7.78} 78%|███████▊ | 7783/10000 [12:15:28<3:21:58, 5.47s/it][2025-06-20 01:45:12,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:45:12,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.64 | bwd_microstep: 3314.19 | bwd_inner_microstep: 3313.24 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.31 [2025-06-20 01:45:12,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.64 | bwd: 3314.20 | bwd_inner: 3313.24 | bwd_allreduce: 0.91 | step: 7.32 78%|███████▊ | 7784/10000 [12:15:33<3:21:46, 5.46s/it] {'loss': 0.0007, 'grad_norm': 0.09790325164794922, 'learning_rate': 4.933701077406605e-06, 'epoch': 7.78} 78%|███████▊ | 7784/10000 [12:15:33<3:21:46, 5.46s/it][2025-06-20 01:45:18,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:45:18,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.98 | bwd_microstep: 3363.45 | bwd_inner_microstep: 3362.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 01:45:18,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.98 | bwd: 3363.47 | bwd_inner: 3362.66 | bwd_allreduce: 0.76 | step: 6.69 78%|███████▊ | 7785/10000 [12:15:39<3:22:24, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.008794983848929405, 'learning_rate': 4.929441868447238e-06, 'epoch': 7.79} 78%|███████▊ | 7785/10000 [12:15:39<3:22:24, 5.48s/it][2025-06-20 01:45:23,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:45:23,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.91 | bwd_microstep: 3311.96 | bwd_inner_microstep: 3311.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-20 01:45:23,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.91 | bwd: 3311.97 | bwd_inner: 3311.15 | bwd_allreduce: 0.78 | step: 7.04 78%|███████▊ | 7786/10000 [12:15:44<3:22:03, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0037532763089984655, 'learning_rate': 4.925184240319478e-06, 'epoch': 7.79} 78%|███████▊ | 7786/10000 [12:15:44<3:22:03, 5.48s/it][2025-06-20 01:45:29,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:45:29,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.32 | bwd_microstep: 3307.96 | bwd_inner_microstep: 3307.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 01:45:29,249] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.32 | bwd: 3307.98 | bwd_inner: 3307.18 | bwd_allreduce: 0.75 | step: 6.71 78%|███████▊ | 7787/10000 [12:15:50<3:21:40, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.09235217422246933, 'learning_rate': 4.920928193469923e-06, 'epoch': 7.79} 78%|███████▊ | 7787/10000 [12:15:50<3:21:40, 5.47s/it][2025-06-20 01:45:34,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:45:34,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.06 | bwd_microstep: 3357.59 | bwd_inner_microstep: 3356.77 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.79 [2025-06-20 01:45:34,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.06 | bwd: 3357.60 | bwd_inner: 3356.77 | bwd_allreduce: 0.79 | step: 6.79 78%|███████▊ | 7788/10000 [12:15:55<3:22:07, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001254912349395454, 'learning_rate': 4.916673728345016e-06, 'epoch': 7.79} 78%|███████▊ | 7788/10000 [12:15:55<3:22:07, 5.48s/it][2025-06-20 01:45:40,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:45:40,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2095.93 | bwd_microstep: 3303.54 | bwd_inner_microstep: 3302.58 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.32 [2025-06-20 01:45:40,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2095.93 | bwd: 3303.55 | bwd_inner: 3302.58 | bwd_allreduce: 0.93 | step: 7.32 78%|███████▊ | 7789/10000 [12:16:01<3:21:33, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.001417305669747293, 'learning_rate': 4.91242084539103e-06, 'epoch': 7.79} 78%|███████▊ | 7789/10000 [12:16:01<3:21:33, 5.47s/it][2025-06-20 01:45:45,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:45:45,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.66 | bwd_microstep: 3319.67 | bwd_inner_microstep: 3318.74 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.03 [2025-06-20 01:45:45,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.66 | bwd: 3319.68 | bwd_inner: 3318.74 | bwd_allreduce: 0.89 | step: 7.03 78%|███████▊ | 7790/10000 [12:16:06<3:21:19, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.06593241542577744, 'learning_rate': 4.908169545054075e-06, 'epoch': 7.79} 78%|███████▊ | 7790/10000 [12:16:06<3:21:19, 5.47s/it][2025-06-20 01:45:51,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:45:51,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.14 | bwd_microstep: 3311.59 | bwd_inner_microstep: 3310.80 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 01:45:51,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.14 | bwd: 3311.61 | bwd_inner: 3310.80 | bwd_allreduce: 0.76 | step: 6.83 78%|███████▊ | 7791/10000 [12:16:11<3:21:02, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.000889544899109751, 'learning_rate': 4.9039198277800926e-06, 'epoch': 7.79} 78%|███████▊ | 7791/10000 [12:16:11<3:21:02, 5.46s/it][2025-06-20 01:45:56,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:45:56,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.29 | bwd_microstep: 3312.72 | bwd_inner_microstep: 3311.83 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.82 [2025-06-20 01:45:56,563] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.29 | bwd: 3312.74 | bwd_inner: 3311.83 | bwd_allreduce: 0.87 | step: 6.82 78%|███████▊ | 7792/10000 [12:16:17<3:20:51, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.017461657524108887, 'learning_rate': 4.89967169401486e-06, 'epoch': 7.79} 78%|███████▊ | 7792/10000 [12:16:17<3:20:51, 5.46s/it][2025-06-20 01:46:02,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:46:02,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.78 | bwd_microstep: 3361.18 | bwd_inner_microstep: 3360.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-20 01:46:02,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.78 | bwd: 3361.19 | bwd_inner: 3360.39 | bwd_allreduce: 0.76 | step: 6.78 78%|███████▊ | 7793/10000 [12:16:22<3:21:27, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0004230812191963196, 'learning_rate': 4.89542514420398e-06, 'epoch': 7.79} 78%|███████▊ | 7793/10000 [12:16:22<3:21:27, 5.48s/it][2025-06-20 01:46:07,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:46:07,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.72 | bwd_microstep: 3316.46 | bwd_inner_microstep: 3315.67 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 01:46:07,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.72 | bwd: 3316.47 | bwd_inner: 3315.67 | bwd_allreduce: 0.76 | step: 6.77 78%|███████▊ | 7794/10000 [12:16:28<3:21:06, 5.47s/it] {'loss': 0.0012, 'grad_norm': 0.3015228807926178, 'learning_rate': 4.891180178792898e-06, 'epoch': 7.79} 78%|███████▊ | 7794/10000 [12:16:28<3:21:06, 5.47s/it][2025-06-20 01:46:12,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:46:12,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.51 | bwd_microstep: 3314.41 | bwd_inner_microstep: 3313.49 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.06 [2025-06-20 01:46:12,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.51 | bwd: 3314.43 | bwd_inner: 3313.49 | bwd_allreduce: 0.89 | step: 7.06 78%|███████▊ | 7795/10000 [12:16:33<3:20:52, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.006198324263095856, 'learning_rate': 4.886936798226895e-06, 'epoch': 7.79} 78%|███████▊ | 7795/10000 [12:16:33<3:20:52, 5.47s/it][2025-06-20 01:46:18,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:46:18,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.30 | bwd_microstep: 3316.95 | bwd_inner_microstep: 3316.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 01:46:18,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.30 | bwd: 3316.96 | bwd_inner: 3316.16 | bwd_allreduce: 0.76 | step: 6.60 78%|███████▊ | 7796/10000 [12:16:39<3:20:41, 5.46s/it] {'loss': 0.0012, 'grad_norm': 0.3536384701728821, 'learning_rate': 4.882695002951079e-06, 'epoch': 7.8} 78%|███████▊ | 7796/10000 [12:16:39<3:20:41, 5.46s/it][2025-06-20 01:46:23,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:46:23,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.19 | bwd_microstep: 3309.18 | bwd_inner_microstep: 3308.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 01:46:23,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.19 | bwd: 3309.19 | bwd_inner: 3308.40 | bwd_allreduce: 0.76 | step: 6.60 78%|███████▊ | 7797/10000 [12:16:44<3:20:22, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0005277626914903522, 'learning_rate': 4.8784547934103966e-06, 'epoch': 7.8} 78%|███████▊ | 7797/10000 [12:16:44<3:20:22, 5.46s/it][2025-06-20 01:46:29,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:46:29,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.17 | bwd_microstep: 3363.42 | bwd_inner_microstep: 3362.63 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 01:46:29,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.17 | bwd: 3363.62 | bwd_inner: 3362.63 | bwd_allreduce: 0.76 | step: 6.62 78%|███████▊ | 7798/10000 [12:16:50<3:21:01, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.219100221991539, 'learning_rate': 4.874216170049624e-06, 'epoch': 7.8} 78%|███████▊ | 7798/10000 [12:16:50<3:21:01, 5.48s/it][2025-06-20 01:46:34,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:46:34,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.23 | bwd_microstep: 3317.76 | bwd_inner_microstep: 3316.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 01:46:34,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.23 | bwd: 3317.78 | bwd_inner: 3316.98 | bwd_allreduce: 0.76 | step: 6.63 78%|███████▊ | 7799/10000 [12:16:55<3:20:54, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.031758032739162445, 'learning_rate': 4.869979133313374e-06, 'epoch': 7.8} 78%|███████▊ | 7799/10000 [12:16:55<3:20:54, 5.48s/it][2025-06-20 01:46:40,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:46:40,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.79 | bwd_microstep: 3317.79 | bwd_inner_microstep: 3316.93 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.07 [2025-06-20 01:46:40,359] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.79 | bwd: 3317.81 | bwd_inner: 3316.93 | bwd_allreduce: 0.83 | step: 7.07 78%|███████▊ | 7800/10000 [12:17:01<3:20:43, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.001042509451508522, 'learning_rate': 4.865743683646094e-06, 'epoch': 7.8} 78%|███████▊ | 7800/10000 [12:17:01<3:20:43, 5.47s/it][2025-06-20 01:46:45,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:46:45,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.70 | bwd_microstep: 3317.77 | bwd_inner_microstep: 3316.74 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.45 [2025-06-20 01:46:45,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.70 | bwd: 3317.79 | bwd_inner: 3316.74 | bwd_allreduce: 1.00 | step: 7.45 78%|███████▊ | 7801/10000 [12:17:06<3:20:35, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.015474236570298672, 'learning_rate': 4.861509821492061e-06, 'epoch': 7.8} 78%|███████▊ | 7801/10000 [12:17:06<3:20:35, 5.47s/it][2025-06-20 01:46:51,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:46:51,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.56 | bwd_microstep: 3323.15 | bwd_inner_microstep: 3322.37 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 01:46:51,301] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.56 | bwd: 3323.16 | bwd_inner: 3322.37 | bwd_allreduce: 0.75 | step: 6.63 78%|███████▊ | 7802/10000 [12:17:12<3:20:26, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.01229773461818695, 'learning_rate': 4.857277547295392e-06, 'epoch': 7.8} 78%|███████▊ | 7802/10000 [12:17:12<3:20:26, 5.47s/it][2025-06-20 01:46:56,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:46:56,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.37 | bwd_microstep: 3321.16 | bwd_inner_microstep: 3320.32 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.76 [2025-06-20 01:46:56,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.37 | bwd: 3321.19 | bwd_inner: 3320.32 | bwd_allreduce: 0.80 | step: 6.77 78%|███████▊ | 7803/10000 [12:17:17<3:20:19, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0014688584487885237, 'learning_rate': 4.8530468615000235e-06, 'epoch': 7.8} 78%|███████▊ | 7803/10000 [12:17:17<3:20:19, 5.47s/it][2025-06-20 01:47:02,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:47:02,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.74 | bwd_microstep: 3320.31 | bwd_inner_microstep: 3319.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 01:47:02,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.74 | bwd: 3320.32 | bwd_inner: 3319.52 | bwd_allreduce: 0.76 | step: 6.77 78%|███████▊ | 7804/10000 [12:17:23<3:20:05, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.09404334425926208, 'learning_rate': 4.84881776454974e-06, 'epoch': 7.8} 78%|███████▊ | 7804/10000 [12:17:23<3:20:05, 5.47s/it][2025-06-20 01:47:07,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-20 01:47:07,774] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.62 | bwd_microstep: 3373.14 | bwd_inner_microstep: 3372.10 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.54 [2025-06-20 01:47:07,775] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.62 | bwd: 3373.15 | bwd_inner: 3372.10 | bwd_allreduce: 1.00 | step: 7.54 78%|███████▊ | 7805/10000 [12:17:28<3:20:53, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.04547737166285515, 'learning_rate': 4.844590256888155e-06, 'epoch': 7.8} 78%|███████▊ | 7805/10000 [12:17:28<3:20:53, 5.49s/it][2025-06-20 01:47:13,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:47:13,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.42 | bwd_microstep: 3323.30 | bwd_inner_microstep: 3322.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 01:47:13,254] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.42 | bwd: 3323.32 | bwd_inner: 3322.50 | bwd_allreduce: 0.77 | step: 6.82 78%|███████▊ | 7806/10000 [12:17:34<3:20:39, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00020863117242697626, 'learning_rate': 4.840364338958714e-06, 'epoch': 7.81} 78%|███████▊ | 7806/10000 [12:17:34<3:20:39, 5.49s/it][2025-06-20 01:47:18,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:47:18,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.41 | bwd_microstep: 3325.31 | bwd_inner_microstep: 3324.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-20 01:47:18,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.41 | bwd: 3325.32 | bwd_inner: 3324.51 | bwd_allreduce: 0.76 | step: 6.96 78%|███████▊ | 7807/10000 [12:17:39<3:20:19, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.013660802505910397, 'learning_rate': 4.8361400112047e-06, 'epoch': 7.81} 78%|███████▊ | 7807/10000 [12:17:39<3:20:19, 5.48s/it][2025-06-20 01:47:24,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:47:24,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.66 | bwd_microstep: 3323.50 | bwd_inner_microstep: 3322.70 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 01:47:24,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.67 | bwd: 3323.51 | bwd_inner: 3322.70 | bwd_allreduce: 0.77 | step: 6.75 78%|███████▊ | 7808/10000 [12:17:44<3:20:06, 5.48s/it] {'loss': 0.0004, 'grad_norm': 0.09358659386634827, 'learning_rate': 4.831917274069218e-06, 'epoch': 7.81} 78%|███████▊ | 7808/10000 [12:17:44<3:20:06, 5.48s/it][2025-06-20 01:47:29,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:47:29,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.74 | bwd_microstep: 3375.66 | bwd_inner_microstep: 3374.71 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.15 [2025-06-20 01:47:29,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.74 | bwd: 3375.67 | bwd_inner: 3374.71 | bwd_allreduce: 0.91 | step: 7.15 78%|███████▊ | 7809/10000 [12:17:50<3:20:50, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0010180027456954122, 'learning_rate': 4.827696127995214e-06, 'epoch': 7.81} 78%|███████▊ | 7809/10000 [12:17:50<3:20:50, 5.50s/it][2025-06-20 01:47:35,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:47:35,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.84 | bwd_microstep: 3395.79 | bwd_inner_microstep: 3394.94 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.45 [2025-06-20 01:47:35,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.84 | bwd: 3395.81 | bwd_inner: 3394.94 | bwd_allreduce: 0.81 | step: 7.46 78%|███████▊ | 7810/10000 [12:17:56<3:21:41, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0005758693441748619, 'learning_rate': 4.82347657342547e-06, 'epoch': 7.81} 78%|███████▊ | 7810/10000 [12:17:56<3:21:41, 5.53s/it][2025-06-20 01:47:40,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:47:40,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.66 | bwd_microstep: 3319.08 | bwd_inner_microstep: 3318.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 01:47:40,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.66 | bwd: 3319.10 | bwd_inner: 3318.30 | bwd_allreduce: 0.75 | step: 6.64 78%|███████▊ | 7811/10000 [12:18:01<3:20:55, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.009624956175684929, 'learning_rate': 4.819258610802598e-06, 'epoch': 7.81} 78%|███████▊ | 7811/10000 [12:18:01<3:20:55, 5.51s/it][2025-06-20 01:47:46,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.89 [2025-06-20 01:47:46,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.75 | bwd_microstep: 3369.72 | bwd_inner_microstep: 3368.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-20 01:47:46,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.75 | bwd: 3369.73 | bwd_inner: 3368.92 | bwd_allreduce: 0.76 | step: 6.93 78%|███████▊ | 7812/10000 [12:18:07<3:21:09, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0008507159654982388, 'learning_rate': 4.815042240569046e-06, 'epoch': 7.81} 78%|███████▊ | 7812/10000 [12:18:07<3:21:09, 5.52s/it][2025-06-20 01:47:51,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.84 [2025-06-20 01:47:51,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.60 | bwd_microstep: 3378.18 | bwd_inner_microstep: 3377.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.22 [2025-06-20 01:47:51,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.60 | bwd: 3378.19 | bwd_inner: 3377.39 | bwd_allreduce: 0.76 | step: 7.22 78%|███████▊ | 7813/10000 [12:18:12<3:21:22, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.03248605877161026, 'learning_rate': 4.810827463167082e-06, 'epoch': 7.81} 78%|███████▊ | 7813/10000 [12:18:12<3:21:22, 5.52s/it][2025-06-20 01:47:57,446] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:47:57,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.47 | bwd_microstep: 3397.81 | bwd_inner_microstep: 3396.99 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.70 [2025-06-20 01:47:57,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.47 | bwd: 3397.83 | bwd_inner: 3396.99 | bwd_allreduce: 0.78 | step: 6.71 78%|███████▊ | 7814/10000 [12:18:18<3:21:50, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0062446692027151585, 'learning_rate': 4.806614279038822e-06, 'epoch': 7.81} 78%|███████▊ | 7814/10000 [12:18:18<3:21:50, 5.54s/it][2025-06-20 01:48:02,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:48:02,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.34 | bwd_microstep: 3326.18 | bwd_inner_microstep: 3325.40 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.54 [2025-06-20 01:48:02,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.34 | bwd: 3326.19 | bwd_inner: 3325.40 | bwd_allreduce: 0.75 | step: 6.55 78%|███████▊ | 7815/10000 [12:18:23<3:21:02, 5.52s/it] {'loss': 0.0244, 'grad_norm': 5.2816162109375, 'learning_rate': 4.802402688626204e-06, 'epoch': 7.81} 78%|███████▊ | 7815/10000 [12:18:23<3:21:02, 5.52s/it][2025-06-20 01:48:08,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:48:08,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.71 | bwd_microstep: 3370.36 | bwd_inner_microstep: 3369.33 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.12 [2025-06-20 01:48:08,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.71 | bwd: 3370.38 | bwd_inner: 3369.33 | bwd_allreduce: 1.00 | step: 7.12 78%|███████▊ | 7816/10000 [12:18:29<3:21:08, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.008937296457588673, 'learning_rate': 4.798192692371015e-06, 'epoch': 7.82} 78%|███████▊ | 7816/10000 [12:18:29<3:21:08, 5.53s/it][2025-06-20 01:48:13,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:48:13,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.81 | bwd_microstep: 3333.62 | bwd_inner_microstep: 3332.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-20 01:48:13,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.81 | bwd: 3333.63 | bwd_inner: 3332.81 | bwd_allreduce: 0.77 | step: 7.00 78%|███████▊ | 7817/10000 [12:18:34<3:20:34, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00027239733026362956, 'learning_rate': 4.793984290714866e-06, 'epoch': 7.82} 78%|███████▊ | 7817/10000 [12:18:34<3:20:34, 5.51s/it][2025-06-20 01:48:19,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:48:19,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.12 | bwd_microstep: 3385.20 | bwd_inner_microstep: 3384.30 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.95 [2025-06-20 01:48:19,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.12 | bwd: 3385.22 | bwd_inner: 3384.30 | bwd_allreduce: 0.87 | step: 6.95 78%|███████▊ | 7818/10000 [12:18:40<3:20:56, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.00010106662375619635, 'learning_rate': 4.789777484099185e-06, 'epoch': 7.82} 78%|███████▊ | 7818/10000 [12:18:40<3:20:56, 5.53s/it][2025-06-20 01:48:25,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:48:25,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.58 | bwd_microstep: 3383.25 | bwd_inner_microstep: 3382.11 | bwd_allreduce_microstep: 1.08 | step_microstep: 7.44 [2025-06-20 01:48:25,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.58 | bwd: 3383.29 | bwd_inner: 3382.11 | bwd_allreduce: 1.11 | step: 7.44 78%|███████▊ | 7819/10000 [12:18:45<3:21:17, 5.54s/it] {'loss': 0.0013, 'grad_norm': 0.3159259259700775, 'learning_rate': 4.785572272965253e-06, 'epoch': 7.82} 78%|███████▊ | 7819/10000 [12:18:45<3:21:17, 5.54s/it][2025-06-20 01:48:30,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:48:30,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.38 | bwd_microstep: 3340.21 | bwd_inner_microstep: 3339.43 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.56 [2025-06-20 01:48:30,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.38 | bwd: 3340.22 | bwd_inner: 3339.43 | bwd_allreduce: 0.75 | step: 6.56 78%|███████▊ | 7820/10000 [12:18:51<3:20:47, 5.53s/it] {'loss': 0.005, 'grad_norm': 1.8689631223678589, 'learning_rate': 4.781368657754178e-06, 'epoch': 7.82} 78%|███████▊ | 7820/10000 [12:18:51<3:20:47, 5.53s/it][2025-06-20 01:48:36,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:48:36,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.33 | bwd_microstep: 3327.05 | bwd_inner_microstep: 3326.21 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.73 [2025-06-20 01:48:36,035] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.33 | bwd: 3327.06 | bwd_inner: 3326.21 | bwd_allreduce: 0.81 | step: 6.74 78%|███████▊ | 7821/10000 [12:18:56<3:20:04, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.01304591540247202, 'learning_rate': 4.777166638906898e-06, 'epoch': 7.82} 78%|███████▊ | 7821/10000 [12:18:56<3:20:04, 5.51s/it][2025-06-20 01:48:41,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:48:41,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.28 | bwd_microstep: 3332.51 | bwd_inner_microstep: 3331.59 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.92 [2025-06-20 01:48:41,528] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.28 | bwd: 3332.52 | bwd_inner: 3331.59 | bwd_allreduce: 0.88 | step: 6.92 78%|███████▊ | 7822/10000 [12:19:02<3:19:48, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.009703448973596096, 'learning_rate': 4.772966216864192e-06, 'epoch': 7.82} 78%|███████▊ | 7822/10000 [12:19:02<3:19:48, 5.50s/it][2025-06-20 01:48:47,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:48:47,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.36 | bwd_microstep: 3340.73 | bwd_inner_microstep: 3339.77 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.32 [2025-06-20 01:48:47,022] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.36 | bwd: 3340.75 | bwd_inner: 3339.77 | bwd_allreduce: 0.92 | step: 7.33 78%|███████▊ | 7823/10000 [12:19:07<3:19:38, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005672217346727848, 'learning_rate': 4.768767392066653e-06, 'epoch': 7.82} 78%|███████▊ | 7823/10000 [12:19:07<3:19:38, 5.50s/it][2025-06-20 01:48:52,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:48:52,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.39 | bwd_microstep: 3324.87 | bwd_inner_microstep: 3324.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-20 01:48:52,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.39 | bwd: 3324.89 | bwd_inner: 3324.09 | bwd_allreduce: 0.76 | step: 6.75 78%|███████▊ | 7824/10000 [12:19:13<3:19:15, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00012571591651067138, 'learning_rate': 4.7645701649547246e-06, 'epoch': 7.82} 78%|███████▊ | 7824/10000 [12:19:13<3:19:15, 5.49s/it][2025-06-20 01:48:57,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:48:57,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.53 | bwd_microstep: 3320.61 | bwd_inner_microstep: 3319.83 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 01:48:57,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.53 | bwd: 3320.62 | bwd_inner: 3319.83 | bwd_allreduce: 0.75 | step: 6.67 78%|███████▊ | 7825/10000 [12:19:18<3:18:51, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00016171247989404947, 'learning_rate': 4.760374535968677e-06, 'epoch': 7.83} 78%|███████▊ | 7825/10000 [12:19:18<3:18:51, 5.49s/it][2025-06-20 01:49:03,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:49:03,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.58 | bwd_microstep: 3330.81 | bwd_inner_microstep: 3329.96 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.84 [2025-06-20 01:49:03,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.58 | bwd: 3330.83 | bwd_inner: 3329.96 | bwd_allreduce: 0.82 | step: 6.85 78%|███████▊ | 7826/10000 [12:19:24<3:18:44, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.009470927529036999, 'learning_rate': 4.756180505548611e-06, 'epoch': 7.83} 78%|███████▊ | 7826/10000 [12:19:24<3:18:44, 5.49s/it][2025-06-20 01:49:08,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:49:08,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.62 | bwd_microstep: 3331.26 | bwd_inner_microstep: 3330.47 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.75 [2025-06-20 01:49:08,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.62 | bwd: 3331.27 | bwd_inner: 3330.47 | bwd_allreduce: 0.76 | step: 6.76 78%|███████▊ | 7827/10000 [12:19:29<3:18:39, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005922859068959951, 'learning_rate': 4.751988074134466e-06, 'epoch': 7.83} 78%|███████▊ | 7827/10000 [12:19:29<3:18:39, 5.49s/it][2025-06-20 01:49:14,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:49:14,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.18 | bwd_microstep: 3326.39 | bwd_inner_microstep: 3325.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 01:49:14,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.18 | bwd: 3326.40 | bwd_inner: 3325.60 | bwd_allreduce: 0.76 | step: 6.70 78%|███████▊ | 7828/10000 [12:19:35<3:18:27, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0003299613017588854, 'learning_rate': 4.7477972421659994e-06, 'epoch': 7.83} 78%|███████▊ | 7828/10000 [12:19:35<3:18:27, 5.48s/it][2025-06-20 01:49:19,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:49:19,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.74 | bwd_microstep: 3324.85 | bwd_inner_microstep: 3324.06 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 01:49:19,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.74 | bwd: 3324.86 | bwd_inner: 3324.06 | bwd_allreduce: 0.76 | step: 6.67 78%|███████▊ | 7829/10000 [12:19:40<3:18:13, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0005857788491994143, 'learning_rate': 4.743608010082812e-06, 'epoch': 7.83} 78%|███████▊ | 7829/10000 [12:19:40<3:18:13, 5.48s/it][2025-06-20 01:49:25,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:49:25,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.64 | bwd_microstep: 3378.84 | bwd_inner_microstep: 3377.78 | bwd_allreduce_microstep: 1.01 | step_microstep: 6.79 [2025-06-20 01:49:25,444] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.64 | bwd: 3378.85 | bwd_inner: 3377.78 | bwd_allreduce: 1.02 | step: 6.79 78%|███████▊ | 7830/10000 [12:19:46<3:19:02, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.000299700943287462, 'learning_rate': 4.73942037832434e-06, 'epoch': 7.83} 78%|███████▊ | 7830/10000 [12:19:46<3:19:02, 5.50s/it][2025-06-20 01:49:30,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:49:30,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.58 | bwd_microstep: 3316.96 | bwd_inner_microstep: 3316.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 01:49:30,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.58 | bwd: 3316.97 | bwd_inner: 3316.17 | bwd_allreduce: 0.76 | step: 6.65 78%|███████▊ | 7831/10000 [12:19:51<3:18:31, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0005255861906334758, 'learning_rate': 4.7352343473298424e-06, 'epoch': 7.83} 78%|███████▊ | 7831/10000 [12:19:51<3:18:31, 5.49s/it][2025-06-20 01:49:36,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:49:36,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.83 | bwd_microstep: 3318.57 | bwd_inner_microstep: 3317.60 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.05 [2025-06-20 01:49:36,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.83 | bwd: 3318.59 | bwd_inner: 3317.60 | bwd_allreduce: 0.94 | step: 7.06 78%|███████▊ | 7832/10000 [12:19:57<3:18:11, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0014693301636725664, 'learning_rate': 4.731049917538415e-06, 'epoch': 7.83} 78%|███████▊ | 7832/10000 [12:19:57<3:18:11, 5.49s/it][2025-06-20 01:49:41,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:49:41,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.05 | bwd_microstep: 3323.31 | bwd_inner_microstep: 3322.46 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.08 [2025-06-20 01:49:41,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.05 | bwd: 3323.32 | bwd_inner: 3322.46 | bwd_allreduce: 0.82 | step: 7.09 78%|███████▊ | 7833/10000 [12:20:02<3:18:05, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.03683772310614586, 'learning_rate': 4.726867089388987e-06, 'epoch': 7.83} 78%|███████▊ | 7833/10000 [12:20:02<3:18:05, 5.48s/it][2025-06-20 01:49:47,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:49:47,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.76 | bwd_microstep: 3375.09 | bwd_inner_microstep: 3374.23 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.21 [2025-06-20 01:49:47,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.76 | bwd: 3375.11 | bwd_inner: 3374.23 | bwd_allreduce: 0.83 | step: 7.21 78%|███████▊ | 7834/10000 [12:20:08<3:18:39, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0008883282425813377, 'learning_rate': 4.7226858633203154e-06, 'epoch': 7.83} 78%|███████▊ | 7834/10000 [12:20:08<3:18:39, 5.50s/it][2025-06-20 01:49:52,883] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:49:52,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.24 | bwd_microstep: 3329.01 | bwd_inner_microstep: 3328.23 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 01:49:52,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.24 | bwd: 3329.03 | bwd_inner: 3328.23 | bwd_allreduce: 0.76 | step: 6.62 78%|███████▊ | 7835/10000 [12:20:13<3:18:16, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.08234044909477234, 'learning_rate': 4.718506239770995e-06, 'epoch': 7.83} 78%|███████▊ | 7835/10000 [12:20:13<3:18:16, 5.50s/it][2025-06-20 01:49:58,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:49:58,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.44 | bwd_microstep: 3372.65 | bwd_inner_microstep: 3371.84 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.03 [2025-06-20 01:49:58,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.44 | bwd: 3372.67 | bwd_inner: 3371.84 | bwd_allreduce: 0.78 | step: 7.03 78%|███████▊ | 7836/10000 [12:20:19<3:18:42, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0025413446128368378, 'learning_rate': 4.714328219179445e-06, 'epoch': 7.84} 78%|███████▊ | 7836/10000 [12:20:19<3:18:42, 5.51s/it][2025-06-20 01:50:03,902] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 01:50:03,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.46 | bwd_microstep: 3311.95 | bwd_inner_microstep: 3310.87 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.95 [2025-06-20 01:50:03,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.46 | bwd: 3311.97 | bwd_inner: 3310.87 | bwd_allreduce: 1.03 | step: 7.95 78%|███████▊ | 7837/10000 [12:20:24<3:18:17, 5.50s/it] {'loss': 0.0007, 'grad_norm': 0.3015868365764618, 'learning_rate': 4.710151801983926e-06, 'epoch': 7.84} 78%|███████▊ | 7837/10000 [12:20:24<3:18:17, 5.50s/it][2025-06-20 01:50:09,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:50:09,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.72 | bwd_microstep: 3358.61 | bwd_inner_microstep: 3357.73 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.15 [2025-06-20 01:50:09,436] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.72 | bwd: 3358.64 | bwd_inner: 3357.73 | bwd_allreduce: 0.84 | step: 7.16 78%|███████▊ | 7838/10000 [12:20:30<3:18:33, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0029021690133959055, 'learning_rate': 4.705976988622518e-06, 'epoch': 7.84} 78%|███████▊ | 7838/10000 [12:20:30<3:18:33, 5.51s/it][2025-06-20 01:50:14,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:50:14,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.67 | bwd_microstep: 3310.53 | bwd_inner_microstep: 3309.72 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-20 01:50:14,897] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.67 | bwd: 3310.54 | bwd_inner: 3309.72 | bwd_allreduce: 0.77 | step: 6.79 78%|███████▊ | 7839/10000 [12:20:35<3:17:54, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002408405300229788, 'learning_rate': 4.70180377953314e-06, 'epoch': 7.84} 78%|███████▊ | 7839/10000 [12:20:35<3:17:54, 5.49s/it][2025-06-20 01:50:20,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:50:20,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.23 | bwd_microstep: 3367.94 | bwd_inner_microstep: 3366.93 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.22 [2025-06-20 01:50:20,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.23 | bwd: 3367.95 | bwd_inner: 3366.93 | bwd_allreduce: 0.98 | step: 7.22 78%|███████▊ | 7840/10000 [12:20:41<3:18:16, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.004094467032700777, 'learning_rate': 4.697632175153543e-06, 'epoch': 7.84} 78%|███████▊ | 7840/10000 [12:20:41<3:18:16, 5.51s/it][2025-06-20 01:50:25,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:50:25,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.12 | bwd_microstep: 3310.57 | bwd_inner_microstep: 3309.76 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.79 [2025-06-20 01:50:25,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.13 | bwd: 3310.58 | bwd_inner: 3309.76 | bwd_allreduce: 0.78 | step: 6.80 78%|███████▊ | 7841/10000 [12:20:46<3:17:39, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004252373240888119, 'learning_rate': 4.693462175921313e-06, 'epoch': 7.84} 78%|███████▊ | 7841/10000 [12:20:46<3:17:39, 5.49s/it][2025-06-20 01:50:31,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-20 01:50:31,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.54 | bwd_microstep: 3311.96 | bwd_inner_microstep: 3311.13 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.31 [2025-06-20 01:50:31,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.54 | bwd: 3311.97 | bwd_inner: 3311.13 | bwd_allreduce: 0.80 | step: 7.31 78%|███████▊ | 7842/10000 [12:20:52<3:17:14, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.012602357193827629, 'learning_rate': 4.689293782273867e-06, 'epoch': 7.84} 78%|███████▊ | 7842/10000 [12:20:52<3:17:14, 5.48s/it][2025-06-20 01:50:36,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.77 [2025-06-20 01:50:36,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.01 | bwd_microstep: 3312.35 | bwd_inner_microstep: 3311.50 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.74 [2025-06-20 01:50:36,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.01 | bwd: 3312.36 | bwd_inner: 3311.50 | bwd_allreduce: 0.82 | step: 6.75 78%|███████▊ | 7843/10000 [12:20:57<3:16:52, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00033993087708950043, 'learning_rate': 4.6851269946484365e-06, 'epoch': 7.84} 78%|███████▊ | 7843/10000 [12:20:57<3:16:52, 5.48s/it][2025-06-20 01:50:42,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:50:42,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.20 | bwd_microstep: 3309.73 | bwd_inner_microstep: 3308.76 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.11 [2025-06-20 01:50:42,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.20 | bwd: 3309.74 | bwd_inner: 3308.76 | bwd_allreduce: 0.93 | step: 7.12 78%|███████▊ | 7844/10000 [12:21:03<3:16:33, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0016699339030310512, 'learning_rate': 4.680961813482106e-06, 'epoch': 7.84} 78%|███████▊ | 7844/10000 [12:21:03<3:16:33, 5.47s/it][2025-06-20 01:50:47,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:50:47,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.15 | bwd_microstep: 3319.42 | bwd_inner_microstep: 3318.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 01:50:47,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.15 | bwd: 3319.43 | bwd_inner: 3318.63 | bwd_allreduce: 0.76 | step: 6.81 78%|███████▊ | 7845/10000 [12:21:08<3:16:29, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.01690182276070118, 'learning_rate': 4.6767982392117836e-06, 'epoch': 7.84} 78%|███████▊ | 7845/10000 [12:21:08<3:16:29, 5.47s/it][2025-06-20 01:50:53,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:50:53,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.03 | bwd_microstep: 3369.24 | bwd_inner_microstep: 3368.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-20 01:50:53,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.03 | bwd: 3369.25 | bwd_inner: 3368.43 | bwd_allreduce: 0.78 | step: 7.04 78%|███████▊ | 7846/10000 [12:21:14<3:17:13, 5.49s/it] {'loss': 0.0006, 'grad_norm': 0.21854378283023834, 'learning_rate': 4.672636272274209e-06, 'epoch': 7.85} 78%|███████▊ | 7846/10000 [12:21:14<3:17:13, 5.49s/it][2025-06-20 01:50:58,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:50:58,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.37 | bwd_microstep: 3314.97 | bwd_inner_microstep: 3314.17 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.29 [2025-06-20 01:50:58,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.37 | bwd: 3314.99 | bwd_inner: 3314.17 | bwd_allreduce: 0.77 | step: 7.30 78%|███████▊ | 7847/10000 [12:21:19<3:16:46, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0001477095065638423, 'learning_rate': 4.668475913105958e-06, 'epoch': 7.85} 78%|███████▊ | 7847/10000 [12:21:19<3:16:46, 5.48s/it][2025-06-20 01:51:04,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:51:04,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.65 | bwd_microstep: 3369.94 | bwd_inner_microstep: 3369.01 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.18 [2025-06-20 01:51:04,290] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.65 | bwd: 3369.96 | bwd_inner: 3369.01 | bwd_allreduce: 0.90 | step: 7.18 78%|███████▊ | 7848/10000 [12:21:25<3:17:16, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.03517480194568634, 'learning_rate': 4.664317162143424e-06, 'epoch': 7.85} 78%|███████▊ | 7848/10000 [12:21:25<3:17:16, 5.50s/it][2025-06-20 01:51:09,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:51:09,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.60 | bwd_microstep: 3374.52 | bwd_inner_microstep: 3373.71 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-20 01:51:09,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.60 | bwd: 3374.53 | bwd_inner: 3373.71 | bwd_allreduce: 0.78 | step: 7.06 78%|███████▊ | 7849/10000 [12:21:30<3:17:37, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00029591337079182267, 'learning_rate': 4.660160019822837e-06, 'epoch': 7.85} 78%|███████▊ | 7849/10000 [12:21:30<3:17:37, 5.51s/it][2025-06-20 01:51:15,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:51:15,302] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.88 | bwd_microstep: 3320.44 | bwd_inner_microstep: 3319.61 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.18 [2025-06-20 01:51:15,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.88 | bwd: 3320.45 | bwd_inner: 3319.61 | bwd_allreduce: 0.79 | step: 7.18 78%|███████▊ | 7850/10000 [12:21:36<3:17:05, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.502718985080719, 'learning_rate': 4.656004486580276e-06, 'epoch': 7.85} 78%|███████▊ | 7850/10000 [12:21:36<3:17:05, 5.50s/it][2025-06-20 01:51:20,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:51:20,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.71 | bwd_microstep: 3322.58 | bwd_inner_microstep: 3321.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.75 [2025-06-20 01:51:20,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.71 | bwd: 3322.60 | bwd_inner: 3321.80 | bwd_allreduce: 0.76 | step: 6.75 79%|███████▊ | 7851/10000 [12:21:41<3:16:37, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002466740319505334, 'learning_rate': 4.651850562851632e-06, 'epoch': 7.85} 79%|███████▊ | 7851/10000 [12:21:41<3:16:37, 5.49s/it][2025-06-20 01:51:26,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:51:26,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2160.00 | bwd_microstep: 3308.52 | bwd_inner_microstep: 3307.58 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.13 [2025-06-20 01:51:26,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2160.00 | bwd: 3308.53 | bwd_inner: 3307.58 | bwd_allreduce: 0.90 | step: 7.13 79%|███████▊ | 7852/10000 [12:21:47<3:16:42, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.02566213719546795, 'learning_rate': 4.64769824907263e-06, 'epoch': 7.85} 79%|███████▊ | 7852/10000 [12:21:47<3:16:42, 5.49s/it][2025-06-20 01:51:31,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.88 [2025-06-20 01:51:31,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.16 | bwd_microstep: 3372.69 | bwd_inner_microstep: 3371.64 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.62 [2025-06-20 01:51:31,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.16 | bwd: 3372.71 | bwd_inner: 3371.64 | bwd_allreduce: 1.01 | step: 7.62 79%|███████▊ | 7853/10000 [12:21:52<3:17:09, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.000126872313558124, 'learning_rate': 4.643547545678832e-06, 'epoch': 7.85} 79%|███████▊ | 7853/10000 [12:21:52<3:17:09, 5.51s/it][2025-06-20 01:51:37,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:51:37,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.67 | bwd_microstep: 3319.00 | bwd_inner_microstep: 3318.18 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.14 [2025-06-20 01:51:37,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.67 | bwd: 3319.01 | bwd_inner: 3318.18 | bwd_allreduce: 0.79 | step: 7.14 79%|███████▊ | 7854/10000 [12:21:58<3:16:36, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0025591396261006594, 'learning_rate': 4.6393984531056235e-06, 'epoch': 7.85} 79%|███████▊ | 7854/10000 [12:21:58<3:16:36, 5.50s/it][2025-06-20 01:51:42,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:51:42,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.10 | bwd_microstep: 3373.73 | bwd_inner_microstep: 3372.68 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.65 [2025-06-20 01:51:42,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.10 | bwd: 3373.75 | bwd_inner: 3372.68 | bwd_allreduce: 1.02 | step: 7.66 79%|███████▊ | 7855/10000 [12:22:03<3:17:06, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0002493089414201677, 'learning_rate': 4.635250971788226e-06, 'epoch': 7.86} 79%|███████▊ | 7855/10000 [12:22:03<3:17:06, 5.51s/it][2025-06-20 01:51:48,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:51:48,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.02 | bwd_microstep: 3312.88 | bwd_inner_microstep: 3311.77 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.80 [2025-06-20 01:51:48,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.02 | bwd: 3312.90 | bwd_inner: 3311.77 | bwd_allreduce: 1.08 | step: 7.81 79%|███████▊ | 7856/10000 [12:22:09<3:16:25, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.13630647957324982, 'learning_rate': 4.631105102161691e-06, 'epoch': 7.86} 79%|███████▊ | 7856/10000 [12:22:09<3:16:25, 5.50s/it][2025-06-20 01:51:53,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:51:53,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.44 | bwd_microstep: 3308.26 | bwd_inner_microstep: 3307.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-20 01:51:53,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.44 | bwd: 3308.27 | bwd_inner: 3307.47 | bwd_allreduce: 0.76 | step: 6.96 79%|███████▊ | 7857/10000 [12:22:14<3:15:46, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0001537541684228927, 'learning_rate': 4.6269608446609e-06, 'epoch': 7.86} 79%|███████▊ | 7857/10000 [12:22:14<3:15:46, 5.48s/it][2025-06-20 01:51:59,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:51:59,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.51 | bwd_microstep: 3318.24 | bwd_inner_microstep: 3317.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-20 01:51:59,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.51 | bwd: 3318.25 | bwd_inner: 3317.43 | bwd_allreduce: 0.78 | step: 7.12 79%|███████▊ | 7858/10000 [12:22:19<3:15:24, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.004301492124795914, 'learning_rate': 4.622818199720576e-06, 'epoch': 7.86} 79%|███████▊ | 7858/10000 [12:22:19<3:15:24, 5.47s/it][2025-06-20 01:52:04,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:52:04,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.74 | bwd_microstep: 3320.96 | bwd_inner_microstep: 3320.17 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 01:52:04,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.74 | bwd: 3320.97 | bwd_inner: 3320.17 | bwd_allreduce: 0.76 | step: 6.80 79%|███████▊ | 7859/10000 [12:22:25<3:15:11, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0003991924459114671, 'learning_rate': 4.618677167775247e-06, 'epoch': 7.86} 79%|███████▊ | 7859/10000 [12:22:25<3:15:11, 5.47s/it][2025-06-20 01:52:10,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 01:52:10,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.47 | bwd_microstep: 3324.69 | bwd_inner_microstep: 3323.73 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.67 [2025-06-20 01:52:10,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.47 | bwd: 3324.70 | bwd_inner: 3323.73 | bwd_allreduce: 0.93 | step: 7.67 79%|███████▊ | 7860/10000 [12:22:30<3:15:11, 5.47s/it] {'loss': 0.0004, 'grad_norm': 0.178672656416893, 'learning_rate': 4.614537749259298e-06, 'epoch': 7.86} 79%|███████▊ | 7860/10000 [12:22:30<3:15:11, 5.47s/it][2025-06-20 01:52:15,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:52:15,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.07 | bwd_microstep: 3362.05 | bwd_inner_microstep: 3361.22 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.22 [2025-06-20 01:52:15,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.07 | bwd: 3362.07 | bwd_inner: 3361.23 | bwd_allreduce: 0.80 | step: 7.22 79%|███████▊ | 7861/10000 [12:22:36<3:15:40, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00023365142988041043, 'learning_rate': 4.610399944606931e-06, 'epoch': 7.86} 79%|███████▊ | 7861/10000 [12:22:36<3:15:40, 5.49s/it][2025-06-20 01:52:21,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:52:21,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.90 | bwd_microstep: 3309.66 | bwd_inner_microstep: 3308.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 01:52:21,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.90 | bwd: 3309.67 | bwd_inner: 3308.87 | bwd_allreduce: 0.76 | step: 6.60 79%|███████▊ | 7862/10000 [12:22:41<3:15:09, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0016615589847788215, 'learning_rate': 4.606263754252187e-06, 'epoch': 7.86} 79%|███████▊ | 7862/10000 [12:22:41<3:15:09, 5.48s/it][2025-06-20 01:52:26,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:52:26,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.24 | bwd_microstep: 3310.21 | bwd_inner_microstep: 3309.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 01:52:26,568] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.24 | bwd: 3310.22 | bwd_inner: 3309.42 | bwd_allreduce: 0.76 | step: 6.64 79%|███████▊ | 7863/10000 [12:22:47<3:14:49, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0009006761247292161, 'learning_rate': 4.602129178628933e-06, 'epoch': 7.86} 79%|███████▊ | 7863/10000 [12:22:47<3:14:49, 5.47s/it][2025-06-20 01:52:32,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 01:52:32,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.15 | bwd_microstep: 3361.93 | bwd_inner_microstep: 3360.82 | bwd_allreduce_microstep: 1.04 | step_microstep: 8.06 [2025-06-20 01:52:32,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.15 | bwd: 3361.95 | bwd_inner: 3360.82 | bwd_allreduce: 1.08 | step: 8.06 79%|███████▊ | 7864/10000 [12:22:52<3:15:22, 5.49s/it] {'loss': 0.0079, 'grad_norm': 3.1443347930908203, 'learning_rate': 4.597996218170861e-06, 'epoch': 7.86} 79%|███████▊ | 7864/10000 [12:22:52<3:15:22, 5.49s/it][2025-06-20 01:52:37,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:52:37,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.79 | bwd_microstep: 3368.71 | bwd_inner_microstep: 3367.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.25 [2025-06-20 01:52:37,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.79 | bwd: 3368.72 | bwd_inner: 3367.90 | bwd_allreduce: 0.78 | step: 7.25 79%|███████▊ | 7865/10000 [12:22:58<3:15:51, 5.50s/it] {'loss': 0.0012, 'grad_norm': 0.5731902718544006, 'learning_rate': 4.593864873311504e-06, 'epoch': 7.87} 79%|███████▊ | 7865/10000 [12:22:58<3:15:51, 5.50s/it][2025-06-20 01:52:43,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:52:43,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.65 | bwd_microstep: 3316.40 | bwd_inner_microstep: 3315.59 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.07 [2025-06-20 01:52:43,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.65 | bwd: 3316.42 | bwd_inner: 3315.59 | bwd_allreduce: 0.78 | step: 7.07 79%|███████▊ | 7866/10000 [12:23:03<3:15:14, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0004915675381198525, 'learning_rate': 4.589735144484217e-06, 'epoch': 7.87} 79%|███████▊ | 7866/10000 [12:23:03<3:15:14, 5.49s/it][2025-06-20 01:52:48,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:52:48,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.45 | bwd_microstep: 3365.31 | bwd_inner_microstep: 3364.34 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.40 [2025-06-20 01:52:48,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.45 | bwd: 3365.32 | bwd_inner: 3364.34 | bwd_allreduce: 0.94 | step: 7.40 79%|███████▊ | 7867/10000 [12:23:09<3:15:40, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.027780093252658844, 'learning_rate': 4.5856070321221945e-06, 'epoch': 7.87} 79%|███████▊ | 7867/10000 [12:23:09<3:15:40, 5.50s/it][2025-06-20 01:52:54,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:52:54,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.27 | bwd_microstep: 3355.14 | bwd_inner_microstep: 3354.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-20 01:52:54,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.27 | bwd: 3355.15 | bwd_inner: 3354.35 | bwd_allreduce: 0.76 | step: 6.65 79%|███████▊ | 7868/10000 [12:23:14<3:15:49, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0010240285191684961, 'learning_rate': 4.5814805366584536e-06, 'epoch': 7.87} 79%|███████▊ | 7868/10000 [12:23:14<3:15:49, 5.51s/it][2025-06-20 01:52:59,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:52:59,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.31 | bwd_microstep: 3367.36 | bwd_inner_microstep: 3366.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 01:52:59,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.31 | bwd: 3367.37 | bwd_inner: 3366.57 | bwd_allreduce: 0.76 | step: 6.73 79%|███████▊ | 7869/10000 [12:23:20<3:16:00, 5.52s/it] {'loss': 0.0, 'grad_norm': 9.969189704861492e-05, 'learning_rate': 4.5773556585258434e-06, 'epoch': 7.87} 79%|███████▊ | 7869/10000 [12:23:20<3:16:00, 5.52s/it][2025-06-20 01:53:05,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:53:05,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.69 | bwd_microstep: 3309.63 | bwd_inner_microstep: 3308.79 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.39 [2025-06-20 01:53:05,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.69 | bwd: 3309.66 | bwd_inner: 3308.79 | bwd_allreduce: 0.81 | step: 7.39 79%|███████▊ | 7870/10000 [12:23:25<3:15:12, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.027944523841142654, 'learning_rate': 4.5732323981570455e-06, 'epoch': 7.87} 79%|███████▊ | 7870/10000 [12:23:25<3:15:12, 5.50s/it][2025-06-20 01:53:10,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:53:10,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.42 | bwd_microstep: 3364.08 | bwd_inner_microstep: 3363.10 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.16 [2025-06-20 01:53:10,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.42 | bwd: 3364.09 | bwd_inner: 3363.10 | bwd_allreduce: 0.94 | step: 7.17 79%|███████▊ | 7871/10000 [12:23:31<3:15:24, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0011666568461805582, 'learning_rate': 4.5691107559845735e-06, 'epoch': 7.87} 79%|███████▊ | 7871/10000 [12:23:31<3:15:24, 5.51s/it][2025-06-20 01:53:16,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:53:16,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.76 | bwd_microstep: 3313.18 | bwd_inner_microstep: 3312.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-20 01:53:16,131] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.76 | bwd: 3313.20 | bwd_inner: 3312.39 | bwd_allreduce: 0.77 | step: 6.90 79%|███████▊ | 7872/10000 [12:23:36<3:14:46, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002222213661298156, 'learning_rate': 4.5649907324407665e-06, 'epoch': 7.87} 79%|███████▊ | 7872/10000 [12:23:36<3:14:46, 5.49s/it][2025-06-20 01:53:21,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 01:53:21,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.59 | bwd_microstep: 3371.33 | bwd_inner_microstep: 3370.34 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.53 [2025-06-20 01:53:21,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.59 | bwd: 3371.35 | bwd_inner: 3370.34 | bwd_allreduce: 0.94 | step: 7.53 79%|███████▊ | 7873/10000 [12:23:42<3:15:20, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0002912537020165473, 'learning_rate': 4.560872327957799e-06, 'epoch': 7.87} 79%|███████▊ | 7873/10000 [12:23:42<3:15:20, 5.51s/it][2025-06-20 01:53:27,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-20 01:53:27,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.57 | bwd_microstep: 3312.55 | bwd_inner_microstep: 3311.77 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.82 [2025-06-20 01:53:27,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.57 | bwd: 3312.56 | bwd_inner: 3311.77 | bwd_allreduce: 0.75 | step: 6.83 79%|███████▊ | 7874/10000 [12:23:47<3:14:40, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0014044344425201416, 'learning_rate': 4.556755542967663e-06, 'epoch': 7.87} 79%|███████▊ | 7874/10000 [12:23:47<3:14:40, 5.49s/it][2025-06-20 01:53:32,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:53:32,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.68 | bwd_microstep: 3316.92 | bwd_inner_microstep: 3316.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 01:53:32,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.68 | bwd: 3316.93 | bwd_inner: 3316.13 | bwd_allreduce: 0.76 | step: 6.64 79%|███████▉ | 7875/10000 [12:23:53<3:14:11, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.005796272307634354, 'learning_rate': 4.552640377902197e-06, 'epoch': 7.88} 79%|███████▉ | 7875/10000 [12:23:53<3:14:11, 5.48s/it][2025-06-20 01:53:38,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 01:53:38,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.85 | bwd_microstep: 3374.91 | bwd_inner_microstep: 3374.12 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-20 01:53:38,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.84 | bwd: 3374.92 | bwd_inner: 3374.12 | bwd_allreduce: 0.75 | step: 6.55 79%|███████▉ | 7876/10000 [12:23:58<3:14:40, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00458662211894989, 'learning_rate': 4.548526833193063e-06, 'epoch': 7.88} 79%|███████▉ | 7876/10000 [12:23:58<3:14:40, 5.50s/it][2025-06-20 01:53:43,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:53:43,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.69 | bwd_microstep: 3359.47 | bwd_inner_microstep: 3358.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-20 01:53:43,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.70 | bwd: 3359.49 | bwd_inner: 3358.67 | bwd_allreduce: 0.78 | step: 7.01 79%|███████▉ | 7877/10000 [12:24:04<3:14:50, 5.51s/it] {'loss': 0.0, 'grad_norm': 8.635764061182272e-06, 'learning_rate': 4.5444149092717526e-06, 'epoch': 7.88} 79%|███████▉ | 7877/10000 [12:24:04<3:14:50, 5.51s/it][2025-06-20 01:53:49,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:53:49,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.39 | bwd_microstep: 3315.70 | bwd_inner_microstep: 3314.83 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.95 [2025-06-20 01:53:49,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.39 | bwd: 3315.71 | bwd_inner: 3314.83 | bwd_allreduce: 0.84 | step: 6.96 79%|███████▉ | 7878/10000 [12:24:09<3:14:12, 5.49s/it] {'loss': 0.0032, 'grad_norm': 0.5921454429626465, 'learning_rate': 4.54030460656959e-06, 'epoch': 7.88} 79%|███████▉ | 7878/10000 [12:24:09<3:14:12, 5.49s/it][2025-06-20 01:53:54,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:53:54,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.42 | bwd_microstep: 3305.81 | bwd_inner_microstep: 3305.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 01:53:54,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.42 | bwd: 3305.82 | bwd_inner: 3305.02 | bwd_allreduce: 0.76 | step: 6.63 79%|███████▉ | 7879/10000 [12:24:15<3:13:40, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.000528796692378819, 'learning_rate': 4.536195925517719e-06, 'epoch': 7.88} 79%|███████▉ | 7879/10000 [12:24:15<3:13:40, 5.48s/it][2025-06-20 01:54:00,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:54:00,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.18 | bwd_microstep: 3317.55 | bwd_inner_microstep: 3316.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 01:54:00,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.18 | bwd: 3317.57 | bwd_inner: 3316.76 | bwd_allreduce: 0.76 | step: 6.75 79%|███████▉ | 7880/10000 [12:24:20<3:13:22, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00042057843529619277, 'learning_rate': 4.532088866547124e-06, 'epoch': 7.88} 79%|███████▉ | 7880/10000 [12:24:20<3:13:22, 5.47s/it][2025-06-20 01:54:05,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-20 01:54:05,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.34 | bwd_microstep: 3315.22 | bwd_inner_microstep: 3314.36 | bwd_allreduce_microstep: 0.81 | step_microstep: 8.09 [2025-06-20 01:54:05,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.34 | bwd: 3315.24 | bwd_inner: 3314.36 | bwd_allreduce: 0.83 | step: 8.10 79%|███████▉ | 7881/10000 [12:24:26<3:13:11, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0009463964379392564, 'learning_rate': 4.527983430088621e-06, 'epoch': 7.88} 79%|███████▉ | 7881/10000 [12:24:26<3:13:11, 5.47s/it][2025-06-20 01:54:11,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:54:11,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.53 | bwd_microstep: 3365.63 | bwd_inner_microstep: 3364.66 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.08 [2025-06-20 01:54:11,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.53 | bwd: 3365.64 | bwd_inner: 3364.66 | bwd_allreduce: 0.93 | step: 7.08 79%|███████▉ | 7882/10000 [12:24:31<3:13:43, 5.49s/it] {'loss': 0.2001, 'grad_norm': 12.430000305175781, 'learning_rate': 4.5238796165728464e-06, 'epoch': 7.88} 79%|███████▉ | 7882/10000 [12:24:31<3:13:43, 5.49s/it][2025-06-20 01:54:16,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:54:16,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.36 | bwd_microstep: 3366.18 | bwd_inner_microstep: 3365.20 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.97 [2025-06-20 01:54:16,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.36 | bwd: 3366.20 | bwd_inner: 3365.20 | bwd_allreduce: 0.95 | step: 7.97 79%|███████▉ | 7883/10000 [12:24:37<3:14:08, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.01696874015033245, 'learning_rate': 4.519777426430279e-06, 'epoch': 7.88} 79%|███████▉ | 7883/10000 [12:24:37<3:14:08, 5.50s/it][2025-06-20 01:54:22,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:54:22,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.04 | bwd_microstep: 3307.54 | bwd_inner_microstep: 3306.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.75 [2025-06-20 01:54:22,008] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.04 | bwd: 3307.55 | bwd_inner: 3306.76 | bwd_allreduce: 0.75 | step: 6.76 79%|███████▉ | 7884/10000 [12:24:42<3:13:32, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0017941524274647236, 'learning_rate': 4.515676860091203e-06, 'epoch': 7.88} 79%|███████▉ | 7884/10000 [12:24:42<3:13:32, 5.49s/it][2025-06-20 01:54:27,475] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 01:54:27,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.34 | bwd_microstep: 3315.65 | bwd_inner_microstep: 3314.53 | bwd_allreduce_microstep: 1.05 | step_microstep: 8.00 [2025-06-20 01:54:27,476] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.35 | bwd: 3315.67 | bwd_inner: 3314.53 | bwd_allreduce: 1.08 | step: 8.00 79%|███████▉ | 7885/10000 [12:24:48<3:13:18, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0001605805300641805, 'learning_rate': 4.511577917985763e-06, 'epoch': 7.88} 79%|███████▉ | 7885/10000 [12:24:48<3:13:18, 5.48s/it][2025-06-20 01:54:33,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:54:33,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.15 | bwd_microstep: 3364.78 | bwd_inner_microstep: 3363.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-20 01:54:33,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.15 | bwd: 3364.79 | bwd_inner: 3363.98 | bwd_allreduce: 0.77 | step: 6.92 79%|███████▉ | 7886/10000 [12:24:53<3:13:51, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.004825400188565254, 'learning_rate': 4.507480600543918e-06, 'epoch': 7.89} 79%|███████▉ | 7886/10000 [12:24:53<3:13:51, 5.50s/it][2025-06-20 01:54:38,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 01:54:38,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.21 | bwd_microstep: 3316.49 | bwd_inner_microstep: 3315.56 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.16 [2025-06-20 01:54:38,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.21 | bwd: 3316.51 | bwd_inner: 3315.56 | bwd_allreduce: 0.89 | step: 7.16 79%|███████▉ | 7887/10000 [12:24:59<3:13:21, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.09862257540225983, 'learning_rate': 4.5033849081954535e-06, 'epoch': 7.89} 79%|███████▉ | 7887/10000 [12:24:59<3:13:21, 5.49s/it][2025-06-20 01:54:43,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:54:43,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.86 | bwd_microstep: 3321.15 | bwd_inner_microstep: 3320.36 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-20 01:54:43,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.87 | bwd: 3321.17 | bwd_inner: 3320.36 | bwd_allreduce: 0.76 | step: 6.84 79%|███████▉ | 7888/10000 [12:25:04<3:13:05, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00047600403195247054, 'learning_rate': 4.499290841369996e-06, 'epoch': 7.89} 79%|███████▉ | 7888/10000 [12:25:04<3:13:05, 5.49s/it][2025-06-20 01:54:49,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:54:49,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.91 | bwd_microstep: 3369.98 | bwd_inner_microstep: 3369.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 01:54:49,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.91 | bwd: 3369.99 | bwd_inner: 3369.20 | bwd_allreduce: 0.75 | step: 6.61 79%|███████▉ | 7889/10000 [12:25:10<3:13:34, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.015966305509209633, 'learning_rate': 4.495198400496983e-06, 'epoch': 7.89} 79%|███████▉ | 7889/10000 [12:25:10<3:13:34, 5.50s/it][2025-06-20 01:54:54,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:54:54,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.37 | bwd_microstep: 3310.53 | bwd_inner_microstep: 3309.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.10 [2025-06-20 01:54:54,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.37 | bwd: 3310.54 | bwd_inner: 3309.74 | bwd_allreduce: 0.76 | step: 7.10 79%|███████▉ | 7890/10000 [12:25:15<3:12:55, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0010752883972600102, 'learning_rate': 4.491107586005698e-06, 'epoch': 7.89} 79%|███████▉ | 7890/10000 [12:25:15<3:12:55, 5.49s/it][2025-06-20 01:55:00,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-20 01:55:00,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.44 | bwd_microstep: 3320.65 | bwd_inner_microstep: 3319.66 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.81 [2025-06-20 01:55:00,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.44 | bwd: 3320.67 | bwd_inner: 3319.66 | bwd_allreduce: 0.95 | step: 7.82 79%|███████▉ | 7891/10000 [12:25:21<3:12:41, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0008093701326288283, 'learning_rate': 4.487018398325247e-06, 'epoch': 7.89} 79%|███████▉ | 7891/10000 [12:25:21<3:12:41, 5.48s/it][2025-06-20 01:55:05,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:55:05,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.90 | bwd_microstep: 3326.07 | bwd_inner_microstep: 3325.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 01:55:05,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.90 | bwd: 3326.08 | bwd_inner: 3325.27 | bwd_allreduce: 0.77 | step: 6.81 79%|███████▉ | 7892/10000 [12:25:26<3:12:33, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0009508470539003611, 'learning_rate': 4.482930837884569e-06, 'epoch': 7.89} 79%|███████▉ | 7892/10000 [12:25:26<3:12:33, 5.48s/it][2025-06-20 01:55:11,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:55:11,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.32 | bwd_microstep: 3321.77 | bwd_inner_microstep: 3320.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 01:55:11,376] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.32 | bwd: 3321.78 | bwd_inner: 3320.98 | bwd_allreduce: 0.76 | step: 6.73 79%|███████▉ | 7893/10000 [12:25:32<3:12:22, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.006718575954437256, 'learning_rate': 4.478844905112436e-06, 'epoch': 7.89} 79%|███████▉ | 7893/10000 [12:25:32<3:12:22, 5.48s/it][2025-06-20 01:55:16,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:55:16,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.41 | bwd_microstep: 3312.93 | bwd_inner_microstep: 3312.07 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.46 [2025-06-20 01:55:16,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.41 | bwd: 3312.95 | bwd_inner: 3312.07 | bwd_allreduce: 0.82 | step: 7.47 79%|███████▉ | 7894/10000 [12:25:37<3:12:05, 5.47s/it] {'loss': 0.0328, 'grad_norm': 5.691018104553223, 'learning_rate': 4.474760600437429e-06, 'epoch': 7.89} 79%|███████▉ | 7894/10000 [12:25:37<3:12:05, 5.47s/it][2025-06-20 01:55:22,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 01:55:22,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.84 | bwd_microstep: 3326.39 | bwd_inner_microstep: 3325.42 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.16 [2025-06-20 01:55:22,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.84 | bwd: 3326.40 | bwd_inner: 3325.42 | bwd_allreduce: 0.94 | step: 7.16 79%|███████▉ | 7895/10000 [12:25:43<3:11:57, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.019426288083195686, 'learning_rate': 4.47067792428798e-06, 'epoch': 7.89} 79%|███████▉ | 7895/10000 [12:25:43<3:11:57, 5.47s/it][2025-06-20 01:55:27,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:55:27,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.62 | bwd_microstep: 3324.31 | bwd_inner_microstep: 3323.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-20 01:55:27,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.62 | bwd: 3324.33 | bwd_inner: 3323.52 | bwd_allreduce: 0.77 | step: 6.66 79%|███████▉ | 7896/10000 [12:25:48<3:11:53, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00021325495617929846, 'learning_rate': 4.466596877092344e-06, 'epoch': 7.9} 79%|███████▉ | 7896/10000 [12:25:48<3:11:53, 5.47s/it][2025-06-20 01:55:33,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:55:33,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.67 | bwd_microstep: 3365.12 | bwd_inner_microstep: 3364.26 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.31 [2025-06-20 01:55:33,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.67 | bwd: 3365.14 | bwd_inner: 3364.26 | bwd_allreduce: 0.82 | step: 7.32 79%|███████▉ | 7897/10000 [12:25:54<3:12:26, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006847850512713194, 'learning_rate': 4.462517459278601e-06, 'epoch': 7.9} 79%|███████▉ | 7897/10000 [12:25:54<3:12:26, 5.49s/it][2025-06-20 01:55:38,783] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 01:55:38,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.37 | bwd_microstep: 3328.94 | bwd_inner_microstep: 3327.88 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.48 [2025-06-20 01:55:38,784] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.37 | bwd: 3328.96 | bwd_inner: 3327.88 | bwd_allreduce: 1.03 | step: 7.48 79%|███████▉ | 7898/10000 [12:25:59<3:12:12, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00018424754671286792, 'learning_rate': 4.4584396712746724e-06, 'epoch': 7.9} 79%|███████▉ | 7898/10000 [12:25:59<3:12:12, 5.49s/it][2025-06-20 01:55:44,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:55:44,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.78 | bwd_microstep: 3317.79 | bwd_inner_microstep: 3316.99 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-20 01:55:44,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.78 | bwd: 3317.81 | bwd_inner: 3316.99 | bwd_allreduce: 0.78 | step: 6.95 79%|███████▉ | 7899/10000 [12:26:05<3:11:56, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.01425243355333805, 'learning_rate': 4.454363513508286e-06, 'epoch': 7.9} 79%|███████▉ | 7899/10000 [12:26:05<3:11:56, 5.48s/it][2025-06-20 01:55:49,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:55:49,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.61 | bwd_microstep: 3318.95 | bwd_inner_microstep: 3318.09 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.92 [2025-06-20 01:55:49,726] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.61 | bwd: 3318.96 | bwd_inner: 3318.09 | bwd_allreduce: 0.83 | step: 6.92 79%|███████▉ | 7900/10000 [12:26:10<3:11:41, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0004961953382007778, 'learning_rate': 4.450288986407019e-06, 'epoch': 7.9} 79%|███████▉ | 7900/10000 [12:26:10<3:11:41, 5.48s/it][2025-06-20 01:55:55,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:55:55,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.35 | bwd_microstep: 3325.67 | bwd_inner_microstep: 3324.87 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-20 01:55:55,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.35 | bwd: 3325.68 | bwd_inner: 3324.87 | bwd_allreduce: 0.77 | step: 6.84 79%|███████▉ | 7901/10000 [12:26:15<3:11:35, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0018448029877617955, 'learning_rate': 4.446216090398265e-06, 'epoch': 7.9} 79%|███████▉ | 7901/10000 [12:26:16<3:11:35, 5.48s/it][2025-06-20 01:56:00,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:56:00,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.51 | bwd_microstep: 3331.30 | bwd_inner_microstep: 3330.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-20 01:56:00,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.51 | bwd: 3331.32 | bwd_inner: 3330.51 | bwd_allreduce: 0.77 | step: 7.09 79%|███████▉ | 7902/10000 [12:26:21<3:11:31, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.024330077692866325, 'learning_rate': 4.442144825909258e-06, 'epoch': 7.9} 79%|███████▉ | 7902/10000 [12:26:21<3:11:31, 5.48s/it][2025-06-20 01:56:06,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:56:06,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.80 | bwd_microstep: 3333.96 | bwd_inner_microstep: 3333.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 01:56:06,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.80 | bwd: 3333.98 | bwd_inner: 3333.18 | bwd_allreduce: 0.76 | step: 6.72 79%|███████▉ | 7903/10000 [12:26:26<3:11:32, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0008798729977570474, 'learning_rate': 4.438075193367053e-06, 'epoch': 7.9} 79%|███████▉ | 7903/10000 [12:26:26<3:11:32, 5.48s/it][2025-06-20 01:56:11,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:56:11,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.61 | bwd_microstep: 3380.22 | bwd_inner_microstep: 3379.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 01:56:11,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.61 | bwd: 3380.24 | bwd_inner: 3379.43 | bwd_allreduce: 0.76 | step: 6.69 79%|███████▉ | 7904/10000 [12:26:32<3:12:07, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.03717602416872978, 'learning_rate': 4.434007193198535e-06, 'epoch': 7.9} 79%|███████▉ | 7904/10000 [12:26:32<3:12:07, 5.50s/it][2025-06-20 01:56:17,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:56:17,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.76 | bwd_microstep: 3328.92 | bwd_inner_microstep: 3328.12 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-20 01:56:17,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.76 | bwd: 3328.94 | bwd_inner: 3328.12 | bwd_allreduce: 0.78 | step: 6.77 79%|███████▉ | 7905/10000 [12:26:37<3:11:46, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00027860162663273513, 'learning_rate': 4.429940825830417e-06, 'epoch': 7.91} 79%|███████▉ | 7905/10000 [12:26:37<3:11:46, 5.49s/it][2025-06-20 01:56:22,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 01:56:22,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.74 | bwd_microstep: 3328.60 | bwd_inner_microstep: 3327.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-20 01:56:22,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.74 | bwd: 3328.62 | bwd_inner: 3327.81 | bwd_allreduce: 0.77 | step: 6.86 79%|███████▉ | 7906/10000 [12:26:43<3:11:29, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0007096796180121601, 'learning_rate': 4.425876091689245e-06, 'epoch': 7.91} 79%|███████▉ | 7906/10000 [12:26:43<3:11:29, 5.49s/it][2025-06-20 01:56:28,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:56:28,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.39 | bwd_microstep: 3325.83 | bwd_inner_microstep: 3325.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 01:56:28,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.39 | bwd: 3325.85 | bwd_inner: 3325.04 | bwd_allreduce: 0.76 | step: 6.69 79%|███████▉ | 7907/10000 [12:26:48<3:11:17, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0038528821896761656, 'learning_rate': 4.421812991201389e-06, 'epoch': 7.91} 79%|███████▉ | 7907/10000 [12:26:48<3:11:17, 5.48s/it][2025-06-20 01:56:33,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:56:33,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.34 | bwd_microstep: 3327.40 | bwd_inner_microstep: 3326.47 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.55 [2025-06-20 01:56:33,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.34 | bwd: 3327.42 | bwd_inner: 3326.47 | bwd_allreduce: 0.90 | step: 7.56 79%|███████▉ | 7908/10000 [12:26:54<3:11:04, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0008708901586942375, 'learning_rate': 4.417751524793053e-06, 'epoch': 7.91} 79%|███████▉ | 7908/10000 [12:26:54<3:11:04, 5.48s/it][2025-06-20 01:56:39,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-20 01:56:39,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.06 | bwd_microstep: 3327.01 | bwd_inner_microstep: 3326.03 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.31 [2025-06-20 01:56:39,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.06 | bwd: 3327.03 | bwd_inner: 3326.03 | bwd_allreduce: 0.95 | step: 7.31 79%|███████▉ | 7909/10000 [12:26:59<3:10:59, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.02164766564965248, 'learning_rate': 4.4136916928902584e-06, 'epoch': 7.91} 79%|███████▉ | 7909/10000 [12:26:59<3:10:59, 5.48s/it][2025-06-20 01:56:44,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:56:44,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.52 | bwd_microstep: 3339.00 | bwd_inner_microstep: 3338.21 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.82 [2025-06-20 01:56:44,579] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.52 | bwd: 3339.01 | bwd_inner: 3338.21 | bwd_allreduce: 0.76 | step: 6.84 79%|███████▉ | 7910/10000 [12:27:05<3:10:58, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001074823085218668, 'learning_rate': 4.409633495918868e-06, 'epoch': 7.91} 79%|███████▉ | 7910/10000 [12:27:05<3:10:58, 5.48s/it][2025-06-20 01:56:50,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.80 [2025-06-20 01:56:50,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.22 | bwd_microstep: 3324.22 | bwd_inner_microstep: 3323.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 01:56:50,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.22 | bwd: 3324.23 | bwd_inner: 3323.43 | bwd_allreduce: 0.76 | step: 6.76 79%|███████▉ | 7911/10000 [12:27:10<3:10:42, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0005569685017690063, 'learning_rate': 4.405576934304567e-06, 'epoch': 7.91} 79%|███████▉ | 7911/10000 [12:27:10<3:10:42, 5.48s/it][2025-06-20 01:56:55,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:56:55,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.75 | bwd_microstep: 3320.03 | bwd_inner_microstep: 3319.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 01:56:55,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.75 | bwd: 3320.04 | bwd_inner: 3319.24 | bwd_allreduce: 0.76 | step: 6.68 79%|███████▉ | 7912/10000 [12:27:16<3:10:28, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.0337209589779377, 'learning_rate': 4.4015220084728675e-06, 'epoch': 7.91} 79%|███████▉ | 7912/10000 [12:27:16<3:10:28, 5.47s/it][2025-06-20 01:57:01,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:57:01,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.50 | bwd_microstep: 3406.40 | bwd_inner_microstep: 3405.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.73 [2025-06-20 01:57:01,099] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.50 | bwd: 3406.41 | bwd_inner: 3405.60 | bwd_allreduce: 0.77 | step: 6.73 79%|███████▉ | 7913/10000 [12:27:21<3:11:35, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.0770149677991867, 'learning_rate': 4.397468718849116e-06, 'epoch': 7.91} 79%|███████▉ | 7913/10000 [12:27:21<3:11:35, 5.51s/it][2025-06-20 01:57:06,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.80 [2025-06-20 01:57:06,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.62 | bwd_microstep: 3372.74 | bwd_inner_microstep: 3371.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 01:57:06,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.62 | bwd: 3372.75 | bwd_inner: 3371.95 | bwd_allreduce: 0.76 | step: 6.79 79%|███████▉ | 7914/10000 [12:27:27<3:11:52, 5.52s/it] {'loss': 0.0, 'grad_norm': 3.0173763661878183e-05, 'learning_rate': 4.393417065858487e-06, 'epoch': 7.91} 79%|███████▉ | 7914/10000 [12:27:27<3:11:52, 5.52s/it][2025-06-20 01:57:12,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:57:12,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.46 | bwd_microstep: 3371.89 | bwd_inner_microstep: 3371.08 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-20 01:57:12,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.46 | bwd: 3371.91 | bwd_inner: 3371.08 | bwd_allreduce: 0.78 | step: 7.05 79%|███████▉ | 7915/10000 [12:27:32<3:12:01, 5.53s/it] {'loss': 0.0003, 'grad_norm': 0.05941436067223549, 'learning_rate': 4.389367049925972e-06, 'epoch': 7.92} 79%|███████▉ | 7915/10000 [12:27:32<3:12:01, 5.53s/it][2025-06-20 01:57:17,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:57:17,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.89 | bwd_microstep: 3329.85 | bwd_inner_microstep: 3329.05 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-20 01:57:17,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.89 | bwd: 3329.87 | bwd_inner: 3329.05 | bwd_allreduce: 0.78 | step: 6.89 79%|███████▉ | 7916/10000 [12:27:38<3:11:27, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00218234327621758, 'learning_rate': 4.3853186714764e-06, 'epoch': 7.92} 79%|███████▉ | 7916/10000 [12:27:38<3:11:27, 5.51s/it][2025-06-20 01:57:23,144] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:57:23,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.47 | bwd_microstep: 3331.51 | bwd_inner_microstep: 3330.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-20 01:57:23,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.47 | bwd: 3331.52 | bwd_inner: 3330.71 | bwd_allreduce: 0.76 | step: 6.65 79%|███████▉ | 7917/10000 [12:27:43<3:11:01, 5.50s/it] {'loss': 0.0025, 'grad_norm': 0.4208451211452484, 'learning_rate': 4.38127193093443e-06, 'epoch': 7.92} 79%|███████▉ | 7917/10000 [12:27:43<3:11:01, 5.50s/it][2025-06-20 01:57:28,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:57:28,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.04 | bwd_microstep: 3330.62 | bwd_inner_microstep: 3329.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 01:57:28,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.04 | bwd: 3330.63 | bwd_inner: 3329.82 | bwd_allreduce: 0.77 | step: 6.69 79%|███████▉ | 7918/10000 [12:27:49<3:10:39, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.04857122153043747, 'learning_rate': 4.377226828724546e-06, 'epoch': 7.92} 79%|███████▉ | 7918/10000 [12:27:49<3:10:39, 5.49s/it][2025-06-20 01:57:34,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:57:34,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.52 | bwd_microstep: 3331.45 | bwd_inner_microstep: 3330.32 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.54 [2025-06-20 01:57:34,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.52 | bwd: 3331.47 | bwd_inner: 3330.32 | bwd_allreduce: 1.09 | step: 7.55 79%|███████▉ | 7919/10000 [12:27:54<3:10:29, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004135288298130035, 'learning_rate': 4.373183365271059e-06, 'epoch': 7.92} 79%|███████▉ | 7919/10000 [12:27:54<3:10:29, 5.49s/it][2025-06-20 01:57:39,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:57:39,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.97 | bwd_microstep: 3323.38 | bwd_inner_microstep: 3322.53 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.96 [2025-06-20 01:57:39,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.97 | bwd: 3323.40 | bwd_inner: 3322.53 | bwd_allreduce: 0.83 | step: 6.96 79%|███████▉ | 7920/10000 [12:28:00<3:10:12, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.003003641264513135, 'learning_rate': 4.369141540998112e-06, 'epoch': 7.92} 79%|███████▉ | 7920/10000 [12:28:00<3:10:12, 5.49s/it][2025-06-20 01:57:45,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:57:45,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.84 | bwd_microstep: 3332.56 | bwd_inner_microstep: 3331.60 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.35 [2025-06-20 01:57:45,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.84 | bwd: 3332.58 | bwd_inner: 3331.60 | bwd_allreduce: 0.93 | step: 7.36 79%|███████▉ | 7921/10000 [12:28:05<3:10:02, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.008093413896858692, 'learning_rate': 4.365101356329673e-06, 'epoch': 7.92} 79%|███████▉ | 7921/10000 [12:28:05<3:10:02, 5.48s/it][2025-06-20 01:57:50,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:57:50,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.42 | bwd_microstep: 3329.14 | bwd_inner_microstep: 3328.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-20 01:57:50,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.42 | bwd: 3329.16 | bwd_inner: 3328.35 | bwd_allreduce: 0.77 | step: 6.85 79%|███████▉ | 7922/10000 [12:28:11<3:09:53, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.011408714577555656, 'learning_rate': 4.361062811689536e-06, 'epoch': 7.92} 79%|███████▉ | 7922/10000 [12:28:11<3:09:53, 5.48s/it][2025-06-20 01:57:56,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 01:57:56,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.80 | bwd_microstep: 3325.84 | bwd_inner_microstep: 3324.79 | bwd_allreduce_microstep: 1.00 | step_microstep: 6.93 [2025-06-20 01:57:56,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.80 | bwd: 3325.86 | bwd_inner: 3324.79 | bwd_allreduce: 1.02 | step: 6.94 79%|███████▉ | 7923/10000 [12:28:16<3:09:46, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0022072370629757643, 'learning_rate': 4.357025907501329e-06, 'epoch': 7.92} 79%|███████▉ | 7923/10000 [12:28:16<3:09:46, 5.48s/it][2025-06-20 01:58:01,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:58:01,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.54 | bwd_microstep: 3328.45 | bwd_inner_microstep: 3327.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 01:58:01,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.54 | bwd: 3328.46 | bwd_inner: 3327.66 | bwd_allreduce: 0.76 | step: 6.64 79%|███████▉ | 7924/10000 [12:28:22<3:09:35, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.19917121529579163, 'learning_rate': 4.352990644188508e-06, 'epoch': 7.92} 79%|███████▉ | 7924/10000 [12:28:22<3:09:35, 5.48s/it][2025-06-20 01:58:06,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:58:06,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.32 | bwd_microstep: 3318.13 | bwd_inner_microstep: 3317.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 01:58:06,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.32 | bwd: 3318.14 | bwd_inner: 3317.34 | bwd_allreduce: 0.76 | step: 6.65 79%|███████▉ | 7925/10000 [12:28:27<3:09:25, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0030467212200164795, 'learning_rate': 4.348957022174343e-06, 'epoch': 7.92} 79%|███████▉ | 7925/10000 [12:28:27<3:09:25, 5.48s/it][2025-06-20 01:58:12,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:58:12,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.39 | bwd_microstep: 3319.08 | bwd_inner_microstep: 3318.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 01:58:12,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.39 | bwd: 3319.09 | bwd_inner: 3318.30 | bwd_allreduce: 0.75 | step: 6.60 79%|███████▉ | 7926/10000 [12:28:33<3:09:13, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0001227028842549771, 'learning_rate': 4.344925041881947e-06, 'epoch': 7.93} 79%|███████▉ | 7926/10000 [12:28:33<3:09:13, 5.47s/it][2025-06-20 01:58:17,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:58:17,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.69 | bwd_microstep: 3374.61 | bwd_inner_microstep: 3373.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-20 01:58:17,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.69 | bwd: 3374.63 | bwd_inner: 3373.82 | bwd_allreduce: 0.77 | step: 6.86 79%|███████▉ | 7927/10000 [12:28:38<3:09:53, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.012729701586067677, 'learning_rate': 4.3408947037342575e-06, 'epoch': 7.93} 79%|███████▉ | 7927/10000 [12:28:38<3:09:53, 5.50s/it][2025-06-20 01:58:23,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:58:23,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.48 | bwd_microstep: 3331.81 | bwd_inner_microstep: 3330.97 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.68 [2025-06-20 01:58:23,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.48 | bwd: 3331.82 | bwd_inner: 3330.97 | bwd_allreduce: 0.80 | step: 6.69 79%|███████▉ | 7928/10000 [12:28:44<3:09:40, 5.49s/it] {'loss': 0.0144, 'grad_norm': 4.253222465515137, 'learning_rate': 4.336866008154037e-06, 'epoch': 7.93} 79%|███████▉ | 7928/10000 [12:28:44<3:09:40, 5.49s/it][2025-06-20 01:58:28,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:58:28,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.32 | bwd_microstep: 3322.50 | bwd_inner_microstep: 3321.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.97 [2025-06-20 01:58:28,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.32 | bwd: 3322.52 | bwd_inner: 3321.72 | bwd_allreduce: 0.76 | step: 6.97 79%|███████▉ | 7929/10000 [12:28:49<3:09:16, 5.48s/it] {'loss': 0.0006, 'grad_norm': 0.13168886303901672, 'learning_rate': 4.332838955563883e-06, 'epoch': 7.93} 79%|███████▉ | 7929/10000 [12:28:49<3:09:16, 5.48s/it][2025-06-20 01:58:34,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:58:34,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.14 | bwd_microstep: 3366.81 | bwd_inner_microstep: 3366.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 01:58:34,459] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.14 | bwd: 3366.83 | bwd_inner: 3366.03 | bwd_allreduce: 0.75 | step: 6.61 79%|███████▉ | 7930/10000 [12:28:55<3:09:40, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.008552392944693565, 'learning_rate': 4.328813546386201e-06, 'epoch': 7.93} 79%|███████▉ | 7930/10000 [12:28:55<3:09:40, 5.50s/it][2025-06-20 01:58:39,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:58:39,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.58 | bwd_microstep: 3371.00 | bwd_inner_microstep: 3370.21 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 01:58:39,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.58 | bwd: 3371.02 | bwd_inner: 3370.21 | bwd_allreduce: 0.76 | step: 6.71 79%|███████▉ | 7931/10000 [12:29:00<3:09:57, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.21099784970283508, 'learning_rate': 4.32478978104325e-06, 'epoch': 7.93} 79%|███████▉ | 7931/10000 [12:29:00<3:09:57, 5.51s/it][2025-06-20 01:58:45,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 01:58:45,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.23 | bwd_microstep: 3329.40 | bwd_inner_microstep: 3328.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 01:58:45,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.23 | bwd: 3329.41 | bwd_inner: 3328.60 | bwd_allreduce: 0.76 | step: 6.69 79%|███████▉ | 7932/10000 [12:29:06<3:09:27, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005164066795259714, 'learning_rate': 4.320767659957097e-06, 'epoch': 7.93} 79%|███████▉ | 7932/10000 [12:29:06<3:09:27, 5.50s/it][2025-06-20 01:58:51,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 01:58:51,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.24 | bwd_microstep: 3366.16 | bwd_inner_microstep: 3365.23 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.06 [2025-06-20 01:58:51,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.24 | bwd: 3366.17 | bwd_inner: 3365.23 | bwd_allreduce: 0.90 | step: 7.06 79%|███████▉ | 7933/10000 [12:29:11<3:09:48, 5.51s/it] {'loss': 0.0032, 'grad_norm': 1.401419758796692, 'learning_rate': 4.3167471835496476e-06, 'epoch': 7.93} 79%|███████▉ | 7933/10000 [12:29:11<3:09:48, 5.51s/it][2025-06-20 01:58:56,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 01:58:56,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.81 | bwd_microstep: 3378.82 | bwd_inner_microstep: 3378.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-20 01:58:56,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.81 | bwd: 3378.83 | bwd_inner: 3378.02 | bwd_allreduce: 0.77 | step: 6.86 79%|███████▉ | 7934/10000 [12:29:17<3:10:11, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.015412945300340652, 'learning_rate': 4.3127283522426366e-06, 'epoch': 7.93} 79%|███████▉ | 7934/10000 [12:29:17<3:10:11, 5.52s/it][2025-06-20 01:59:02,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:59:02,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.24 | bwd_microstep: 3316.86 | bwd_inner_microstep: 3316.07 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-20 01:59:02,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.24 | bwd: 3316.88 | bwd_inner: 3316.07 | bwd_allreduce: 0.77 | step: 6.95 79%|███████▉ | 7935/10000 [12:29:22<3:09:29, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.001607707585208118, 'learning_rate': 4.308711166457611e-06, 'epoch': 7.94} 79%|███████▉ | 7935/10000 [12:29:22<3:09:29, 5.51s/it][2025-06-20 01:59:07,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:59:07,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.89 | bwd_microstep: 3373.79 | bwd_inner_microstep: 3373.00 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.61 [2025-06-20 01:59:07,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.89 | bwd: 3373.81 | bwd_inner: 3373.00 | bwd_allreduce: 0.77 | step: 6.61 79%|███████▉ | 7936/10000 [12:29:28<3:09:45, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.004119403660297394, 'learning_rate': 4.304695626615955e-06, 'epoch': 7.94} 79%|███████▉ | 7936/10000 [12:29:28<3:09:45, 5.52s/it][2025-06-20 01:59:13,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 01:59:13,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.68 | bwd_microstep: 3364.59 | bwd_inner_microstep: 3363.48 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.35 [2025-06-20 01:59:13,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.68 | bwd: 3364.61 | bwd_inner: 3363.48 | bwd_allreduce: 1.07 | step: 7.36 79%|███████▉ | 7937/10000 [12:29:33<3:09:51, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.012109583243727684, 'learning_rate': 4.300681733138885e-06, 'epoch': 7.94} 79%|███████▉ | 7937/10000 [12:29:33<3:09:51, 5.52s/it][2025-06-20 01:59:18,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:59:18,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.47 | bwd_microstep: 3313.61 | bwd_inner_microstep: 3312.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 01:59:18,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.47 | bwd: 3313.62 | bwd_inner: 3312.82 | bwd_allreduce: 0.76 | step: 6.72 79%|███████▉ | 7938/10000 [12:29:39<3:09:10, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.10446228832006454, 'learning_rate': 4.296669486447438e-06, 'epoch': 7.94} 79%|███████▉ | 7938/10000 [12:29:39<3:09:10, 5.50s/it][2025-06-20 01:59:24,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:59:24,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.61 | bwd_microstep: 3364.90 | bwd_inner_microstep: 3364.10 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-20 01:59:24,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.61 | bwd: 3364.91 | bwd_inner: 3364.10 | bwd_allreduce: 0.77 | step: 6.98 79%|███████▉ | 7939/10000 [12:29:44<3:09:22, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.01484930794686079, 'learning_rate': 4.2926588869624816e-06, 'epoch': 7.94} 79%|███████▉ | 7939/10000 [12:29:44<3:09:22, 5.51s/it][2025-06-20 01:59:29,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 01:59:29,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.20 | bwd_microstep: 3315.91 | bwd_inner_microstep: 3315.08 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.89 [2025-06-20 01:59:29,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.20 | bwd: 3315.92 | bwd_inner: 3315.08 | bwd_allreduce: 0.80 | step: 6.89 79%|███████▉ | 7940/10000 [12:29:50<3:08:42, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.04456056281924248, 'learning_rate': 4.288649935104707e-06, 'epoch': 7.94} 79%|███████▉ | 7940/10000 [12:29:50<3:08:42, 5.50s/it][2025-06-20 01:59:35,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:59:35,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.29 | bwd_microstep: 3323.49 | bwd_inner_microstep: 3322.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 01:59:35,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.29 | bwd: 3323.50 | bwd_inner: 3322.70 | bwd_allreduce: 0.76 | step: 6.63 79%|███████▉ | 7941/10000 [12:29:55<3:08:18, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.014415131881833076, 'learning_rate': 4.284642631294635e-06, 'epoch': 7.94} 79%|███████▉ | 7941/10000 [12:29:55<3:08:18, 5.49s/it][2025-06-20 01:59:40,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 01:59:40,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.32 | bwd_microstep: 3320.45 | bwd_inner_microstep: 3319.59 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.98 [2025-06-20 01:59:40,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.32 | bwd: 3320.47 | bwd_inner: 3319.59 | bwd_allreduce: 0.82 | step: 6.98 79%|███████▉ | 7942/10000 [12:30:01<3:08:05, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0012634667800739408, 'learning_rate': 4.280636975952617e-06, 'epoch': 7.94} 79%|███████▉ | 7942/10000 [12:30:01<3:08:05, 5.48s/it][2025-06-20 01:59:46,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:59:46,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.71 | bwd_microstep: 3368.74 | bwd_inner_microstep: 3367.95 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-20 01:59:46,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.71 | bwd: 3368.75 | bwd_inner: 3367.95 | bwd_allreduce: 0.76 | step: 6.66 79%|███████▉ | 7943/10000 [12:30:06<3:08:30, 5.50s/it] {'loss': 0.0, 'grad_norm': 1.736345438985154e-05, 'learning_rate': 4.276632969498821e-06, 'epoch': 7.94} 79%|███████▉ | 7943/10000 [12:30:06<3:08:30, 5.50s/it][2025-06-20 01:59:51,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 01:59:51,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.69 | bwd_microstep: 3319.23 | bwd_inner_microstep: 3318.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 01:59:51,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.69 | bwd: 3319.25 | bwd_inner: 3318.44 | bwd_allreduce: 0.77 | step: 6.85 79%|███████▉ | 7944/10000 [12:30:12<3:08:07, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0003997002204414457, 'learning_rate': 4.272630612353259e-06, 'epoch': 7.94} 79%|███████▉ | 7944/10000 [12:30:12<3:08:07, 5.49s/it][2025-06-20 01:59:57,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 01:59:57,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.77 | bwd_microstep: 3378.81 | bwd_inner_microstep: 3377.97 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.79 [2025-06-20 01:59:57,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.77 | bwd: 3378.82 | bwd_inner: 3377.97 | bwd_allreduce: 0.81 | step: 6.80 79%|███████▉ | 7945/10000 [12:30:17<3:08:38, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0018835047958418727, 'learning_rate': 4.268629904935746e-06, 'epoch': 7.95} 79%|███████▉ | 7945/10000 [12:30:17<3:08:38, 5.51s/it][2025-06-20 02:00:02,518] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:00:02,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.02 | bwd_microstep: 3322.30 | bwd_inner_microstep: 3321.33 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.36 [2025-06-20 02:00:02,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.02 | bwd: 3322.31 | bwd_inner: 3321.33 | bwd_allreduce: 0.94 | step: 7.36 79%|███████▉ | 7946/10000 [12:30:23<3:08:12, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.01318568829447031, 'learning_rate': 4.2646308476659445e-06, 'epoch': 7.95} 79%|███████▉ | 7946/10000 [12:30:23<3:08:12, 5.50s/it][2025-06-20 02:00:07,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:00:07,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.77 | bwd_microstep: 3327.69 | bwd_inner_microstep: 3326.70 | bwd_allreduce_microstep: 0.94 | step_microstep: 6.97 [2025-06-20 02:00:07,999] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.77 | bwd: 3327.70 | bwd_inner: 3326.70 | bwd_allreduce: 0.96 | step: 6.97 79%|███████▉ | 7947/10000 [12:30:28<3:07:55, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00016427398077212274, 'learning_rate': 4.2606334409633396e-06, 'epoch': 7.95} 79%|███████▉ | 7947/10000 [12:30:28<3:07:55, 5.49s/it][2025-06-20 02:00:13,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:00:13,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.95 | bwd_microstep: 3315.97 | bwd_inner_microstep: 3315.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-20 02:00:13,467] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.95 | bwd: 3315.98 | bwd_inner: 3315.18 | bwd_allreduce: 0.75 | step: 6.57 79%|███████▉ | 7948/10000 [12:30:34<3:07:34, 5.48s/it] {'loss': 0.0078, 'grad_norm': 2.827730417251587, 'learning_rate': 4.256637685247236e-06, 'epoch': 7.95} 79%|███████▉ | 7948/10000 [12:30:34<3:07:34, 5.48s/it][2025-06-20 02:00:18,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:00:18,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.41 | bwd_microstep: 3321.31 | bwd_inner_microstep: 3320.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-20 02:00:18,928] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.41 | bwd: 3321.32 | bwd_inner: 3320.51 | bwd_allreduce: 0.76 | step: 6.84 79%|███████▉ | 7949/10000 [12:30:39<3:07:15, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.04901481792330742, 'learning_rate': 4.252643580936777e-06, 'epoch': 7.95} 79%|███████▉ | 7949/10000 [12:30:39<3:07:15, 5.48s/it][2025-06-20 02:00:24,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 02:00:24,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.37 | bwd_microstep: 3316.22 | bwd_inner_microstep: 3315.17 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.61 [2025-06-20 02:00:24,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.37 | bwd: 3316.24 | bwd_inner: 3315.17 | bwd_allreduce: 1.01 | step: 7.61 80%|███████▉ | 7950/10000 [12:30:45<3:06:59, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.017164118587970734, 'learning_rate': 4.248651128450916e-06, 'epoch': 7.95} 80%|███████▉ | 7950/10000 [12:30:45<3:06:59, 5.47s/it][2025-06-20 02:00:29,941] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 02:00:29,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.80 | bwd_microstep: 3369.66 | bwd_inner_microstep: 3368.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 02:00:29,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.80 | bwd: 3369.67 | bwd_inner: 3368.87 | bwd_allreduce: 0.76 | step: 6.64 80%|███████▉ | 7951/10000 [12:30:50<3:07:42, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.01611739583313465, 'learning_rate': 4.2446603282084475e-06, 'epoch': 7.95} 80%|███████▉ | 7951/10000 [12:30:50<3:07:42, 5.50s/it][2025-06-20 02:00:35,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:00:35,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.33 | bwd_microstep: 3324.84 | bwd_inner_microstep: 3324.01 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.91 [2025-06-20 02:00:35,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.33 | bwd: 3324.86 | bwd_inner: 3324.01 | bwd_allreduce: 0.79 | step: 6.92 80%|███████▉ | 7952/10000 [12:30:56<3:07:22, 5.49s/it] {'loss': 0.0, 'grad_norm': 1.892307227535639e-05, 'learning_rate': 4.240671180627987e-06, 'epoch': 7.95} 80%|███████▉ | 7952/10000 [12:30:56<3:07:22, 5.49s/it][2025-06-20 02:00:40,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:00:40,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.64 | bwd_microstep: 3371.44 | bwd_inner_microstep: 3370.50 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.10 [2025-06-20 02:00:40,959] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.64 | bwd: 3371.46 | bwd_inner: 3370.50 | bwd_allreduce: 0.91 | step: 7.10 80%|███████▉ | 7953/10000 [12:31:01<3:07:52, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0009689006255939603, 'learning_rate': 4.236683686127981e-06, 'epoch': 7.95} 80%|███████▉ | 7953/10000 [12:31:01<3:07:52, 5.51s/it][2025-06-20 02:00:46,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:00:46,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.35 | bwd_microstep: 3367.44 | bwd_inner_microstep: 3366.52 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.42 [2025-06-20 02:00:46,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.35 | bwd: 3367.46 | bwd_inner: 3366.52 | bwd_allreduce: 0.89 | step: 7.42 80%|███████▉ | 7954/10000 [12:31:07<3:08:09, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.004828463774174452, 'learning_rate': 4.232697845126692e-06, 'epoch': 7.95} 80%|███████▉ | 7954/10000 [12:31:07<3:08:09, 5.52s/it][2025-06-20 02:00:52,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:00:52,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.11 | bwd_microstep: 3371.21 | bwd_inner_microstep: 3370.38 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.90 [2025-06-20 02:00:52,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.11 | bwd: 3371.23 | bwd_inner: 3370.38 | bwd_allreduce: 0.79 | step: 6.90 80%|███████▉ | 7955/10000 [12:31:12<3:08:21, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.004891437478363514, 'learning_rate': 4.228713658042223e-06, 'epoch': 7.96} 80%|███████▉ | 7955/10000 [12:31:12<3:08:21, 5.53s/it][2025-06-20 02:00:57,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:00:57,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.95 | bwd_microstep: 3315.02 | bwd_inner_microstep: 3314.17 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.04 [2025-06-20 02:00:57,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.95 | bwd: 3315.04 | bwd_inner: 3314.17 | bwd_allreduce: 0.81 | step: 7.04 80%|███████▉ | 7956/10000 [12:31:18<3:07:39, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0015961171593517065, 'learning_rate': 4.224731125292496e-06, 'epoch': 7.96} 80%|███████▉ | 7956/10000 [12:31:18<3:07:39, 5.51s/it][2025-06-20 02:01:02,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:01:02,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.85 | bwd_microstep: 3312.08 | bwd_inner_microstep: 3311.00 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.58 [2025-06-20 02:01:02,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.85 | bwd: 3312.11 | bwd_inner: 3311.00 | bwd_allreduce: 1.05 | step: 7.59 80%|███████▉ | 7957/10000 [12:31:23<3:07:01, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.07534560561180115, 'learning_rate': 4.220750247295258e-06, 'epoch': 7.96} 80%|███████▉ | 7957/10000 [12:31:23<3:07:01, 5.49s/it][2025-06-20 02:01:08,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:01:08,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.03 | bwd_microstep: 3362.23 | bwd_inner_microstep: 3361.34 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.89 [2025-06-20 02:01:08,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.03 | bwd: 3362.25 | bwd_inner: 3361.34 | bwd_allreduce: 0.87 | step: 6.90 80%|███████▉ | 7958/10000 [12:31:29<3:07:19, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.019328195601701736, 'learning_rate': 4.216771024468082e-06, 'epoch': 7.96} 80%|███████▉ | 7958/10000 [12:31:29<3:07:19, 5.50s/it][2025-06-20 02:01:14,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:01:14,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.79 | bwd_microstep: 3367.73 | bwd_inner_microstep: 3366.65 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.35 [2025-06-20 02:01:14,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.79 | bwd: 3367.75 | bwd_inner: 3366.65 | bwd_allreduce: 1.04 | step: 7.34 80%|███████▉ | 7959/10000 [12:31:34<3:07:37, 5.52s/it] {'loss': 0.002, 'grad_norm': 0.8102616667747498, 'learning_rate': 4.212793457228384e-06, 'epoch': 7.96} 80%|███████▉ | 7959/10000 [12:31:34<3:07:37, 5.52s/it][2025-06-20 02:01:19,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.84 [2025-06-20 02:01:19,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.71 | bwd_microstep: 3366.54 | bwd_inner_microstep: 3365.68 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.45 [2025-06-20 02:01:19,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.71 | bwd: 3366.56 | bwd_inner: 3365.68 | bwd_allreduce: 0.82 | step: 7.45 80%|███████▉ | 7960/10000 [12:31:40<3:07:50, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00014633101818617433, 'learning_rate': 4.208817545993371e-06, 'epoch': 7.96} 80%|███████▉ | 7960/10000 [12:31:40<3:07:50, 5.52s/it][2025-06-20 02:01:25,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:01:25,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.07 | bwd_microstep: 3372.36 | bwd_inner_microstep: 3371.51 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.93 [2025-06-20 02:01:25,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.07 | bwd: 3372.38 | bwd_inner: 3371.51 | bwd_allreduce: 0.81 | step: 6.93 80%|███████▉ | 7961/10000 [12:31:45<3:07:50, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0008815736509859562, 'learning_rate': 4.204843291180112e-06, 'epoch': 7.96} 80%|███████▉ | 7961/10000 [12:31:45<3:07:50, 5.53s/it][2025-06-20 02:01:30,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:01:30,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.39 | bwd_microstep: 3398.52 | bwd_inner_microstep: 3397.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-20 02:01:30,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.39 | bwd: 3398.53 | bwd_inner: 3397.73 | bwd_allreduce: 0.76 | step: 6.69 80%|███████▉ | 7962/10000 [12:31:51<3:08:09, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.006435139570385218, 'learning_rate': 4.200870693205484e-06, 'epoch': 7.96} 80%|███████▉ | 7962/10000 [12:31:51<3:08:09, 5.54s/it][2025-06-20 02:01:36,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:01:36,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.87 | bwd_microstep: 3368.38 | bwd_inner_microstep: 3367.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 02:01:36,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.88 | bwd: 3368.40 | bwd_inner: 3367.59 | bwd_allreduce: 0.76 | step: 6.69 80%|███████▉ | 7963/10000 [12:31:57<3:07:59, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0019871701952069998, 'learning_rate': 4.196899752486192e-06, 'epoch': 7.96} 80%|███████▉ | 7963/10000 [12:31:57<3:07:59, 5.54s/it][2025-06-20 02:01:41,677] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:01:41,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.73 | bwd_microstep: 3314.49 | bwd_inner_microstep: 3313.61 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.99 [2025-06-20 02:01:41,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.73 | bwd: 3314.51 | bwd_inner: 3313.61 | bwd_allreduce: 0.84 | step: 6.99 80%|███████▉ | 7964/10000 [12:32:02<3:07:01, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0007386959623545408, 'learning_rate': 4.192930469438779e-06, 'epoch': 7.96} 80%|███████▉ | 7964/10000 [12:32:02<3:07:01, 5.51s/it][2025-06-20 02:01:47,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:01:47,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.25 | bwd_microstep: 3376.33 | bwd_inner_microstep: 3375.37 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.35 [2025-06-20 02:01:47,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.25 | bwd: 3376.34 | bwd_inner: 3375.37 | bwd_allreduce: 0.93 | step: 7.35 80%|███████▉ | 7965/10000 [12:32:08<3:07:16, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.01622355729341507, 'learning_rate': 4.1889628444795896e-06, 'epoch': 7.96} 80%|███████▉ | 7965/10000 [12:32:08<3:07:16, 5.52s/it][2025-06-20 02:01:52,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:01:52,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.91 | bwd_microstep: 3310.85 | bwd_inner_microstep: 3310.06 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 02:01:52,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.91 | bwd: 3310.86 | bwd_inner: 3310.06 | bwd_allreduce: 0.76 | step: 6.61 80%|███████▉ | 7966/10000 [12:32:13<3:06:27, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0006496271234937012, 'learning_rate': 4.1849968780248186e-06, 'epoch': 7.97} 80%|███████▉ | 7966/10000 [12:32:13<3:06:27, 5.50s/it][2025-06-20 02:01:58,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:01:58,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.21 | bwd_microstep: 3369.22 | bwd_inner_microstep: 3368.41 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-20 02:01:58,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.21 | bwd: 3369.23 | bwd_inner: 3368.41 | bwd_allreduce: 0.78 | step: 7.06 80%|███████▉ | 7967/10000 [12:32:19<3:06:41, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.1014152318239212, 'learning_rate': 4.181032570490473e-06, 'epoch': 7.97} 80%|███████▉ | 7967/10000 [12:32:19<3:06:41, 5.51s/it][2025-06-20 02:02:03,661] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:02:03,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.11 | bwd_microstep: 3315.25 | bwd_inner_microstep: 3314.18 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.07 [2025-06-20 02:02:03,662] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.11 | bwd: 3315.27 | bwd_inner: 3314.18 | bwd_allreduce: 1.04 | step: 7.08 80%|███████▉ | 7968/10000 [12:32:24<3:06:04, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.000979037955403328, 'learning_rate': 4.177069922292394e-06, 'epoch': 7.97} 80%|███████▉ | 7968/10000 [12:32:24<3:06:04, 5.49s/it][2025-06-20 02:02:09,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:02:09,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.32 | bwd_microstep: 3319.28 | bwd_inner_microstep: 3318.48 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 02:02:09,120] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.32 | bwd: 3319.29 | bwd_inner: 3318.48 | bwd_allreduce: 0.77 | step: 6.73 80%|███████▉ | 7969/10000 [12:32:29<3:05:36, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.01807267777621746, 'learning_rate': 4.17310893384625e-06, 'epoch': 7.97} 80%|███████▉ | 7969/10000 [12:32:29<3:05:36, 5.48s/it][2025-06-20 02:02:14,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:02:14,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.52 | bwd_microstep: 3376.59 | bwd_inner_microstep: 3375.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 02:02:14,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.52 | bwd: 3376.61 | bwd_inner: 3375.81 | bwd_allreduce: 0.76 | step: 6.63 80%|███████▉ | 7970/10000 [12:32:35<3:06:07, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.04328148439526558, 'learning_rate': 4.169149605567517e-06, 'epoch': 7.97} 80%|███████▉ | 7970/10000 [12:32:35<3:06:07, 5.50s/it][2025-06-20 02:02:20,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:02:20,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.28 | bwd_microstep: 3326.39 | bwd_inner_microstep: 3325.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 02:02:20,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.28 | bwd: 3326.40 | bwd_inner: 3325.60 | bwd_allreduce: 0.76 | step: 6.62 80%|███████▉ | 7971/10000 [12:32:40<3:05:41, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0011166305048391223, 'learning_rate': 4.1651919378715175e-06, 'epoch': 7.97} 80%|███████▉ | 7971/10000 [12:32:40<3:05:41, 5.49s/it][2025-06-20 02:02:25,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:02:25,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.44 | bwd_microstep: 3306.78 | bwd_inner_microstep: 3305.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 02:02:25,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.44 | bwd: 3306.79 | bwd_inner: 3305.99 | bwd_allreduce: 0.76 | step: 6.65 80%|███████▉ | 7972/10000 [12:32:46<3:05:05, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.020685793831944466, 'learning_rate': 4.161235931173393e-06, 'epoch': 7.97} 80%|███████▉ | 7972/10000 [12:32:46<3:05:05, 5.48s/it][2025-06-20 02:02:31,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:02:31,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.69 | bwd_microstep: 3371.96 | bwd_inner_microstep: 3371.17 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 02:02:31,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.69 | bwd: 3371.97 | bwd_inner: 3371.17 | bwd_allreduce: 0.76 | step: 6.72 80%|███████▉ | 7973/10000 [12:32:51<3:05:34, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0030105039477348328, 'learning_rate': 4.1572815858881085e-06, 'epoch': 7.97} 80%|███████▉ | 7973/10000 [12:32:51<3:05:34, 5.49s/it][2025-06-20 02:02:36,624] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:02:36,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.78 | bwd_microstep: 3360.34 | bwd_inner_microstep: 3359.54 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-20 02:02:36,625] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.78 | bwd: 3360.35 | bwd_inner: 3359.54 | bwd_allreduce: 0.77 | step: 6.98 80%|███████▉ | 7974/10000 [12:32:57<3:05:45, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001034750253893435, 'learning_rate': 4.153328902430458e-06, 'epoch': 7.97} 80%|███████▉ | 7974/10000 [12:32:57<3:05:45, 5.50s/it][2025-06-20 02:02:42,080] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:02:42,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.84 | bwd_microstep: 3314.86 | bwd_inner_microstep: 3314.08 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 02:02:42,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.84 | bwd: 3314.88 | bwd_inner: 3314.08 | bwd_allreduce: 0.76 | step: 6.61 80%|███████▉ | 7975/10000 [12:33:02<3:05:12, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0002964339219033718, 'learning_rate': 4.149377881215058e-06, 'epoch': 7.97} 80%|███████▉ | 7975/10000 [12:33:02<3:05:12, 5.49s/it][2025-06-20 02:02:47,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:02:47,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.88 | bwd_microstep: 3324.69 | bwd_inner_microstep: 3323.91 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 02:02:47,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.88 | bwd: 3324.71 | bwd_inner: 3323.91 | bwd_allreduce: 0.75 | step: 6.58 80%|███████▉ | 7976/10000 [12:33:08<3:04:51, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0012045949697494507, 'learning_rate': 4.145428522656354e-06, 'epoch': 7.98} 80%|███████▉ | 7976/10000 [12:33:08<3:04:51, 5.48s/it][2025-06-20 02:02:52,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.88 [2025-06-20 02:02:52,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.54 | bwd_microstep: 3314.78 | bwd_inner_microstep: 3313.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.21 [2025-06-20 02:02:52,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.54 | bwd: 3314.79 | bwd_inner: 3313.98 | bwd_allreduce: 0.77 | step: 7.22 80%|███████▉ | 7977/10000 [12:33:13<3:04:26, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.038266621530056, 'learning_rate': 4.141480827168614e-06, 'epoch': 7.98} 80%|███████▉ | 7977/10000 [12:33:13<3:04:26, 5.47s/it][2025-06-20 02:02:58,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:02:58,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.99 | bwd_microstep: 3366.99 | bwd_inner_microstep: 3366.21 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 02:02:58,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.99 | bwd: 3367.01 | bwd_inner: 3366.21 | bwd_allreduce: 0.76 | step: 6.60 80%|███████▉ | 7978/10000 [12:33:19<3:04:56, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0010320127476006746, 'learning_rate': 4.137534795165934e-06, 'epoch': 7.98} 80%|███████▉ | 7978/10000 [12:33:19<3:04:56, 5.49s/it][2025-06-20 02:03:04,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:03:04,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.89 | bwd_microstep: 3358.26 | bwd_inner_microstep: 3357.39 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.85 [2025-06-20 02:03:04,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.89 | bwd: 3358.27 | bwd_inner: 3357.39 | bwd_allreduce: 0.84 | step: 6.85 80%|███████▉ | 7979/10000 [12:33:24<3:05:06, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0001100252557080239, 'learning_rate': 4.1335904270622355e-06, 'epoch': 7.98} 80%|███████▉ | 7979/10000 [12:33:24<3:05:06, 5.50s/it][2025-06-20 02:03:09,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:03:09,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.84 | bwd_microstep: 3314.28 | bwd_inner_microstep: 3313.50 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-20 02:03:09,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.84 | bwd: 3314.30 | bwd_inner: 3313.50 | bwd_allreduce: 0.75 | step: 6.55 80%|███████▉ | 7980/10000 [12:33:30<3:04:36, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0008545885211788118, 'learning_rate': 4.129647723271266e-06, 'epoch': 7.98} 80%|███████▉ | 7980/10000 [12:33:30<3:04:36, 5.48s/it][2025-06-20 02:03:15,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:03:15,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.07 | bwd_microstep: 3360.18 | bwd_inner_microstep: 3359.34 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.82 [2025-06-20 02:03:15,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.07 | bwd: 3360.20 | bwd_inner: 3359.34 | bwd_allreduce: 0.80 | step: 6.82 80%|███████▉ | 7981/10000 [12:33:35<3:04:52, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.017365673556923866, 'learning_rate': 4.125706684206587e-06, 'epoch': 7.98} 80%|███████▉ | 7981/10000 [12:33:35<3:04:52, 5.49s/it][2025-06-20 02:03:20,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:03:20,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.90 | bwd_microstep: 3307.08 | bwd_inner_microstep: 3306.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.65 [2025-06-20 02:03:20,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.90 | bwd: 3307.09 | bwd_inner: 3306.28 | bwd_allreduce: 0.78 | step: 6.65 80%|███████▉ | 7982/10000 [12:33:41<3:04:19, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00016123396926559508, 'learning_rate': 4.121767310281606e-06, 'epoch': 7.98} 80%|███████▉ | 7982/10000 [12:33:41<3:04:19, 5.48s/it][2025-06-20 02:03:26,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:03:26,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.08 | bwd_microstep: 3385.67 | bwd_inner_microstep: 3384.89 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.52 [2025-06-20 02:03:26,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.08 | bwd: 3385.68 | bwd_inner: 3384.89 | bwd_allreduce: 0.75 | step: 6.52 80%|███████▉ | 7983/10000 [12:33:46<3:04:56, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0005640143062919378, 'learning_rate': 4.117829601909538e-06, 'epoch': 7.98} 80%|███████▉ | 7983/10000 [12:33:46<3:04:56, 5.50s/it][2025-06-20 02:03:31,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:03:31,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.60 | bwd_microstep: 3315.73 | bwd_inner_microstep: 3314.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 02:03:31,461] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.60 | bwd: 3315.74 | bwd_inner: 3314.94 | bwd_allreduce: 0.76 | step: 6.68 80%|███████▉ | 7984/10000 [12:33:52<3:04:22, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00026735110441222787, 'learning_rate': 4.113893559503435e-06, 'epoch': 7.98} 80%|███████▉ | 7984/10000 [12:33:52<3:04:22, 5.49s/it][2025-06-20 02:03:36,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:03:36,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.48 | bwd_microstep: 3311.76 | bwd_inner_microstep: 3310.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-20 02:03:36,912] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.48 | bwd: 3311.77 | bwd_inner: 3310.98 | bwd_allreduce: 0.75 | step: 6.71 80%|███████▉ | 7985/10000 [12:33:57<3:03:54, 5.48s/it] {'loss': 0.0, 'grad_norm': 6.214553286554292e-05, 'learning_rate': 4.1099591834761734e-06, 'epoch': 7.99} 80%|███████▉ | 7985/10000 [12:33:57<3:03:54, 5.48s/it][2025-06-20 02:03:42,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:03:42,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.17 | bwd_microstep: 3314.28 | bwd_inner_microstep: 3313.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-20 02:03:42,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.17 | bwd: 3314.29 | bwd_inner: 3313.49 | bwd_allreduce: 0.76 | step: 6.96 80%|███████▉ | 7986/10000 [12:34:03<3:03:33, 5.47s/it] {'loss': 0.0004, 'grad_norm': 0.12466340512037277, 'learning_rate': 4.106026474240439e-06, 'epoch': 7.99} 80%|███████▉ | 7986/10000 [12:34:03<3:03:33, 5.47s/it][2025-06-20 02:03:47,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:03:47,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.41 | bwd_microstep: 3357.07 | bwd_inner_microstep: 3356.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 02:03:47,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.41 | bwd: 3357.08 | bwd_inner: 3356.27 | bwd_allreduce: 0.76 | step: 6.78 80%|███████▉ | 7987/10000 [12:34:08<3:03:59, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.005128487013280392, 'learning_rate': 4.102095432208761e-06, 'epoch': 7.99} 80%|███████▉ | 7987/10000 [12:34:08<3:03:59, 5.48s/it][2025-06-20 02:03:53,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:03:53,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.55 | bwd_microstep: 3355.51 | bwd_inner_microstep: 3354.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 02:03:53,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.55 | bwd: 3355.52 | bwd_inner: 3354.71 | bwd_allreduce: 0.76 | step: 6.72 80%|███████▉ | 7988/10000 [12:34:14<3:04:14, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.023440780118107796, 'learning_rate': 4.098166057793489e-06, 'epoch': 7.99} 80%|███████▉ | 7988/10000 [12:34:14<3:04:14, 5.49s/it][2025-06-20 02:03:58,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:03:58,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.25 | bwd_microstep: 3306.11 | bwd_inner_microstep: 3305.23 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.32 [2025-06-20 02:03:58,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.25 | bwd: 3306.13 | bwd_inner: 3305.23 | bwd_allreduce: 0.84 | step: 7.32 80%|███████▉ | 7989/10000 [12:34:19<3:03:43, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0006395751843228936, 'learning_rate': 4.094238351406794e-06, 'epoch': 7.99} 80%|███████▉ | 7989/10000 [12:34:19<3:03:43, 5.48s/it][2025-06-20 02:04:04,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:04:04,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.00 | bwd_microstep: 3390.71 | bwd_inner_microstep: 3389.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 02:04:04,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.00 | bwd: 3390.73 | bwd_inner: 3389.91 | bwd_allreduce: 0.77 | step: 6.83 80%|███████▉ | 7990/10000 [12:34:25<3:04:24, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00041426109964959323, 'learning_rate': 4.0903123134606755e-06, 'epoch': 7.99} 80%|███████▉ | 7990/10000 [12:34:25<3:04:24, 5.50s/it][2025-06-20 02:04:09,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:04:09,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.78 | bwd_microstep: 3314.96 | bwd_inner_microstep: 3313.84 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.72 [2025-06-20 02:04:09,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.78 | bwd: 3314.98 | bwd_inner: 3313.84 | bwd_allreduce: 1.09 | step: 7.73 80%|███████▉ | 7991/10000 [12:34:30<3:03:52, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.009534643962979317, 'learning_rate': 4.0863879443669565e-06, 'epoch': 7.99} 80%|███████▉ | 7991/10000 [12:34:30<3:03:52, 5.49s/it][2025-06-20 02:04:15,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 02:04:15,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.54 | bwd_microstep: 3311.71 | bwd_inner_microstep: 3310.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-20 02:04:15,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.54 | bwd: 3311.73 | bwd_inner: 3310.93 | bwd_allreduce: 0.75 | step: 6.56 80%|███████▉ | 7992/10000 [12:34:36<3:03:21, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.013236502185463905, 'learning_rate': 4.082465244537284e-06, 'epoch': 7.99} 80%|███████▉ | 7992/10000 [12:34:36<3:03:21, 5.48s/it][2025-06-20 02:04:20,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:04:20,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.02 | bwd_microstep: 3311.99 | bwd_inner_microstep: 3311.11 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.03 [2025-06-20 02:04:20,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.02 | bwd: 3312.00 | bwd_inner: 3311.12 | bwd_allreduce: 0.85 | step: 7.04 80%|███████▉ | 7993/10000 [12:34:41<3:02:56, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.009452839381992817, 'learning_rate': 4.078544214383131e-06, 'epoch': 7.99} 80%|███████▉ | 7993/10000 [12:34:41<3:02:56, 5.47s/it][2025-06-20 02:04:26,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:04:26,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.05 | bwd_microstep: 3323.01 | bwd_inner_microstep: 3322.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-20 02:04:26,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.05 | bwd: 3323.03 | bwd_inner: 3322.19 | bwd_allreduce: 0.78 | step: 7.01 80%|███████▉ | 7994/10000 [12:34:47<3:02:54, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.04111415892839432, 'learning_rate': 4.074624854315796e-06, 'epoch': 7.99} 80%|███████▉ | 7994/10000 [12:34:47<3:02:54, 5.47s/it][2025-06-20 02:04:31,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:04:31,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.09 | bwd_microstep: 3374.97 | bwd_inner_microstep: 3374.12 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.13 [2025-06-20 02:04:31,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.09 | bwd: 3374.98 | bwd_inner: 3374.12 | bwd_allreduce: 0.82 | step: 7.13 80%|███████▉ | 7995/10000 [12:34:52<3:03:29, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.003348710248246789, 'learning_rate': 4.0707071647464105e-06, 'epoch': 8.0} 80%|███████▉ | 7995/10000 [12:34:52<3:03:29, 5.49s/it][2025-06-20 02:04:37,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:04:37,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.14 | bwd_microstep: 3321.71 | bwd_inner_microstep: 3320.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.60 [2025-06-20 02:04:37,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.14 | bwd: 3321.73 | bwd_inner: 3320.92 | bwd_allreduce: 0.76 | step: 6.60 80%|███████▉ | 7996/10000 [12:34:58<3:03:06, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0012900519650429487, 'learning_rate': 4.066791146085904e-06, 'epoch': 8.0} 80%|███████▉ | 7996/10000 [12:34:58<3:03:06, 5.48s/it][2025-06-20 02:04:42,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:04:42,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.76 | bwd_microstep: 3311.56 | bwd_inner_microstep: 3310.77 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 02:04:42,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.76 | bwd: 3311.57 | bwd_inner: 3310.77 | bwd_allreduce: 0.76 | step: 6.68 80%|███████▉ | 7997/10000 [12:35:03<3:02:40, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0007842408958822489, 'learning_rate': 4.062876798745062e-06, 'epoch': 8.0} 80%|███████▉ | 7997/10000 [12:35:03<3:02:40, 5.47s/it][2025-06-20 02:04:48,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:04:48,223] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.61 | bwd_microstep: 3368.02 | bwd_inner_microstep: 3367.19 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.90 [2025-06-20 02:04:48,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.61 | bwd: 3368.04 | bwd_inner: 3367.19 | bwd_allreduce: 0.80 | step: 6.90 80%|███████▉ | 7998/10000 [12:35:09<3:03:13, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.010711858049035072, 'learning_rate': 4.058964123134474e-06, 'epoch': 8.0} 80%|███████▉ | 7998/10000 [12:35:09<3:03:13, 5.49s/it][2025-06-20 02:04:53,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:04:53,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.78 | bwd_microstep: 3321.71 | bwd_inner_microstep: 3320.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-20 02:04:53,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.78 | bwd: 3321.73 | bwd_inner: 3320.92 | bwd_allreduce: 0.77 | step: 6.90 80%|███████▉ | 7999/10000 [12:35:14<3:02:51, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.008946266025304794, 'learning_rate': 4.0550531196645645e-06, 'epoch': 8.0} 80%|███████▉ | 7999/10000 [12:35:14<3:02:51, 5.48s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-06-20 02:05:01,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.72 [2025-06-20 02:05:01,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.22 | bwd_microstep: 3370.88 | bwd_inner_microstep: 3370.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.09 [2025-06-20 02:05:01,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.22 | bwd: 3370.90 | bwd_inner: 3370.10 | bwd_allreduce: 0.76 | step: 7.10 80%|████████ | 8000/10000 [12:35:21<3:21:20, 6.04s/it] {'loss': 0.0, 'grad_norm': 0.0011487607844173908, 'learning_rate': 4.051143788745588e-06, 'epoch': 8.0} 80%|████████ | 8000/10000 [12:35:21<3:21:20, 6.04s/it]evaluate! [INFO|trainer.py:3910] 2025-06-20 02:05:11,132 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-20 02:05:11,136 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-20 02:05:11,137 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-20 02:05:58,256 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-20 02:05:58,258 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-20 02:05:58,259 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-20 02:05:58,259 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json evaluate! [INFO|trainer.py:3910] 2025-06-20 02:06:10,920 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-20 02:06:10,924 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-20 02:06:10,925 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-20 02:07:00,247 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-20 02:07:00,249 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-20 02:07:00,249 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-20 02:07:00,249 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json [2025-06-20 02:07:04,686] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-20 02:07:10,684] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-20 02:07:16,597] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-20 02:07:22,777] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-20 02:07:40,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:07:40,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2061.68 | bwd_microstep: 3270.97 | bwd_inner_microstep: 3269.94 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.30 [2025-06-20 02:07:40,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2061.66 | bwd: 3270.98 | bwd_inner: 3269.94 | bwd_allreduce: 0.99 | step: 7.30 80%|████████ | 8001/10000 [12:38:01<28:58:06, 52.17s/it] {'loss': 0.0, 'grad_norm': 0.0015936634736135602, 'learning_rate': 4.0472361307875996e-06, 'epoch': 8.0} 80%|████████ | 8001/10000 [12:38:01<28:58:06, 52.17s/it][2025-06-20 02:07:46,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:07:46,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2093.44 | bwd_microstep: 3284.53 | bwd_inner_microstep: 3283.71 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.22 [2025-06-20 02:07:46,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2093.44 | bwd: 3284.55 | bwd_inner: 3283.71 | bwd_allreduce: 0.79 | step: 7.22 80%|████████ | 8002/10000 [12:38:07<21:10:12, 38.14s/it] {'loss': 0.0001, 'grad_norm': 0.03775737062096596, 'learning_rate': 4.0433301462005035e-06, 'epoch': 8.0} 80%|████████ | 8002/10000 [12:38:07<21:10:12, 38.14s/it][2025-06-20 02:07:51,668] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 2.59 | optimizer_step: 2.73 [2025-06-20 02:07:51,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2084.35 | bwd_microstep: 3285.14 | bwd_inner_microstep: 3283.99 | bwd_allreduce_microstep: 1.08 | step_microstep: 8.79 [2025-06-20 02:07:51,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2084.35 | bwd: 3285.16 | bwd_inner: 3283.99 | bwd_allreduce: 1.11 | step: 8.79 80%|████████ | 8003/10000 [12:38:12<15:42:46, 28.33s/it] {'loss': 0.0, 'grad_norm': 0.007298486307263374, 'learning_rate': 4.039425835394017e-06, 'epoch': 8.0} 80%|████████ | 8003/10000 [12:38:12<15:42:46, 28.33s/it][2025-06-20 02:07:57,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:07:57,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2093.73 | bwd_microstep: 3298.89 | bwd_inner_microstep: 3297.89 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.71 [2025-06-20 02:07:57,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2093.73 | bwd: 3298.91 | bwd_inner: 3297.89 | bwd_allreduce: 0.97 | step: 7.71 80%|████████ | 8004/10000 [12:38:17<11:53:53, 21.46s/it] {'loss': 0.0, 'grad_norm': 0.0005660078604705632, 'learning_rate': 4.035523198777684e-06, 'epoch': 8.0} 80%|████████ | 8004/10000 [12:38:17<11:53:53, 21.46s/it][2025-06-20 02:08:02,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:08:02,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.57 | bwd_microstep: 3303.02 | bwd_inner_microstep: 3302.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 02:08:02,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.57 | bwd: 3303.04 | bwd_inner: 3302.22 | bwd_allreduce: 0.77 | step: 6.76 80%|████████ | 8005/10000 [12:38:23<9:13:42, 16.65s/it] {'loss': 0.0, 'grad_norm': 0.007550687994807959, 'learning_rate': 4.0316222367608815e-06, 'epoch': 8.01} 80%|████████ | 8005/10000 [12:38:23<9:13:42, 16.65s/it][2025-06-20 02:08:07,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:08:07,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2091.64 | bwd_microstep: 3306.53 | bwd_inner_microstep: 3305.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-20 02:08:07,981] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2091.64 | bwd: 3306.55 | bwd_inner: 3305.73 | bwd_allreduce: 0.78 | step: 7.27 80%|████████ | 8006/10000 [12:38:28<7:21:36, 13.29s/it] {'loss': 0.002, 'grad_norm': 0.7277768850326538, 'learning_rate': 4.02772294975279e-06, 'epoch': 8.01} 80%|████████ | 8006/10000 [12:38:28<7:21:36, 13.29s/it][2025-06-20 02:08:13,488] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:08:13,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.71 | bwd_microstep: 3344.84 | bwd_inner_microstep: 3344.04 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-20 02:08:13,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.71 | bwd: 3344.86 | bwd_inner: 3344.04 | bwd_allreduce: 0.77 | step: 7.01 80%|████████ | 8007/10000 [12:38:34<6:03:51, 10.95s/it] {'loss': 0.0, 'grad_norm': 0.00015493134560529143, 'learning_rate': 4.02382533816243e-06, 'epoch': 8.01} 80%|████████ | 8007/10000 [12:38:34<6:03:51, 10.95s/it][2025-06-20 02:08:18,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:08:18,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2084.42 | bwd_microstep: 3285.75 | bwd_inner_microstep: 3284.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-20 02:08:18,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2084.42 | bwd: 3285.77 | bwd_inner: 3284.96 | bwd_allreduce: 0.77 | step: 7.01 80%|████████ | 8008/10000 [12:38:39<5:08:27, 9.29s/it] {'loss': 0.0, 'grad_norm': 0.0011173911625519395, 'learning_rate': 4.019929402398643e-06, 'epoch': 8.01} 80%|████████ | 8008/10000 [12:38:39<5:08:27, 9.29s/it][2025-06-20 02:08:24,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:08:24,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2083.13 | bwd_microstep: 3282.85 | bwd_inner_microstep: 3282.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-20 02:08:24,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2083.13 | bwd: 3282.87 | bwd_inner: 3282.06 | bwd_allreduce: 0.77 | step: 6.74 80%|████████ | 8009/10000 [12:38:45<4:29:36, 8.12s/it] {'loss': 0.0, 'grad_norm': 0.0004721582226920873, 'learning_rate': 4.0160351428700985e-06, 'epoch': 8.01} 80%|████████ | 8009/10000 [12:38:45<4:29:36, 8.12s/it][2025-06-20 02:08:29,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:08:29,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2086.84 | bwd_microstep: 3289.98 | bwd_inner_microstep: 3289.00 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.66 [2025-06-20 02:08:29,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2086.84 | bwd: 3290.00 | bwd_inner: 3289.00 | bwd_allreduce: 0.95 | step: 7.67 80%|████████ | 8010/10000 [12:38:50<4:02:31, 7.31s/it] {'loss': 0.0, 'grad_norm': 0.0010494368616491556, 'learning_rate': 4.0121425599852814e-06, 'epoch': 8.01} 80%|████████ | 8010/10000 [12:38:50<4:02:31, 7.31s/it][2025-06-20 02:08:35,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:08:35,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.15 | bwd_microstep: 3333.38 | bwd_inner_microstep: 3332.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 02:08:35,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.15 | bwd: 3333.39 | bwd_inner: 3332.58 | bwd_allreduce: 0.77 | step: 6.72 80%|████████ | 8011/10000 [12:38:55<3:44:11, 6.76s/it] {'loss': 0.0, 'grad_norm': 0.0014473388437181711, 'learning_rate': 4.00825165415251e-06, 'epoch': 8.01} 80%|████████ | 8011/10000 [12:38:56<3:44:11, 6.76s/it][2025-06-20 02:08:40,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:08:40,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.68 | bwd_microstep: 3300.90 | bwd_inner_microstep: 3299.90 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.12 [2025-06-20 02:08:40,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.68 | bwd: 3300.92 | bwd_inner: 3299.90 | bwd_allreduce: 0.97 | step: 7.13 80%|████████ | 8012/10000 [12:39:01<3:30:54, 6.37s/it] {'loss': 0.0001, 'grad_norm': 0.017372585833072662, 'learning_rate': 4.00436242577992e-06, 'epoch': 8.01} 80%|████████ | 8012/10000 [12:39:01<3:30:54, 6.37s/it][2025-06-20 02:08:46,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:08:46,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.37 | bwd_microstep: 3350.13 | bwd_inner_microstep: 3349.32 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.09 [2025-06-20 02:08:46,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.36 | bwd: 3350.14 | bwd_inner: 3349.32 | bwd_allreduce: 0.77 | step: 7.09 80%|████████ | 8013/10000 [12:39:06<3:22:13, 6.11s/it] {'loss': 0.0001, 'grad_norm': 0.009686574339866638, 'learning_rate': 4.000474875275471e-06, 'epoch': 8.01} 80%|████████ | 8013/10000 [12:39:06<3:22:13, 6.11s/it][2025-06-20 02:08:51,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:08:51,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2089.04 | bwd_microstep: 3301.48 | bwd_inner_microstep: 3300.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 02:08:51,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2089.04 | bwd: 3301.50 | bwd_inner: 3300.69 | bwd_allreduce: 0.76 | step: 6.77 80%|████████ | 8014/10000 [12:39:12<3:15:24, 5.90s/it] {'loss': 0.0001, 'grad_norm': 0.026234125718474388, 'learning_rate': 3.996589003046954e-06, 'epoch': 8.01} 80%|████████ | 8014/10000 [12:39:12<3:15:24, 5.90s/it][2025-06-20 02:08:57,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:08:57,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.60 | bwd_microstep: 3375.37 | bwd_inner_microstep: 3374.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-20 02:08:57,106] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.60 | bwd: 3375.38 | bwd_inner: 3374.58 | bwd_allreduce: 0.76 | step: 6.65 80%|████████ | 8015/10000 [12:39:17<3:11:38, 5.79s/it] {'loss': 0.0008, 'grad_norm': 0.18658219277858734, 'learning_rate': 3.9927048095019796e-06, 'epoch': 8.02} 80%|████████ | 8015/10000 [12:39:17<3:11:38, 5.79s/it][2025-06-20 02:09:02,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:09:02,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2093.76 | bwd_microstep: 3311.53 | bwd_inner_microstep: 3310.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-20 02:09:02,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2093.76 | bwd: 3311.55 | bwd_inner: 3310.73 | bwd_allreduce: 0.77 | step: 7.24 80%|████████ | 8016/10000 [12:39:23<3:08:04, 5.69s/it] {'loss': 0.0, 'grad_norm': 0.010769384913146496, 'learning_rate': 3.988822295047974e-06, 'epoch': 8.02} 80%|████████ | 8016/10000 [12:39:23<3:08:04, 5.69s/it][2025-06-20 02:09:07,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:09:07,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.36 | bwd_microstep: 3300.98 | bwd_inner_microstep: 3300.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 02:09:07,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.36 | bwd: 3300.99 | bwd_inner: 3300.19 | bwd_allreduce: 0.76 | step: 6.76 80%|████████ | 8017/10000 [12:39:28<3:05:33, 5.61s/it] {'loss': 0.0, 'grad_norm': 0.001524232910014689, 'learning_rate': 3.9849414600922e-06, 'epoch': 8.02} 80%|████████ | 8017/10000 [12:39:28<3:05:33, 5.61s/it][2025-06-20 02:09:13,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:09:13,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.84 | bwd_microstep: 3349.63 | bwd_inner_microstep: 3348.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 02:09:13,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.84 | bwd: 3349.64 | bwd_inner: 3348.83 | bwd_allreduce: 0.77 | step: 6.70 80%|████████ | 8018/10000 [12:39:34<3:04:20, 5.58s/it] {'loss': 0.0, 'grad_norm': 0.0009832227369770408, 'learning_rate': 3.981062305041738e-06, 'epoch': 8.02} 80%|████████ | 8018/10000 [12:39:34<3:04:20, 5.58s/it][2025-06-20 02:09:18,923] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:09:18,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2093.15 | bwd_microstep: 3297.98 | bwd_inner_microstep: 3297.18 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-20 02:09:18,924] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2093.15 | bwd: 3297.99 | bwd_inner: 3297.18 | bwd_allreduce: 0.77 | step: 7.14 80%|████████ | 8019/10000 [12:39:39<3:02:46, 5.54s/it] {'loss': 0.0001, 'grad_norm': 0.02515975944697857, 'learning_rate': 3.977184830303495e-06, 'epoch': 8.02} 80%|████████ | 8019/10000 [12:39:39<3:02:46, 5.54s/it][2025-06-20 02:09:24,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:09:24,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2094.31 | bwd_microstep: 3309.19 | bwd_inner_microstep: 3308.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 02:09:24,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2094.31 | bwd: 3309.20 | bwd_inner: 3308.40 | bwd_allreduce: 0.76 | step: 6.81 80%|████████ | 8020/10000 [12:39:45<3:01:45, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.022020071744918823, 'learning_rate': 3.973309036284203e-06, 'epoch': 8.02} 80%|████████ | 8020/10000 [12:39:45<3:01:45, 5.51s/it][2025-06-20 02:09:29,876] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:09:29,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.89 | bwd_microstep: 3351.97 | bwd_inner_microstep: 3351.17 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-20 02:09:29,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.89 | bwd: 3351.98 | bwd_inner: 3351.17 | bwd_allreduce: 0.77 | step: 7.14 80%|████████ | 8021/10000 [12:39:50<3:01:41, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00018175964942201972, 'learning_rate': 3.969434923390407e-06, 'epoch': 8.02} 80%|████████ | 8021/10000 [12:39:50<3:01:41, 5.51s/it][2025-06-20 02:09:35,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:09:35,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.78 | bwd_microstep: 3353.73 | bwd_inner_microstep: 3352.88 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.93 [2025-06-20 02:09:35,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.78 | bwd: 3353.75 | bwd_inner: 3352.88 | bwd_allreduce: 0.82 | step: 6.93 80%|████████ | 8022/10000 [12:39:56<3:01:35, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.00791772361844778, 'learning_rate': 3.965562492028487e-06, 'epoch': 8.02} 80%|████████ | 8022/10000 [12:39:56<3:01:35, 5.51s/it][2025-06-20 02:09:40,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:09:40,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.58 | bwd_microstep: 3343.33 | bwd_inner_microstep: 3342.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 02:09:40,887] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.58 | bwd: 3343.35 | bwd_inner: 3342.54 | bwd_allreduce: 0.76 | step: 6.68 80%|████████ | 8023/10000 [12:40:01<3:01:25, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0020899788942188025, 'learning_rate': 3.961691742604643e-06, 'epoch': 8.02} 80%|████████ | 8023/10000 [12:40:01<3:01:25, 5.51s/it][2025-06-20 02:09:46,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:09:46,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.05 | bwd_microstep: 3354.78 | bwd_inner_microstep: 3353.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 02:09:46,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.05 | bwd: 3354.79 | bwd_inner: 3353.98 | bwd_allreduce: 0.77 | step: 6.85 80%|████████ | 8024/10000 [12:40:07<3:01:24, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0075508649460971355, 'learning_rate': 3.9578226755249e-06, 'epoch': 8.02} 80%|████████ | 8024/10000 [12:40:07<3:01:24, 5.51s/it][2025-06-20 02:09:51,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:09:51,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.43 | bwd_microstep: 3350.59 | bwd_inner_microstep: 3349.80 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-20 02:09:51,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.43 | bwd: 3350.61 | bwd_inner: 3349.80 | bwd_allreduce: 0.77 | step: 7.00 80%|████████ | 8025/10000 [12:40:12<3:01:18, 5.51s/it] {'loss': 0.0012, 'grad_norm': 0.3516295254230499, 'learning_rate': 3.953955291195104e-06, 'epoch': 8.03} 80%|████████ | 8025/10000 [12:40:12<3:01:18, 5.51s/it][2025-06-20 02:09:57,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:09:57,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2091.21 | bwd_microstep: 3301.03 | bwd_inner_microstep: 3300.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-20 02:09:57,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2091.21 | bwd: 3301.05 | bwd_inner: 3300.22 | bwd_allreduce: 0.79 | step: 6.77 80%|████████ | 8026/10000 [12:40:18<3:00:26, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.025683578103780746, 'learning_rate': 3.950089590020927e-06, 'epoch': 8.03} 80%|████████ | 8026/10000 [12:40:18<3:00:26, 5.48s/it][2025-06-20 02:10:02,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 02:10:02,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.54 | bwd_microstep: 3368.17 | bwd_inner_microstep: 3367.28 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.49 [2025-06-20 02:10:02,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.54 | bwd: 3368.20 | bwd_inner: 3367.28 | bwd_allreduce: 0.85 | step: 7.49 80%|████████ | 8027/10000 [12:40:23<3:00:48, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0076106819324195385, 'learning_rate': 3.94622557240786e-06, 'epoch': 8.03} 80%|████████ | 8027/10000 [12:40:23<3:00:48, 5.50s/it][2025-06-20 02:10:08,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 02:10:08,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2160.83 | bwd_microstep: 3368.43 | bwd_inner_microstep: 3367.52 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.22 [2025-06-20 02:10:08,443] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2160.83 | bwd: 3368.46 | bwd_inner: 3367.52 | bwd_allreduce: 0.87 | step: 8.23 80%|████████ | 8028/10000 [12:40:29<3:01:31, 5.52s/it] {'loss': 0.0025, 'grad_norm': 0.5729918479919434, 'learning_rate': 3.942363238761222e-06, 'epoch': 8.03} 80%|████████ | 8028/10000 [12:40:29<3:01:31, 5.52s/it][2025-06-20 02:10:13,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:10:13,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.69 | bwd_microstep: 3316.49 | bwd_inner_microstep: 3315.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 02:10:13,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.69 | bwd: 3316.51 | bwd_inner: 3315.70 | bwd_allreduce: 0.76 | step: 6.67 80%|████████ | 8029/10000 [12:40:34<3:01:08, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0009210991556756198, 'learning_rate': 3.9385025894861576e-06, 'epoch': 8.03} 80%|████████ | 8029/10000 [12:40:34<3:01:08, 5.51s/it][2025-06-20 02:10:19,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:10:19,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.72 | bwd_microstep: 3370.12 | bwd_inner_microstep: 3369.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 02:10:19,470] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.72 | bwd: 3370.14 | bwd_inner: 3369.34 | bwd_allreduce: 0.75 | step: 6.62 80%|████████ | 8030/10000 [12:40:40<3:01:10, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0008359832572750747, 'learning_rate': 3.934643624987626e-06, 'epoch': 8.03} 80%|████████ | 8030/10000 [12:40:40<3:01:10, 5.52s/it][2025-06-20 02:10:24,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:10:24,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.36 | bwd_microstep: 3300.44 | bwd_inner_microstep: 3299.64 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-20 02:10:24,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.36 | bwd: 3300.46 | bwd_inner: 3299.64 | bwd_allreduce: 0.77 | step: 7.20 80%|████████ | 8031/10000 [12:40:45<3:00:17, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0020480393432080746, 'learning_rate': 3.930786345670421e-06, 'epoch': 8.03} 80%|████████ | 8031/10000 [12:40:45<3:00:17, 5.49s/it][2025-06-20 02:10:30,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:10:30,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.81 | bwd_microstep: 3354.94 | bwd_inner_microstep: 3354.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 02:10:30,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.81 | bwd: 3354.95 | bwd_inner: 3354.14 | bwd_allreduce: 0.77 | step: 6.82 80%|████████ | 8032/10000 [12:40:51<3:00:20, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00810269732028246, 'learning_rate': 3.926930751939144e-06, 'epoch': 8.03} 80%|████████ | 8032/10000 [12:40:51<3:00:20, 5.50s/it][2025-06-20 02:10:35,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:10:35,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2090.36 | bwd_microstep: 3299.20 | bwd_inner_microstep: 3298.39 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.11 [2025-06-20 02:10:35,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2090.36 | bwd: 3299.21 | bwd_inner: 3298.39 | bwd_allreduce: 0.78 | step: 7.11 80%|████████ | 8033/10000 [12:40:56<2:59:33, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.01913708634674549, 'learning_rate': 3.9230768441982345e-06, 'epoch': 8.03} 80%|████████ | 8033/10000 [12:40:56<2:59:33, 5.48s/it][2025-06-20 02:10:41,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:10:41,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.51 | bwd_microstep: 3381.75 | bwd_inner_microstep: 3380.88 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.50 [2025-06-20 02:10:41,401] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.51 | bwd: 3381.78 | bwd_inner: 3380.88 | bwd_allreduce: 0.82 | step: 7.49 80%|████████ | 8034/10000 [12:41:02<3:00:17, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0001580275420565158, 'learning_rate': 3.919224622851949e-06, 'epoch': 8.03} 80%|████████ | 8034/10000 [12:41:02<3:00:17, 5.50s/it][2025-06-20 02:10:46,942] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:10:46,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.17 | bwd_microstep: 3368.48 | bwd_inner_microstep: 3367.60 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.37 [2025-06-20 02:10:46,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.18 | bwd: 3368.51 | bwd_inner: 3367.60 | bwd_allreduce: 0.84 | step: 7.38 80%|████████ | 8035/10000 [12:41:07<3:00:35, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.03056212328374386, 'learning_rate': 3.915374088304367e-06, 'epoch': 8.04} 80%|████████ | 8035/10000 [12:41:07<3:00:35, 5.51s/it][2025-06-20 02:10:52,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 02:10:52,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2149.85 | bwd_microstep: 3341.93 | bwd_inner_microstep: 3341.03 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.48 [2025-06-20 02:10:52,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2149.85 | bwd: 3341.96 | bwd_inner: 3341.03 | bwd_allreduce: 0.85 | step: 7.48 80%|████████ | 8036/10000 [12:41:13<3:00:43, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0005916264490224421, 'learning_rate': 3.911525240959398e-06, 'epoch': 8.04} 80%|████████ | 8036/10000 [12:41:13<3:00:43, 5.52s/it][2025-06-20 02:10:58,105] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 02:10:58,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2178.09 | bwd_microstep: 3400.46 | bwd_inner_microstep: 3399.56 | bwd_allreduce_microstep: 0.82 | step_microstep: 8.16 [2025-06-20 02:10:58,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2178.09 | bwd: 3400.49 | bwd_inner: 3399.56 | bwd_allreduce: 0.86 | step: 8.17 80%|████████ | 8037/10000 [12:41:18<3:01:41, 5.55s/it] {'loss': 0.0001, 'grad_norm': 0.027097169309854507, 'learning_rate': 3.907678081220754e-06, 'epoch': 8.04} 80%|████████ | 8037/10000 [12:41:18<3:01:41, 5.55s/it][2025-06-20 02:11:03,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 02:11:03,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2150.19 | bwd_microstep: 3314.38 | bwd_inner_microstep: 3313.49 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.40 [2025-06-20 02:11:03,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2150.19 | bwd: 3314.40 | bwd_inner: 3313.49 | bwd_allreduce: 0.84 | step: 7.40 80%|████████ | 8038/10000 [12:41:24<3:01:10, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.006092550233006477, 'learning_rate': 3.9038326094919955e-06, 'epoch': 8.04} 80%|████████ | 8038/10000 [12:41:24<3:01:10, 5.54s/it][2025-06-20 02:11:09,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:11:09,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2160.13 | bwd_microstep: 3372.70 | bwd_inner_microstep: 3371.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 02:11:09,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2160.13 | bwd: 3372.72 | bwd_inner: 3371.91 | bwd_allreduce: 0.76 | step: 6.71 80%|████████ | 8039/10000 [12:41:29<3:01:24, 5.55s/it] {'loss': 0.0, 'grad_norm': 0.006685248110443354, 'learning_rate': 3.899988826176491e-06, 'epoch': 8.04} 80%|████████ | 8039/10000 [12:41:29<3:01:24, 5.55s/it][2025-06-20 02:11:14,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:11:14,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.30 | bwd_microstep: 3317.80 | bwd_inner_microstep: 3316.78 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.67 [2025-06-20 02:11:14,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.30 | bwd: 3317.82 | bwd_inner: 3316.79 | bwd_allreduce: 0.98 | step: 7.67 80%|████████ | 8040/10000 [12:41:35<3:00:35, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.025718754157423973, 'learning_rate': 3.896146731677435e-06, 'epoch': 8.04} 80%|████████ | 8040/10000 [12:41:35<3:00:35, 5.53s/it][2025-06-20 02:11:20,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 02:11:20,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.85 | bwd_microstep: 3362.62 | bwd_inner_microstep: 3361.72 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.60 [2025-06-20 02:11:20,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.85 | bwd: 3362.64 | bwd_inner: 3361.72 | bwd_allreduce: 0.87 | step: 7.61 80%|████████ | 8041/10000 [12:41:41<3:00:41, 5.53s/it] {'loss': 0.0003, 'grad_norm': 0.06922080367803574, 'learning_rate': 3.8923063263978525e-06, 'epoch': 8.04} 80%|████████ | 8041/10000 [12:41:41<3:00:41, 5.53s/it][2025-06-20 02:11:25,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 02:11:25,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.87 | bwd_microstep: 3309.93 | bwd_inner_microstep: 3308.77 | bwd_allreduce_microstep: 1.08 | step_microstep: 8.43 [2025-06-20 02:11:25,710] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.87 | bwd: 3309.96 | bwd_inner: 3308.77 | bwd_allreduce: 1.11 | step: 8.46 80%|████████ | 8042/10000 [12:41:46<3:00:12, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0019888568203896284, 'learning_rate': 3.888467610740574e-06, 'epoch': 8.04} 80%|████████ | 8042/10000 [12:41:46<3:00:12, 5.52s/it][2025-06-20 02:11:31,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:11:31,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.53 | bwd_microstep: 3309.16 | bwd_inner_microstep: 3308.37 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-20 02:11:31,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.53 | bwd: 3309.18 | bwd_inner: 3308.37 | bwd_allreduce: 0.77 | step: 7.01 80%|████████ | 8043/10000 [12:41:51<2:59:39, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0008934731595218182, 'learning_rate': 3.884630585108266e-06, 'epoch': 8.04} 80%|████████ | 8043/10000 [12:41:51<2:59:39, 5.51s/it][2025-06-20 02:11:36,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 02:11:36,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.30 | bwd_microstep: 3364.46 | bwd_inner_microstep: 3363.32 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.55 [2025-06-20 02:11:36,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.30 | bwd: 3364.48 | bwd_inner: 3363.32 | bwd_allreduce: 1.10 | step: 7.56 80%|████████ | 8044/10000 [12:41:57<2:59:54, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.030806446447968483, 'learning_rate': 3.880795249903418e-06, 'epoch': 8.04} 80%|████████ | 8044/10000 [12:41:57<2:59:54, 5.52s/it][2025-06-20 02:11:42,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:11:42,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.90 | bwd_microstep: 3313.23 | bwd_inner_microstep: 3312.36 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.01 [2025-06-20 02:11:42,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.90 | bwd: 3313.26 | bwd_inner: 3312.37 | bwd_allreduce: 0.83 | step: 7.01 80%|████████ | 8045/10000 [12:42:03<2:59:19, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.003631208324804902, 'learning_rate': 3.876961605528333e-06, 'epoch': 8.04} 80%|████████ | 8045/10000 [12:42:03<2:59:19, 5.50s/it][2025-06-20 02:11:47,731] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:11:47,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.01 | bwd_microstep: 3357.63 | bwd_inner_microstep: 3356.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-20 02:11:47,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.01 | bwd: 3357.64 | bwd_inner: 3356.82 | bwd_allreduce: 0.77 | step: 6.99 80%|████████ | 8046/10000 [12:42:08<2:59:29, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.002635366516187787, 'learning_rate': 3.873129652385148e-06, 'epoch': 8.05} 80%|████████ | 8046/10000 [12:42:08<2:59:29, 5.51s/it][2025-06-20 02:11:53,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:11:53,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.78 | bwd_microstep: 3317.28 | bwd_inner_microstep: 3316.48 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-20 02:11:53,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.78 | bwd: 3317.30 | bwd_inner: 3316.48 | bwd_allreduce: 0.77 | step: 6.97 80%|████████ | 8047/10000 [12:42:14<2:59:02, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0001533495815237984, 'learning_rate': 3.869299390875816e-06, 'epoch': 8.05} 80%|████████ | 8047/10000 [12:42:14<2:59:02, 5.50s/it][2025-06-20 02:11:58,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:11:58,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.51 | bwd_microstep: 3312.60 | bwd_inner_microstep: 3311.67 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.97 [2025-06-20 02:11:58,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.51 | bwd: 3312.61 | bwd_inner: 3311.67 | bwd_allreduce: 0.90 | step: 6.97 80%|████████ | 8048/10000 [12:42:19<2:58:29, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.011351997964084148, 'learning_rate': 3.865470821402113e-06, 'epoch': 8.05} 80%|████████ | 8048/10000 [12:42:19<2:58:29, 5.49s/it][2025-06-20 02:12:04,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:12:04,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.15 | bwd_microstep: 3367.09 | bwd_inner_microstep: 3366.09 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.80 [2025-06-20 02:12:04,198] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.15 | bwd: 3367.11 | bwd_inner: 3366.09 | bwd_allreduce: 0.96 | step: 7.81 80%|████████ | 8049/10000 [12:42:25<2:58:57, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0002461187541484833, 'learning_rate': 3.8616439443656385e-06, 'epoch': 8.05} 80%|████████ | 8049/10000 [12:42:25<2:58:57, 5.50s/it][2025-06-20 02:12:09,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:12:09,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.10 | bwd_microstep: 3324.13 | bwd_inner_microstep: 3323.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 02:12:09,679] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.10 | bwd: 3324.15 | bwd_inner: 3323.34 | bwd_allreduce: 0.76 | step: 6.76 80%|████████ | 8050/10000 [12:42:30<2:58:34, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00011337146133882925, 'learning_rate': 3.857818760167813e-06, 'epoch': 8.05} 80%|████████ | 8050/10000 [12:42:30<2:58:34, 5.49s/it][2025-06-20 02:12:15,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:12:15,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.97 | bwd_microstep: 3312.83 | bwd_inner_microstep: 3312.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-20 02:12:15,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.98 | bwd: 3312.85 | bwd_inner: 3312.04 | bwd_allreduce: 0.76 | step: 6.68 81%|████████ | 8051/10000 [12:42:35<2:58:02, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.014441304840147495, 'learning_rate': 3.853995269209887e-06, 'epoch': 8.05} 81%|████████ | 8051/10000 [12:42:35<2:58:02, 5.48s/it][2025-06-20 02:12:20,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:12:20,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.40 | bwd_microstep: 3323.59 | bwd_inner_microstep: 3322.80 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-20 02:12:20,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.40 | bwd: 3323.60 | bwd_inner: 3322.80 | bwd_allreduce: 0.76 | step: 6.68 81%|████████ | 8052/10000 [12:42:41<2:57:46, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004448779858648777, 'learning_rate': 3.850173471892915e-06, 'epoch': 8.05} 81%|████████ | 8052/10000 [12:42:41<2:57:46, 5.48s/it][2025-06-20 02:12:26,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:12:26,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.75 | bwd_microstep: 3369.73 | bwd_inner_microstep: 3368.84 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.11 [2025-06-20 02:12:26,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.75 | bwd: 3369.74 | bwd_inner: 3368.84 | bwd_allreduce: 0.85 | step: 7.12 81%|████████ | 8053/10000 [12:42:46<2:58:19, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0004895002348348498, 'learning_rate': 3.846353368617795e-06, 'epoch': 8.05} 81%|████████ | 8053/10000 [12:42:46<2:58:19, 5.50s/it][2025-06-20 02:12:31,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 02:12:31,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.65 | bwd_microstep: 3330.92 | bwd_inner_microstep: 3330.03 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.39 [2025-06-20 02:12:31,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.65 | bwd: 3330.94 | bwd_inner: 3330.03 | bwd_allreduce: 0.84 | step: 7.40 81%|████████ | 8054/10000 [12:42:52<2:58:01, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0015426515601575375, 'learning_rate': 3.842534959785231e-06, 'epoch': 8.05} 81%|████████ | 8054/10000 [12:42:52<2:58:01, 5.49s/it][2025-06-20 02:12:37,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:12:37,151] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.78 | bwd_microstep: 3364.73 | bwd_inner_microstep: 3363.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-20 02:12:37,152] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.78 | bwd: 3364.74 | bwd_inner: 3363.93 | bwd_allreduce: 0.77 | step: 6.90 81%|████████ | 8055/10000 [12:42:57<2:58:29, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0031039423774927855, 'learning_rate': 3.838718245795763e-06, 'epoch': 8.05} 81%|████████ | 8055/10000 [12:42:57<2:58:29, 5.51s/it][2025-06-20 02:12:42,608] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:12:42,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.92 | bwd_microstep: 3311.17 | bwd_inner_microstep: 3310.01 | bwd_allreduce_microstep: 1.10 | step_microstep: 7.22 [2025-06-20 02:12:42,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.92 | bwd: 3311.18 | bwd_inner: 3310.01 | bwd_allreduce: 1.13 | step: 7.22 81%|████████ | 8056/10000 [12:43:03<2:57:55, 5.49s/it] {'loss': 0.0, 'grad_norm': 2.8849352020188235e-05, 'learning_rate': 3.834903227049749e-06, 'epoch': 8.06} 81%|████████ | 8056/10000 [12:43:03<2:57:55, 5.49s/it][2025-06-20 02:12:48,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:12:48,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.56 | bwd_microstep: 3321.22 | bwd_inner_microstep: 3320.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 02:12:48,078] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.56 | bwd: 3321.24 | bwd_inner: 3320.43 | bwd_allreduce: 0.77 | step: 6.68 81%|████████ | 8057/10000 [12:43:08<2:57:35, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.0635499358177185, 'learning_rate': 3.831089903947358e-06, 'epoch': 8.06} 81%|████████ | 8057/10000 [12:43:08<2:57:35, 5.48s/it][2025-06-20 02:12:53,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:12:53,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.30 | bwd_microstep: 3314.20 | bwd_inner_microstep: 3313.38 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.90 [2025-06-20 02:12:53,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.30 | bwd: 3314.22 | bwd_inner: 3313.38 | bwd_allreduce: 0.79 | step: 6.91 81%|████████ | 8058/10000 [12:43:14<2:57:17, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0002844417467713356, 'learning_rate': 3.827278276888593e-06, 'epoch': 8.06} 81%|████████ | 8058/10000 [12:43:14<2:57:17, 5.48s/it][2025-06-20 02:12:59,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:12:59,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.67 | bwd_microstep: 3392.99 | bwd_inner_microstep: 3392.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-20 02:12:59,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.67 | bwd: 3393.00 | bwd_inner: 3392.20 | bwd_allreduce: 0.76 | step: 6.71 81%|████████ | 8059/10000 [12:43:19<2:58:05, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.01787145808339119, 'learning_rate': 3.823468346273276e-06, 'epoch': 8.06} 81%|████████ | 8059/10000 [12:43:19<2:58:05, 5.50s/it][2025-06-20 02:13:04,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:13:04,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.47 | bwd_microstep: 3378.90 | bwd_inner_microstep: 3378.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 02:13:04,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.47 | bwd: 3378.91 | bwd_inner: 3378.11 | bwd_allreduce: 0.76 | step: 6.68 81%|████████ | 8060/10000 [12:43:25<2:58:26, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0008467864245176315, 'learning_rate': 3.819660112501053e-06, 'epoch': 8.06} 81%|████████ | 8060/10000 [12:43:25<2:58:26, 5.52s/it][2025-06-20 02:13:10,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:13:10,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.76 | bwd_microstep: 3371.86 | bwd_inner_microstep: 3371.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 02:13:10,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.76 | bwd: 3371.87 | bwd_inner: 3371.07 | bwd_allreduce: 0.75 | step: 6.63 81%|████████ | 8061/10000 [12:43:30<2:58:31, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.054570604115724564, 'learning_rate': 3.815853575971389e-06, 'epoch': 8.06} 81%|████████ | 8061/10000 [12:43:30<2:58:31, 5.52s/it][2025-06-20 02:13:15,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:13:15,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.81 | bwd_microstep: 3318.24 | bwd_inner_microstep: 3317.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-20 02:13:15,658] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.81 | bwd: 3318.25 | bwd_inner: 3317.45 | bwd_allreduce: 0.76 | step: 6.81 81%|████████ | 8062/10000 [12:43:36<2:57:49, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.04466810077428818, 'learning_rate': 3.8120487370835714e-06, 'epoch': 8.06} 81%|████████ | 8062/10000 [12:43:36<2:57:49, 5.51s/it][2025-06-20 02:13:21,128] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:13:21,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.56 | bwd_microstep: 3324.47 | bwd_inner_microstep: 3323.66 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-20 02:13:21,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.56 | bwd: 3324.48 | bwd_inner: 3323.66 | bwd_allreduce: 0.77 | step: 7.15 81%|████████ | 8063/10000 [12:43:41<2:57:24, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.008365632966160774, 'learning_rate': 3.8082455962367106e-06, 'epoch': 8.06} 81%|████████ | 8063/10000 [12:43:41<2:57:24, 5.50s/it][2025-06-20 02:13:26,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:13:26,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.79 | bwd_microstep: 3318.22 | bwd_inner_microstep: 3317.43 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 02:13:26,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.79 | bwd: 3318.23 | bwd_inner: 3317.43 | bwd_allreduce: 0.76 | step: 6.69 81%|████████ | 8064/10000 [12:43:47<2:57:02, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.04757903888821602, 'learning_rate': 3.804444153829738e-06, 'epoch': 8.06} 81%|████████ | 8064/10000 [12:43:47<2:57:02, 5.49s/it][2025-06-20 02:13:32,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:13:32,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.70 | bwd_microstep: 3326.35 | bwd_inner_microstep: 3325.52 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.47 [2025-06-20 02:13:32,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.70 | bwd: 3326.37 | bwd_inner: 3325.52 | bwd_allreduce: 0.80 | step: 7.47 81%|████████ | 8065/10000 [12:43:52<2:56:50, 5.48s/it] {'loss': 0.0005, 'grad_norm': 0.2852950096130371, 'learning_rate': 3.8006444102614092e-06, 'epoch': 8.06} 81%|████████ | 8065/10000 [12:43:52<2:56:50, 5.48s/it][2025-06-20 02:13:37,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:13:37,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.50 | bwd_microstep: 3368.30 | bwd_inner_microstep: 3367.50 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.83 [2025-06-20 02:13:37,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.50 | bwd: 3368.32 | bwd_inner: 3367.50 | bwd_allreduce: 0.78 | step: 6.83 81%|████████ | 8066/10000 [12:43:58<2:57:17, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0017890991875901818, 'learning_rate': 3.7968463659303024e-06, 'epoch': 8.07} 81%|████████ | 8066/10000 [12:43:58<2:57:17, 5.50s/it][2025-06-20 02:13:43,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:13:43,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.41 | bwd_microstep: 3321.72 | bwd_inner_microstep: 3320.78 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.51 [2025-06-20 02:13:43,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.41 | bwd: 3321.74 | bwd_inner: 3320.78 | bwd_allreduce: 0.92 | step: 7.52 81%|████████ | 8067/10000 [12:44:03<2:56:52, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.009242614731192589, 'learning_rate': 3.793050021234805e-06, 'epoch': 8.07} 81%|████████ | 8067/10000 [12:44:03<2:56:52, 5.49s/it][2025-06-20 02:13:48,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:13:48,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.25 | bwd_microstep: 3317.97 | bwd_inner_microstep: 3316.89 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.29 [2025-06-20 02:13:48,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.25 | bwd: 3317.98 | bwd_inner: 3316.89 | bwd_allreduce: 1.04 | step: 7.29 81%|████████ | 8068/10000 [12:44:09<2:56:36, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0032205614261329174, 'learning_rate': 3.789255376573142e-06, 'epoch': 8.07} 81%|████████ | 8068/10000 [12:44:09<2:56:36, 5.48s/it][2025-06-20 02:13:54,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:13:54,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.53 | bwd_microstep: 3322.67 | bwd_inner_microstep: 3321.81 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.43 [2025-06-20 02:13:54,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.53 | bwd: 3322.69 | bwd_inner: 3321.81 | bwd_allreduce: 0.82 | step: 7.43 81%|████████ | 8069/10000 [12:44:14<2:56:28, 5.48s/it] {'loss': 0.0, 'grad_norm': 2.888184826588258e-05, 'learning_rate': 3.7854624323433542e-06, 'epoch': 8.07} 81%|████████ | 8069/10000 [12:44:14<2:56:28, 5.48s/it][2025-06-20 02:13:59,499] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-20 02:13:59,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.70 | bwd_microstep: 3324.52 | bwd_inner_microstep: 3323.51 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.76 [2025-06-20 02:13:59,500] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.70 | bwd: 3324.54 | bwd_inner: 3323.51 | bwd_allreduce: 0.98 | step: 7.76 81%|████████ | 8070/10000 [12:44:20<2:56:16, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00010831464169314131, 'learning_rate': 3.781671188943303e-06, 'epoch': 8.07} 81%|████████ | 8070/10000 [12:44:20<2:56:16, 5.48s/it][2025-06-20 02:14:05,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:14:05,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.75 | bwd_microstep: 3379.88 | bwd_inner_microstep: 3379.10 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.85 [2025-06-20 02:14:05,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.75 | bwd: 3379.90 | bwd_inner: 3379.10 | bwd_allreduce: 0.76 | step: 6.86 81%|████████ | 8071/10000 [12:44:25<2:56:50, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0006486084894277155, 'learning_rate': 3.777881646770678e-06, 'epoch': 8.07} 81%|████████ | 8071/10000 [12:44:25<2:56:50, 5.50s/it][2025-06-20 02:14:10,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:14:10,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.79 | bwd_microstep: 3331.70 | bwd_inner_microstep: 3330.87 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.01 [2025-06-20 02:14:10,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.79 | bwd: 3331.72 | bwd_inner: 3330.87 | bwd_allreduce: 0.80 | step: 7.01 81%|████████ | 8072/10000 [12:44:31<2:56:36, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.011719397269189358, 'learning_rate': 3.7740938062229736e-06, 'epoch': 8.07} 81%|████████ | 8072/10000 [12:44:31<2:56:36, 5.50s/it][2025-06-20 02:14:16,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:14:16,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.47 | bwd_microstep: 3406.94 | bwd_inner_microstep: 3406.11 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.82 [2025-06-20 02:14:16,136] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.47 | bwd: 3406.96 | bwd_inner: 3406.11 | bwd_allreduce: 0.80 | step: 6.83 81%|████████ | 8073/10000 [12:44:36<2:57:31, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.002202700823545456, 'learning_rate': 3.7703076676975216e-06, 'epoch': 8.07} 81%|████████ | 8073/10000 [12:44:36<2:57:31, 5.53s/it][2025-06-20 02:14:21,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.72 [2025-06-20 02:14:21,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.01 | bwd_microstep: 3329.64 | bwd_inner_microstep: 3328.37 | bwd_allreduce_microstep: 1.20 | step_microstep: 8.17 [2025-06-20 02:14:21,619] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.01 | bwd: 3329.67 | bwd_inner: 3328.37 | bwd_allreduce: 1.22 | step: 8.18 81%|████████ | 8074/10000 [12:44:42<2:57:01, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0008955325465649366, 'learning_rate': 3.7665232315914704e-06, 'epoch': 8.07} 81%|████████ | 8074/10000 [12:44:42<2:57:01, 5.51s/it][2025-06-20 02:14:27,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:14:27,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.62 | bwd_microstep: 3373.76 | bwd_inner_microstep: 3372.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 02:14:27,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.62 | bwd: 3373.77 | bwd_inner: 3372.97 | bwd_allreduce: 0.76 | step: 6.72 81%|████████ | 8075/10000 [12:44:47<2:57:21, 5.53s/it] {'loss': 0.0, 'grad_norm': 6.564700743183494e-05, 'learning_rate': 3.762740498301791e-06, 'epoch': 8.07} 81%|████████ | 8075/10000 [12:44:47<2:57:21, 5.53s/it][2025-06-20 02:14:32,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:14:32,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.01 | bwd_microstep: 3327.28 | bwd_inner_microstep: 3326.46 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.33 [2025-06-20 02:14:32,657] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.01 | bwd: 3327.29 | bwd_inner: 3326.46 | bwd_allreduce: 0.79 | step: 7.33 81%|████████ | 8076/10000 [12:44:53<2:56:45, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.011632503010332584, 'learning_rate': 3.7589594682252805e-06, 'epoch': 8.08} 81%|████████ | 8076/10000 [12:44:53<2:56:45, 5.51s/it][2025-06-20 02:14:38,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:14:38,214] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.17 | bwd_microstep: 3382.58 | bwd_inner_microstep: 3381.74 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.89 [2025-06-20 02:14:38,215] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.17 | bwd: 3382.59 | bwd_inner: 3381.74 | bwd_allreduce: 0.81 | step: 6.90 81%|████████ | 8077/10000 [12:44:59<2:57:05, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0008812348823994398, 'learning_rate': 3.7551801417585387e-06, 'epoch': 8.08} 81%|████████ | 8077/10000 [12:44:59<2:57:05, 5.53s/it][2025-06-20 02:14:43,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:14:43,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.26 | bwd_microstep: 3330.62 | bwd_inner_microstep: 3329.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.30 [2025-06-20 02:14:43,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.26 | bwd: 3330.64 | bwd_inner: 3329.81 | bwd_allreduce: 0.78 | step: 7.30 81%|████████ | 8078/10000 [12:45:04<2:56:33, 5.51s/it] {'loss': 0.0, 'grad_norm': 2.9913231628597714e-05, 'learning_rate': 3.751402519298004e-06, 'epoch': 8.08} 81%|████████ | 8078/10000 [12:45:04<2:56:33, 5.51s/it][2025-06-20 02:14:49,171] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:14:49,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.64 | bwd_microstep: 3320.86 | bwd_inner_microstep: 3320.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 02:14:49,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.64 | bwd: 3320.87 | bwd_inner: 3320.06 | bwd_allreduce: 0.77 | step: 6.69 81%|████████ | 8079/10000 [12:45:09<2:56:08, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.030632618814706802, 'learning_rate': 3.7476266012399354e-06, 'epoch': 8.08} 81%|████████ | 8079/10000 [12:45:09<2:56:08, 5.50s/it][2025-06-20 02:14:54,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:14:54,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.69 | bwd_microstep: 3322.98 | bwd_inner_microstep: 3322.01 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.36 [2025-06-20 02:14:54,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.69 | bwd: 3323.00 | bwd_inner: 3322.01 | bwd_allreduce: 0.94 | step: 7.37 81%|████████ | 8080/10000 [12:45:15<2:55:49, 5.49s/it] {'loss': 0.0, 'grad_norm': 1.298078495892696e-05, 'learning_rate': 3.7438523879804047e-06, 'epoch': 8.08} 81%|████████ | 8080/10000 [12:45:15<2:55:49, 5.49s/it][2025-06-20 02:15:00,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:15:00,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.87 | bwd_microstep: 3375.70 | bwd_inner_microstep: 3374.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-20 02:15:00,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.87 | bwd: 3375.71 | bwd_inner: 3374.89 | bwd_allreduce: 0.78 | step: 7.18 81%|████████ | 8081/10000 [12:45:21<2:56:18, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0015269938157871366, 'learning_rate': 3.7400798799153126e-06, 'epoch': 8.08} 81%|████████ | 8081/10000 [12:45:21<2:56:18, 5.51s/it][2025-06-20 02:15:05,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:15:05,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.05 | bwd_microstep: 3403.05 | bwd_inner_microstep: 3402.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 02:15:05,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.05 | bwd: 3403.06 | bwd_inner: 3402.26 | bwd_allreduce: 0.76 | step: 6.67 81%|████████ | 8082/10000 [12:45:26<2:56:53, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.003965172450989485, 'learning_rate': 3.7363090774403766e-06, 'epoch': 8.08} 81%|████████ | 8082/10000 [12:45:26<2:56:53, 5.53s/it][2025-06-20 02:15:11,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 02:15:11,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.37 | bwd_microstep: 3332.82 | bwd_inner_microstep: 3331.65 | bwd_allreduce_microstep: 1.09 | step_microstep: 8.00 [2025-06-20 02:15:11,267] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.37 | bwd: 3332.84 | bwd_inner: 3331.65 | bwd_allreduce: 1.13 | step: 8.01 81%|████████ | 8083/10000 [12:45:32<2:56:19, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0002753448789007962, 'learning_rate': 3.7325399809511354e-06, 'epoch': 8.08} 81%|████████ | 8083/10000 [12:45:32<2:56:19, 5.52s/it][2025-06-20 02:15:16,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:15:16,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.37 | bwd_microstep: 3321.45 | bwd_inner_microstep: 3320.65 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-20 02:15:16,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.37 | bwd: 3321.46 | bwd_inner: 3320.65 | bwd_allreduce: 0.77 | step: 6.98 81%|████████ | 8084/10000 [12:45:37<2:55:52, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.008834202773869038, 'learning_rate': 3.7287725908429508e-06, 'epoch': 8.08} 81%|████████ | 8084/10000 [12:45:37<2:55:52, 5.51s/it][2025-06-20 02:15:22,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:15:22,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.21 | bwd_microstep: 3323.70 | bwd_inner_microstep: 3322.90 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-20 02:15:22,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.21 | bwd: 3323.71 | bwd_inner: 3322.90 | bwd_allreduce: 0.77 | step: 6.95 81%|████████ | 8085/10000 [12:45:43<2:55:23, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0008893065387383103, 'learning_rate': 3.725006907511004e-06, 'epoch': 8.09} 81%|████████ | 8085/10000 [12:45:43<2:55:23, 5.50s/it][2025-06-20 02:15:27,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:15:27,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.14 | bwd_microstep: 3321.72 | bwd_inner_microstep: 3320.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-20 02:15:27,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.14 | bwd: 3321.74 | bwd_inner: 3320.92 | bwd_allreduce: 0.77 | step: 7.18 81%|████████ | 8086/10000 [12:45:48<2:55:10, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0031366024632006884, 'learning_rate': 3.721242931350304e-06, 'epoch': 8.09} 81%|████████ | 8086/10000 [12:45:48<2:55:10, 5.49s/it][2025-06-20 02:15:33,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 02:15:33,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.68 | bwd_microstep: 3328.93 | bwd_inner_microstep: 3327.89 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.39 [2025-06-20 02:15:33,180] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.68 | bwd: 3328.95 | bwd_inner: 3327.89 | bwd_allreduce: 1.01 | step: 7.40 81%|████████ | 8087/10000 [12:45:53<2:54:58, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0030116296838968992, 'learning_rate': 3.717480662755664e-06, 'epoch': 8.09} 81%|████████ | 8087/10000 [12:45:53<2:54:58, 5.49s/it][2025-06-20 02:15:38,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:15:38,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.27 | bwd_microstep: 3324.12 | bwd_inner_microstep: 3323.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:15:38,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.27 | bwd: 3324.14 | bwd_inner: 3323.34 | bwd_allreduce: 0.76 | step: 6.62 81%|████████ | 8088/10000 [12:45:59<2:54:50, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.007429101504385471, 'learning_rate': 3.7137201021217317e-06, 'epoch': 8.09} 81%|████████ | 8088/10000 [12:45:59<2:54:50, 5.49s/it][2025-06-20 02:15:44,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:15:44,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.03 | bwd_microstep: 3333.09 | bwd_inner_microstep: 3332.10 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.35 [2025-06-20 02:15:44,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.03 | bwd: 3333.10 | bwd_inner: 3332.10 | bwd_allreduce: 0.95 | step: 7.35 81%|████████ | 8089/10000 [12:46:04<2:54:43, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0004321090818848461, 'learning_rate': 3.7099612498429727e-06, 'epoch': 8.09} 81%|████████ | 8089/10000 [12:46:04<2:54:43, 5.49s/it][2025-06-20 02:15:49,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:15:49,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.07 | bwd_microstep: 3336.29 | bwd_inner_microstep: 3335.46 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.12 [2025-06-20 02:15:49,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.08 | bwd: 3336.30 | bwd_inner: 3335.46 | bwd_allreduce: 0.80 | step: 7.13 81%|████████ | 8090/10000 [12:46:10<2:54:42, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0009253194439224899, 'learning_rate': 3.7062041063136754e-06, 'epoch': 8.09} 81%|████████ | 8090/10000 [12:46:10<2:54:42, 5.49s/it][2025-06-20 02:15:55,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:15:55,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.44 | bwd_microstep: 3327.14 | bwd_inner_microstep: 3326.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 02:15:55,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.44 | bwd: 3327.15 | bwd_inner: 3326.34 | bwd_allreduce: 0.77 | step: 6.73 81%|████████ | 8091/10000 [12:46:15<2:54:31, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0031371801160275936, 'learning_rate': 3.7024486719279453e-06, 'epoch': 8.09} 81%|████████ | 8091/10000 [12:46:15<2:54:31, 5.49s/it][2025-06-20 02:16:00,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:16:00,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.49 | bwd_microstep: 3314.50 | bwd_inner_microstep: 3313.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-20 02:16:00,584] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.49 | bwd: 3314.51 | bwd_inner: 3313.71 | bwd_allreduce: 0.76 | step: 6.65 81%|████████ | 8092/10000 [12:46:21<2:54:12, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0002930333139374852, 'learning_rate': 3.698694947079715e-06, 'epoch': 8.09} 81%|████████ | 8092/10000 [12:46:21<2:54:12, 5.48s/it][2025-06-20 02:16:06,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.78 [2025-06-20 02:16:06,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.30 | bwd_microstep: 3323.47 | bwd_inner_microstep: 3322.68 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 02:16:06,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.30 | bwd: 3323.49 | bwd_inner: 3322.68 | bwd_allreduce: 0.76 | step: 6.77 81%|████████ | 8093/10000 [12:46:26<2:54:00, 5.47s/it] {'loss': 0.0008, 'grad_norm': 0.26361894607543945, 'learning_rate': 3.6949429321627238e-06, 'epoch': 8.09} 81%|████████ | 8093/10000 [12:46:26<2:54:00, 5.47s/it][2025-06-20 02:16:11,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:16:11,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.40 | bwd_microstep: 3323.53 | bwd_inner_microstep: 3322.72 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.11 [2025-06-20 02:16:11,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.40 | bwd: 3323.55 | bwd_inner: 3322.72 | bwd_allreduce: 0.79 | step: 7.11 81%|████████ | 8094/10000 [12:46:32<2:53:52, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.023035654798150063, 'learning_rate': 3.6911926275705457e-06, 'epoch': 8.09} 81%|████████ | 8094/10000 [12:46:32<2:53:52, 5.47s/it][2025-06-20 02:16:16,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:16:16,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.50 | bwd_microstep: 3333.83 | bwd_inner_microstep: 3332.88 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.04 [2025-06-20 02:16:16,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.50 | bwd: 3333.84 | bwd_inner: 3332.88 | bwd_allreduce: 0.91 | step: 7.05 81%|████████ | 8095/10000 [12:46:37<2:53:48, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.004040430299937725, 'learning_rate': 3.6874440336965677e-06, 'epoch': 8.1} 81%|████████ | 8095/10000 [12:46:37<2:53:48, 5.47s/it][2025-06-20 02:16:22,473] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:16:22,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.66 | bwd_microstep: 3327.57 | bwd_inner_microstep: 3326.60 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.16 [2025-06-20 02:16:22,474] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.66 | bwd: 3327.58 | bwd_inner: 3326.60 | bwd_allreduce: 0.93 | step: 7.16 81%|████████ | 8096/10000 [12:46:43<2:53:45, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0009879301069304347, 'learning_rate': 3.6836971509340025e-06, 'epoch': 8.1} 81%|████████ | 8096/10000 [12:46:43<2:53:45, 5.48s/it][2025-06-20 02:16:27,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:16:27,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.29 | bwd_microstep: 3322.16 | bwd_inner_microstep: 3321.37 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 02:16:27,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.29 | bwd: 3322.17 | bwd_inner: 3321.37 | bwd_allreduce: 0.76 | step: 6.63 81%|████████ | 8097/10000 [12:46:48<2:53:34, 5.47s/it] {'loss': 0.0, 'grad_norm': 3.158671461278573e-05, 'learning_rate': 3.6799519796758797e-06, 'epoch': 8.1} 81%|████████ | 8097/10000 [12:46:48<2:53:34, 5.47s/it][2025-06-20 02:16:33,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:16:33,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.12 | bwd_microstep: 3369.94 | bwd_inner_microstep: 3369.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-20 02:16:33,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.12 | bwd: 3369.95 | bwd_inner: 3369.13 | bwd_allreduce: 0.77 | step: 6.74 81%|████████ | 8098/10000 [12:46:54<2:54:09, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.000284783513052389, 'learning_rate': 3.67620852031505e-06, 'epoch': 8.1} 81%|████████ | 8098/10000 [12:46:54<2:54:09, 5.49s/it][2025-06-20 02:16:38,952] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:16:38,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.61 | bwd_microstep: 3323.51 | bwd_inner_microstep: 3322.71 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 02:16:38,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.61 | bwd: 3323.52 | bwd_inner: 3322.71 | bwd_allreduce: 0.77 | step: 6.72 81%|████████ | 8099/10000 [12:46:59<2:53:49, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.13378450274467468, 'learning_rate': 3.672466773244188e-06, 'epoch': 8.1} 81%|████████ | 8099/10000 [12:46:59<2:53:49, 5.49s/it][2025-06-20 02:16:44,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:16:44,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.31 | bwd_microstep: 3386.62 | bwd_inner_microstep: 3385.62 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.52 [2025-06-20 02:16:44,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.31 | bwd: 3386.64 | bwd_inner: 3385.62 | bwd_allreduce: 0.97 | step: 7.52 81%|████████ | 8100/10000 [12:47:05<2:54:24, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00610134843736887, 'learning_rate': 3.668726738855779e-06, 'epoch': 8.1} 81%|████████ | 8100/10000 [12:47:05<2:54:24, 5.51s/it][2025-06-20 02:16:49,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:16:49,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.90 | bwd_microstep: 3320.23 | bwd_inner_microstep: 3319.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 02:16:49,982] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.90 | bwd: 3320.24 | bwd_inner: 3319.45 | bwd_allreduce: 0.75 | step: 6.65 81%|████████ | 8101/10000 [12:47:10<2:53:58, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.05207153782248497, 'learning_rate': 3.664988417542141e-06, 'epoch': 8.1} 81%|████████ | 8101/10000 [12:47:10<2:53:58, 5.50s/it][2025-06-20 02:16:55,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:16:55,447] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.74 | bwd_microstep: 3317.13 | bwd_inner_microstep: 3316.20 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.05 [2025-06-20 02:16:55,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.74 | bwd: 3317.14 | bwd_inner: 3316.20 | bwd_allreduce: 0.90 | step: 7.05 81%|████████ | 8102/10000 [12:47:16<2:53:35, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0049021984450519085, 'learning_rate': 3.661251809695412e-06, 'epoch': 8.1} 81%|████████ | 8102/10000 [12:47:16<2:53:35, 5.49s/it][2025-06-20 02:17:01,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:17:01,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.65 | bwd_microstep: 3381.84 | bwd_inner_microstep: 3381.06 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:17:01,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.65 | bwd: 3381.86 | bwd_inner: 3381.06 | bwd_allreduce: 0.75 | step: 6.62 81%|████████ | 8103/10000 [12:47:21<2:54:10, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0009266284760087729, 'learning_rate': 3.6575169157075306e-06, 'epoch': 8.1} 81%|████████ | 8103/10000 [12:47:21<2:54:10, 5.51s/it][2025-06-20 02:17:06,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:17:06,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.32 | bwd_microstep: 3323.58 | bwd_inner_microstep: 3322.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-20 02:17:06,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.32 | bwd: 3323.60 | bwd_inner: 3322.78 | bwd_allreduce: 0.77 | step: 6.83 81%|████████ | 8104/10000 [12:47:27<2:53:45, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00318143074400723, 'learning_rate': 3.6537837359702776e-06, 'epoch': 8.1} 81%|████████ | 8104/10000 [12:47:27<2:53:45, 5.50s/it][2025-06-20 02:17:11,944] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:17:11,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.42 | bwd_microstep: 3320.44 | bwd_inner_microstep: 3319.65 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.82 [2025-06-20 02:17:11,945] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.42 | bwd: 3320.46 | bwd_inner: 3319.65 | bwd_allreduce: 0.76 | step: 6.82 81%|████████ | 8105/10000 [12:47:32<2:53:19, 5.49s/it] {'loss': 0.0032, 'grad_norm': 1.5226150751113892, 'learning_rate': 3.650052270875244e-06, 'epoch': 8.11} 81%|████████ | 8105/10000 [12:47:32<2:53:19, 5.49s/it][2025-06-20 02:17:17,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:17:17,482] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.98 | bwd_microstep: 3371.30 | bwd_inner_microstep: 3370.52 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 02:17:17,483] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.98 | bwd: 3371.31 | bwd_inner: 3370.52 | bwd_allreduce: 0.75 | step: 6.68 81%|████████ | 8106/10000 [12:47:38<2:53:42, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0019642964471131563, 'learning_rate': 3.6463225208138476e-06, 'epoch': 8.11} 81%|████████ | 8106/10000 [12:47:38<2:53:42, 5.50s/it][2025-06-20 02:17:22,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:17:22,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.08 | bwd_microstep: 3319.41 | bwd_inner_microstep: 3318.47 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.11 [2025-06-20 02:17:22,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.08 | bwd: 3319.42 | bwd_inner: 3318.47 | bwd_allreduce: 0.91 | step: 7.12 81%|████████ | 8107/10000 [12:47:43<2:53:17, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.008638587780296803, 'learning_rate': 3.6425944861773223e-06, 'epoch': 8.11} 81%|████████ | 8107/10000 [12:47:43<2:53:17, 5.49s/it][2025-06-20 02:17:28,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:17:28,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.20 | bwd_microstep: 3309.68 | bwd_inner_microstep: 3308.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:17:28,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.20 | bwd: 3309.70 | bwd_inner: 3308.90 | bwd_allreduce: 0.75 | step: 6.62 81%|████████ | 8108/10000 [12:47:49<2:52:55, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00021979364100843668, 'learning_rate': 3.638868167356713e-06, 'epoch': 8.11} 81%|████████ | 8108/10000 [12:47:49<2:52:55, 5.48s/it][2025-06-20 02:17:33,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:17:33,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.27 | bwd_microstep: 3320.77 | bwd_inner_microstep: 3319.86 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.03 [2025-06-20 02:17:33,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.27 | bwd: 3320.78 | bwd_inner: 3319.86 | bwd_allreduce: 0.88 | step: 7.03 81%|████████ | 8109/10000 [12:47:54<2:52:40, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0017614285461604595, 'learning_rate': 3.635143564742898e-06, 'epoch': 8.11} 81%|████████ | 8109/10000 [12:47:54<2:52:40, 5.48s/it][2025-06-20 02:17:39,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:17:39,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.63 | bwd_microstep: 3405.53 | bwd_inner_microstep: 3404.73 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.87 [2025-06-20 02:17:39,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.63 | bwd: 3405.55 | bwd_inner: 3404.73 | bwd_allreduce: 0.77 | step: 6.87 81%|████████ | 8110/10000 [12:48:00<2:53:35, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.025636589154601097, 'learning_rate': 3.6314206787265738e-06, 'epoch': 8.11} 81%|████████ | 8110/10000 [12:48:00<2:53:35, 5.51s/it][2025-06-20 02:17:44,935] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:17:44,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.28 | bwd_microstep: 3324.84 | bwd_inner_microstep: 3324.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-20 02:17:44,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.28 | bwd: 3324.85 | bwd_inner: 3324.04 | bwd_allreduce: 0.77 | step: 6.89 81%|████████ | 8111/10000 [12:48:05<2:53:06, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.010854699648916721, 'learning_rate': 3.62769950969825e-06, 'epoch': 8.11} 81%|████████ | 8111/10000 [12:48:05<2:53:06, 5.50s/it][2025-06-20 02:17:50,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:17:50,471] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.89 | bwd_microstep: 3370.62 | bwd_inner_microstep: 3369.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 02:17:50,472] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.89 | bwd: 3370.64 | bwd_inner: 3369.82 | bwd_allreduce: 0.77 | step: 6.81 81%|████████ | 8112/10000 [12:48:11<2:53:21, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.028496019542217255, 'learning_rate': 3.6239800580482663e-06, 'epoch': 8.11} 81%|████████ | 8112/10000 [12:48:11<2:53:21, 5.51s/it][2025-06-20 02:17:55,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:17:55,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.91 | bwd_microstep: 3309.59 | bwd_inner_microstep: 3308.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-20 02:17:55,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.91 | bwd: 3309.60 | bwd_inner: 3308.79 | bwd_allreduce: 0.76 | step: 6.97 81%|████████ | 8113/10000 [12:48:16<2:52:47, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0016943294322118163, 'learning_rate': 3.6202623241667655e-06, 'epoch': 8.11} 81%|████████ | 8113/10000 [12:48:16<2:52:47, 5.49s/it][2025-06-20 02:18:01,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:18:01,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.46 | bwd_microstep: 3318.25 | bwd_inner_microstep: 3317.26 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.34 [2025-06-20 02:18:01,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.46 | bwd: 3318.27 | bwd_inner: 3317.26 | bwd_allreduce: 0.95 | step: 7.35 81%|████████ | 8114/10000 [12:48:22<2:52:22, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0011052449699491262, 'learning_rate': 3.616546308443727e-06, 'epoch': 8.11} 81%|████████ | 8114/10000 [12:48:22<2:52:22, 5.48s/it][2025-06-20 02:18:06,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:18:06,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.40 | bwd_microstep: 3322.40 | bwd_inner_microstep: 3321.62 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-20 02:18:06,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.40 | bwd: 3322.41 | bwd_inner: 3321.62 | bwd_allreduce: 0.75 | step: 6.53 81%|████████ | 8115/10000 [12:48:27<2:52:17, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00021147524239495397, 'learning_rate': 3.612832011268943e-06, 'epoch': 8.12} 81%|████████ | 8115/10000 [12:48:27<2:52:17, 5.48s/it][2025-06-20 02:18:12,334] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:18:12,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.00 | bwd_microstep: 3316.83 | bwd_inner_microstep: 3316.05 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 02:18:12,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.00 | bwd: 3316.85 | bwd_inner: 3316.05 | bwd_allreduce: 0.76 | step: 6.60 81%|████████ | 8116/10000 [12:48:33<2:51:58, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.006352395284920931, 'learning_rate': 3.6091194330320267e-06, 'epoch': 8.12} 81%|████████ | 8116/10000 [12:48:33<2:51:58, 5.48s/it][2025-06-20 02:18:17,796] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 02:18:17,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.16 | bwd_microstep: 3325.76 | bwd_inner_microstep: 3324.76 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.23 [2025-06-20 02:18:17,797] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.16 | bwd: 3325.77 | bwd_inner: 3324.76 | bwd_allreduce: 0.95 | step: 7.23 81%|████████ | 8117/10000 [12:48:38<2:51:45, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0022283068392425776, 'learning_rate': 3.6054085741224066e-06, 'epoch': 8.12} 81%|████████ | 8117/10000 [12:48:38<2:51:45, 5.47s/it][2025-06-20 02:18:23,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 02:18:23,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.92 | bwd_microstep: 3322.15 | bwd_inner_microstep: 3320.91 | bwd_allreduce_microstep: 1.17 | step_microstep: 7.92 [2025-06-20 02:18:23,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.92 | bwd: 3322.17 | bwd_inner: 3320.91 | bwd_allreduce: 1.20 | step: 7.92 81%|████████ | 8118/10000 [12:48:44<2:51:38, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.02985939383506775, 'learning_rate': 3.60169943492934e-06, 'epoch': 8.12} 81%|████████ | 8118/10000 [12:48:44<2:51:38, 5.47s/it][2025-06-20 02:18:28,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:18:28,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.39 | bwd_microstep: 3375.13 | bwd_inner_microstep: 3374.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 02:18:28,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.39 | bwd: 3375.15 | bwd_inner: 3374.35 | bwd_allreduce: 0.76 | step: 6.65 81%|████████ | 8119/10000 [12:48:49<2:52:17, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00021686033869627863, 'learning_rate': 3.5979920158418957e-06, 'epoch': 8.12} 81%|████████ | 8119/10000 [12:48:49<2:52:17, 5.50s/it][2025-06-20 02:18:34,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:18:34,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.62 | bwd_microstep: 3371.33 | bwd_inner_microstep: 3370.53 | bwd_allreduce_microstep: 0.75 | step_microstep: 9.07 [2025-06-20 02:18:34,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.62 | bwd: 3371.34 | bwd_inner: 3370.53 | bwd_allreduce: 0.77 | step: 9.11 81%|████████ | 8120/10000 [12:48:55<2:52:35, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00833780039101839, 'learning_rate': 3.5942863172489627e-06, 'epoch': 8.12} 81%|████████ | 8120/10000 [12:48:55<2:52:35, 5.51s/it][2025-06-20 02:18:39,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:18:39,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.58 | bwd_microstep: 3320.69 | bwd_inner_microstep: 3319.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-20 02:18:39,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.58 | bwd: 3320.70 | bwd_inner: 3319.89 | bwd_allreduce: 0.77 | step: 6.72 81%|████████ | 8121/10000 [12:49:00<2:52:01, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0004033090372104198, 'learning_rate': 3.5905823395392546e-06, 'epoch': 8.12} 81%|████████ | 8121/10000 [12:49:00<2:52:01, 5.49s/it][2025-06-20 02:18:45,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:18:45,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.83 | bwd_microstep: 3325.26 | bwd_inner_microstep: 3324.47 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 02:18:45,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.83 | bwd: 3325.27 | bwd_inner: 3324.47 | bwd_allreduce: 0.76 | step: 6.72 81%|████████ | 8122/10000 [12:49:06<2:51:40, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0030873038340359926, 'learning_rate': 3.5868800831013074e-06, 'epoch': 8.12} 81%|████████ | 8122/10000 [12:49:06<2:51:40, 5.48s/it][2025-06-20 02:18:50,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 02:18:50,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.00 | bwd_microstep: 3320.62 | bwd_inner_microstep: 3319.65 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.14 [2025-06-20 02:18:50,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.00 | bwd: 3320.64 | bwd_inner: 3319.65 | bwd_allreduce: 0.94 | step: 7.14 81%|████████ | 8123/10000 [12:49:11<2:51:23, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0025511966086924076, 'learning_rate': 3.5831795483234567e-06, 'epoch': 8.12} 81%|████████ | 8123/10000 [12:49:11<2:51:23, 5.48s/it][2025-06-20 02:18:56,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:18:56,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.94 | bwd_microstep: 3367.57 | bwd_inner_microstep: 3366.78 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-20 02:18:56,277] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.94 | bwd: 3367.58 | bwd_inner: 3366.78 | bwd_allreduce: 0.76 | step: 7.08 81%|████████ | 8124/10000 [12:49:17<2:51:48, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.7258502244949341, 'learning_rate': 3.579480735593881e-06, 'epoch': 8.12} 81%|████████ | 8124/10000 [12:49:17<2:51:48, 5.50s/it][2025-06-20 02:19:01,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:19:01,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.66 | bwd_microstep: 3358.14 | bwd_inner_microstep: 3357.34 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.16 [2025-06-20 02:19:01,800] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.66 | bwd: 3358.15 | bwd_inner: 3357.34 | bwd_allreduce: 0.77 | step: 7.16 81%|████████▏ | 8125/10000 [12:49:22<2:51:59, 5.50s/it] {'loss': 0.0, 'grad_norm': 6.644464156124741e-05, 'learning_rate': 3.5757836453005633e-06, 'epoch': 8.12} 81%|████████▏ | 8125/10000 [12:49:22<2:51:59, 5.50s/it][2025-06-20 02:19:07,326] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:19:07,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.37 | bwd_microstep: 3363.26 | bwd_inner_microstep: 3362.43 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.71 [2025-06-20 02:19:07,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.37 | bwd: 3363.27 | bwd_inner: 3362.43 | bwd_allreduce: 0.80 | step: 6.72 81%|████████▏ | 8126/10000 [12:49:28<2:52:06, 5.51s/it] {'loss': 0.0003, 'grad_norm': 0.06823138147592545, 'learning_rate': 3.572088277831316e-06, 'epoch': 8.13} 81%|████████▏ | 8126/10000 [12:49:28<2:52:06, 5.51s/it][2025-06-20 02:19:12,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 02:19:12,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.08 | bwd_microstep: 3360.03 | bwd_inner_microstep: 3358.98 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.78 [2025-06-20 02:19:12,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.08 | bwd: 3360.04 | bwd_inner: 3358.98 | bwd_allreduce: 1.01 | step: 7.79 81%|████████▏ | 8127/10000 [12:49:33<2:52:12, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0005970283527858555, 'learning_rate': 3.5683946335737686e-06, 'epoch': 8.13} 81%|████████▏ | 8127/10000 [12:49:33<2:52:12, 5.52s/it][2025-06-20 02:19:18,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:19:18,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.71 | bwd_microstep: 3365.69 | bwd_inner_microstep: 3364.91 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 02:19:18,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.71 | bwd: 3365.71 | bwd_inner: 3364.91 | bwd_allreduce: 0.76 | step: 6.65 81%|████████▏ | 8128/10000 [12:49:39<2:52:16, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0017109304899349809, 'learning_rate': 3.5647027129153576e-06, 'epoch': 8.13} 81%|████████▏ | 8128/10000 [12:49:39<2:52:16, 5.52s/it][2025-06-20 02:19:23,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:19:23,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.88 | bwd_microstep: 3319.49 | bwd_inner_microstep: 3318.68 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-20 02:19:23,859] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.88 | bwd: 3319.50 | bwd_inner: 3318.68 | bwd_allreduce: 0.78 | step: 7.15 81%|████████▏ | 8129/10000 [12:49:44<2:51:41, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0008453885675407946, 'learning_rate': 3.561012516243354e-06, 'epoch': 8.13} 81%|████████▏ | 8129/10000 [12:49:44<2:51:41, 5.51s/it][2025-06-20 02:19:29,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:19:29,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.67 | bwd_microstep: 3314.69 | bwd_inner_microstep: 3313.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 02:19:29,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.67 | bwd: 3314.70 | bwd_inner: 3313.90 | bwd_allreduce: 0.76 | step: 6.57 81%|████████▏ | 8130/10000 [12:49:50<2:51:10, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00014354668383020908, 'learning_rate': 3.557324043944841e-06, 'epoch': 8.13} 81%|████████▏ | 8130/10000 [12:49:50<2:51:10, 5.49s/it][2025-06-20 02:19:34,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:19:34,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.85 | bwd_microstep: 3310.25 | bwd_inner_microstep: 3309.31 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.13 [2025-06-20 02:19:34,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.85 | bwd: 3310.26 | bwd_inner: 3309.31 | bwd_allreduce: 0.91 | step: 7.14 81%|████████▏ | 8131/10000 [12:49:55<2:50:38, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0065203201957046986, 'learning_rate': 3.5536372964067223e-06, 'epoch': 8.13} 81%|████████▏ | 8131/10000 [12:49:55<2:50:38, 5.48s/it][2025-06-20 02:19:40,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:19:40,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.69 | bwd_microstep: 3313.99 | bwd_inner_microstep: 3313.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 02:19:40,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.69 | bwd: 3314.00 | bwd_inner: 3313.19 | bwd_allreduce: 0.77 | step: 6.76 81%|████████▏ | 8132/10000 [12:50:01<2:50:22, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.004746125545352697, 'learning_rate': 3.549952274015722e-06, 'epoch': 8.13} 81%|████████▏ | 8132/10000 [12:50:01<2:50:22, 5.47s/it][2025-06-20 02:19:45,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:19:45,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.41 | bwd_microstep: 3312.63 | bwd_inner_microstep: 3311.81 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-20 02:19:45,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.41 | bwd: 3312.65 | bwd_inner: 3311.81 | bwd_allreduce: 0.79 | step: 7.24 81%|████████▏ | 8133/10000 [12:50:06<2:50:04, 5.47s/it] {'loss': 0.0013, 'grad_norm': 0.21482014656066895, 'learning_rate': 3.5462689771583823e-06, 'epoch': 8.13} 81%|████████▏ | 8133/10000 [12:50:06<2:50:04, 5.47s/it][2025-06-20 02:19:51,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:19:51,129] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.36 | bwd_microstep: 3314.21 | bwd_inner_microstep: 3313.39 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.91 [2025-06-20 02:19:51,130] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.36 | bwd: 3314.23 | bwd_inner: 3313.39 | bwd_allreduce: 0.79 | step: 6.91 81%|████████▏ | 8134/10000 [12:50:11<2:49:53, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.00578108849003911, 'learning_rate': 3.542587406221061e-06, 'epoch': 8.13} 81%|████████▏ | 8134/10000 [12:50:11<2:49:53, 5.46s/it][2025-06-20 02:19:56,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:19:56,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.85 | bwd_microstep: 3302.84 | bwd_inner_microstep: 3301.88 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.24 [2025-06-20 02:19:56,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.85 | bwd: 3302.86 | bwd_inner: 3301.88 | bwd_allreduce: 0.93 | step: 7.25 81%|████████▏ | 8135/10000 [12:50:17<2:49:37, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.016522638499736786, 'learning_rate': 3.5389075615899414e-06, 'epoch': 8.13} 81%|████████▏ | 8135/10000 [12:50:17<2:49:37, 5.46s/it][2025-06-20 02:20:02,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:20:02,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.55 | bwd_microstep: 3312.24 | bwd_inner_microstep: 3311.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 02:20:02,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.56 | bwd: 3312.25 | bwd_inner: 3311.44 | bwd_allreduce: 0.76 | step: 6.73 81%|████████▏ | 8136/10000 [12:50:22<2:49:33, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0001963026588782668, 'learning_rate': 3.5352294436510183e-06, 'epoch': 8.14} 81%|████████▏ | 8136/10000 [12:50:22<2:49:33, 5.46s/it][2025-06-20 02:20:07,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 02:20:07,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.15 | bwd_microstep: 3331.39 | bwd_inner_microstep: 3330.48 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.27 [2025-06-20 02:20:07,506] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.16 | bwd: 3331.43 | bwd_inner: 3330.48 | bwd_allreduce: 0.87 | step: 8.28 81%|████████▏ | 8137/10000 [12:50:28<2:49:37, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.00040139484917744994, 'learning_rate': 3.5315530527901176e-06, 'epoch': 8.14} 81%|████████▏ | 8137/10000 [12:50:28<2:49:37, 5.46s/it][2025-06-20 02:20:13,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-20 02:20:13,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.68 | bwd_microstep: 3373.14 | bwd_inner_microstep: 3372.15 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.78 [2025-06-20 02:20:13,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.68 | bwd: 3373.17 | bwd_inner: 3372.15 | bwd_allreduce: 0.94 | step: 7.78 81%|████████▏ | 8138/10000 [12:50:33<2:50:21, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0063789295963943005, 'learning_rate': 3.5278783893928623e-06, 'epoch': 8.14} 81%|████████▏ | 8138/10000 [12:50:33<2:50:21, 5.49s/it][2025-06-20 02:20:18,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:20:18,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2181.81 | bwd_microstep: 3400.28 | bwd_inner_microstep: 3399.40 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.38 [2025-06-20 02:20:18,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2181.81 | bwd: 3400.31 | bwd_inner: 3399.40 | bwd_allreduce: 0.83 | step: 7.39 81%|████████▏ | 8139/10000 [12:50:39<2:51:34, 5.53s/it] {'loss': 0.0008, 'grad_norm': 0.17775416374206543, 'learning_rate': 3.524205453844718e-06, 'epoch': 8.14} 81%|████████▏ | 8139/10000 [12:50:39<2:51:34, 5.53s/it][2025-06-20 02:20:24,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:20:24,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2158.84 | bwd_microstep: 3375.35 | bwd_inner_microstep: 3374.55 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.15 [2025-06-20 02:20:24,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2158.84 | bwd: 3375.37 | bwd_inner: 3374.55 | bwd_allreduce: 0.78 | step: 7.15 81%|████████▏ | 8140/10000 [12:50:45<2:51:53, 5.54s/it] {'loss': 0.0, 'grad_norm': 7.47669764677994e-05, 'learning_rate': 3.520534246530951e-06, 'epoch': 8.14} 81%|████████▏ | 8140/10000 [12:50:45<2:51:53, 5.54s/it][2025-06-20 02:20:29,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:20:29,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.41 | bwd_microstep: 3311.84 | bwd_inner_microstep: 3311.06 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-20 02:20:29,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.41 | bwd: 3311.85 | bwd_inner: 3311.06 | bwd_allreduce: 0.75 | step: 6.73 81%|████████▏ | 8141/10000 [12:50:50<2:50:56, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0005543335573747754, 'learning_rate': 3.516864767836661e-06, 'epoch': 8.14} 81%|████████▏ | 8141/10000 [12:50:50<2:50:56, 5.52s/it][2025-06-20 02:20:35,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:20:35,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.19 | bwd_microstep: 3324.78 | bwd_inner_microstep: 3323.92 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.49 [2025-06-20 02:20:35,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.19 | bwd: 3324.79 | bwd_inner: 3323.92 | bwd_allreduce: 0.83 | step: 7.51 81%|████████▏ | 8142/10000 [12:50:55<2:50:26, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.01774408668279648, 'learning_rate': 3.5131970181467612e-06, 'epoch': 8.14} 81%|████████▏ | 8142/10000 [12:50:55<2:50:26, 5.50s/it][2025-06-20 02:20:40,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.73 [2025-06-20 02:20:40,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2167.87 | bwd_microstep: 3398.84 | bwd_inner_microstep: 3397.89 | bwd_allreduce_microstep: 0.86 | step_microstep: 8.68 [2025-06-20 02:20:40,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2167.87 | bwd: 3398.87 | bwd_inner: 3397.89 | bwd_allreduce: 0.90 | step: 8.68 81%|████████▏ | 8143/10000 [12:51:01<2:51:24, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.00033558576251380146, 'learning_rate': 3.509530997845969e-06, 'epoch': 8.14} 81%|████████▏ | 8143/10000 [12:51:01<2:51:24, 5.54s/it][2025-06-20 02:20:46,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:20:46,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.46 | bwd_microstep: 3314.68 | bwd_inner_microstep: 3313.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 02:20:46,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.46 | bwd: 3314.70 | bwd_inner: 3313.89 | bwd_allreduce: 0.76 | step: 6.73 81%|████████▏ | 8144/10000 [12:51:07<2:50:58, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.000490349717438221, 'learning_rate': 3.5058667073188433e-06, 'epoch': 8.14} 81%|████████▏ | 8144/10000 [12:51:07<2:50:58, 5.53s/it][2025-06-20 02:20:51,839] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-20 02:20:51,840] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.94 | bwd_microstep: 3366.75 | bwd_inner_microstep: 3365.63 | bwd_allreduce_microstep: 1.02 | step_microstep: 8.45 [2025-06-20 02:20:51,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.94 | bwd: 3366.79 | bwd_inner: 3365.63 | bwd_allreduce: 1.07 | step: 8.45 81%|████████▏ | 8145/10000 [12:51:12<2:50:57, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0018952478421851993, 'learning_rate': 3.502204146949746e-06, 'epoch': 8.14} 81%|████████▏ | 8145/10000 [12:51:12<2:50:57, 5.53s/it][2025-06-20 02:20:57,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:20:57,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.57 | bwd_microstep: 3319.05 | bwd_inner_microstep: 3318.25 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.02 [2025-06-20 02:20:57,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.57 | bwd: 3319.07 | bwd_inner: 3318.25 | bwd_allreduce: 0.77 | step: 7.02 81%|████████▏ | 8146/10000 [12:51:18<2:50:21, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.25074702501296997, 'learning_rate': 3.4985433171228667e-06, 'epoch': 8.15} 81%|████████▏ | 8146/10000 [12:51:18<2:50:21, 5.51s/it][2025-06-20 02:21:02,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:21:02,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.41 | bwd_microstep: 3368.14 | bwd_inner_microstep: 3367.36 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 02:21:02,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.41 | bwd: 3368.15 | bwd_inner: 3367.36 | bwd_allreduce: 0.75 | step: 6.60 81%|████████▏ | 8147/10000 [12:51:23<2:50:26, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0007953550084494054, 'learning_rate': 3.4948842182222122e-06, 'epoch': 8.15} 81%|████████▏ | 8147/10000 [12:51:23<2:50:26, 5.52s/it][2025-06-20 02:21:08,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 02:21:08,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.92 | bwd_microstep: 3322.00 | bwd_inner_microstep: 3321.11 | bwd_allreduce_microstep: 0.83 | step_microstep: 8.28 [2025-06-20 02:21:08,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.92 | bwd: 3322.02 | bwd_inner: 3321.11 | bwd_allreduce: 0.86 | step: 8.29 81%|████████▏ | 8148/10000 [12:51:29<2:50:10, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0013020546175539494, 'learning_rate': 3.491226850631595e-06, 'epoch': 8.15} 81%|████████▏ | 8148/10000 [12:51:29<2:50:10, 5.51s/it][2025-06-20 02:21:13,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:21:13,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2161.46 | bwd_microstep: 3358.26 | bwd_inner_microstep: 3357.42 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.34 [2025-06-20 02:21:13,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2161.47 | bwd: 3358.28 | bwd_inner: 3357.42 | bwd_allreduce: 0.80 | step: 7.34 81%|████████▏ | 8149/10000 [12:51:34<2:50:35, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.009979619644582272, 'learning_rate': 3.487571214734664e-06, 'epoch': 8.15} 81%|████████▏ | 8149/10000 [12:51:34<2:50:35, 5.53s/it][2025-06-20 02:21:19,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:21:19,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.22 | bwd_microstep: 3318.74 | bwd_inner_microstep: 3317.88 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.21 [2025-06-20 02:21:19,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.22 | bwd: 3318.77 | bwd_inner: 3317.88 | bwd_allreduce: 0.82 | step: 7.21 82%|████████▏ | 8150/10000 [12:51:40<2:49:53, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0004108869470655918, 'learning_rate': 3.4839173109148728e-06, 'epoch': 8.15} 82%|████████▏ | 8150/10000 [12:51:40<2:49:53, 5.51s/it][2025-06-20 02:21:24,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:21:24,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2155.63 | bwd_microstep: 3364.77 | bwd_inner_microstep: 3363.89 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.17 [2025-06-20 02:21:24,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2155.63 | bwd: 3364.79 | bwd_inner: 3363.89 | bwd_allreduce: 0.84 | step: 7.17 82%|████████▏ | 8151/10000 [12:51:45<2:50:17, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.001365208183415234, 'learning_rate': 3.480265139555503e-06, 'epoch': 8.15} 82%|████████▏ | 8151/10000 [12:51:45<2:50:17, 5.53s/it][2025-06-20 02:21:30,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 02:21:30,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.86 | bwd_microstep: 3324.67 | bwd_inner_microstep: 3323.77 | bwd_allreduce_microstep: 0.83 | step_microstep: 8.26 [2025-06-20 02:21:30,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.86 | bwd: 3324.70 | bwd_inner: 3323.77 | bwd_allreduce: 0.86 | step: 8.26 82%|████████▏ | 8152/10000 [12:51:51<2:50:04, 5.52s/it] {'loss': 0.0, 'grad_norm': 3.536889562383294e-05, 'learning_rate': 3.476614701039651e-06, 'epoch': 8.15} 82%|████████▏ | 8152/10000 [12:51:51<2:50:04, 5.52s/it][2025-06-20 02:21:36,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 02:21:36,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2162.12 | bwd_microstep: 3373.58 | bwd_inner_microstep: 3372.67 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.54 [2025-06-20 02:21:36,042] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2162.12 | bwd: 3373.61 | bwd_inner: 3372.67 | bwd_allreduce: 0.87 | step: 7.55 82%|████████▏ | 8153/10000 [12:51:56<2:50:33, 5.54s/it] {'loss': 0.0001, 'grad_norm': 0.024536026641726494, 'learning_rate': 3.472965995750228e-06, 'epoch': 8.15} 82%|████████▏ | 8153/10000 [12:51:56<2:50:33, 5.54s/it][2025-06-20 02:21:41,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:21:41,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.79 | bwd_microstep: 3328.04 | bwd_inner_microstep: 3327.23 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.70 [2025-06-20 02:21:41,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.79 | bwd: 3328.06 | bwd_inner: 3327.23 | bwd_allreduce: 0.78 | step: 7.72 82%|████████▏ | 8154/10000 [12:52:02<2:50:15, 5.53s/it] {'loss': 0.0005, 'grad_norm': 0.086124949157238, 'learning_rate': 3.4693190240699704e-06, 'epoch': 8.15} 82%|████████▏ | 8154/10000 [12:52:02<2:50:15, 5.53s/it][2025-06-20 02:21:47,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:21:47,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.33 | bwd_microstep: 3377.93 | bwd_inner_microstep: 3377.06 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.26 [2025-06-20 02:21:47,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.33 | bwd: 3377.95 | bwd_inner: 3377.06 | bwd_allreduce: 0.83 | step: 7.27 82%|████████▏ | 8155/10000 [12:52:07<2:50:25, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.00015568900562357157, 'learning_rate': 3.465673786381423e-06, 'epoch': 8.15} 82%|████████▏ | 8155/10000 [12:52:07<2:50:25, 5.54s/it][2025-06-20 02:21:52,609] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:21:52,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.77 | bwd_microstep: 3336.61 | bwd_inner_microstep: 3335.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 02:21:52,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.77 | bwd: 3336.63 | bwd_inner: 3335.82 | bwd_allreduce: 0.77 | step: 6.70 82%|████████▏ | 8156/10000 [12:52:13<2:49:47, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.005906625185161829, 'learning_rate': 3.4620302830669595e-06, 'epoch': 8.16} 82%|████████▏ | 8156/10000 [12:52:13<2:49:47, 5.52s/it][2025-06-20 02:21:58,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:21:58,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.00 | bwd_microstep: 3374.34 | bwd_inner_microstep: 3373.53 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-20 02:21:58,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.00 | bwd: 3374.36 | bwd_inner: 3373.53 | bwd_allreduce: 0.78 | step: 7.15 82%|████████▏ | 8157/10000 [12:52:18<2:49:50, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.00930821429938078, 'learning_rate': 3.4583885145087613e-06, 'epoch': 8.16} 82%|████████▏ | 8157/10000 [12:52:18<2:49:50, 5.53s/it][2025-06-20 02:22:03,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 02:22:03,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.63 | bwd_microstep: 3370.51 | bwd_inner_microstep: 3369.37 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.52 [2025-06-20 02:22:03,709] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.63 | bwd: 3370.53 | bwd_inner: 3369.37 | bwd_allreduce: 1.10 | step: 7.52 82%|████████▏ | 8158/10000 [12:52:24<2:50:02, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0022050305269658566, 'learning_rate': 3.4547484810888433e-06, 'epoch': 8.16} 82%|████████▏ | 8158/10000 [12:52:24<2:50:02, 5.54s/it][2025-06-20 02:22:09,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:22:09,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.32 | bwd_microstep: 3332.29 | bwd_inner_microstep: 3331.49 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 02:22:09,208] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.32 | bwd: 3332.30 | bwd_inner: 3331.49 | bwd_allreduce: 0.77 | step: 6.71 82%|████████▏ | 8159/10000 [12:52:30<2:49:33, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.02701469324529171, 'learning_rate': 3.4511101831890146e-06, 'epoch': 8.16} 82%|████████▏ | 8159/10000 [12:52:30<2:49:33, 5.53s/it][2025-06-20 02:22:14,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:22:14,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.64 | bwd_microstep: 3321.53 | bwd_inner_microstep: 3320.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:22:14,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.64 | bwd: 3321.54 | bwd_inner: 3320.75 | bwd_allreduce: 0.75 | step: 6.62 82%|████████▏ | 8160/10000 [12:52:35<2:49:05, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.005043108481913805, 'learning_rate': 3.447473621190922e-06, 'epoch': 8.16} 82%|████████▏ | 8160/10000 [12:52:35<2:49:05, 5.51s/it][2025-06-20 02:22:20,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:22:20,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.49 | bwd_microstep: 3325.51 | bwd_inner_microstep: 3324.63 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.10 [2025-06-20 02:22:20,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.49 | bwd: 3325.52 | bwd_inner: 3324.63 | bwd_allreduce: 0.85 | step: 7.11 82%|████████▏ | 8161/10000 [12:52:40<2:48:44, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.001761189429089427, 'learning_rate': 3.4438387954760242e-06, 'epoch': 8.16} 82%|████████▏ | 8161/10000 [12:52:40<2:48:44, 5.51s/it][2025-06-20 02:22:25,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:22:25,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.37 | bwd_microstep: 3319.90 | bwd_inner_microstep: 3319.09 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.14 [2025-06-20 02:22:25,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.37 | bwd: 3319.91 | bwd_inner: 3319.09 | bwd_allreduce: 0.77 | step: 7.14 82%|████████▏ | 8162/10000 [12:52:46<2:48:22, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0018812085036188364, 'learning_rate': 3.4402057064255988e-06, 'epoch': 8.16} 82%|████████▏ | 8162/10000 [12:52:46<2:48:22, 5.50s/it][2025-06-20 02:22:31,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:22:31,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.32 | bwd_microstep: 3372.39 | bwd_inner_microstep: 3371.37 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.32 [2025-06-20 02:22:31,207] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.32 | bwd: 3372.43 | bwd_inner: 3371.37 | bwd_allreduce: 0.95 | step: 7.31 82%|████████▏ | 8163/10000 [12:52:52<2:48:48, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.026504822075366974, 'learning_rate': 3.436574354420741e-06, 'epoch': 8.16} 82%|████████▏ | 8163/10000 [12:52:52<2:48:48, 5.51s/it][2025-06-20 02:22:36,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:22:36,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.40 | bwd_microstep: 3329.57 | bwd_inner_microstep: 3328.70 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.63 [2025-06-20 02:22:36,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.40 | bwd: 3329.60 | bwd_inner: 3328.70 | bwd_allreduce: 0.83 | step: 7.64 82%|████████▏ | 8164/10000 [12:52:57<2:48:28, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.018862459808588028, 'learning_rate': 3.4329447398423564e-06, 'epoch': 8.16} 82%|████████▏ | 8164/10000 [12:52:57<2:48:28, 5.51s/it][2025-06-20 02:22:42,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:22:42,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.19 | bwd_microstep: 3320.50 | bwd_inner_microstep: 3319.71 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.80 [2025-06-20 02:22:42,162] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.19 | bwd: 3320.52 | bwd_inner: 3319.71 | bwd_allreduce: 0.76 | step: 6.81 82%|████████▏ | 8165/10000 [12:53:02<2:48:01, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004379332065582275, 'learning_rate': 3.4293168630711772e-06, 'epoch': 8.16} 82%|████████▏ | 8165/10000 [12:53:02<2:48:01, 5.49s/it][2025-06-20 02:22:47,653] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 02:22:47,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.84 | bwd_microstep: 3332.79 | bwd_inner_microstep: 3331.92 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.88 [2025-06-20 02:22:47,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.84 | bwd: 3332.82 | bwd_inner: 3331.92 | bwd_allreduce: 0.83 | step: 7.89 82%|████████▏ | 8166/10000 [12:53:08<2:47:57, 5.50s/it] {'loss': 0.0, 'grad_norm': 6.72034511808306e-05, 'learning_rate': 3.4256907244877537e-06, 'epoch': 8.17} 82%|████████▏ | 8166/10000 [12:53:08<2:47:57, 5.50s/it][2025-06-20 02:22:53,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:22:53,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2153.88 | bwd_microstep: 3402.60 | bwd_inner_microstep: 3401.81 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 02:22:53,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2153.88 | bwd: 3402.61 | bwd_inner: 3401.81 | bwd_allreduce: 0.76 | step: 6.58 82%|████████▏ | 8167/10000 [12:53:14<2:48:49, 5.53s/it] {'loss': 0.0032, 'grad_norm': 2.370586395263672, 'learning_rate': 3.4220663244724483e-06, 'epoch': 8.17} 82%|████████▏ | 8167/10000 [12:53:14<2:48:49, 5.53s/it][2025-06-20 02:22:58,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:22:58,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.73 | bwd_microstep: 3324.64 | bwd_inner_microstep: 3323.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-20 02:22:58,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.73 | bwd: 3324.65 | bwd_inner: 3323.86 | bwd_allreduce: 0.75 | step: 6.54 82%|████████▏ | 8168/10000 [12:53:19<2:48:13, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0014242122415453196, 'learning_rate': 3.418443663405446e-06, 'epoch': 8.17} 82%|████████▏ | 8168/10000 [12:53:19<2:48:13, 5.51s/it][2025-06-20 02:23:04,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:23:04,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2158.04 | bwd_microstep: 3385.89 | bwd_inner_microstep: 3385.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-20 02:23:04,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2158.04 | bwd: 3385.91 | bwd_inner: 3385.11 | bwd_allreduce: 0.75 | step: 6.57 82%|████████▏ | 8169/10000 [12:53:25<2:48:47, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0017549553886055946, 'learning_rate': 3.4148227416667455e-06, 'epoch': 8.17} 82%|████████▏ | 8169/10000 [12:53:25<2:48:47, 5.53s/it][2025-06-20 02:23:09,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.80 [2025-06-20 02:23:09,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2148.39 | bwd_microstep: 3385.64 | bwd_inner_microstep: 3384.73 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.04 [2025-06-20 02:23:09,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2148.39 | bwd: 3385.65 | bwd_inner: 3384.73 | bwd_allreduce: 0.88 | step: 7.04 82%|████████▏ | 8170/10000 [12:53:30<2:49:03, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0004445973609108478, 'learning_rate': 3.411203559636165e-06, 'epoch': 8.17} 82%|████████▏ | 8170/10000 [12:53:30<2:49:03, 5.54s/it][2025-06-20 02:23:15,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:23:15,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.07 | bwd_microstep: 3327.04 | bwd_inner_microstep: 3326.22 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.33 [2025-06-20 02:23:15,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.07 | bwd: 3327.05 | bwd_inner: 3326.22 | bwd_allreduce: 0.79 | step: 7.34 82%|████████▏ | 8171/10000 [12:53:36<2:48:25, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.003927523270249367, 'learning_rate': 3.4075861176933402e-06, 'epoch': 8.17} 82%|████████▏ | 8171/10000 [12:53:36<2:48:25, 5.53s/it][2025-06-20 02:23:20,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 02:23:20,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.58 | bwd_microstep: 3323.74 | bwd_inner_microstep: 3322.67 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.59 [2025-06-20 02:23:20,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.58 | bwd: 3323.75 | bwd_inner: 3322.67 | bwd_allreduce: 1.03 | step: 7.59 82%|████████▏ | 8172/10000 [12:53:41<2:47:53, 5.51s/it] {'loss': 0.0, 'grad_norm': 8.97977442946285e-05, 'learning_rate': 3.4039704162177232e-06, 'epoch': 8.17} 82%|████████▏ | 8172/10000 [12:53:41<2:47:53, 5.51s/it][2025-06-20 02:23:26,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:23:26,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.60 | bwd_microstep: 3401.92 | bwd_inner_microstep: 3401.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-20 02:23:26,426] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.60 | bwd: 3401.93 | bwd_inner: 3401.13 | bwd_allreduce: 0.76 | step: 6.81 82%|████████▏ | 8173/10000 [12:53:47<2:48:28, 5.53s/it] {'loss': 0.0003, 'grad_norm': 0.07712904363870621, 'learning_rate': 3.4003564555885915e-06, 'epoch': 8.17} 82%|████████▏ | 8173/10000 [12:53:47<2:48:28, 5.53s/it][2025-06-20 02:23:31,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:23:31,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.84 | bwd_microstep: 3319.91 | bwd_inner_microstep: 3319.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-20 02:23:31,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.84 | bwd: 3319.93 | bwd_inner: 3319.11 | bwd_allreduce: 0.77 | step: 7.02 82%|████████▏ | 8174/10000 [12:53:52<2:47:45, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.009406544268131256, 'learning_rate': 3.396744236185021e-06, 'epoch': 8.17} 82%|████████▏ | 8174/10000 [12:53:52<2:47:45, 5.51s/it][2025-06-20 02:23:37,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 02:23:37,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.61 | bwd_microstep: 3323.77 | bwd_inner_microstep: 3322.99 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-20 02:23:37,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.61 | bwd: 3323.78 | bwd_inner: 3322.99 | bwd_allreduce: 0.75 | step: 6.53 82%|████████▏ | 8175/10000 [12:53:58<2:47:18, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.07235690206289291, 'learning_rate': 3.39313375838592e-06, 'epoch': 8.18} 82%|████████▏ | 8175/10000 [12:53:58<2:47:18, 5.50s/it][2025-06-20 02:23:42,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:23:42,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.32 | bwd_microstep: 3369.92 | bwd_inner_microstep: 3369.10 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.15 [2025-06-20 02:23:42,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.32 | bwd: 3369.94 | bwd_inner: 3369.10 | bwd_allreduce: 0.79 | step: 7.16 82%|████████▏ | 8176/10000 [12:54:03<2:47:35, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.003928241319954395, 'learning_rate': 3.389525022570015e-06, 'epoch': 8.18} 82%|████████▏ | 8176/10000 [12:54:03<2:47:35, 5.51s/it][2025-06-20 02:23:48,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:23:48,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.51 | bwd_microstep: 3338.40 | bwd_inner_microstep: 3337.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-20 02:23:48,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.51 | bwd: 3338.42 | bwd_inner: 3337.61 | bwd_allreduce: 0.76 | step: 6.88 82%|████████▏ | 8177/10000 [12:54:09<2:47:16, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0008155657560564578, 'learning_rate': 3.3859180291158446e-06, 'epoch': 8.18} 82%|████████▏ | 8177/10000 [12:54:09<2:47:16, 5.51s/it][2025-06-20 02:23:53,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:23:53,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.04 | bwd_microstep: 3333.33 | bwd_inner_microstep: 3332.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 02:23:53,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.04 | bwd: 3333.34 | bwd_inner: 3332.54 | bwd_allreduce: 0.76 | step: 6.81 82%|████████▏ | 8178/10000 [12:54:14<2:46:55, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.006411944981664419, 'learning_rate': 3.382312778401766e-06, 'epoch': 8.18} 82%|████████▏ | 8178/10000 [12:54:14<2:46:55, 5.50s/it][2025-06-20 02:23:59,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 02:23:59,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.41 | bwd_microstep: 3401.37 | bwd_inner_microstep: 3400.26 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.29 [2025-06-20 02:23:59,454] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.41 | bwd: 3401.39 | bwd_inner: 3400.26 | bwd_allreduce: 1.07 | step: 7.29 82%|████████▏ | 8179/10000 [12:54:20<2:47:38, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.000439795374404639, 'learning_rate': 3.378709270805951e-06, 'epoch': 8.18} 82%|████████▏ | 8179/10000 [12:54:20<2:47:38, 5.52s/it][2025-06-20 02:24:04,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:24:04,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.26 | bwd_microstep: 3332.88 | bwd_inner_microstep: 3332.05 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.10 [2025-06-20 02:24:04,948] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.26 | bwd: 3332.90 | bwd_inner: 3332.05 | bwd_allreduce: 0.80 | step: 7.11 82%|████████▏ | 8180/10000 [12:54:25<2:47:15, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0056914216838777065, 'learning_rate': 3.37510750670639e-06, 'epoch': 8.18} 82%|████████▏ | 8180/10000 [12:54:25<2:47:15, 5.51s/it][2025-06-20 02:24:10,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:24:10,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.76 | bwd_microstep: 3327.58 | bwd_inner_microstep: 3326.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 02:24:10,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.76 | bwd: 3327.59 | bwd_inner: 3326.79 | bwd_allreduce: 0.77 | step: 6.78 82%|████████▏ | 8181/10000 [12:54:31<2:46:52, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00012505798076745123, 'learning_rate': 3.3715074864808916e-06, 'epoch': 8.18} 82%|████████▏ | 8181/10000 [12:54:31<2:46:52, 5.50s/it][2025-06-20 02:24:15,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:24:15,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.59 | bwd_microstep: 3314.99 | bwd_inner_microstep: 3313.96 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.28 [2025-06-20 02:24:15,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.59 | bwd: 3315.01 | bwd_inner: 3313.96 | bwd_allreduce: 0.99 | step: 7.28 82%|████████▏ | 8182/10000 [12:54:36<2:46:26, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004338169936090708, 'learning_rate': 3.367909210507083e-06, 'epoch': 8.18} 82%|████████▏ | 8182/10000 [12:54:36<2:46:26, 5.49s/it][2025-06-20 02:24:21,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:24:21,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.84 | bwd_microstep: 3321.69 | bwd_inner_microstep: 3320.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 02:24:21,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.84 | bwd: 3321.71 | bwd_inner: 3320.90 | bwd_allreduce: 0.77 | step: 6.72 82%|████████▏ | 8183/10000 [12:54:42<2:46:10, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.004273703321814537, 'learning_rate': 3.3643126791624114e-06, 'epoch': 8.18} 82%|████████▏ | 8183/10000 [12:54:42<2:46:10, 5.49s/it][2025-06-20 02:24:26,916] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 02:24:26,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.78 | bwd_microstep: 3374.55 | bwd_inner_microstep: 3373.53 | bwd_allreduce_microstep: 0.96 | step_microstep: 8.02 [2025-06-20 02:24:26,917] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.78 | bwd: 3374.57 | bwd_inner: 3373.53 | bwd_allreduce: 0.99 | step: 8.03 82%|████████▏ | 8184/10000 [12:54:47<2:46:39, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0036319149658083916, 'learning_rate': 3.360717892824128e-06, 'epoch': 8.18} 82%|████████▏ | 8184/10000 [12:54:47<2:46:39, 5.51s/it][2025-06-20 02:24:32,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:24:32,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.50 | bwd_microstep: 3319.78 | bwd_inner_microstep: 3318.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-20 02:24:32,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.50 | bwd: 3319.79 | bwd_inner: 3318.98 | bwd_allreduce: 0.77 | step: 7.07 82%|████████▏ | 8185/10000 [12:54:53<2:46:14, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.08302083611488342, 'learning_rate': 3.3571248518693112e-06, 'epoch': 8.19} 82%|████████▏ | 8185/10000 [12:54:53<2:46:14, 5.50s/it][2025-06-20 02:24:37,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:24:37,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.92 | bwd_microstep: 3329.11 | bwd_inner_microstep: 3328.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-20 02:24:37,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.92 | bwd: 3329.12 | bwd_inner: 3328.32 | bwd_allreduce: 0.76 | step: 6.74 82%|████████▏ | 8186/10000 [12:54:58<2:45:58, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00028461404144763947, 'learning_rate': 3.353533556674855e-06, 'epoch': 8.19} 82%|████████▏ | 8186/10000 [12:54:58<2:45:58, 5.49s/it][2025-06-20 02:24:43,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:24:43,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.47 | bwd_microstep: 3334.08 | bwd_inner_microstep: 3333.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 02:24:43,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.47 | bwd: 3334.09 | bwd_inner: 3333.29 | bwd_allreduce: 0.76 | step: 6.73 82%|████████▏ | 8187/10000 [12:55:04<2:45:47, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00041269962093792856, 'learning_rate': 3.3499440076174717e-06, 'epoch': 8.19} 82%|████████▏ | 8187/10000 [12:55:04<2:45:47, 5.49s/it][2025-06-20 02:24:48,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:24:48,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.16 | bwd_microstep: 3325.97 | bwd_inner_microstep: 3324.99 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.65 [2025-06-20 02:24:48,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.16 | bwd: 3325.98 | bwd_inner: 3324.99 | bwd_allreduce: 0.95 | step: 7.65 82%|████████▏ | 8188/10000 [12:55:09<2:45:36, 5.48s/it] {'loss': 0.0008, 'grad_norm': 0.3146677315235138, 'learning_rate': 3.3463562050736844e-06, 'epoch': 8.19} 82%|████████▏ | 8188/10000 [12:55:09<2:45:36, 5.48s/it][2025-06-20 02:24:54,377] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:24:54,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.74 | bwd_microstep: 3380.90 | bwd_inner_microstep: 3380.10 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-20 02:24:54,378] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.74 | bwd: 3380.92 | bwd_inner: 3380.10 | bwd_allreduce: 0.77 | step: 6.76 82%|████████▏ | 8189/10000 [12:55:15<2:46:09, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.003590172156691551, 'learning_rate': 3.3427701494198404e-06, 'epoch': 8.19} 82%|████████▏ | 8189/10000 [12:55:15<2:46:09, 5.50s/it][2025-06-20 02:24:59,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:24:59,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.76 | bwd_microstep: 3331.82 | bwd_inner_microstep: 3331.04 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-20 02:24:59,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.76 | bwd: 3331.83 | bwd_inner: 3331.04 | bwd_allreduce: 0.75 | step: 6.53 82%|████████▏ | 8190/10000 [12:55:20<2:45:52, 5.50s/it] {'loss': 0.0, 'grad_norm': 6.365553417708725e-05, 'learning_rate': 3.339185841032098e-06, 'epoch': 8.19} 82%|████████▏ | 8190/10000 [12:55:20<2:45:52, 5.50s/it][2025-06-20 02:25:05,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:25:05,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.02 | bwd_microstep: 3330.35 | bwd_inner_microstep: 3329.55 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-20 02:25:05,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.02 | bwd: 3330.36 | bwd_inner: 3329.55 | bwd_allreduce: 0.77 | step: 7.00 82%|████████▏ | 8191/10000 [12:55:26<2:45:35, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0049330443143844604, 'learning_rate': 3.335603280286437e-06, 'epoch': 8.19} 82%|████████▏ | 8191/10000 [12:55:26<2:45:35, 5.49s/it][2025-06-20 02:25:10,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:25:10,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.90 | bwd_microstep: 3383.12 | bwd_inner_microstep: 3382.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-20 02:25:10,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.90 | bwd: 3383.14 | bwd_inner: 3382.34 | bwd_allreduce: 0.76 | step: 6.77 82%|████████▏ | 8192/10000 [12:55:31<2:46:08, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0013011402916163206, 'learning_rate': 3.332022467558651e-06, 'epoch': 8.19} 82%|████████▏ | 8192/10000 [12:55:31<2:46:08, 5.51s/it][2025-06-20 02:25:16,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:25:16,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.26 | bwd_microstep: 3327.46 | bwd_inner_microstep: 3326.57 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.83 [2025-06-20 02:25:16,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.26 | bwd: 3327.48 | bwd_inner: 3326.57 | bwd_allreduce: 0.87 | step: 6.83 82%|████████▏ | 8193/10000 [12:55:37<2:45:39, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.1337808072566986, 'learning_rate': 3.328443403224353e-06, 'epoch': 8.19} 82%|████████▏ | 8193/10000 [12:55:37<2:45:39, 5.50s/it][2025-06-20 02:25:21,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 02:25:21,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.59 | bwd_microstep: 3327.56 | bwd_inner_microstep: 3326.54 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.58 [2025-06-20 02:25:21,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.59 | bwd: 3327.57 | bwd_inner: 3326.54 | bwd_allreduce: 0.99 | step: 7.59 82%|████████▏ | 8194/10000 [12:55:42<2:45:29, 5.50s/it] {'loss': 0.0007, 'grad_norm': 0.21566742658615112, 'learning_rate': 3.324866087658962e-06, 'epoch': 8.19} 82%|████████▏ | 8194/10000 [12:55:42<2:45:29, 5.50s/it][2025-06-20 02:25:27,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:25:27,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.75 | bwd_microstep: 3368.33 | bwd_inner_microstep: 3367.52 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-20 02:25:27,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.75 | bwd: 3368.35 | bwd_inner: 3367.52 | bwd_allreduce: 0.78 | step: 7.18 82%|████████▏ | 8195/10000 [12:55:48<2:46:01, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.04348526895046234, 'learning_rate': 3.321290521237728e-06, 'epoch': 8.2} 82%|████████▏ | 8195/10000 [12:55:48<2:46:01, 5.52s/it][2025-06-20 02:25:32,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:25:32,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.40 | bwd_microstep: 3320.42 | bwd_inner_microstep: 3319.63 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-20 02:25:32,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.40 | bwd: 3320.43 | bwd_inner: 3319.63 | bwd_allreduce: 0.76 | step: 6.78 82%|████████▏ | 8196/10000 [12:55:53<2:45:24, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0006713305483572185, 'learning_rate': 3.317716704335707e-06, 'epoch': 8.2} 82%|████████▏ | 8196/10000 [12:55:53<2:45:24, 5.50s/it][2025-06-20 02:25:38,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:25:38,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.88 | bwd_microstep: 3309.65 | bwd_inner_microstep: 3308.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-20 02:25:38,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.88 | bwd: 3309.67 | bwd_inner: 3308.85 | bwd_allreduce: 0.77 | step: 6.68 82%|████████▏ | 8197/10000 [12:55:59<2:44:52, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0004980482626706362, 'learning_rate': 3.3141446373277808e-06, 'epoch': 8.2} 82%|████████▏ | 8197/10000 [12:55:59<2:44:52, 5.49s/it][2025-06-20 02:25:43,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:25:43,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.15 | bwd_microstep: 3310.93 | bwd_inner_microstep: 3310.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 02:25:43,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.15 | bwd: 3310.94 | bwd_inner: 3310.15 | bwd_allreduce: 0.76 | step: 6.68 82%|████████▏ | 8198/10000 [12:56:04<2:44:30, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.006627582013607025, 'learning_rate': 3.3105743205886465e-06, 'epoch': 8.2} 82%|████████▏ | 8198/10000 [12:56:04<2:44:30, 5.48s/it][2025-06-20 02:25:49,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:25:49,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.18 | bwd_microstep: 3316.82 | bwd_inner_microstep: 3315.92 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.98 [2025-06-20 02:25:49,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.18 | bwd: 3316.83 | bwd_inner: 3315.92 | bwd_allreduce: 0.87 | step: 6.98 82%|████████▏ | 8199/10000 [12:56:10<2:44:17, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00638318108394742, 'learning_rate': 3.3070057544928027e-06, 'epoch': 8.2} 82%|████████▏ | 8199/10000 [12:56:10<2:44:17, 5.47s/it][2025-06-20 02:25:54,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 02:25:54,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.09 | bwd_microstep: 3330.46 | bwd_inner_microstep: 3329.37 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.65 [2025-06-20 02:25:54,748] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.09 | bwd: 3330.48 | bwd_inner: 3329.37 | bwd_allreduce: 1.06 | step: 7.66 82%|████████▏ | 8200/10000 [12:56:15<2:44:18, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0002759566705208272, 'learning_rate': 3.3034389394145806e-06, 'epoch': 8.2} 82%|████████▏ | 8200/10000 [12:56:15<2:44:18, 5.48s/it][2025-06-20 02:26:00,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:26:00,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.50 | bwd_microstep: 3365.66 | bwd_inner_microstep: 3364.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 02:26:00,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.50 | bwd: 3365.67 | bwd_inner: 3364.86 | bwd_allreduce: 0.76 | step: 6.72 82%|████████▏ | 8201/10000 [12:56:21<2:44:47, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.022757187485694885, 'learning_rate': 3.299873875728121e-06, 'epoch': 8.2} 82%|████████▏ | 8201/10000 [12:56:21<2:44:47, 5.50s/it][2025-06-20 02:26:05,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:26:05,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.72 | bwd_microstep: 3372.26 | bwd_inner_microstep: 3371.43 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.88 [2025-06-20 02:26:05,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.72 | bwd: 3372.28 | bwd_inner: 3371.43 | bwd_allreduce: 0.79 | step: 6.89 82%|████████▏ | 8202/10000 [12:56:26<2:45:04, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0033227598760277033, 'learning_rate': 3.2963105638073854e-06, 'epoch': 8.2} 82%|████████▏ | 8202/10000 [12:56:26<2:45:04, 5.51s/it][2025-06-20 02:26:11,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:26:11,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.89 | bwd_microstep: 3366.36 | bwd_inner_microstep: 3365.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 02:26:11,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.89 | bwd: 3366.37 | bwd_inner: 3365.56 | bwd_allreduce: 0.76 | step: 6.82 82%|████████▏ | 8203/10000 [12:56:32<2:45:13, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.01975165121257305, 'learning_rate': 3.2927490040261476e-06, 'epoch': 8.2} 82%|████████▏ | 8203/10000 [12:56:32<2:45:13, 5.52s/it][2025-06-20 02:26:16,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.65 | optimizer_step: 2.73 [2025-06-20 02:26:16,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.49 | bwd_microstep: 3312.63 | bwd_inner_microstep: 3311.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.30 [2025-06-20 02:26:16,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.49 | bwd: 3312.64 | bwd_inner: 3311.83 | bwd_allreduce: 0.77 | step: 7.30 82%|████████▏ | 8204/10000 [12:56:37<2:44:37, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.07845397293567657, 'learning_rate': 3.289189196757998e-06, 'epoch': 8.2} 82%|████████▏ | 8204/10000 [12:56:37<2:44:37, 5.50s/it][2025-06-20 02:26:22,292] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:26:22,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.58 | bwd_microstep: 3326.00 | bwd_inner_microstep: 3324.79 | bwd_allreduce_microstep: 1.14 | step_microstep: 7.69 [2025-06-20 02:26:22,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.58 | bwd: 3326.02 | bwd_inner: 3324.79 | bwd_allreduce: 1.17 | step: 7.70 82%|████████▏ | 8205/10000 [12:56:43<2:44:16, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.037893857806921005, 'learning_rate': 3.2856311423763444e-06, 'epoch': 8.21} 82%|████████▏ | 8205/10000 [12:56:43<2:44:16, 5.49s/it][2025-06-20 02:26:27,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:26:27,825] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.92 | bwd_microstep: 3362.55 | bwd_inner_microstep: 3361.59 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.22 [2025-06-20 02:26:27,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.92 | bwd: 3362.57 | bwd_inner: 3361.59 | bwd_allreduce: 0.93 | step: 7.23 82%|████████▏ | 8206/10000 [12:56:48<2:44:32, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00647105323150754, 'learning_rate': 3.2820748412544123e-06, 'epoch': 8.21} 82%|████████▏ | 8206/10000 [12:56:48<2:44:32, 5.50s/it][2025-06-20 02:26:33,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:26:33,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.99 | bwd_microstep: 3371.30 | bwd_inner_microstep: 3370.48 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.90 [2025-06-20 02:26:33,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.99 | bwd: 3371.31 | bwd_inner: 3370.48 | bwd_allreduce: 0.78 | step: 6.90 82%|████████▏ | 8207/10000 [12:56:54<2:44:50, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0011589251225814223, 'learning_rate': 3.2785202937652374e-06, 'epoch': 8.21} 82%|████████▏ | 8207/10000 [12:56:54<2:44:50, 5.52s/it][2025-06-20 02:26:38,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:26:38,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.97 | bwd_microstep: 3312.45 | bwd_inner_microstep: 3311.64 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-20 02:26:38,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.97 | bwd: 3312.46 | bwd_inner: 3311.64 | bwd_allreduce: 0.78 | step: 6.88 82%|████████▏ | 8208/10000 [12:56:59<2:44:18, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.08567021042108536, 'learning_rate': 3.2749675002816806e-06, 'epoch': 8.21} 82%|████████▏ | 8208/10000 [12:56:59<2:44:18, 5.50s/it][2025-06-20 02:26:44,324] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 02:26:44,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.45 | bwd_microstep: 3321.23 | bwd_inner_microstep: 3320.20 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.61 [2025-06-20 02:26:44,325] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.45 | bwd: 3321.25 | bwd_inner: 3320.20 | bwd_allreduce: 1.00 | step: 7.62 82%|████████▏ | 8209/10000 [12:57:05<2:44:04, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00689691212028265, 'learning_rate': 3.2714164611764065e-06, 'epoch': 8.21} 82%|████████▏ | 8209/10000 [12:57:05<2:44:04, 5.50s/it][2025-06-20 02:26:49,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:26:49,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.08 | bwd_microstep: 3362.20 | bwd_inner_microstep: 3361.38 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.00 [2025-06-20 02:26:49,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.08 | bwd: 3362.21 | bwd_inner: 3361.38 | bwd_allreduce: 0.79 | step: 7.00 82%|████████▏ | 8210/10000 [12:57:10<2:44:17, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00031106395181268454, 'learning_rate': 3.267867176821904e-06, 'epoch': 8.21} 82%|████████▏ | 8210/10000 [12:57:10<2:44:17, 5.51s/it][2025-06-20 02:26:55,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:26:55,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.51 | bwd_microstep: 3377.86 | bwd_inner_microstep: 3377.05 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.08 [2025-06-20 02:26:55,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.51 | bwd: 3377.88 | bwd_inner: 3377.05 | bwd_allreduce: 0.79 | step: 7.08 82%|████████▏ | 8211/10000 [12:57:16<2:44:41, 5.52s/it] {'loss': 0.0078, 'grad_norm': 2.1320738792419434, 'learning_rate': 3.2643196475904815e-06, 'epoch': 8.21} 82%|████████▏ | 8211/10000 [12:57:16<2:44:41, 5.52s/it][2025-06-20 02:27:00,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:27:00,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.33 | bwd_microstep: 3319.39 | bwd_inner_microstep: 3318.55 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.20 [2025-06-20 02:27:00,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.33 | bwd: 3319.40 | bwd_inner: 3318.55 | bwd_allreduce: 0.81 | step: 7.20 82%|████████▏ | 8212/10000 [12:57:21<2:44:03, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.12610915303230286, 'learning_rate': 3.260773873854253e-06, 'epoch': 8.21} 82%|████████▏ | 8212/10000 [12:57:21<2:44:03, 5.51s/it][2025-06-20 02:27:06,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:27:06,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.78 | bwd_microstep: 3370.16 | bwd_inner_microstep: 3369.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-20 02:27:06,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.78 | bwd: 3370.18 | bwd_inner: 3369.38 | bwd_allreduce: 0.76 | step: 6.64 82%|████████▏ | 8213/10000 [12:57:27<2:44:13, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00016680255066603422, 'learning_rate': 3.2572298559851555e-06, 'epoch': 8.21} 82%|████████▏ | 8213/10000 [12:57:27<2:44:13, 5.51s/it][2025-06-20 02:27:11,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:27:11,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.16 | bwd_microstep: 3309.82 | bwd_inner_microstep: 3309.00 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.02 [2025-06-20 02:27:11,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.16 | bwd: 3309.83 | bwd_inner: 3309.00 | bwd_allreduce: 0.79 | step: 7.03 82%|████████▏ | 8214/10000 [12:57:32<2:43:37, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.01829046569764614, 'learning_rate': 3.253687594354946e-06, 'epoch': 8.21} 82%|████████▏ | 8214/10000 [12:57:32<2:43:37, 5.50s/it][2025-06-20 02:27:17,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:27:17,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.45 | bwd_microstep: 3372.34 | bwd_inner_microstep: 3371.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 02:27:17,411] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.45 | bwd: 3372.35 | bwd_inner: 3371.54 | bwd_allreduce: 0.77 | step: 6.79 82%|████████▏ | 8215/10000 [12:57:38<2:43:53, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.006288055796176195, 'learning_rate': 3.250147089335183e-06, 'epoch': 8.21} 82%|████████▏ | 8215/10000 [12:57:38<2:43:53, 5.51s/it][2025-06-20 02:27:22,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:27:22,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.47 | bwd_microstep: 3316.22 | bwd_inner_microstep: 3315.37 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.19 [2025-06-20 02:27:22,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.47 | bwd: 3316.24 | bwd_inner: 3315.37 | bwd_allreduce: 0.82 | step: 7.20 82%|████████▏ | 8216/10000 [12:57:43<2:43:19, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0052448599599301815, 'learning_rate': 3.2466083412972506e-06, 'epoch': 8.22} 82%|████████▏ | 8216/10000 [12:57:43<2:43:19, 5.49s/it][2025-06-20 02:27:28,405] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:27:28,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.70 | bwd_microstep: 3368.71 | bwd_inner_microstep: 3367.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-20 02:27:28,406] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.70 | bwd: 3368.74 | bwd_inner: 3367.92 | bwd_allreduce: 0.76 | step: 6.78 82%|████████▏ | 8217/10000 [12:57:49<2:43:38, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.036616019904613495, 'learning_rate': 3.243071350612348e-06, 'epoch': 8.22} 82%|████████▏ | 8217/10000 [12:57:49<2:43:38, 5.51s/it][2025-06-20 02:27:33,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:27:33,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.91 | bwd_microstep: 3387.98 | bwd_inner_microstep: 3387.20 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 02:27:33,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.91 | bwd: 3388.00 | bwd_inner: 3387.20 | bwd_allreduce: 0.76 | step: 6.67 82%|████████▏ | 8218/10000 [12:57:54<2:44:03, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0067914570681750774, 'learning_rate': 3.239536117651487e-06, 'epoch': 8.22} 82%|████████▏ | 8218/10000 [12:57:54<2:44:03, 5.52s/it][2025-06-20 02:27:39,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:27:39,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.70 | bwd_microstep: 3370.21 | bwd_inner_microstep: 3369.19 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.23 [2025-06-20 02:27:39,505] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.70 | bwd: 3370.23 | bwd_inner: 3369.19 | bwd_allreduce: 0.99 | step: 7.23 82%|████████▏ | 8219/10000 [12:58:00<2:44:06, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0026883536484092474, 'learning_rate': 3.2360026427855075e-06, 'epoch': 8.22} 82%|████████▏ | 8219/10000 [12:58:00<2:44:06, 5.53s/it][2025-06-20 02:27:44,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:27:44,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.43 | bwd_microstep: 3316.38 | bwd_inner_microstep: 3315.57 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.14 [2025-06-20 02:27:44,964] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.43 | bwd: 3316.40 | bwd_inner: 3315.57 | bwd_allreduce: 0.79 | step: 7.14 82%|████████▏ | 8220/10000 [12:58:05<2:43:22, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.005682797636836767, 'learning_rate': 3.2324709263850406e-06, 'epoch': 8.22} 82%|████████▏ | 8220/10000 [12:58:05<2:43:22, 5.51s/it][2025-06-20 02:27:50,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:27:50,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.95 | bwd_microstep: 3308.87 | bwd_inner_microstep: 3308.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 02:27:50,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.95 | bwd: 3308.89 | bwd_inner: 3308.08 | bwd_allreduce: 0.76 | step: 6.79 82%|████████▏ | 8221/10000 [12:58:11<2:42:48, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.059440866112709045, 'learning_rate': 3.228940968820555e-06, 'epoch': 8.22} 82%|████████▏ | 8221/10000 [12:58:11<2:42:48, 5.49s/it][2025-06-20 02:27:55,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:27:55,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.84 | bwd_microstep: 3360.02 | bwd_inner_microstep: 3359.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.73 [2025-06-20 02:27:55,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.84 | bwd: 3360.03 | bwd_inner: 3359.20 | bwd_allreduce: 0.78 | step: 6.73 82%|████████▏ | 8222/10000 [12:58:16<2:43:02, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.002653045579791069, 'learning_rate': 3.2254127704623216e-06, 'epoch': 8.22} 82%|████████▏ | 8222/10000 [12:58:16<2:43:02, 5.50s/it][2025-06-20 02:28:01,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.72 [2025-06-20 02:28:01,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.99 | bwd_microstep: 3318.30 | bwd_inner_microstep: 3317.14 | bwd_allreduce_microstep: 1.09 | step_microstep: 8.30 [2025-06-20 02:28:01,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.99 | bwd: 3318.33 | bwd_inner: 3317.14 | bwd_allreduce: 1.12 | step: 8.31 82%|████████▏ | 8223/10000 [12:58:22<2:42:43, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00012887160119134933, 'learning_rate': 3.2218863316804327e-06, 'epoch': 8.22} 82%|████████▏ | 8223/10000 [12:58:22<2:42:43, 5.49s/it][2025-06-20 02:28:06,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:28:06,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.55 | bwd_microstep: 3319.38 | bwd_inner_microstep: 3318.49 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.09 [2025-06-20 02:28:06,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.55 | bwd: 3319.40 | bwd_inner: 3318.49 | bwd_allreduce: 0.86 | step: 7.10 82%|████████▏ | 8224/10000 [12:58:27<2:42:28, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.01199460867792368, 'learning_rate': 3.218361652844808e-06, 'epoch': 8.22} 82%|████████▏ | 8224/10000 [12:58:27<2:42:28, 5.49s/it][2025-06-20 02:28:12,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 02:28:12,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.16 | bwd_microstep: 3312.84 | bwd_inner_microstep: 3311.71 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.92 [2025-06-20 02:28:12,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.17 | bwd: 3312.87 | bwd_inner: 3311.71 | bwd_allreduce: 1.09 | step: 7.93 82%|████████▏ | 8225/10000 [12:58:33<2:42:05, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0005359470378607512, 'learning_rate': 3.214838734325154e-06, 'epoch': 8.22} 82%|████████▏ | 8225/10000 [12:58:33<2:42:05, 5.48s/it][2025-06-20 02:28:17,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:28:17,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.69 | bwd_microstep: 3361.00 | bwd_inner_microstep: 3360.19 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.28 [2025-06-20 02:28:17,900] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.69 | bwd: 3361.02 | bwd_inner: 3360.19 | bwd_allreduce: 0.78 | step: 7.28 82%|████████▏ | 8226/10000 [12:58:38<2:42:37, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.002202163450419903, 'learning_rate': 3.2113175764910156e-06, 'epoch': 8.23} 82%|████████▏ | 8226/10000 [12:58:38<2:42:37, 5.50s/it][2025-06-20 02:28:23,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:28:23,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.28 | bwd_microstep: 3306.31 | bwd_inner_microstep: 3305.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 02:28:23,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.28 | bwd: 3306.32 | bwd_inner: 3305.51 | bwd_allreduce: 0.77 | step: 6.86 82%|████████▏ | 8227/10000 [12:58:44<2:42:06, 5.49s/it] {'loss': 0.0007, 'grad_norm': 0.10002154111862183, 'learning_rate': 3.207798179711743e-06, 'epoch': 8.23} 82%|████████▏ | 8227/10000 [12:58:44<2:42:06, 5.49s/it][2025-06-20 02:28:28,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:28:28,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.80 | bwd_microstep: 3314.70 | bwd_inner_microstep: 3313.91 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-20 02:28:28,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.80 | bwd: 3314.71 | bwd_inner: 3313.91 | bwd_allreduce: 0.76 | step: 6.66 82%|████████▏ | 8228/10000 [12:58:49<2:41:44, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.273247092962265, 'learning_rate': 3.204280544356506e-06, 'epoch': 8.23} 82%|████████▏ | 8228/10000 [12:58:49<2:41:44, 5.48s/it][2025-06-20 02:28:34,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:28:34,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.17 | bwd_microstep: 3386.41 | bwd_inner_microstep: 3385.37 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.31 [2025-06-20 02:28:34,365] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.17 | bwd: 3386.43 | bwd_inner: 3385.37 | bwd_allreduce: 1.01 | step: 7.32 82%|████████▏ | 8229/10000 [12:58:55<2:42:20, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00010577314242254943, 'learning_rate': 3.200764670794294e-06, 'epoch': 8.23} 82%|████████▏ | 8229/10000 [12:58:55<2:42:20, 5.50s/it][2025-06-20 02:28:39,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:28:39,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.74 | bwd_microstep: 3315.21 | bwd_inner_microstep: 3314.40 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.28 [2025-06-20 02:28:39,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.74 | bwd: 3315.23 | bwd_inner: 3314.40 | bwd_allreduce: 0.78 | step: 7.28 82%|████████▏ | 8230/10000 [12:59:00<2:41:52, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.026893578469753265, 'learning_rate': 3.1972505593938963e-06, 'epoch': 8.23} 82%|████████▏ | 8230/10000 [12:59:00<2:41:52, 5.49s/it][2025-06-20 02:28:45,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:28:45,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.01 | bwd_microstep: 3313.41 | bwd_inner_microstep: 3312.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-20 02:28:45,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.01 | bwd: 3313.43 | bwd_inner: 3312.62 | bwd_allreduce: 0.76 | step: 6.74 82%|████████▏ | 8231/10000 [12:59:06<2:41:34, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.003342735581099987, 'learning_rate': 3.1937382105239313e-06, 'epoch': 8.23} 82%|████████▏ | 8231/10000 [12:59:06<2:41:34, 5.48s/it][2025-06-20 02:28:50,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:28:50,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.07 | bwd_microstep: 3307.25 | bwd_inner_microstep: 3306.43 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.77 [2025-06-20 02:28:50,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.07 | bwd: 3307.26 | bwd_inner: 3306.43 | bwd_allreduce: 0.79 | step: 6.78 82%|████████▏ | 8232/10000 [12:59:11<2:41:10, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00015356074436567724, 'learning_rate': 3.1902276245528264e-06, 'epoch': 8.23} 82%|████████▏ | 8232/10000 [12:59:11<2:41:10, 5.47s/it][h264 @ 0x3a3be180] Reference 5 >= 5 [h264 @ 0x3a3be180] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x3a3b5480] left block unavailable for requested intra mode [h264 @ 0x3a3b5480] error while decoding MB 0 25, bytestream 45493 [h264 @ 0x3a4a3840] Reference 5 >= 5 [h264 @ 0x3a4a3840] error while decoding MB 15 42, bytestream 9292 [2025-06-20 02:28:56,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:28:56,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.80 | bwd_microstep: 3307.91 | bwd_inner_microstep: 3307.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-20 02:28:56,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.80 | bwd: 3307.93 | bwd_inner: 3307.11 | bwd_allreduce: 0.77 | step: 6.91 82%|████████▏ | 8233/10000 [12:59:16<2:40:57, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.08018995821475983, 'learning_rate': 3.1867188018488293e-06, 'epoch': 8.23} 82%|████████▏ | 8233/10000 [12:59:16<2:40:57, 5.47s/it][h264 @ 0x3a4a3840] left block unavailable for requested intra mode [h264 @ 0x3a4a3840] error while decoding MB 0 25, bytestream 45493 [2025-06-20 02:29:01,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:29:01,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.68 | bwd_microstep: 3362.92 | bwd_inner_microstep: 3362.11 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.30 [2025-06-20 02:29:01,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.68 | bwd: 3362.94 | bwd_inner: 3362.11 | bwd_allreduce: 0.78 | step: 7.30 82%|████████▏ | 8234/10000 [12:59:22<2:41:25, 5.48s/it] {'loss': 0.0024, 'grad_norm': 0.9171738028526306, 'learning_rate': 3.183211742779999e-06, 'epoch': 8.23} 82%|████████▏ | 8234/10000 [12:59:22<2:41:25, 5.48s/it][2025-06-20 02:29:07,172] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:29:07,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.86 | bwd_microstep: 3314.63 | bwd_inner_microstep: 3313.83 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-20 02:29:07,173] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.86 | bwd: 3314.65 | bwd_inner: 3313.83 | bwd_allreduce: 0.77 | step: 6.82 82%|████████▏ | 8235/10000 [12:59:27<2:41:05, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0006333351484499872, 'learning_rate': 3.179706447714206e-06, 'epoch': 8.23} 82%|████████▏ | 8235/10000 [12:59:27<2:41:05, 5.48s/it][2025-06-20 02:29:12,630] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:29:12,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.94 | bwd_microstep: 3317.17 | bwd_inner_microstep: 3316.23 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.98 [2025-06-20 02:29:12,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.94 | bwd: 3317.18 | bwd_inner: 3316.23 | bwd_allreduce: 0.90 | step: 7.99 82%|████████▏ | 8236/10000 [12:59:33<2:40:52, 5.47s/it] {'loss': 0.0012, 'grad_norm': 0.4763699471950531, 'learning_rate': 3.17620291701914e-06, 'epoch': 8.24} 82%|████████▏ | 8236/10000 [12:59:33<2:40:52, 5.47s/it][2025-06-20 02:29:18,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:29:18,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.35 | bwd_microstep: 3310.63 | bwd_inner_microstep: 3309.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 02:29:18,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.35 | bwd: 3310.64 | bwd_inner: 3309.83 | bwd_allreduce: 0.77 | step: 6.68 82%|████████▏ | 8237/10000 [12:59:38<2:40:36, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.07171043753623962, 'learning_rate': 3.172701151062305e-06, 'epoch': 8.24} 82%|████████▏ | 8237/10000 [12:59:38<2:40:36, 5.47s/it][2025-06-20 02:29:23,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:29:23,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.82 | bwd_microstep: 3320.97 | bwd_inner_microstep: 3319.95 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.29 [2025-06-20 02:29:23,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.82 | bwd: 3320.98 | bwd_inner: 3319.95 | bwd_allreduce: 0.98 | step: 7.29 82%|████████▏ | 8238/10000 [12:59:44<2:40:28, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.012161263264715672, 'learning_rate': 3.1692011502110232e-06, 'epoch': 8.24} 82%|████████▏ | 8238/10000 [12:59:44<2:40:28, 5.46s/it][2025-06-20 02:29:29,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:29:29,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.49 | bwd_microstep: 3315.90 | bwd_inner_microstep: 3314.99 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.06 [2025-06-20 02:29:29,004] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.49 | bwd: 3315.91 | bwd_inner: 3314.99 | bwd_allreduce: 0.87 | step: 7.06 82%|████████▏ | 8239/10000 [12:59:49<2:40:18, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0016501955687999725, 'learning_rate': 3.165702914832425e-06, 'epoch': 8.24} 82%|████████▏ | 8239/10000 [12:59:49<2:40:18, 5.46s/it][2025-06-20 02:29:34,465] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:29:34,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.14 | bwd_microstep: 3317.36 | bwd_inner_microstep: 3316.55 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.20 [2025-06-20 02:29:34,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.14 | bwd: 3317.38 | bwd_inner: 3316.55 | bwd_allreduce: 0.78 | step: 7.20 82%|████████▏ | 8240/10000 [12:59:55<2:40:13, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0016712474171072245, 'learning_rate': 3.162206445293461e-06, 'epoch': 8.24} 82%|████████▏ | 8240/10000 [12:59:55<2:40:13, 5.46s/it][2025-06-20 02:29:39,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:29:39,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.93 | bwd_microstep: 3358.55 | bwd_inner_microstep: 3357.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 02:29:39,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.93 | bwd: 3358.57 | bwd_inner: 3357.76 | bwd_allreduce: 0.76 | step: 6.78 82%|████████▏ | 8241/10000 [13:00:00<2:40:40, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0007583262049593031, 'learning_rate': 3.1587117419608937e-06, 'epoch': 8.24} 82%|████████▏ | 8241/10000 [13:00:00<2:40:40, 5.48s/it][2025-06-20 02:29:45,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:29:45,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.63 | bwd_microstep: 3357.76 | bwd_inner_microstep: 3356.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.73 [2025-06-20 02:29:45,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.63 | bwd: 3357.77 | bwd_inner: 3356.95 | bwd_allreduce: 0.78 | step: 6.73 82%|████████▏ | 8242/10000 [13:00:06<2:40:55, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.045177336782217026, 'learning_rate': 3.155218805201303e-06, 'epoch': 8.24} 82%|████████▏ | 8242/10000 [13:00:06<2:40:55, 5.49s/it][2025-06-20 02:29:50,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:29:50,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.11 | bwd_microstep: 3323.21 | bwd_inner_microstep: 3322.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-20 02:29:50,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.11 | bwd: 3323.23 | bwd_inner: 3322.41 | bwd_allreduce: 0.77 | step: 6.88 82%|████████▏ | 8243/10000 [13:00:11<2:40:33, 5.48s/it] {'loss': 0.0, 'grad_norm': 5.390553269535303e-05, 'learning_rate': 3.151727635381081e-06, 'epoch': 8.24} 82%|████████▏ | 8243/10000 [13:00:11<2:40:33, 5.48s/it][2025-06-20 02:29:56,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:29:56,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.32 | bwd_microstep: 3311.34 | bwd_inner_microstep: 3310.54 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-20 02:29:56,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.32 | bwd: 3311.35 | bwd_inner: 3310.54 | bwd_allreduce: 0.77 | step: 6.89 82%|████████▏ | 8244/10000 [13:00:17<2:40:12, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0016775463009253144, 'learning_rate': 3.1482382328664384e-06, 'epoch': 8.24} 82%|████████▏ | 8244/10000 [13:00:17<2:40:12, 5.47s/it][2025-06-20 02:30:01,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:30:01,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.45 | bwd_microstep: 3314.24 | bwd_inner_microstep: 3313.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 02:30:01,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.45 | bwd: 3314.26 | bwd_inner: 3313.44 | bwd_allreduce: 0.77 | step: 6.68 82%|████████▏ | 8245/10000 [13:00:22<2:39:57, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0017820048378780484, 'learning_rate': 3.1447505980233896e-06, 'epoch': 8.24} 82%|████████▏ | 8245/10000 [13:00:22<2:39:57, 5.47s/it][2025-06-20 02:30:07,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:30:07,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.01 | bwd_microstep: 3366.91 | bwd_inner_microstep: 3366.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 02:30:07,419] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.01 | bwd: 3366.92 | bwd_inner: 3366.11 | bwd_allreduce: 0.77 | step: 6.76 82%|████████▏ | 8246/10000 [13:00:28<2:40:29, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0003587743849493563, 'learning_rate': 3.1412647312177747e-06, 'epoch': 8.25} 82%|████████▏ | 8246/10000 [13:00:28<2:40:29, 5.49s/it][2025-06-20 02:30:12,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:30:12,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.25 | bwd_microstep: 3361.42 | bwd_inner_microstep: 3360.57 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.90 [2025-06-20 02:30:12,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.25 | bwd: 3361.43 | bwd_inner: 3360.57 | bwd_allreduce: 0.82 | step: 6.91 82%|████████▏ | 8247/10000 [13:00:33<2:40:44, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0005570593057200313, 'learning_rate': 3.1377806328152483e-06, 'epoch': 8.25} 82%|████████▏ | 8247/10000 [13:00:33<2:40:44, 5.50s/it][2025-06-20 02:30:18,409] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:30:18,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.74 | bwd_microstep: 3319.55 | bwd_inner_microstep: 3318.75 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-20 02:30:18,410] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.74 | bwd: 3319.57 | bwd_inner: 3318.75 | bwd_allreduce: 0.77 | step: 6.94 82%|████████▏ | 8248/10000 [13:00:39<2:40:17, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.031464722007513046, 'learning_rate': 3.134298303181271e-06, 'epoch': 8.25} 82%|████████▏ | 8248/10000 [13:00:39<2:40:17, 5.49s/it][2025-06-20 02:30:23,949] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 02:30:23,950] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.12 | bwd_microstep: 3372.69 | bwd_inner_microstep: 3371.66 | bwd_allreduce_microstep: 0.95 | step_microstep: 8.05 [2025-06-20 02:30:23,951] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.12 | bwd: 3372.72 | bwd_inner: 3371.66 | bwd_allreduce: 0.98 | step: 8.05 82%|████████▏ | 8249/10000 [13:00:44<2:40:41, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0016490601701661944, 'learning_rate': 3.130817742681129e-06, 'epoch': 8.25} 82%|████████▏ | 8249/10000 [13:00:44<2:40:41, 5.51s/it][2025-06-20 02:30:29,453] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:30:29,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.39 | bwd_microstep: 3329.42 | bwd_inner_microstep: 3328.36 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.82 [2025-06-20 02:30:29,455] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.39 | bwd: 3329.45 | bwd_inner: 3328.36 | bwd_allreduce: 1.01 | step: 7.85 82%|████████▎ | 8250/10000 [13:00:50<2:40:34, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.006371337454766035, 'learning_rate': 3.1273389516799126e-06, 'epoch': 8.25} 82%|████████▎ | 8250/10000 [13:00:50<2:40:34, 5.51s/it][2025-06-20 02:30:34,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:30:34,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2153.59 | bwd_microstep: 3311.39 | bwd_inner_microstep: 3310.55 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.73 [2025-06-20 02:30:34,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2153.59 | bwd: 3311.40 | bwd_inner: 3310.55 | bwd_allreduce: 0.80 | step: 6.73 83%|████████▎ | 8251/10000 [13:00:55<2:40:31, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.053488947451114655, 'learning_rate': 3.123861930542529e-06, 'epoch': 8.25} 83%|████████▎ | 8251/10000 [13:00:55<2:40:31, 5.51s/it][2025-06-20 02:30:40,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:30:40,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.83 | bwd_microstep: 3362.73 | bwd_inner_microstep: 3361.73 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.25 [2025-06-20 02:30:40,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.83 | bwd: 3362.75 | bwd_inner: 3361.73 | bwd_allreduce: 0.95 | step: 7.26 83%|████████▎ | 8252/10000 [13:01:01<2:40:36, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.02055533230304718, 'learning_rate': 3.1203866796337046e-06, 'epoch': 8.25} 83%|████████▎ | 8252/10000 [13:01:01<2:40:36, 5.51s/it][2025-06-20 02:30:45,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:30:45,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.23 | bwd_microstep: 3318.40 | bwd_inner_microstep: 3317.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 02:30:45,960] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.23 | bwd: 3318.41 | bwd_inner: 3317.60 | bwd_allreduce: 0.77 | step: 6.71 83%|████████▎ | 8253/10000 [13:01:06<2:40:05, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.02452286332845688, 'learning_rate': 3.116913199317977e-06, 'epoch': 8.25} 83%|████████▎ | 8253/10000 [13:01:06<2:40:05, 5.50s/it][2025-06-20 02:30:51,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:30:51,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.46 | bwd_microstep: 3321.42 | bwd_inner_microstep: 3320.62 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-20 02:30:51,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.46 | bwd: 3321.44 | bwd_inner: 3320.62 | bwd_allreduce: 0.78 | step: 7.16 83%|████████▎ | 8254/10000 [13:01:12<2:39:41, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00010174649651162326, 'learning_rate': 3.1134414899597033e-06, 'epoch': 8.25} 83%|████████▎ | 8254/10000 [13:01:12<2:39:41, 5.49s/it][2025-06-20 02:30:56,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:30:56,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.16 | bwd_microstep: 3310.95 | bwd_inner_microstep: 3310.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 02:30:56,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.16 | bwd: 3310.97 | bwd_inner: 3310.16 | bwd_allreduce: 0.76 | step: 6.77 83%|████████▎ | 8255/10000 [13:01:17<2:39:20, 5.48s/it] {'loss': 0.0007, 'grad_norm': 0.4050438702106476, 'learning_rate': 3.109971551923039e-06, 'epoch': 8.26} 83%|████████▎ | 8255/10000 [13:01:17<2:39:20, 5.48s/it][2025-06-20 02:31:02,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:31:02,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.68 | bwd_microstep: 3317.54 | bwd_inner_microstep: 3316.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 02:31:02,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.68 | bwd: 3317.56 | bwd_inner: 3316.74 | bwd_allreduce: 0.77 | step: 6.69 83%|████████▎ | 8256/10000 [13:01:23<2:39:03, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0008063245913945138, 'learning_rate': 3.1065033855719706e-06, 'epoch': 8.26} 83%|████████▎ | 8256/10000 [13:01:23<2:39:03, 5.47s/it][2025-06-20 02:31:07,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:31:07,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.99 | bwd_microstep: 3319.74 | bwd_inner_microstep: 3318.72 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.11 [2025-06-20 02:31:07,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.99 | bwd: 3319.75 | bwd_inner: 3318.72 | bwd_allreduce: 0.99 | step: 7.11 83%|████████▎ | 8257/10000 [13:01:28<2:38:52, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0009403639123775065, 'learning_rate': 3.103036991270292e-06, 'epoch': 8.26} 83%|████████▎ | 8257/10000 [13:01:28<2:38:52, 5.47s/it][2025-06-20 02:31:13,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:31:13,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.33 | bwd_microstep: 3313.65 | bwd_inner_microstep: 3312.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 02:31:13,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.33 | bwd: 3313.67 | bwd_inner: 3312.85 | bwd_allreduce: 0.77 | step: 6.73 83%|████████▎ | 8258/10000 [13:01:34<2:38:40, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0009707615827210248, 'learning_rate': 3.099572369381605e-06, 'epoch': 8.26} 83%|████████▎ | 8258/10000 [13:01:34<2:38:40, 5.47s/it][2025-06-20 02:31:18,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 02:31:18,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.65 | bwd_microstep: 3403.70 | bwd_inner_microstep: 3402.72 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.79 [2025-06-20 02:31:18,835] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.65 | bwd: 3403.74 | bwd_inner: 3402.72 | bwd_allreduce: 0.93 | step: 7.79 83%|████████▎ | 8259/10000 [13:01:39<2:39:36, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0007074050954543054, 'learning_rate': 3.0961095202693502e-06, 'epoch': 8.26} 83%|████████▎ | 8259/10000 [13:01:39<2:39:36, 5.50s/it][2025-06-20 02:31:24,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 02:31:24,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2151.73 | bwd_microstep: 3333.33 | bwd_inner_microstep: 3332.42 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.97 [2025-06-20 02:31:24,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2151.73 | bwd: 3333.35 | bwd_inner: 3332.42 | bwd_allreduce: 0.86 | step: 7.97 83%|████████▎ | 8260/10000 [13:01:45<2:39:47, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.029007870703935623, 'learning_rate': 3.0926484442967485e-06, 'epoch': 8.26} 83%|████████▎ | 8260/10000 [13:01:45<2:39:47, 5.51s/it][2025-06-20 02:31:29,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 02:31:29,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.39 | bwd_microstep: 3326.69 | bwd_inner_microstep: 3325.81 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.56 [2025-06-20 02:31:29,881] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.39 | bwd: 3326.73 | bwd_inner: 3325.81 | bwd_allreduce: 0.84 | step: 7.56 83%|████████▎ | 8261/10000 [13:01:50<2:39:43, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.2009274959564209, 'learning_rate': 3.089189141826856e-06, 'epoch': 8.26} 83%|████████▎ | 8261/10000 [13:01:50<2:39:43, 5.51s/it][2025-06-20 02:31:35,479] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 02:31:35,480] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2158.84 | bwd_microstep: 3395.69 | bwd_inner_microstep: 3394.59 | bwd_allreduce_microstep: 1.03 | step_microstep: 8.22 [2025-06-20 02:31:35,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2158.84 | bwd: 3395.72 | bwd_inner: 3394.59 | bwd_allreduce: 1.06 | step: 8.25 83%|████████▎ | 8262/10000 [13:01:56<2:40:26, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.005205485038459301, 'learning_rate': 3.0857316132225377e-06, 'epoch': 8.26} 83%|████████▎ | 8262/10000 [13:01:56<2:40:26, 5.54s/it][2025-06-20 02:31:41,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:31:41,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2169.96 | bwd_microstep: 3378.21 | bwd_inner_microstep: 3377.42 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 02:31:41,079] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2169.97 | bwd: 3378.23 | bwd_inner: 3377.42 | bwd_allreduce: 0.76 | step: 6.69 83%|████████▎ | 8263/10000 [13:02:01<2:40:48, 5.55s/it] {'loss': 0.0003, 'grad_norm': 0.06087986007332802, 'learning_rate': 3.082275858846471e-06, 'epoch': 8.26} 83%|████████▎ | 8263/10000 [13:02:01<2:40:48, 5.55s/it][2025-06-20 02:31:46,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:31:46,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.37 | bwd_microstep: 3315.24 | bwd_inner_microstep: 3314.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 02:31:46,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.37 | bwd: 3315.25 | bwd_inner: 3314.45 | bwd_allreduce: 0.76 | step: 6.71 83%|████████▎ | 8264/10000 [13:02:07<2:40:06, 5.53s/it] {'loss': 0.0126, 'grad_norm': 3.090325117111206, 'learning_rate': 3.0788218790611515e-06, 'epoch': 8.26} 83%|████████▎ | 8264/10000 [13:02:07<2:40:06, 5.53s/it][2025-06-20 02:31:52,043] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:31:52,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.51 | bwd_microstep: 3327.81 | bwd_inner_microstep: 3327.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 02:31:52,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.51 | bwd: 3327.82 | bwd_inner: 3327.01 | bwd_allreduce: 0.76 | step: 6.72 83%|████████▎ | 8265/10000 [13:02:12<2:39:32, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0011351905995979905, 'learning_rate': 3.0753696742288807e-06, 'epoch': 8.27} 83%|████████▎ | 8265/10000 [13:02:12<2:39:32, 5.52s/it][2025-06-20 02:31:57,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:31:57,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.87 | bwd_microstep: 3333.12 | bwd_inner_microstep: 3332.30 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.99 [2025-06-20 02:31:57,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.87 | bwd: 3333.14 | bwd_inner: 3332.30 | bwd_allreduce: 0.79 | step: 6.99 83%|████████▎ | 8266/10000 [13:02:18<2:39:20, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.004397676791995764, 'learning_rate': 3.0719192447117805e-06, 'epoch': 8.27} 83%|████████▎ | 8266/10000 [13:02:18<2:39:20, 5.51s/it][2025-06-20 02:32:03,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 02:32:03,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.41 | bwd_microstep: 3336.02 | bwd_inner_microstep: 3334.97 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.64 [2025-06-20 02:32:03,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.41 | bwd: 3336.04 | bwd_inner: 3334.97 | bwd_allreduce: 1.01 | step: 7.67 83%|████████▎ | 8267/10000 [13:02:23<2:39:15, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.02456466294825077, 'learning_rate': 3.068470590871788e-06, 'epoch': 8.27} 83%|████████▎ | 8267/10000 [13:02:23<2:39:15, 5.51s/it][2025-06-20 02:32:08,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.78 [2025-06-20 02:32:08,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.65 | bwd_microstep: 3323.88 | bwd_inner_microstep: 3323.09 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 02:32:08,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.66 | bwd: 3323.89 | bwd_inner: 3323.09 | bwd_allreduce: 0.76 | step: 6.79 83%|████████▎ | 8268/10000 [13:02:29<2:38:59, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0005982681177556515, 'learning_rate': 3.0650237130706474e-06, 'epoch': 8.27} 83%|████████▎ | 8268/10000 [13:02:29<2:38:59, 5.51s/it][2025-06-20 02:32:14,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:32:14,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.65 | bwd_microstep: 3322.91 | bwd_inner_microstep: 3322.04 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.21 [2025-06-20 02:32:14,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.65 | bwd: 3322.94 | bwd_inner: 3322.04 | bwd_allreduce: 0.83 | step: 7.21 83%|████████▎ | 8269/10000 [13:02:34<2:38:46, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.0399177223443985, 'learning_rate': 3.0615786116699265e-06, 'epoch': 8.27} 83%|████████▎ | 8269/10000 [13:02:34<2:38:46, 5.50s/it][2025-06-20 02:32:19,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 02:32:19,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.96 | bwd_microstep: 3336.72 | bwd_inner_microstep: 3335.78 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.57 [2025-06-20 02:32:19,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.96 | bwd: 3336.73 | bwd_inner: 3335.78 | bwd_allreduce: 0.91 | step: 7.59 83%|████████▎ | 8270/10000 [13:02:40<2:38:33, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0009242097148671746, 'learning_rate': 3.0581352870309897e-06, 'epoch': 8.27} 83%|████████▎ | 8270/10000 [13:02:40<2:38:33, 5.50s/it][2025-06-20 02:32:25,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-20 02:32:25,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.03 | bwd_microstep: 3321.02 | bwd_inner_microstep: 3320.22 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.58 [2025-06-20 02:32:25,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.03 | bwd: 3321.03 | bwd_inner: 3320.22 | bwd_allreduce: 0.77 | step: 7.58 83%|████████▎ | 8271/10000 [13:02:45<2:38:22, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0005307343089953065, 'learning_rate': 3.0546937395150333e-06, 'epoch': 8.27} 83%|████████▎ | 8271/10000 [13:02:45<2:38:22, 5.50s/it][2025-06-20 02:32:30,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-20 02:32:30,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2163.01 | bwd_microstep: 3371.30 | bwd_inner_microstep: 3370.20 | bwd_allreduce_microstep: 1.02 | step_microstep: 8.12 [2025-06-20 02:32:30,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2163.01 | bwd: 3371.32 | bwd_inner: 3370.20 | bwd_allreduce: 1.05 | step: 8.12 83%|████████▎ | 8272/10000 [13:02:51<2:38:59, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0045370375737547874, 'learning_rate': 3.0512539694830566e-06, 'epoch': 8.27} 83%|████████▎ | 8272/10000 [13:02:51<2:38:59, 5.52s/it][2025-06-20 02:32:36,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:32:36,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.64 | bwd_microstep: 3330.82 | bwd_inner_microstep: 3329.76 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.13 [2025-06-20 02:32:36,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.64 | bwd: 3330.84 | bwd_inner: 3329.76 | bwd_allreduce: 1.02 | step: 7.13 83%|████████▎ | 8273/10000 [13:02:56<2:38:38, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.02752089872956276, 'learning_rate': 3.047815977295878e-06, 'epoch': 8.27} 83%|████████▎ | 8273/10000 [13:02:56<2:38:38, 5.51s/it][2025-06-20 02:32:41,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:32:41,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.66 | bwd_microstep: 3331.46 | bwd_inner_microstep: 3330.57 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.67 [2025-06-20 02:32:41,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.66 | bwd: 3331.48 | bwd_inner: 3330.57 | bwd_allreduce: 0.86 | step: 6.67 83%|████████▎ | 8274/10000 [13:03:02<2:38:18, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.003275861032307148, 'learning_rate': 3.0443797633141247e-06, 'epoch': 8.27} 83%|████████▎ | 8274/10000 [13:03:02<2:38:18, 5.50s/it][2025-06-20 02:32:47,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 02:32:47,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.00 | bwd_microstep: 3321.69 | bwd_inner_microstep: 3320.67 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.18 [2025-06-20 02:32:47,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.00 | bwd: 3321.70 | bwd_inner: 3320.67 | bwd_allreduce: 0.98 | step: 7.19 83%|████████▎ | 8275/10000 [13:03:07<2:37:55, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.009074239991605282, 'learning_rate': 3.040945327898244e-06, 'epoch': 8.28} 83%|████████▎ | 8275/10000 [13:03:07<2:37:55, 5.49s/it][2025-06-20 02:32:52,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.72 [2025-06-20 02:32:52,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.84 | bwd_microstep: 3328.37 | bwd_inner_microstep: 3327.05 | bwd_allreduce_microstep: 1.24 | step_microstep: 8.36 [2025-06-20 02:32:52,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.85 | bwd: 3328.39 | bwd_inner: 3327.05 | bwd_allreduce: 1.29 | step: 8.35 83%|████████▎ | 8276/10000 [13:03:13<2:37:51, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.03402676060795784, 'learning_rate': 3.0375126714084847e-06, 'epoch': 8.28} 83%|████████▎ | 8276/10000 [13:03:13<2:37:51, 5.49s/it][2025-06-20 02:32:58,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 02:32:58,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.55 | bwd_microstep: 3322.06 | bwd_inner_microstep: 3320.72 | bwd_allreduce_microstep: 1.28 | step_microstep: 7.32 [2025-06-20 02:32:58,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.55 | bwd: 3322.08 | bwd_inner: 3320.72 | bwd_allreduce: 1.31 | step: 7.32 83%|████████▎ | 8277/10000 [13:03:18<2:37:41, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0039021766278892756, 'learning_rate': 3.034081794204924e-06, 'epoch': 8.28} 83%|████████▎ | 8277/10000 [13:03:18<2:37:41, 5.49s/it][2025-06-20 02:33:03,508] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 02:33:03,509] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.59 | bwd_microstep: 3318.32 | bwd_inner_microstep: 3317.31 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.75 [2025-06-20 02:33:03,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.59 | bwd: 3318.34 | bwd_inner: 3317.31 | bwd_allreduce: 0.97 | step: 7.75 83%|████████▎ | 8278/10000 [13:03:24<2:37:31, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0003820061683654785, 'learning_rate': 3.03065269664744e-06, 'epoch': 8.28} 83%|████████▎ | 8278/10000 [13:03:24<2:37:31, 5.49s/it][2025-06-20 02:33:09,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:33:09,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.88 | bwd_microstep: 3385.83 | bwd_inner_microstep: 3384.90 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.10 [2025-06-20 02:33:09,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.88 | bwd: 3385.84 | bwd_inner: 3384.90 | bwd_allreduce: 0.89 | step: 7.11 83%|████████▎ | 8279/10000 [13:03:29<2:38:09, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.010154187679290771, 'learning_rate': 3.0272253790957307e-06, 'epoch': 8.28} 83%|████████▎ | 8279/10000 [13:03:29<2:38:09, 5.51s/it][2025-06-20 02:33:14,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:33:14,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.66 | bwd_microstep: 3324.23 | bwd_inner_microstep: 3323.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-20 02:33:14,587] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.66 | bwd: 3324.24 | bwd_inner: 3323.43 | bwd_allreduce: 0.78 | step: 6.74 83%|████████▎ | 8280/10000 [13:03:35<2:37:58, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0009070278611034155, 'learning_rate': 3.023799841909312e-06, 'epoch': 8.28} 83%|████████▎ | 8280/10000 [13:03:35<2:37:58, 5.51s/it][2025-06-20 02:33:20,058] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:33:20,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.94 | bwd_microstep: 3324.87 | bwd_inner_microstep: 3323.93 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.09 [2025-06-20 02:33:20,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.94 | bwd: 3324.89 | bwd_inner: 3323.93 | bwd_allreduce: 0.91 | step: 7.09 83%|████████▎ | 8281/10000 [13:03:40<2:37:32, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0006145923398435116, 'learning_rate': 3.0203760854474963e-06, 'epoch': 8.28} 83%|████████▎ | 8281/10000 [13:03:40<2:37:32, 5.50s/it][2025-06-20 02:33:25,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:33:25,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.12 | bwd_microstep: 3326.04 | bwd_inner_microstep: 3325.22 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.17 [2025-06-20 02:33:25,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.12 | bwd: 3326.06 | bwd_inner: 3325.22 | bwd_allreduce: 0.78 | step: 7.17 83%|████████▎ | 8282/10000 [13:03:46<2:37:18, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002837482141330838, 'learning_rate': 3.016954110069423e-06, 'epoch': 8.28} 83%|████████▎ | 8282/10000 [13:03:46<2:37:18, 5.49s/it][2025-06-20 02:33:31,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:33:31,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.59 | bwd_microstep: 3366.91 | bwd_inner_microstep: 3366.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 02:33:31,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.59 | bwd: 3366.93 | bwd_inner: 3366.12 | bwd_allreduce: 0.76 | step: 6.86 83%|████████▎ | 8283/10000 [13:03:51<2:37:37, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.001113531063310802, 'learning_rate': 3.0135339161340436e-06, 'epoch': 8.28} 83%|████████▎ | 8283/10000 [13:03:51<2:37:37, 5.51s/it][2025-06-20 02:33:36,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:33:36,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.61 | bwd_microstep: 3325.47 | bwd_inner_microstep: 3324.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 02:33:36,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.61 | bwd: 3325.49 | bwd_inner: 3324.69 | bwd_allreduce: 0.76 | step: 6.72 83%|████████▎ | 8284/10000 [13:03:57<2:37:10, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0018788104644045234, 'learning_rate': 3.0101155040001217e-06, 'epoch': 8.28} 83%|████████▎ | 8284/10000 [13:03:57<2:37:10, 5.50s/it][2025-06-20 02:33:42,092] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:33:42,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.89 | bwd_microstep: 3370.38 | bwd_inner_microstep: 3369.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-20 02:33:42,093] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.89 | bwd: 3370.40 | bwd_inner: 3369.57 | bwd_allreduce: 0.78 | step: 7.15 83%|████████▎ | 8285/10000 [13:04:02<2:37:30, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.055300094187259674, 'learning_rate': 3.0066988740262325e-06, 'epoch': 8.29} 83%|████████▎ | 8285/10000 [13:04:02<2:37:30, 5.51s/it][2025-06-20 02:33:47,554] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:33:47,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.20 | bwd_microstep: 3317.27 | bwd_inner_microstep: 3316.48 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 02:33:47,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.20 | bwd: 3317.28 | bwd_inner: 3316.48 | bwd_allreduce: 0.76 | step: 6.80 83%|████████▎ | 8286/10000 [13:04:08<2:36:59, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0004208917962387204, 'learning_rate': 3.003284026570761e-06, 'epoch': 8.29} 83%|████████▎ | 8286/10000 [13:04:08<2:36:59, 5.50s/it][2025-06-20 02:33:53,023] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:33:53,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.20 | bwd_microstep: 3325.05 | bwd_inner_microstep: 3324.12 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.07 [2025-06-20 02:33:53,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.20 | bwd: 3325.06 | bwd_inner: 3324.12 | bwd_allreduce: 0.89 | step: 7.07 83%|████████▎ | 8287/10000 [13:04:13<2:36:40, 5.49s/it] {'loss': 0.0012, 'grad_norm': 0.3846215009689331, 'learning_rate': 2.9998709619919085e-06, 'epoch': 8.29} 83%|████████▎ | 8287/10000 [13:04:13<2:36:40, 5.49s/it][2025-06-20 02:33:58,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:33:58,492] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.73 | bwd_microstep: 3323.18 | bwd_inner_microstep: 3322.38 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.14 [2025-06-20 02:33:58,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.73 | bwd: 3323.20 | bwd_inner: 3322.38 | bwd_allreduce: 0.77 | step: 7.14 83%|████████▎ | 8288/10000 [13:04:19<2:36:25, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0003008315688930452, 'learning_rate': 2.996459680647694e-06, 'epoch': 8.29} 83%|████████▎ | 8288/10000 [13:04:19<2:36:25, 5.48s/it][2025-06-20 02:34:03,961] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 02:34:03,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.37 | bwd_microstep: 3324.94 | bwd_inner_microstep: 3323.92 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.61 [2025-06-20 02:34:03,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.37 | bwd: 3324.96 | bwd_inner: 3323.92 | bwd_allreduce: 0.99 | step: 7.61 83%|████████▎ | 8289/10000 [13:04:24<2:36:13, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.06221486255526543, 'learning_rate': 2.9930501828959424e-06, 'epoch': 8.29} 83%|████████▎ | 8289/10000 [13:04:24<2:36:13, 5.48s/it][2025-06-20 02:34:09,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:34:09,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.74 | bwd_microstep: 3320.71 | bwd_inner_microstep: 3319.83 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.09 [2025-06-20 02:34:09,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.74 | bwd: 3320.73 | bwd_inner: 3319.83 | bwd_allreduce: 0.85 | step: 7.09 83%|████████▎ | 8290/10000 [13:04:30<2:36:07, 5.48s/it] {'loss': 0.0, 'grad_norm': 8.530924969818443e-05, 'learning_rate': 2.9896424690942984e-06, 'epoch': 8.29} 83%|████████▎ | 8290/10000 [13:04:30<2:36:07, 5.48s/it][2025-06-20 02:34:14,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:34:14,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.49 | bwd_microstep: 3317.17 | bwd_inner_microstep: 3316.27 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.85 [2025-06-20 02:34:14,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.49 | bwd: 3317.19 | bwd_inner: 3316.27 | bwd_allreduce: 0.88 | step: 6.85 83%|████████▎ | 8291/10000 [13:04:35<2:35:55, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0007307421183213592, 'learning_rate': 2.9862365396002092e-06, 'epoch': 8.29} 83%|████████▎ | 8291/10000 [13:04:35<2:35:55, 5.47s/it][2025-06-20 02:34:20,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:34:20,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.19 | bwd_microstep: 3324.08 | bwd_inner_microstep: 3323.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-20 02:34:20,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.19 | bwd: 3324.10 | bwd_inner: 3323.28 | bwd_allreduce: 0.77 | step: 6.79 83%|████████▎ | 8292/10000 [13:04:41<2:35:47, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.03753369301557541, 'learning_rate': 2.9828323947709427e-06, 'epoch': 8.29} 83%|████████▎ | 8292/10000 [13:04:41<2:35:47, 5.47s/it][2025-06-20 02:34:25,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:34:25,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.95 | bwd_microstep: 3319.83 | bwd_inner_microstep: 3319.01 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.08 [2025-06-20 02:34:25,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.95 | bwd: 3319.84 | bwd_inner: 3319.01 | bwd_allreduce: 0.78 | step: 7.09 83%|████████▎ | 8293/10000 [13:04:46<2:35:39, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0005609446670860052, 'learning_rate': 2.979430034963575e-06, 'epoch': 8.29} 83%|████████▎ | 8293/10000 [13:04:46<2:35:39, 5.47s/it][2025-06-20 02:34:31,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:34:31,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.83 | bwd_microstep: 3323.38 | bwd_inner_microstep: 3322.56 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.03 [2025-06-20 02:34:31,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.83 | bwd: 3323.39 | bwd_inner: 3322.56 | bwd_allreduce: 0.78 | step: 7.03 83%|████████▎ | 8294/10000 [13:04:52<2:35:36, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0006715742056258023, 'learning_rate': 2.976029460535004e-06, 'epoch': 8.29} 83%|████████▎ | 8294/10000 [13:04:52<2:35:36, 5.47s/it][2025-06-20 02:34:36,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:34:36,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.26 | bwd_microstep: 3317.96 | bwd_inner_microstep: 3317.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 02:34:36,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.26 | bwd: 3317.97 | bwd_inner: 3317.16 | bwd_allreduce: 0.76 | step: 6.70 83%|████████▎ | 8295/10000 [13:04:57<2:35:26, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.001152237062342465, 'learning_rate': 2.9726306718419386e-06, 'epoch': 8.29} 83%|████████▎ | 8295/10000 [13:04:57<2:35:26, 5.47s/it][2025-06-20 02:34:42,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:34:42,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.50 | bwd_microstep: 3322.14 | bwd_inner_microstep: 3321.27 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.18 [2025-06-20 02:34:42,246] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.50 | bwd: 3322.16 | bwd_inner: 3321.27 | bwd_allreduce: 0.83 | step: 7.19 83%|████████▎ | 8296/10000 [13:05:03<2:35:19, 5.47s/it] {'loss': 0.0005, 'grad_norm': 0.31205520033836365, 'learning_rate': 2.969233669240885e-06, 'epoch': 8.3} 83%|████████▎ | 8296/10000 [13:05:03<2:35:19, 5.47s/it][2025-06-20 02:34:47,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:34:47,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.04 | bwd_microstep: 3371.30 | bwd_inner_microstep: 3370.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 02:34:47,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.04 | bwd: 3371.32 | bwd_inner: 3370.52 | bwd_allreduce: 0.76 | step: 6.69 83%|████████▎ | 8297/10000 [13:05:08<2:35:51, 5.49s/it] {'loss': 0.0021, 'grad_norm': 1.0392297506332397, 'learning_rate': 2.9658384530881766e-06, 'epoch': 8.3} 83%|████████▎ | 8297/10000 [13:05:08<2:35:51, 5.49s/it][2025-06-20 02:34:53,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:34:53,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.20 | bwd_microstep: 3325.61 | bwd_inner_microstep: 3324.66 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.10 [2025-06-20 02:34:53,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.20 | bwd: 3325.62 | bwd_inner: 3324.66 | bwd_allreduce: 0.92 | step: 7.11 83%|████████▎ | 8298/10000 [13:05:14<2:35:37, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.023381616920232773, 'learning_rate': 2.962445023739955e-06, 'epoch': 8.3} 83%|████████▎ | 8298/10000 [13:05:14<2:35:37, 5.49s/it][2025-06-20 02:34:58,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:34:58,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.75 | bwd_microstep: 3370.23 | bwd_inner_microstep: 3369.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-20 02:34:58,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.75 | bwd: 3370.25 | bwd_inner: 3369.45 | bwd_allreduce: 0.76 | step: 6.79 83%|████████▎ | 8299/10000 [13:05:19<2:36:00, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0012802784331142902, 'learning_rate': 2.9590533815521795e-06, 'epoch': 8.3} 83%|████████▎ | 8299/10000 [13:05:19<2:36:00, 5.50s/it][2025-06-20 02:35:04,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:35:04,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.03 | bwd_microstep: 3324.82 | bwd_inner_microstep: 3324.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.75 [2025-06-20 02:35:04,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.03 | bwd: 3324.84 | bwd_inner: 3324.00 | bwd_allreduce: 0.78 | step: 6.75 83%|████████▎ | 8300/10000 [13:05:25<2:35:43, 5.50s/it] {'loss': 0.0012, 'grad_norm': 0.3402007818222046, 'learning_rate': 2.9556635268806165e-06, 'epoch': 8.3} 83%|████████▎ | 8300/10000 [13:05:25<2:35:43, 5.50s/it][2025-06-20 02:35:09,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:35:09,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.74 | bwd_microstep: 3313.65 | bwd_inner_microstep: 3312.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 02:35:09,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.74 | bwd: 3313.66 | bwd_inner: 3312.87 | bwd_allreduce: 0.75 | step: 6.57 83%|████████▎ | 8301/10000 [13:05:30<2:35:16, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0010042507201433182, 'learning_rate': 2.952275460080842e-06, 'epoch': 8.3} 83%|████████▎ | 8301/10000 [13:05:30<2:35:16, 5.48s/it][2025-06-20 02:35:15,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:35:15,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2208.41 | bwd_microstep: 3374.08 | bwd_inner_microstep: 3373.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-20 02:35:15,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2208.41 | bwd: 3374.09 | bwd_inner: 3373.28 | bwd_allreduce: 0.77 | step: 6.96 83%|████████▎ | 8302/10000 [13:05:36<2:36:20, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.01210970152169466, 'learning_rate': 2.9488891815082497e-06, 'epoch': 8.3} 83%|████████▎ | 8302/10000 [13:05:36<2:36:20, 5.52s/it][2025-06-20 02:35:20,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 02:35:20,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.58 | bwd_microstep: 3371.17 | bwd_inner_microstep: 3370.16 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.45 [2025-06-20 02:35:20,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.58 | bwd: 3371.20 | bwd_inner: 3370.16 | bwd_allreduce: 0.98 | step: 7.45 83%|████████▎ | 8303/10000 [13:05:41<2:36:27, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.011347906664013863, 'learning_rate': 2.9455046915180464e-06, 'epoch': 8.3} 83%|████████▎ | 8303/10000 [13:05:41<2:36:27, 5.53s/it][2025-06-20 02:35:26,373] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:35:26,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.80 | bwd_microstep: 3309.81 | bwd_inner_microstep: 3308.97 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.92 [2025-06-20 02:35:26,374] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.80 | bwd: 3309.83 | bwd_inner: 3308.97 | bwd_allreduce: 0.80 | step: 6.92 83%|████████▎ | 8304/10000 [13:05:47<2:35:48, 5.51s/it] {'loss': 0.0004, 'grad_norm': 0.06643881648778915, 'learning_rate': 2.9421219904652478e-06, 'epoch': 8.3} 83%|████████▎ | 8304/10000 [13:05:47<2:35:48, 5.51s/it][2025-06-20 02:35:31,841] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:35:31,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.89 | bwd_microstep: 3322.38 | bwd_inner_microstep: 3321.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-20 02:35:31,842] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.89 | bwd: 3322.39 | bwd_inner: 3321.57 | bwd_allreduce: 0.77 | step: 7.16 83%|████████▎ | 8305/10000 [13:05:52<2:35:19, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0005305241793394089, 'learning_rate': 2.93874107870469e-06, 'epoch': 8.3} 83%|████████▎ | 8305/10000 [13:05:52<2:35:19, 5.50s/it][2025-06-20 02:35:37,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:35:37,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.28 | bwd_microstep: 3313.93 | bwd_inner_microstep: 3313.12 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 02:35:37,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.28 | bwd: 3313.94 | bwd_inner: 3313.12 | bwd_allreduce: 0.78 | step: 6.81 83%|████████▎ | 8306/10000 [13:05:58<2:34:55, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.017998971045017242, 'learning_rate': 2.9353619565910028e-06, 'epoch': 8.31} 83%|████████▎ | 8306/10000 [13:05:58<2:34:55, 5.49s/it][2025-06-20 02:35:42,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.85 [2025-06-20 02:35:42,752] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.25 | bwd_microstep: 3306.83 | bwd_inner_microstep: 3305.98 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.71 [2025-06-20 02:35:42,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.25 | bwd: 3306.85 | bwd_inner: 3305.98 | bwd_allreduce: 0.82 | step: 7.72 83%|████████▎ | 8307/10000 [13:06:03<2:34:31, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00023364581284113228, 'learning_rate': 2.93198462447865e-06, 'epoch': 8.31} 83%|████████▎ | 8307/10000 [13:06:03<2:34:31, 5.48s/it][2025-06-20 02:35:48,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:35:48,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.15 | bwd_microstep: 3369.90 | bwd_inner_microstep: 3369.10 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-20 02:35:48,287] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.15 | bwd: 3369.92 | bwd_inner: 3369.10 | bwd_allreduce: 0.78 | step: 6.90 83%|████████▎ | 8308/10000 [13:06:09<2:34:54, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.00915160309523344, 'learning_rate': 2.928609082721892e-06, 'epoch': 8.31} 83%|████████▎ | 8308/10000 [13:06:09<2:34:54, 5.49s/it][2025-06-20 02:35:53,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:35:53,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.51 | bwd_microstep: 3311.18 | bwd_inner_microstep: 3310.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-20 02:35:53,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.51 | bwd: 3311.20 | bwd_inner: 3310.37 | bwd_allreduce: 0.78 | step: 7.24 83%|████████▎ | 8309/10000 [13:06:14<2:34:33, 5.48s/it] {'loss': 0.0, 'grad_norm': 5.9090485592605546e-05, 'learning_rate': 2.9252353316748116e-06, 'epoch': 8.31} 83%|████████▎ | 8309/10000 [13:06:14<2:34:33, 5.48s/it][2025-06-20 02:35:59,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:35:59,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.17 | bwd_microstep: 3315.84 | bwd_inner_microstep: 3314.84 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.26 [2025-06-20 02:35:59,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.17 | bwd: 3315.85 | bwd_inner: 3314.84 | bwd_allreduce: 0.97 | step: 7.26 83%|████████▎ | 8310/10000 [13:06:20<2:34:23, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00021246133837848902, 'learning_rate': 2.921863371691298e-06, 'epoch': 8.31} 83%|████████▎ | 8310/10000 [13:06:20<2:34:23, 5.48s/it][2025-06-20 02:36:04,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:36:04,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.96 | bwd_microstep: 3315.82 | bwd_inner_microstep: 3314.83 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.61 [2025-06-20 02:36:04,691] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.96 | bwd: 3315.83 | bwd_inner: 3314.83 | bwd_allreduce: 0.96 | step: 7.62 83%|████████▎ | 8311/10000 [13:06:25<2:34:10, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0008046084549278021, 'learning_rate': 2.9184932031250548e-06, 'epoch': 8.31} 83%|████████▎ | 8311/10000 [13:06:25<2:34:10, 5.48s/it][2025-06-20 02:36:10,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.70 | optimizer_step: 2.73 [2025-06-20 02:36:10,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.87 | bwd_microstep: 3320.20 | bwd_inner_microstep: 3318.93 | bwd_allreduce_microstep: 1.20 | step_microstep: 8.14 [2025-06-20 02:36:10,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.87 | bwd: 3320.22 | bwd_inner: 3318.93 | bwd_allreduce: 1.23 | step: 8.14 83%|████████▎ | 8312/10000 [13:06:30<2:34:01, 5.47s/it] {'loss': 0.0, 'grad_norm': 3.302927143522538e-05, 'learning_rate': 2.915124826329598e-06, 'epoch': 8.31} 83%|████████▎ | 8312/10000 [13:06:30<2:34:01, 5.47s/it][2025-06-20 02:36:15,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:36:15,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.09 | bwd_microstep: 3320.05 | bwd_inner_microstep: 3319.22 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.46 [2025-06-20 02:36:15,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.09 | bwd: 3320.07 | bwd_inner: 3319.22 | bwd_allreduce: 0.79 | step: 7.46 83%|████████▎ | 8313/10000 [13:06:36<2:33:54, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.007319462951272726, 'learning_rate': 2.911758241658251e-06, 'epoch': 8.31} 83%|████████▎ | 8313/10000 [13:06:36<2:33:54, 5.47s/it][2025-06-20 02:36:21,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:36:21,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.37 | bwd_microstep: 3310.95 | bwd_inner_microstep: 3310.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-20 02:36:21,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.37 | bwd: 3310.96 | bwd_inner: 3310.15 | bwd_allreduce: 0.77 | step: 6.82 83%|████████▎ | 8314/10000 [13:06:41<2:33:38, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00011499734682729468, 'learning_rate': 2.9083934494641574e-06, 'epoch': 8.31} 83%|████████▎ | 8314/10000 [13:06:41<2:33:38, 5.47s/it][2025-06-20 02:36:26,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:36:26,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.95 | bwd_microstep: 3314.64 | bwd_inner_microstep: 3313.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 02:36:26,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.95 | bwd: 3314.66 | bwd_inner: 3313.85 | bwd_allreduce: 0.76 | step: 6.70 83%|████████▎ | 8315/10000 [13:06:47<2:33:25, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0062038227915763855, 'learning_rate': 2.9050304501002725e-06, 'epoch': 8.31} 83%|████████▎ | 8315/10000 [13:06:47<2:33:25, 5.46s/it][2025-06-20 02:36:31,996] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:36:31,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.58 | bwd_microstep: 3316.84 | bwd_inner_microstep: 3316.03 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.31 [2025-06-20 02:36:31,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.58 | bwd: 3316.85 | bwd_inner: 3316.03 | bwd_allreduce: 0.78 | step: 7.31 83%|████████▎ | 8316/10000 [13:06:52<2:33:17, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.03957073390483856, 'learning_rate': 2.901669243919345e-06, 'epoch': 8.32} 83%|████████▎ | 8316/10000 [13:06:52<2:33:17, 5.46s/it][2025-06-20 02:36:37,539] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:36:37,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.78 | bwd_microstep: 3374.44 | bwd_inner_microstep: 3373.65 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.86 [2025-06-20 02:36:37,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.78 | bwd: 3374.45 | bwd_inner: 3373.65 | bwd_allreduce: 0.76 | step: 6.86 83%|████████▎ | 8317/10000 [13:06:58<2:33:52, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.000640922284219414, 'learning_rate': 2.8983098312739597e-06, 'epoch': 8.32} 83%|████████▎ | 8317/10000 [13:06:58<2:33:52, 5.49s/it][2025-06-20 02:36:43,074] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:36:43,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.27 | bwd_microstep: 3364.24 | bwd_inner_microstep: 3363.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-20 02:36:43,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.27 | bwd: 3364.25 | bwd_inner: 3363.43 | bwd_allreduce: 0.78 | step: 6.82 83%|████████▎ | 8318/10000 [13:07:03<2:34:11, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.009676213376224041, 'learning_rate': 2.8949522125165017e-06, 'epoch': 8.32} 83%|████████▎ | 8318/10000 [13:07:03<2:34:11, 5.50s/it][2025-06-20 02:36:48,526] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:36:48,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.91 | bwd_microstep: 3312.54 | bwd_inner_microstep: 3311.75 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 02:36:48,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.91 | bwd: 3312.56 | bwd_inner: 3311.75 | bwd_allreduce: 0.76 | step: 6.70 83%|████████▎ | 8319/10000 [13:07:09<2:33:41, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0035482156090438366, 'learning_rate': 2.8915963879991695e-06, 'epoch': 8.32} 83%|████████▎ | 8319/10000 [13:07:09<2:33:41, 5.49s/it][2025-06-20 02:36:53,971] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:36:53,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.17 | bwd_microstep: 3306.43 | bwd_inner_microstep: 3305.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-20 02:36:53,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.17 | bwd: 3306.44 | bwd_inner: 3305.63 | bwd_allreduce: 0.77 | step: 7.07 83%|████████▎ | 8320/10000 [13:07:14<2:33:15, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.004968433640897274, 'learning_rate': 2.8882423580739784e-06, 'epoch': 8.32} 83%|████████▎ | 8320/10000 [13:07:14<2:33:15, 5.47s/it][2025-06-20 02:36:59,431] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:36:59,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.20 | bwd_microstep: 3320.59 | bwd_inner_microstep: 3319.57 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.16 [2025-06-20 02:36:59,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.20 | bwd: 3320.61 | bwd_inner: 3319.57 | bwd_allreduce: 0.97 | step: 7.16 83%|████████▎ | 8321/10000 [13:07:20<2:33:04, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0008329858537763357, 'learning_rate': 2.884890123092743e-06, 'epoch': 8.32} 83%|████████▎ | 8321/10000 [13:07:20<2:33:04, 5.47s/it][2025-06-20 02:37:04,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:37:04,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.42 | bwd_microstep: 3358.58 | bwd_inner_microstep: 3357.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 02:37:04,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.42 | bwd: 3358.60 | bwd_inner: 3357.79 | bwd_allreduce: 0.77 | step: 6.81 83%|████████▎ | 8322/10000 [13:07:25<2:33:32, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.11142726987600327, 'learning_rate': 2.8815396834070975e-06, 'epoch': 8.32} 83%|████████▎ | 8322/10000 [13:07:25<2:33:32, 5.49s/it][2025-06-20 02:37:10,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:37:10,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.43 | bwd_microstep: 3313.33 | bwd_inner_microstep: 3312.36 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.11 [2025-06-20 02:37:10,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.43 | bwd: 3313.34 | bwd_inner: 3312.36 | bwd_allreduce: 0.76 | step: 7.12 83%|████████▎ | 8323/10000 [13:07:31<2:33:07, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.03909871354699135, 'learning_rate': 2.87819103936849e-06, 'epoch': 8.32} 83%|████████▎ | 8323/10000 [13:07:31<2:33:07, 5.48s/it][2025-06-20 02:37:15,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:37:15,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.98 | bwd_microstep: 3308.48 | bwd_inner_microstep: 3307.70 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 02:37:15,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.98 | bwd: 3308.50 | bwd_inner: 3307.70 | bwd_allreduce: 0.75 | step: 6.61 83%|████████▎ | 8324/10000 [13:07:36<2:32:47, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00010168291919399053, 'learning_rate': 2.874844191328179e-06, 'epoch': 8.32} 83%|████████▎ | 8324/10000 [13:07:36<2:32:47, 5.47s/it][2025-06-20 02:37:21,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:37:21,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.61 | bwd_microstep: 3355.84 | bwd_inner_microstep: 3355.06 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 02:37:21,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.61 | bwd: 3355.85 | bwd_inner: 3355.06 | bwd_allreduce: 0.75 | step: 6.59 83%|████████▎ | 8325/10000 [13:07:42<2:33:06, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.003959374502301216, 'learning_rate': 2.871499139637237e-06, 'epoch': 8.32} 83%|████████▎ | 8325/10000 [13:07:42<2:33:06, 5.48s/it][2025-06-20 02:37:26,845] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:37:26,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.89 | bwd_microstep: 3314.62 | bwd_inner_microstep: 3313.83 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-20 02:37:26,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.89 | bwd: 3314.63 | bwd_inner: 3313.83 | bwd_allreduce: 0.76 | step: 6.76 83%|████████▎ | 8326/10000 [13:07:47<2:32:46, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.006953381933271885, 'learning_rate': 2.868155884646533e-06, 'epoch': 8.33} 83%|████████▎ | 8326/10000 [13:07:47<2:32:46, 5.48s/it][2025-06-20 02:37:32,294] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.79 [2025-06-20 02:37:32,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.16 | bwd_microstep: 3313.35 | bwd_inner_microstep: 3312.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.84 [2025-06-20 02:37:32,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.16 | bwd: 3313.36 | bwd_inner: 3312.56 | bwd_allreduce: 0.76 | step: 6.84 83%|████████▎ | 8327/10000 [13:07:53<2:32:27, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0018100114539265633, 'learning_rate': 2.8648144267067614e-06, 'epoch': 8.33} 83%|████████▎ | 8327/10000 [13:07:53<2:32:27, 5.47s/it][2025-06-20 02:37:37,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:37:37,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.81 | bwd_microstep: 3310.55 | bwd_inner_microstep: 3309.77 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:37:37,747] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.81 | bwd: 3310.57 | bwd_inner: 3309.77 | bwd_allreduce: 0.75 | step: 6.63 83%|████████▎ | 8328/10000 [13:07:58<2:32:14, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.01045044232159853, 'learning_rate': 2.861474766168435e-06, 'epoch': 8.33} 83%|████████▎ | 8328/10000 [13:07:58<2:32:14, 5.46s/it][2025-06-20 02:37:43,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:37:43,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.02 | bwd_microstep: 3360.32 | bwd_inner_microstep: 3359.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 02:37:43,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.02 | bwd: 3360.33 | bwd_inner: 3359.53 | bwd_allreduce: 0.76 | step: 6.58 83%|████████▎ | 8329/10000 [13:08:04<2:32:35, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.004293050616979599, 'learning_rate': 2.8581369033818627e-06, 'epoch': 8.33} 83%|████████▎ | 8329/10000 [13:08:04<2:32:35, 5.48s/it][2025-06-20 02:37:48,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:37:48,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.47 | bwd_microstep: 3316.49 | bwd_inner_microstep: 3315.64 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.92 [2025-06-20 02:37:48,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.47 | bwd: 3316.51 | bwd_inner: 3315.64 | bwd_allreduce: 0.81 | step: 6.92 83%|████████▎ | 8330/10000 [13:08:09<2:32:20, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0010497092735022306, 'learning_rate': 2.854800838697176e-06, 'epoch': 8.33} 83%|████████▎ | 8330/10000 [13:08:09<2:32:20, 5.47s/it][2025-06-20 02:37:54,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:37:54,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.09 | bwd_microstep: 3311.94 | bwd_inner_microstep: 3311.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 02:37:54,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.09 | bwd: 3311.95 | bwd_inner: 3311.16 | bwd_allreduce: 0.75 | step: 6.61 83%|████████▎ | 8331/10000 [13:08:14<2:32:04, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.004556190222501755, 'learning_rate': 2.8514665724643032e-06, 'epoch': 8.33} 83%|████████▎ | 8331/10000 [13:08:14<2:32:04, 5.47s/it][2025-06-20 02:37:59,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:37:59,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.72 | bwd_microstep: 3322.10 | bwd_inner_microstep: 3321.28 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.94 [2025-06-20 02:37:59,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.72 | bwd: 3322.11 | bwd_inner: 3321.28 | bwd_allreduce: 0.79 | step: 6.94 83%|████████▎ | 8332/10000 [13:08:20<2:31:59, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.046075642108917236, 'learning_rate': 2.848134105032998e-06, 'epoch': 8.33} 83%|████████▎ | 8332/10000 [13:08:20<2:31:59, 5.47s/it][2025-06-20 02:38:05,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.90 [2025-06-20 02:38:05,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.83 | bwd_microstep: 3309.92 | bwd_inner_microstep: 3308.85 | bwd_allreduce_microstep: 1.00 | step_microstep: 8.15 [2025-06-20 02:38:05,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.83 | bwd: 3309.94 | bwd_inner: 3308.85 | bwd_allreduce: 1.03 | step: 8.16 83%|████████▎ | 8333/10000 [13:08:25<2:31:45, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0005035193171352148, 'learning_rate': 2.844803436752821e-06, 'epoch': 8.33} 83%|████████▎ | 8333/10000 [13:08:25<2:31:45, 5.46s/it][2025-06-20 02:38:10,557] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:38:10,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.82 | bwd_microstep: 3321.42 | bwd_inner_microstep: 3320.63 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.91 [2025-06-20 02:38:10,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.82 | bwd: 3321.44 | bwd_inner: 3320.63 | bwd_allreduce: 0.76 | step: 6.91 83%|████████▎ | 8334/10000 [13:08:31<2:31:41, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.02072485163807869, 'learning_rate': 2.841474567973144e-06, 'epoch': 8.33} 83%|████████▎ | 8334/10000 [13:08:31<2:31:41, 5.46s/it][2025-06-20 02:38:16,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:38:16,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.80 | bwd_microstep: 3310.15 | bwd_inner_microstep: 3309.37 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 02:38:16,005] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.80 | bwd: 3310.16 | bwd_inner: 3309.37 | bwd_allreduce: 0.76 | step: 6.65 83%|████████▎ | 8335/10000 [13:08:36<2:31:27, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.008059253916144371, 'learning_rate': 2.838147499043149e-06, 'epoch': 8.34} 83%|████████▎ | 8335/10000 [13:08:36<2:31:27, 5.46s/it][2025-06-20 02:38:21,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:38:21,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.62 | bwd_microstep: 3362.74 | bwd_inner_microstep: 3361.91 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.82 [2025-06-20 02:38:21,527] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.62 | bwd: 3362.75 | bwd_inner: 3361.91 | bwd_allreduce: 0.79 | step: 6.83 83%|████████▎ | 8336/10000 [13:08:42<2:31:55, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0013277976540848613, 'learning_rate': 2.8348222303118376e-06, 'epoch': 8.34} 83%|████████▎ | 8336/10000 [13:08:42<2:31:55, 5.48s/it][2025-06-20 02:38:26,994] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:38:26,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.53 | bwd_microstep: 3323.50 | bwd_inner_microstep: 3322.57 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.15 [2025-06-20 02:38:26,995] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.53 | bwd: 3323.52 | bwd_inner: 3322.57 | bwd_allreduce: 0.90 | step: 7.16 83%|████████▎ | 8337/10000 [13:08:47<2:31:44, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.053566284477710724, 'learning_rate': 2.8314987621280022e-06, 'epoch': 8.34} 83%|████████▎ | 8337/10000 [13:08:47<2:31:44, 5.47s/it][2025-06-20 02:38:32,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:38:32,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.99 | bwd_microstep: 3309.40 | bwd_inner_microstep: 3308.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 02:38:32,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.99 | bwd: 3309.42 | bwd_inner: 3308.61 | bwd_allreduce: 0.77 | step: 6.76 83%|████████▎ | 8338/10000 [13:08:53<2:31:29, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0011027802247554064, 'learning_rate': 2.828177094840263e-06, 'epoch': 8.34} 83%|████████▎ | 8338/10000 [13:08:53<2:31:29, 5.47s/it][2025-06-20 02:38:37,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:38:37,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.64 | bwd_microstep: 3309.26 | bwd_inner_microstep: 3308.45 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.03 [2025-06-20 02:38:37,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.64 | bwd: 3309.27 | bwd_inner: 3308.45 | bwd_allreduce: 0.77 | step: 7.03 83%|████████▎ | 8339/10000 [13:08:58<2:31:16, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.024404259398579597, 'learning_rate': 2.8248572287970535e-06, 'epoch': 8.34} 83%|████████▎ | 8339/10000 [13:08:58<2:31:16, 5.46s/it][2025-06-20 02:38:43,351] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:38:43,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.07 | bwd_microstep: 3306.06 | bwd_inner_microstep: 3305.16 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.14 [2025-06-20 02:38:43,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.07 | bwd: 3306.07 | bwd_inner: 3305.16 | bwd_allreduce: 0.86 | step: 7.14 83%|████████▎ | 8340/10000 [13:09:04<2:31:03, 5.46s/it] {'loss': 0.0005, 'grad_norm': 0.14665400981903076, 'learning_rate': 2.8215391643466052e-06, 'epoch': 8.34} 83%|████████▎ | 8340/10000 [13:09:04<2:31:03, 5.46s/it][2025-06-20 02:38:48,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:38:48,810] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.85 | bwd_microstep: 3310.85 | bwd_inner_microstep: 3310.06 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:38:48,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.85 | bwd: 3310.86 | bwd_inner: 3310.06 | bwd_allreduce: 0.76 | step: 6.63 83%|████████▎ | 8341/10000 [13:09:09<2:30:55, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.001631587860174477, 'learning_rate': 2.8182229018369777e-06, 'epoch': 8.34} 83%|████████▎ | 8341/10000 [13:09:09<2:30:55, 5.46s/it][2025-06-20 02:38:54,265] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:38:54,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.46 | bwd_microstep: 3316.29 | bwd_inner_microstep: 3315.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.59 [2025-06-20 02:38:54,266] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.46 | bwd: 3316.31 | bwd_inner: 3315.50 | bwd_allreduce: 0.76 | step: 6.59 83%|████████▎ | 8342/10000 [13:09:15<2:30:48, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.009176382794976234, 'learning_rate': 2.814908441616018e-06, 'epoch': 8.34} 83%|████████▎ | 8342/10000 [13:09:15<2:30:48, 5.46s/it][2025-06-20 02:38:59,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:38:59,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.18 | bwd_microstep: 3364.25 | bwd_inner_microstep: 3363.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-20 02:38:59,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.18 | bwd: 3364.26 | bwd_inner: 3363.46 | bwd_allreduce: 0.76 | step: 6.57 83%|████████▎ | 8343/10000 [13:09:20<2:31:23, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0011432593455538154, 'learning_rate': 2.8115957840314066e-06, 'epoch': 8.34} 83%|████████▎ | 8343/10000 [13:09:20<2:31:23, 5.48s/it][2025-06-20 02:39:05,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:39:05,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.48 | bwd_microstep: 3353.18 | bwd_inner_microstep: 3352.24 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.94 [2025-06-20 02:39:05,323] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.48 | bwd: 3353.19 | bwd_inner: 3352.24 | bwd_allreduce: 0.90 | step: 6.95 83%|████████▎ | 8344/10000 [13:09:26<2:31:36, 5.49s/it] {'loss': 0.0015, 'grad_norm': 0.31387078762054443, 'learning_rate': 2.8082849294306224e-06, 'epoch': 8.34} 83%|████████▎ | 8344/10000 [13:09:26<2:31:36, 5.49s/it][2025-06-20 02:39:10,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:39:10,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.85 | bwd_microstep: 3367.81 | bwd_inner_microstep: 3366.85 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.20 [2025-06-20 02:39:10,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.85 | bwd: 3367.83 | bwd_inner: 3366.85 | bwd_allreduce: 0.94 | step: 7.21 83%|████████▎ | 8345/10000 [13:09:31<2:31:51, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.014412042684853077, 'learning_rate': 2.804975878160958e-06, 'epoch': 8.35} 83%|████████▎ | 8345/10000 [13:09:31<2:31:51, 5.51s/it][2025-06-20 02:39:16,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:39:16,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.81 | bwd_microstep: 3310.86 | bwd_inner_microstep: 3309.99 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.00 [2025-06-20 02:39:16,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.81 | bwd: 3310.88 | bwd_inner: 3309.99 | bwd_allreduce: 0.82 | step: 7.00 83%|████████▎ | 8346/10000 [13:09:37<2:31:20, 5.49s/it] {'loss': 0.002, 'grad_norm': 0.5035959482192993, 'learning_rate': 2.801668630569523e-06, 'epoch': 8.35} 83%|████████▎ | 8346/10000 [13:09:37<2:31:20, 5.49s/it][2025-06-20 02:39:21,843] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:39:21,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.31 | bwd_microstep: 3373.29 | bwd_inner_microstep: 3372.40 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.04 [2025-06-20 02:39:21,844] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.31 | bwd: 3373.31 | bwd_inner: 3372.40 | bwd_allreduce: 0.85 | step: 7.05 83%|████████▎ | 8347/10000 [13:09:42<2:31:37, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.011117218062281609, 'learning_rate': 2.7983631870032257e-06, 'epoch': 8.35} 83%|████████▎ | 8347/10000 [13:09:42<2:31:37, 5.50s/it][2025-06-20 02:39:27,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.74 [2025-06-20 02:39:27,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.23 | bwd_microstep: 3312.87 | bwd_inner_microstep: 3312.07 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 02:39:27,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.23 | bwd: 3312.88 | bwd_inner: 3312.07 | bwd_allreduce: 0.76 | step: 6.79 83%|████████▎ | 8348/10000 [13:09:48<2:31:09, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0006971433758735657, 'learning_rate': 2.7950595478087983e-06, 'epoch': 8.35} 83%|████████▎ | 8348/10000 [13:09:48<2:31:09, 5.49s/it][2025-06-20 02:39:32,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 02:39:32,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.55 | bwd_microstep: 3318.33 | bwd_inner_microstep: 3317.23 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.48 [2025-06-20 02:39:32,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.55 | bwd: 3318.35 | bwd_inner: 3317.23 | bwd_allreduce: 1.07 | step: 7.49 83%|████████▎ | 8349/10000 [13:09:53<2:30:49, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00810092594474554, 'learning_rate': 2.79175771333277e-06, 'epoch': 8.35} 83%|████████▎ | 8349/10000 [13:09:53<2:30:49, 5.48s/it][2025-06-20 02:39:38,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:39:38,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.27 | bwd_microstep: 3313.41 | bwd_inner_microstep: 3312.63 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 02:39:38,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.27 | bwd: 3313.43 | bwd_inner: 3312.62 | bwd_allreduce: 0.76 | step: 6.60 84%|████████▎ | 8350/10000 [13:09:59<2:30:29, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.010564595460891724, 'learning_rate': 2.788457683921495e-06, 'epoch': 8.35} 84%|████████▎ | 8350/10000 [13:09:59<2:30:29, 5.47s/it][2025-06-20 02:39:43,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:39:43,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.69 | bwd_microstep: 3363.96 | bwd_inner_microstep: 3363.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 02:39:43,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.69 | bwd: 3363.98 | bwd_inner: 3363.18 | bwd_allreduce: 0.76 | step: 6.58 84%|████████▎ | 8351/10000 [13:10:04<2:30:49, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0004376537399366498, 'learning_rate': 2.7851594599211297e-06, 'epoch': 8.35} 84%|████████▎ | 8351/10000 [13:10:04<2:30:49, 5.49s/it][2025-06-20 02:39:49,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:39:49,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.35 | bwd_microstep: 3327.63 | bwd_inner_microstep: 3326.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-20 02:39:49,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.35 | bwd: 3327.65 | bwd_inner: 3326.81 | bwd_allreduce: 0.79 | step: 6.85 84%|████████▎ | 8352/10000 [13:10:10<2:30:39, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0020084651187062263, 'learning_rate': 2.781863041677637e-06, 'epoch': 8.35} 84%|████████▎ | 8352/10000 [13:10:10<2:30:39, 5.49s/it][2025-06-20 02:39:54,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.72 [2025-06-20 02:39:54,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.40 | bwd_microstep: 3371.11 | bwd_inner_microstep: 3370.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 02:39:54,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.40 | bwd: 3371.12 | bwd_inner: 3370.33 | bwd_allreduce: 0.76 | step: 6.70 84%|████████▎ | 8353/10000 [13:10:15<2:30:57, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001505595981143415, 'learning_rate': 2.7785684295367988e-06, 'epoch': 8.35} 84%|████████▎ | 8353/10000 [13:10:15<2:30:57, 5.50s/it][2025-06-20 02:40:00,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-20 02:40:00,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.13 | bwd_microstep: 3375.67 | bwd_inner_microstep: 3374.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-20 02:40:00,291] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.13 | bwd: 3375.69 | bwd_inner: 3374.89 | bwd_allreduce: 0.76 | step: 6.57 84%|████████▎ | 8354/10000 [13:10:21<2:31:10, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0014134341618046165, 'learning_rate': 2.775275623844205e-06, 'epoch': 8.35} 84%|████████▎ | 8354/10000 [13:10:21<2:31:10, 5.51s/it][2025-06-20 02:40:05,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:40:05,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.04 | bwd_microstep: 3378.22 | bwd_inner_microstep: 3377.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 02:40:05,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.05 | bwd: 3378.24 | bwd_inner: 3377.43 | bwd_allreduce: 0.76 | step: 6.81 84%|████████▎ | 8355/10000 [13:10:26<2:31:22, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0035667710471898317, 'learning_rate': 2.7719846249452566e-06, 'epoch': 8.36} 84%|████████▎ | 8355/10000 [13:10:26<2:31:22, 5.52s/it][2025-06-20 02:40:11,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:40:11,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.61 | bwd_microstep: 3326.04 | bwd_inner_microstep: 3325.25 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 02:40:11,303] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.61 | bwd: 3326.05 | bwd_inner: 3325.25 | bwd_allreduce: 0.75 | step: 6.62 84%|████████▎ | 8356/10000 [13:10:32<2:30:49, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0012314675841480494, 'learning_rate': 2.768695433185169e-06, 'epoch': 8.36} 84%|████████▎ | 8356/10000 [13:10:32<2:30:49, 5.50s/it][2025-06-20 02:40:16,763] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:40:16,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.76 | bwd_microstep: 3321.14 | bwd_inner_microstep: 3320.24 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.94 [2025-06-20 02:40:16,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.76 | bwd: 3321.15 | bwd_inner: 3320.24 | bwd_allreduce: 0.86 | step: 6.95 84%|████████▎ | 8357/10000 [13:10:37<2:30:22, 5.49s/it] {'loss': 0.0, 'grad_norm': 6.830978963989764e-05, 'learning_rate': 2.765408048908951e-06, 'epoch': 8.36} 84%|████████▎ | 8357/10000 [13:10:37<2:30:22, 5.49s/it][2025-06-20 02:40:22,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:40:22,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.41 | bwd_microstep: 3394.75 | bwd_inner_microstep: 3393.74 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.09 [2025-06-20 02:40:22,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.41 | bwd: 3394.77 | bwd_inner: 3393.74 | bwd_allreduce: 0.97 | step: 7.09 84%|████████▎ | 8358/10000 [13:10:43<2:30:57, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.000277872895821929, 'learning_rate': 2.7621224724614415e-06, 'epoch': 8.36} 84%|████████▎ | 8358/10000 [13:10:43<2:30:57, 5.52s/it][2025-06-20 02:40:27,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:40:27,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.08 | bwd_microstep: 3373.35 | bwd_inner_microstep: 3372.49 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.69 [2025-06-20 02:40:27,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.08 | bwd: 3373.37 | bwd_inner: 3372.48 | bwd_allreduce: 0.84 | step: 6.69 84%|████████▎ | 8359/10000 [13:10:48<2:31:07, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0007573224138468504, 'learning_rate': 2.758838704187281e-06, 'epoch': 8.36} 84%|████████▎ | 8359/10000 [13:10:48<2:31:07, 5.53s/it][2025-06-20 02:40:33,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:40:33,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.87 | bwd_microstep: 3318.09 | bwd_inner_microstep: 3317.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:40:33,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.87 | bwd: 3318.11 | bwd_inner: 3317.31 | bwd_allreduce: 0.75 | step: 6.63 84%|████████▎ | 8360/10000 [13:10:54<2:30:30, 5.51s/it] {'loss': 0.0173, 'grad_norm': 5.938935279846191, 'learning_rate': 2.7555567444309205e-06, 'epoch': 8.36} 84%|████████▎ | 8360/10000 [13:10:54<2:30:30, 5.51s/it][2025-06-20 02:40:38,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:40:38,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.32 | bwd_microstep: 3370.85 | bwd_inner_microstep: 3370.05 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-20 02:40:38,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.32 | bwd: 3370.86 | bwd_inner: 3370.05 | bwd_allreduce: 0.77 | step: 7.01 84%|████████▎ | 8361/10000 [13:10:59<2:30:38, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.002537434920668602, 'learning_rate': 2.7522765935366293e-06, 'epoch': 8.36} 84%|████████▎ | 8361/10000 [13:10:59<2:30:38, 5.51s/it][2025-06-20 02:40:44,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:40:44,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.21 | bwd_microstep: 3328.65 | bwd_inner_microstep: 3327.86 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-20 02:40:44,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.21 | bwd: 3328.66 | bwd_inner: 3327.86 | bwd_allreduce: 0.76 | step: 6.73 84%|████████▎ | 8362/10000 [13:11:05<2:30:10, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0029677299316972494, 'learning_rate': 2.748998251848465e-06, 'epoch': 8.36} 84%|████████▎ | 8362/10000 [13:11:05<2:30:10, 5.50s/it][2025-06-20 02:40:49,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:40:49,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.54 | bwd_microstep: 3320.34 | bwd_inner_microstep: 3319.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-20 02:40:49,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.54 | bwd: 3320.36 | bwd_inner: 3319.55 | bwd_allreduce: 0.77 | step: 6.64 84%|████████▎ | 8363/10000 [13:11:10<2:29:50, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.007923690602183342, 'learning_rate': 2.745721719710326e-06, 'epoch': 8.36} 84%|████████▎ | 8363/10000 [13:11:10<2:29:50, 5.49s/it][2025-06-20 02:40:55,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:40:55,391] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.17 | bwd_microstep: 3396.94 | bwd_inner_microstep: 3396.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 02:40:55,392] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.17 | bwd: 3396.95 | bwd_inner: 3396.16 | bwd_allreduce: 0.76 | step: 6.67 84%|████████▎ | 8364/10000 [13:11:16<2:30:23, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00019864617206621915, 'learning_rate': 2.7424469974658972e-06, 'epoch': 8.36} 84%|████████▎ | 8364/10000 [13:11:16<2:30:23, 5.52s/it][2025-06-20 02:41:00,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:41:00,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.62 | bwd_microstep: 3379.58 | bwd_inner_microstep: 3378.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 02:41:00,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.62 | bwd: 3379.59 | bwd_inner: 3378.80 | bwd_allreduce: 0.75 | step: 6.60 84%|████████▎ | 8365/10000 [13:11:21<2:30:35, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.001123338588513434, 'learning_rate': 2.739174085458682e-06, 'epoch': 8.37} 84%|████████▎ | 8365/10000 [13:11:21<2:30:35, 5.53s/it][2025-06-20 02:41:06,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:41:06,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.37 | bwd_microstep: 3373.31 | bwd_inner_microstep: 3372.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 02:41:06,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.37 | bwd: 3373.33 | bwd_inner: 3372.53 | bwd_allreduce: 0.76 | step: 6.63 84%|████████▎ | 8366/10000 [13:11:27<2:30:37, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0006737775402143598, 'learning_rate': 2.735902984032002e-06, 'epoch': 8.37} 84%|████████▎ | 8366/10000 [13:11:27<2:30:37, 5.53s/it][2025-06-20 02:41:11,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:41:11,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.06 | bwd_microstep: 3318.31 | bwd_inner_microstep: 3317.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.51 [2025-06-20 02:41:11,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.06 | bwd: 3318.32 | bwd_inner: 3317.53 | bwd_allreduce: 0.75 | step: 6.51 84%|████████▎ | 8367/10000 [13:11:32<2:29:57, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.010571005754172802, 'learning_rate': 2.73263369352897e-06, 'epoch': 8.37} 84%|████████▎ | 8367/10000 [13:11:32<2:29:57, 5.51s/it][2025-06-20 02:41:17,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:41:17,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.17 | bwd_microstep: 3336.09 | bwd_inner_microstep: 3335.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.63 [2025-06-20 02:41:17,425] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.17 | bwd: 3336.10 | bwd_inner: 3335.28 | bwd_allreduce: 0.78 | step: 6.64 84%|████████▎ | 8368/10000 [13:11:38<2:29:36, 5.50s/it] {'loss': 0.0079, 'grad_norm': 3.64326810836792, 'learning_rate': 2.7293662142925213e-06, 'epoch': 8.37} 84%|████████▎ | 8368/10000 [13:11:38<2:29:36, 5.50s/it][2025-06-20 02:41:22,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:41:22,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.50 | bwd_microstep: 3340.13 | bwd_inner_microstep: 3339.27 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.84 [2025-06-20 02:41:22,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.50 | bwd: 3340.15 | bwd_inner: 3339.27 | bwd_allreduce: 0.83 | step: 6.84 84%|████████▎ | 8369/10000 [13:11:43<2:29:28, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0008631825912743807, 'learning_rate': 2.726100546665402e-06, 'epoch': 8.37} 84%|████████▎ | 8369/10000 [13:11:43<2:29:28, 5.50s/it][2025-06-20 02:41:28,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:41:28,413] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.43 | bwd_microstep: 3335.25 | bwd_inner_microstep: 3334.44 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-20 02:41:28,414] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.43 | bwd: 3335.26 | bwd_inner: 3334.44 | bwd_allreduce: 0.77 | step: 7.08 84%|████████▎ | 8370/10000 [13:11:49<2:29:21, 5.50s/it] {'loss': 0.004, 'grad_norm': 1.018272042274475, 'learning_rate': 2.722836690990165e-06, 'epoch': 8.37} 84%|████████▎ | 8370/10000 [13:11:49<2:29:21, 5.50s/it][2025-06-20 02:41:33,910] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:41:33,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.68 | bwd_microstep: 3329.11 | bwd_inner_microstep: 3328.29 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-20 02:41:33,911] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.68 | bwd: 3329.13 | bwd_inner: 3328.29 | bwd_allreduce: 0.79 | step: 6.85 84%|████████▎ | 8371/10000 [13:11:54<2:29:16, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.004721909761428833, 'learning_rate': 2.719574647609178e-06, 'epoch': 8.37} 84%|████████▎ | 8371/10000 [13:11:54<2:29:16, 5.50s/it][2025-06-20 02:41:39,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:41:39,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.98 | bwd_microstep: 3335.22 | bwd_inner_microstep: 3334.37 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.93 [2025-06-20 02:41:39,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.98 | bwd: 3335.23 | bwd_inner: 3334.37 | bwd_allreduce: 0.82 | step: 6.93 84%|████████▎ | 8372/10000 [13:12:00<2:29:02, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.007428489625453949, 'learning_rate': 2.716314416864603e-06, 'epoch': 8.37} 84%|████████▎ | 8372/10000 [13:12:00<2:29:02, 5.49s/it][2025-06-20 02:41:44,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:41:44,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.72 | bwd_microstep: 3343.87 | bwd_inner_microstep: 3342.96 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.99 [2025-06-20 02:41:44,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.72 | bwd: 3343.89 | bwd_inner: 3342.96 | bwd_allreduce: 0.88 | step: 7.00 84%|████████▎ | 8373/10000 [13:12:05<2:29:01, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.044487979263067245, 'learning_rate': 2.713055999098433e-06, 'epoch': 8.37} 84%|████████▎ | 8373/10000 [13:12:05<2:29:01, 5.50s/it][2025-06-20 02:41:50,387] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:41:50,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.38 | bwd_microstep: 3336.71 | bwd_inner_microstep: 3335.79 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.15 [2025-06-20 02:41:50,388] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.38 | bwd: 3336.73 | bwd_inner: 3335.79 | bwd_allreduce: 0.89 | step: 7.16 84%|████████▎ | 8374/10000 [13:12:11<2:28:54, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.011308042332530022, 'learning_rate': 2.7097993946524547e-06, 'epoch': 8.37} 84%|████████▎ | 8374/10000 [13:12:11<2:28:54, 5.49s/it][2025-06-20 02:41:55,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:41:55,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.52 | bwd_microstep: 3378.65 | bwd_inner_microstep: 3377.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-20 02:41:55,943] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.52 | bwd: 3378.66 | bwd_inner: 3377.86 | bwd_allreduce: 0.76 | step: 6.91 84%|████████▍ | 8375/10000 [13:12:16<2:29:18, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00033267849357798696, 'learning_rate': 2.7065446038682752e-06, 'epoch': 8.38} 84%|████████▍ | 8375/10000 [13:12:16<2:29:18, 5.51s/it][2025-06-20 02:42:01,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:42:01,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.46 | bwd_microstep: 3323.22 | bwd_inner_microstep: 3322.43 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-20 02:42:01,415] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.46 | bwd: 3323.24 | bwd_inner: 3322.43 | bwd_allreduce: 0.76 | step: 6.62 84%|████████▍ | 8376/10000 [13:12:22<2:28:52, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0009743165574036539, 'learning_rate': 2.7032916270873077e-06, 'epoch': 8.38} 84%|████████▍ | 8376/10000 [13:12:22<2:28:52, 5.50s/it][2025-06-20 02:42:06,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:42:06,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.49 | bwd_microstep: 3382.19 | bwd_inner_microstep: 3381.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 02:42:06,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.49 | bwd: 3382.20 | bwd_inner: 3381.39 | bwd_allreduce: 0.76 | step: 6.73 84%|████████▍ | 8377/10000 [13:12:27<2:29:15, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00020506202417891473, 'learning_rate': 2.7000404646507706e-06, 'epoch': 8.38} 84%|████████▍ | 8377/10000 [13:12:27<2:29:15, 5.52s/it][2025-06-20 02:42:12,516] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:42:12,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.48 | bwd_microstep: 3377.79 | bwd_inner_microstep: 3376.79 | bwd_allreduce_microstep: 0.96 | step_microstep: 6.80 [2025-06-20 02:42:12,517] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.48 | bwd: 3377.81 | bwd_inner: 3376.79 | bwd_allreduce: 0.98 | step: 6.80 84%|████████▍ | 8378/10000 [13:12:33<2:29:22, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.04315967857837677, 'learning_rate': 2.6967911168996952e-06, 'epoch': 8.38} 84%|████████▍ | 8378/10000 [13:12:33<2:29:22, 5.53s/it][2025-06-20 02:42:18,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:42:18,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.27 | bwd_microstep: 3376.88 | bwd_inner_microstep: 3376.09 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-20 02:42:18,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.27 | bwd: 3376.89 | bwd_inner: 3376.09 | bwd_allreduce: 0.76 | step: 6.75 84%|████████▍ | 8379/10000 [13:12:38<2:29:28, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.006604746915400028, 'learning_rate': 2.693543584174925e-06, 'epoch': 8.38} 84%|████████▍ | 8379/10000 [13:12:38<2:29:28, 5.53s/it][2025-06-20 02:42:23,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:42:23,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.88 | bwd_microstep: 3328.80 | bwd_inner_microstep: 3327.94 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.30 [2025-06-20 02:42:23,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.88 | bwd: 3328.83 | bwd_inner: 3327.94 | bwd_allreduce: 0.82 | step: 7.30 84%|████████▍ | 8380/10000 [13:12:44<2:29:00, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.026041965931653976, 'learning_rate': 2.690297866817111e-06, 'epoch': 8.38} 84%|████████▍ | 8380/10000 [13:12:44<2:29:00, 5.52s/it][2025-06-20 02:42:29,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:42:29,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.54 | bwd_microstep: 3377.57 | bwd_inner_microstep: 3376.79 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:42:29,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.54 | bwd: 3377.59 | bwd_inner: 3376.79 | bwd_allreduce: 0.76 | step: 6.62 84%|████████▍ | 8381/10000 [13:12:49<2:29:13, 5.53s/it] {'loss': 0.0005, 'grad_norm': 0.10237177461385727, 'learning_rate': 2.687053965166715e-06, 'epoch': 8.38} 84%|████████▍ | 8381/10000 [13:12:49<2:29:13, 5.53s/it][2025-06-20 02:42:34,588] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:42:34,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.16 | bwd_microstep: 3325.98 | bwd_inner_microstep: 3325.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 02:42:34,589] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.16 | bwd: 3326.00 | bwd_inner: 3325.19 | bwd_allreduce: 0.77 | step: 6.79 84%|████████▍ | 8382/10000 [13:12:55<2:28:43, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.006961312610656023, 'learning_rate': 2.6838118795640067e-06, 'epoch': 8.38} 84%|████████▍ | 8382/10000 [13:12:55<2:28:43, 5.52s/it][2025-06-20 02:42:40,145] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:42:40,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.34 | bwd_microstep: 3373.71 | bwd_inner_microstep: 3372.89 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.01 [2025-06-20 02:42:40,146] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.34 | bwd: 3373.72 | bwd_inner: 3372.89 | bwd_allreduce: 0.79 | step: 7.01 84%|████████▍ | 8383/10000 [13:13:00<2:28:58, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0005927826859988272, 'learning_rate': 2.680571610349063e-06, 'epoch': 8.38} 84%|████████▍ | 8383/10000 [13:13:00<2:28:58, 5.53s/it][2025-06-20 02:42:45,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:42:45,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.32 | bwd_microstep: 3370.38 | bwd_inner_microstep: 3369.48 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.36 [2025-06-20 02:42:45,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.32 | bwd: 3370.40 | bwd_inner: 3369.48 | bwd_allreduce: 0.87 | step: 7.37 84%|████████▍ | 8384/10000 [13:13:06<2:29:05, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.003869637381285429, 'learning_rate': 2.677333157861777e-06, 'epoch': 8.38} 84%|████████▍ | 8384/10000 [13:13:06<2:29:05, 5.54s/it][2025-06-20 02:42:51,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:42:51,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.32 | bwd_microstep: 3375.08 | bwd_inner_microstep: 3374.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.80 [2025-06-20 02:42:51,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.32 | bwd: 3375.09 | bwd_inner: 3374.28 | bwd_allreduce: 0.77 | step: 6.80 84%|████████▍ | 8385/10000 [13:13:12<2:29:08, 5.54s/it] {'loss': 0.0005, 'grad_norm': 0.1516016721725464, 'learning_rate': 2.674096522441845e-06, 'epoch': 8.38} 84%|████████▍ | 8385/10000 [13:13:12<2:29:08, 5.54s/it][2025-06-20 02:42:56,798] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 02:42:56,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.16 | bwd_microstep: 3378.88 | bwd_inner_microstep: 3377.85 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.30 [2025-06-20 02:42:56,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.16 | bwd: 3378.89 | bwd_inner: 3377.85 | bwd_allreduce: 0.99 | step: 7.30 84%|████████▍ | 8386/10000 [13:13:17<2:29:05, 5.54s/it] {'loss': 0.0005, 'grad_norm': 0.16601897776126862, 'learning_rate': 2.67086170442878e-06, 'epoch': 8.39} 84%|████████▍ | 8386/10000 [13:13:17<2:29:05, 5.54s/it][2025-06-20 02:43:02,274] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.61 | optimizer_step: 2.73 [2025-06-20 02:43:02,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.32 | bwd_microstep: 3324.40 | bwd_inner_microstep: 3323.60 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-20 02:43:02,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.32 | bwd: 3324.42 | bwd_inner: 3323.60 | bwd_allreduce: 0.77 | step: 6.92 84%|████████▍ | 8387/10000 [13:13:23<2:28:27, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0017610099166631699, 'learning_rate': 2.667628704161893e-06, 'epoch': 8.39} 84%|████████▍ | 8387/10000 [13:13:23<2:28:27, 5.52s/it][2025-06-20 02:43:07,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:43:07,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.49 | bwd_microstep: 3398.34 | bwd_inner_microstep: 3397.52 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.07 [2025-06-20 02:43:07,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.49 | bwd: 3398.35 | bwd_inner: 3397.52 | bwd_allreduce: 0.78 | step: 7.07 84%|████████▍ | 8388/10000 [13:13:28<2:28:51, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0007959983195178211, 'learning_rate': 2.6643975219803107e-06, 'epoch': 8.39} 84%|████████▍ | 8388/10000 [13:13:28<2:28:51, 5.54s/it][2025-06-20 02:43:13,327] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:43:13,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.04 | bwd_microstep: 3323.38 | bwd_inner_microstep: 3322.42 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.28 [2025-06-20 02:43:13,328] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.04 | bwd: 3323.40 | bwd_inner: 3322.42 | bwd_allreduce: 0.93 | step: 7.28 84%|████████▍ | 8389/10000 [13:13:34<2:28:11, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0009762324043549597, 'learning_rate': 2.6611681582229708e-06, 'epoch': 8.39} 84%|████████▍ | 8389/10000 [13:13:34<2:28:11, 5.52s/it][2025-06-20 02:43:18,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:43:18,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.58 | bwd_microstep: 3313.75 | bwd_inner_microstep: 3312.97 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.53 [2025-06-20 02:43:18,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.58 | bwd: 3313.76 | bwd_inner: 3312.97 | bwd_allreduce: 0.75 | step: 6.54 84%|████████▍ | 8390/10000 [13:13:39<2:27:39, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.05749126523733139, 'learning_rate': 2.6579406132286225e-06, 'epoch': 8.39} 84%|████████▍ | 8390/10000 [13:13:39<2:27:39, 5.50s/it][2025-06-20 02:43:24,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:43:24,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.57 | bwd_microstep: 3318.40 | bwd_inner_microstep: 3317.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.69 [2025-06-20 02:43:24,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.57 | bwd: 3318.41 | bwd_inner: 3317.60 | bwd_allreduce: 0.77 | step: 6.69 84%|████████▍ | 8391/10000 [13:13:45<2:27:15, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00011779987107729539, 'learning_rate': 2.6547148873358186e-06, 'epoch': 8.39} 84%|████████▍ | 8391/10000 [13:13:45<2:27:15, 5.49s/it][2025-06-20 02:43:29,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:43:29,807] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.80 | bwd_microstep: 3381.07 | bwd_inner_microstep: 3380.25 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.44 [2025-06-20 02:43:29,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.80 | bwd: 3381.09 | bwd_inner: 3380.25 | bwd_allreduce: 0.79 | step: 7.44 84%|████████▍ | 8392/10000 [13:13:50<2:27:39, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.008714811876416206, 'learning_rate': 2.651490980882916e-06, 'epoch': 8.39} 84%|████████▍ | 8392/10000 [13:13:50<2:27:39, 5.51s/it][2025-06-20 02:43:35,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:43:35,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.91 | bwd_microstep: 3327.90 | bwd_inner_microstep: 3327.10 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-20 02:43:35,286] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.91 | bwd: 3327.92 | bwd_inner: 3327.10 | bwd_allreduce: 0.77 | step: 6.92 84%|████████▍ | 8393/10000 [13:13:56<2:27:18, 5.50s/it] {'loss': 0.0005, 'grad_norm': 0.14367926120758057, 'learning_rate': 2.6482688942080945e-06, 'epoch': 8.39} 84%|████████▍ | 8393/10000 [13:13:56<2:27:18, 5.50s/it][2025-06-20 02:43:40,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:43:40,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.05 | bwd_microstep: 3320.05 | bwd_inner_microstep: 3319.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-20 02:43:40,751] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.05 | bwd: 3320.06 | bwd_inner: 3319.27 | bwd_allreduce: 0.75 | step: 6.55 84%|████████▍ | 8394/10000 [13:14:01<2:26:55, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00015741500828880817, 'learning_rate': 2.6450486276493337e-06, 'epoch': 8.39} 84%|████████▍ | 8394/10000 [13:14:01<2:26:55, 5.49s/it][2025-06-20 02:43:46,284] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:43:46,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.68 | bwd_microstep: 3369.06 | bwd_inner_microstep: 3368.13 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.97 [2025-06-20 02:43:46,285] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.68 | bwd: 3369.07 | bwd_inner: 3368.13 | bwd_allreduce: 0.90 | step: 6.98 84%|████████▍ | 8395/10000 [13:14:07<2:27:12, 5.50s/it] {'loss': 0.0032, 'grad_norm': 2.47007417678833, 'learning_rate': 2.641830181544425e-06, 'epoch': 8.39} 84%|████████▍ | 8395/10000 [13:14:07<2:27:12, 5.50s/it][2025-06-20 02:43:51,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:43:51,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.03 | bwd_microstep: 3313.39 | bwd_inner_microstep: 3312.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 02:43:51,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.03 | bwd: 3313.40 | bwd_inner: 3312.61 | bwd_allreduce: 0.75 | step: 6.58 84%|████████▍ | 8396/10000 [13:14:12<2:26:42, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.0781266987323761, 'learning_rate': 2.638613556230967e-06, 'epoch': 8.4} 84%|████████▍ | 8396/10000 [13:14:12<2:26:42, 5.49s/it][2025-06-20 02:43:57,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:43:57,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.02 | bwd_microstep: 3363.17 | bwd_inner_microstep: 3362.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 02:43:57,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.02 | bwd: 3363.18 | bwd_inner: 3362.38 | bwd_allreduce: 0.76 | step: 6.64 84%|████████▍ | 8397/10000 [13:14:18<2:26:55, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0010209832107648253, 'learning_rate': 2.635398752046372e-06, 'epoch': 8.4} 84%|████████▍ | 8397/10000 [13:14:18<2:26:55, 5.50s/it][2025-06-20 02:44:02,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:44:02,802] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.50 | bwd_microstep: 3372.22 | bwd_inner_microstep: 3371.42 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-20 02:44:02,803] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.50 | bwd: 3372.24 | bwd_inner: 3371.42 | bwd_allreduce: 0.78 | step: 6.92 84%|████████▍ | 8398/10000 [13:14:23<2:27:09, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0037497133016586304, 'learning_rate': 2.632185769327855e-06, 'epoch': 8.4} 84%|████████▍ | 8398/10000 [13:14:23<2:27:09, 5.51s/it][2025-06-20 02:44:08,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:44:08,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.12 | bwd_microstep: 3363.67 | bwd_inner_microstep: 3362.83 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.39 [2025-06-20 02:44:08,338] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.12 | bwd: 3363.68 | bwd_inner: 3362.83 | bwd_allreduce: 0.81 | step: 7.40 84%|████████▍ | 8399/10000 [13:14:29<2:27:16, 5.52s/it] {'loss': 0.0, 'grad_norm': 2.1211548300925642e-05, 'learning_rate': 2.6289746084124444e-06, 'epoch': 8.4} 84%|████████▍ | 8399/10000 [13:14:29<2:27:16, 5.52s/it][2025-06-20 02:44:13,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:44:13,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.76 | bwd_microstep: 3312.29 | bwd_inner_microstep: 3311.44 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.93 [2025-06-20 02:44:13,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.76 | bwd: 3312.30 | bwd_inner: 3311.44 | bwd_allreduce: 0.81 | step: 6.93 84%|████████▍ | 8400/10000 [13:14:34<2:26:37, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0006922634202055633, 'learning_rate': 2.6257652696369773e-06, 'epoch': 8.4} 84%|████████▍ | 8400/10000 [13:14:34<2:26:37, 5.50s/it][2025-06-20 02:44:19,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:44:19,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.59 | bwd_microstep: 3365.43 | bwd_inner_microstep: 3364.63 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 02:44:19,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.59 | bwd: 3365.44 | bwd_inner: 3364.63 | bwd_allreduce: 0.76 | step: 6.72 84%|████████▍ | 8401/10000 [13:14:40<2:26:46, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.04657415300607681, 'learning_rate': 2.6225577533380954e-06, 'epoch': 8.4} 84%|████████▍ | 8401/10000 [13:14:40<2:26:46, 5.51s/it][2025-06-20 02:44:24,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:44:24,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.80 | bwd_microstep: 3376.37 | bwd_inner_microstep: 3375.57 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 02:44:24,868] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.80 | bwd: 3376.38 | bwd_inner: 3375.57 | bwd_allreduce: 0.77 | step: 6.86 84%|████████▍ | 8402/10000 [13:14:45<2:27:00, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.005984584800899029, 'learning_rate': 2.6193520598522605e-06, 'epoch': 8.4} 84%|████████▍ | 8402/10000 [13:14:45<2:27:00, 5.52s/it][2025-06-20 02:44:30,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:44:30,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.93 | bwd_microstep: 3366.42 | bwd_inner_microstep: 3365.59 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.08 [2025-06-20 02:44:30,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.93 | bwd: 3366.43 | bwd_inner: 3365.59 | bwd_allreduce: 0.80 | step: 7.09 84%|████████▍ | 8403/10000 [13:14:51<2:26:59, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0025714761577546597, 'learning_rate': 2.6161481895157237e-06, 'epoch': 8.4} 84%|████████▍ | 8403/10000 [13:14:51<2:26:59, 5.52s/it][2025-06-20 02:44:35,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:44:35,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.16 | bwd_microstep: 3361.27 | bwd_inner_microstep: 3360.48 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 02:44:35,925] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.16 | bwd: 3361.29 | bwd_inner: 3360.48 | bwd_allreduce: 0.77 | step: 6.85 84%|████████▍ | 8404/10000 [13:14:56<2:26:56, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.04860146343708038, 'learning_rate': 2.6129461426645586e-06, 'epoch': 8.4} 84%|████████▍ | 8404/10000 [13:14:56<2:26:56, 5.52s/it][2025-06-20 02:44:41,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 02:44:41,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.04 | bwd_microstep: 3317.81 | bwd_inner_microstep: 3316.69 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.92 [2025-06-20 02:44:41,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.04 | bwd: 3317.83 | bwd_inner: 3316.69 | bwd_allreduce: 1.08 | step: 7.92 84%|████████▍ | 8405/10000 [13:15:02<2:26:19, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0001531200687168166, 'learning_rate': 2.6097459196346496e-06, 'epoch': 8.4} 84%|████████▍ | 8405/10000 [13:15:02<2:26:19, 5.50s/it][2025-06-20 02:44:46,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:44:46,920] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.16 | bwd_microstep: 3370.11 | bwd_inner_microstep: 3369.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 02:44:46,921] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.16 | bwd: 3370.13 | bwd_inner: 3369.32 | bwd_allreduce: 0.76 | step: 6.58 84%|████████▍ | 8406/10000 [13:15:07<2:26:29, 5.51s/it] {'loss': 0.0008, 'grad_norm': 0.38213589787483215, 'learning_rate': 2.606547520761682e-06, 'epoch': 8.41} 84%|████████▍ | 8406/10000 [13:15:07<2:26:29, 5.51s/it][2025-06-20 02:44:52,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:44:52,385] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.26 | bwd_microstep: 3327.38 | bwd_inner_microstep: 3326.42 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.06 [2025-06-20 02:44:52,386] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.26 | bwd: 3327.39 | bwd_inner: 3326.42 | bwd_allreduce: 0.93 | step: 7.06 84%|████████▍ | 8407/10000 [13:15:13<2:26:00, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.019947776570916176, 'learning_rate': 2.603350946381158e-06, 'epoch': 8.41} 84%|████████▍ | 8407/10000 [13:15:13<2:26:00, 5.50s/it][2025-06-20 02:44:57,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:44:57,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.68 | bwd_microstep: 3307.39 | bwd_inner_microstep: 3306.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-20 02:44:57,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.68 | bwd: 3307.41 | bwd_inner: 3306.59 | bwd_allreduce: 0.77 | step: 7.00 84%|████████▍ | 8408/10000 [13:15:18<2:25:30, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0002970476052723825, 'learning_rate': 2.6001561968283762e-06, 'epoch': 8.41} 84%|████████▍ | 8408/10000 [13:15:18<2:25:30, 5.48s/it][2025-06-20 02:45:03,349] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:45:03,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.92 | bwd_microstep: 3356.26 | bwd_inner_microstep: 3355.20 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.16 [2025-06-20 02:45:03,350] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.92 | bwd: 3356.28 | bwd_inner: 3355.20 | bwd_allreduce: 1.02 | step: 7.16 84%|████████▍ | 8409/10000 [13:15:24<2:25:40, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.003986665513366461, 'learning_rate': 2.596963272438453e-06, 'epoch': 8.41} 84%|████████▍ | 8409/10000 [13:15:24<2:25:40, 5.49s/it][2025-06-20 02:45:08,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:45:08,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.19 | bwd_microstep: 3317.49 | bwd_inner_microstep: 3316.52 | bwd_allreduce_microstep: 0.90 | step_microstep: 6.96 [2025-06-20 02:45:08,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.19 | bwd: 3317.51 | bwd_inner: 3316.52 | bwd_allreduce: 0.93 | step: 6.96 84%|████████▍ | 8410/10000 [13:15:29<2:25:20, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0005642186151817441, 'learning_rate': 2.593772173546314e-06, 'epoch': 8.41} 84%|████████▍ | 8410/10000 [13:15:29<2:25:20, 5.48s/it][2025-06-20 02:45:14,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:45:14,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.45 | bwd_microstep: 3363.08 | bwd_inner_microstep: 3362.28 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 02:45:14,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.45 | bwd: 3363.10 | bwd_inner: 3362.28 | bwd_allreduce: 0.78 | step: 6.81 84%|████████▍ | 8411/10000 [13:15:35<2:25:41, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0014110348420217633, 'learning_rate': 2.5905829004866865e-06, 'epoch': 8.41} 84%|████████▍ | 8411/10000 [13:15:35<2:25:41, 5.50s/it][2025-06-20 02:45:19,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:45:19,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.54 | bwd_microstep: 3328.54 | bwd_inner_microstep: 3327.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:45:19,837] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.54 | bwd: 3328.55 | bwd_inner: 3327.76 | bwd_allreduce: 0.75 | step: 6.62 84%|████████▍ | 8412/10000 [13:15:40<2:25:27, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.042623039335012436, 'learning_rate': 2.5873954535941194e-06, 'epoch': 8.41} 84%|████████▍ | 8412/10000 [13:15:40<2:25:27, 5.50s/it][2025-06-20 02:45:25,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:45:25,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.31 | bwd_microstep: 3357.77 | bwd_inner_microstep: 3356.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-20 02:45:25,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.31 | bwd: 3357.79 | bwd_inner: 3356.98 | bwd_allreduce: 0.77 | step: 7.06 84%|████████▍ | 8413/10000 [13:15:46<2:25:35, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005370600149035454, 'learning_rate': 2.584209833202951e-06, 'epoch': 8.41} 84%|████████▍ | 8413/10000 [13:15:46<2:25:35, 5.50s/it][2025-06-20 02:45:30,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:45:30,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.57 | bwd_microstep: 3374.81 | bwd_inner_microstep: 3374.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:45:30,896] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.57 | bwd: 3374.83 | bwd_inner: 3374.03 | bwd_allreduce: 0.76 | step: 6.63 84%|████████▍ | 8414/10000 [13:15:51<2:25:44, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00038310018135234714, 'learning_rate': 2.5810260396473407e-06, 'epoch': 8.41} 84%|████████▍ | 8414/10000 [13:15:51<2:25:44, 5.51s/it][2025-06-20 02:45:36,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:45:36,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.09 | bwd_microstep: 3328.60 | bwd_inner_microstep: 3327.64 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.29 [2025-06-20 02:45:36,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.09 | bwd: 3328.61 | bwd_inner: 3327.64 | bwd_allreduce: 0.93 | step: 7.30 84%|████████▍ | 8415/10000 [13:15:57<2:25:20, 5.50s/it] {'loss': 0.0032, 'grad_norm': 0.8357073664665222, 'learning_rate': 2.5778440732612553e-06, 'epoch': 8.41} 84%|████████▍ | 8415/10000 [13:15:57<2:25:20, 5.50s/it][2025-06-20 02:45:41,827] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:45:41,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.86 | bwd_microstep: 3312.66 | bwd_inner_microstep: 3311.83 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.18 [2025-06-20 02:45:41,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.86 | bwd: 3312.67 | bwd_inner: 3311.83 | bwd_allreduce: 0.80 | step: 7.18 84%|████████▍ | 8416/10000 [13:16:02<2:24:54, 5.49s/it] {'loss': 0.0, 'grad_norm': 2.954720002890099e-05, 'learning_rate': 2.5746639343784674e-06, 'epoch': 8.42} 84%|████████▍ | 8416/10000 [13:16:02<2:24:54, 5.49s/it][2025-06-20 02:45:47,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:45:47,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.01 | bwd_microstep: 3359.67 | bwd_inner_microstep: 3358.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-20 02:45:47,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.01 | bwd: 3359.68 | bwd_inner: 3358.88 | bwd_allreduce: 0.76 | step: 6.74 84%|████████▍ | 8417/10000 [13:16:08<2:25:06, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0025448796804994345, 'learning_rate': 2.5714856233325616e-06, 'epoch': 8.42} 84%|████████▍ | 8417/10000 [13:16:08<2:25:06, 5.50s/it][2025-06-20 02:45:52,808] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:45:52,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.56 | bwd_microstep: 3311.27 | bwd_inner_microstep: 3310.48 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:45:52,809] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.56 | bwd: 3311.28 | bwd_inner: 3310.48 | bwd_allreduce: 0.76 | step: 6.62 84%|████████▍ | 8418/10000 [13:16:13<2:24:38, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.008645678870379925, 'learning_rate': 2.5683091404569236e-06, 'epoch': 8.42} 84%|████████▍ | 8418/10000 [13:16:13<2:24:38, 5.49s/it][2025-06-20 02:45:58,268] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:45:58,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.77 | bwd_microstep: 3316.34 | bwd_inner_microstep: 3315.44 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.92 [2025-06-20 02:45:58,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.78 | bwd: 3316.36 | bwd_inner: 3315.44 | bwd_allreduce: 0.87 | step: 6.93 84%|████████▍ | 8419/10000 [13:16:19<2:24:21, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.002218744484707713, 'learning_rate': 2.5651344860847526e-06, 'epoch': 8.42} 84%|████████▍ | 8419/10000 [13:16:19<2:24:21, 5.48s/it][2025-06-20 02:46:03,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:46:03,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.92 | bwd_microstep: 3317.80 | bwd_inner_microstep: 3317.02 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 02:46:03,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.91 | bwd: 3317.81 | bwd_inner: 3317.02 | bwd_allreduce: 0.75 | step: 6.57 84%|████████▍ | 8420/10000 [13:16:24<2:24:07, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.016469625756144524, 'learning_rate': 2.5619616605490595e-06, 'epoch': 8.42} 84%|████████▍ | 8420/10000 [13:16:24<2:24:07, 5.47s/it][2025-06-20 02:46:09,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:46:09,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.26 | bwd_microstep: 3356.55 | bwd_inner_microstep: 3355.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 02:46:09,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.26 | bwd: 3356.56 | bwd_inner: 3355.76 | bwd_allreduce: 0.76 | step: 6.79 84%|████████▍ | 8421/10000 [13:16:30<2:24:22, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0055627115070819855, 'learning_rate': 2.5587906641826533e-06, 'epoch': 8.42} 84%|████████▍ | 8421/10000 [13:16:30<2:24:22, 5.49s/it][2025-06-20 02:46:14,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:46:14,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.80 | bwd_microstep: 3361.99 | bwd_inner_microstep: 3361.20 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-20 02:46:14,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.80 | bwd: 3362.01 | bwd_inner: 3361.20 | bwd_allreduce: 0.77 | step: 6.90 84%|████████▍ | 8422/10000 [13:16:35<2:24:39, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0001484481617808342, 'learning_rate': 2.555621497318164e-06, 'epoch': 8.42} 84%|████████▍ | 8422/10000 [13:16:35<2:24:39, 5.50s/it][2025-06-20 02:46:20,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 02:46:20,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.46 | bwd_microstep: 3311.87 | bwd_inner_microstep: 3310.79 | bwd_allreduce_microstep: 1.02 | step_microstep: 7.33 [2025-06-20 02:46:20,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.46 | bwd: 3311.89 | bwd_inner: 3310.79 | bwd_allreduce: 1.04 | step: 7.33 84%|████████▍ | 8423/10000 [13:16:41<2:24:11, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006872450467199087, 'learning_rate': 2.552454160288014e-06, 'epoch': 8.42} 84%|████████▍ | 8423/10000 [13:16:41<2:24:11, 5.49s/it][2025-06-20 02:46:25,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:46:25,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.25 | bwd_microstep: 3313.40 | bwd_inner_microstep: 3312.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 02:46:25,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.25 | bwd: 3313.41 | bwd_inner: 3312.61 | bwd_allreduce: 0.76 | step: 6.85 84%|████████▍ | 8424/10000 [13:16:46<2:23:51, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0004280737484805286, 'learning_rate': 2.5492886534244466e-06, 'epoch': 8.42} 84%|████████▍ | 8424/10000 [13:16:46<2:23:51, 5.48s/it][2025-06-20 02:46:31,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:46:31,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.33 | bwd_microstep: 3362.89 | bwd_inner_microstep: 3362.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-20 02:46:31,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.33 | bwd: 3362.90 | bwd_inner: 3362.11 | bwd_allreduce: 0.75 | step: 6.56 84%|████████▍ | 8425/10000 [13:16:52<2:24:11, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0010366428177803755, 'learning_rate': 2.5461249770595074e-06, 'epoch': 8.43} 84%|████████▍ | 8425/10000 [13:16:52<2:24:11, 5.49s/it][2025-06-20 02:46:36,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:46:36,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.42 | bwd_microstep: 3321.30 | bwd_inner_microstep: 3320.51 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 02:46:36,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.42 | bwd: 3321.31 | bwd_inner: 3320.51 | bwd_allreduce: 0.76 | step: 6.60 84%|████████▍ | 8426/10000 [13:16:57<2:23:52, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.002436357084661722, 'learning_rate': 2.5429631315250514e-06, 'epoch': 8.43} 84%|████████▍ | 8426/10000 [13:16:57<2:23:52, 5.48s/it][2025-06-20 02:46:42,195] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:46:42,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.86 | bwd_microstep: 3353.24 | bwd_inner_microstep: 3352.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-20 02:46:42,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.86 | bwd: 3353.25 | bwd_inner: 3352.46 | bwd_allreduce: 0.75 | step: 6.53 84%|████████▍ | 8427/10000 [13:17:02<2:24:00, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00025369267677888274, 'learning_rate': 2.539803117152748e-06, 'epoch': 8.43} 84%|████████▍ | 8427/10000 [13:17:02<2:24:00, 5.49s/it][2025-06-20 02:46:47,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:46:47,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.32 | bwd_microstep: 3311.90 | bwd_inner_microstep: 3310.93 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.12 [2025-06-20 02:46:47,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.32 | bwd: 3311.92 | bwd_inner: 3310.93 | bwd_allreduce: 0.94 | step: 7.12 84%|████████▍ | 8428/10000 [13:17:08<2:23:35, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00682902941480279, 'learning_rate': 2.5366449342740572e-06, 'epoch': 8.43} 84%|████████▍ | 8428/10000 [13:17:08<2:23:35, 5.48s/it][2025-06-20 02:46:53,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:46:53,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.12 | bwd_microstep: 3311.57 | bwd_inner_microstep: 3310.75 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.18 [2025-06-20 02:46:53,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.12 | bwd: 3311.59 | bwd_inner: 3310.75 | bwd_allreduce: 0.79 | step: 7.19 84%|████████▍ | 8429/10000 [13:17:13<2:23:19, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00018941896269097924, 'learning_rate': 2.5334885832202626e-06, 'epoch': 8.43} 84%|████████▍ | 8429/10000 [13:17:13<2:23:19, 5.47s/it][2025-06-20 02:46:58,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:46:58,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.52 | bwd_microstep: 3312.87 | bwd_inner_microstep: 3312.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-20 02:46:58,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.52 | bwd: 3312.89 | bwd_inner: 3312.09 | bwd_allreduce: 0.76 | step: 6.81 84%|████████▍ | 8430/10000 [13:17:19<2:23:05, 5.47s/it] {'loss': 0.0, 'grad_norm': 8.436704956693575e-05, 'learning_rate': 2.530334064322453e-06, 'epoch': 8.43} 84%|████████▍ | 8430/10000 [13:17:19<2:23:05, 5.47s/it][2025-06-20 02:47:04,016] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:47:04,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.34 | bwd_microstep: 3312.31 | bwd_inner_microstep: 3311.46 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.98 [2025-06-20 02:47:04,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.34 | bwd: 3312.33 | bwd_inner: 3311.46 | bwd_allreduce: 0.81 | step: 6.98 84%|████████▍ | 8431/10000 [13:17:24<2:22:55, 5.47s/it] {'loss': 0.0, 'grad_norm': 1.6139269064296968e-05, 'learning_rate': 2.5271813779115182e-06, 'epoch': 8.43} 84%|████████▍ | 8431/10000 [13:17:24<2:22:55, 5.47s/it][2025-06-20 02:47:09,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:47:09,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.20 | bwd_microstep: 3362.97 | bwd_inner_microstep: 3362.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 02:47:09,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.20 | bwd: 3362.98 | bwd_inner: 3362.18 | bwd_allreduce: 0.76 | step: 6.59 84%|████████▍ | 8432/10000 [13:17:30<2:23:17, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.023335127159953117, 'learning_rate': 2.524030524318164e-06, 'epoch': 8.43} 84%|████████▍ | 8432/10000 [13:17:30<2:23:17, 5.48s/it][2025-06-20 02:47:14,992] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:47:14,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.72 | bwd_microstep: 3310.84 | bwd_inner_microstep: 3310.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 02:47:14,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.72 | bwd: 3310.85 | bwd_inner: 3310.04 | bwd_allreduce: 0.77 | step: 6.70 84%|████████▍ | 8433/10000 [13:17:35<2:22:55, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00020884005061816424, 'learning_rate': 2.5208815038729007e-06, 'epoch': 8.43} 84%|████████▍ | 8433/10000 [13:17:35<2:22:55, 5.47s/it][2025-06-20 02:47:20,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:47:20,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.34 | bwd_microstep: 3325.31 | bwd_inner_microstep: 3324.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 02:47:20,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.34 | bwd: 3325.32 | bwd_inner: 3324.51 | bwd_allreduce: 0.77 | step: 6.68 84%|████████▍ | 8434/10000 [13:17:41<2:22:50, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.0064150188118219376, 'learning_rate': 2.517734316906042e-06, 'epoch': 8.43} 84%|████████▍ | 8434/10000 [13:17:41<2:22:50, 5.47s/it][2025-06-20 02:47:26,000] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:47:26,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.65 | bwd_microstep: 3359.99 | bwd_inner_microstep: 3359.21 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:47:26,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.65 | bwd: 3360.01 | bwd_inner: 3359.21 | bwd_allreduce: 0.76 | step: 6.62 84%|████████▍ | 8435/10000 [13:17:46<2:23:14, 5.49s/it] {'loss': 0.0004, 'grad_norm': 0.07991907000541687, 'learning_rate': 2.5145889637477174e-06, 'epoch': 8.44} 84%|████████▍ | 8435/10000 [13:17:46<2:23:14, 5.49s/it][2025-06-20 02:47:31,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:47:31,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.68 | bwd_microstep: 3360.78 | bwd_inner_microstep: 3359.78 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.07 [2025-06-20 02:47:31,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.68 | bwd: 3360.80 | bwd_inner: 3359.78 | bwd_allreduce: 0.96 | step: 7.07 84%|████████▍ | 8436/10000 [13:17:52<2:23:24, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0017459274968132377, 'learning_rate': 2.511445444727858e-06, 'epoch': 8.44} 84%|████████▍ | 8436/10000 [13:17:52<2:23:24, 5.50s/it][2025-06-20 02:47:37,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:47:37,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.54 | bwd_microstep: 3366.37 | bwd_inner_microstep: 3365.46 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.87 [2025-06-20 02:47:37,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.54 | bwd: 3366.39 | bwd_inner: 3365.46 | bwd_allreduce: 0.88 | step: 6.87 84%|████████▍ | 8437/10000 [13:17:57<2:23:29, 5.51s/it] {'loss': 0.0, 'grad_norm': 9.258586942451075e-05, 'learning_rate': 2.5083037601762094e-06, 'epoch': 8.44} 84%|████████▍ | 8437/10000 [13:17:57<2:23:29, 5.51s/it][2025-06-20 02:47:42,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:47:42,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.65 | bwd_microstep: 3317.94 | bwd_inner_microstep: 3317.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 02:47:42,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.65 | bwd: 3317.95 | bwd_inner: 3317.15 | bwd_allreduce: 0.76 | step: 6.65 84%|████████▍ | 8438/10000 [13:18:03<2:22:58, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.011341293342411518, 'learning_rate': 2.50516391042231e-06, 'epoch': 8.44} 84%|████████▍ | 8438/10000 [13:18:03<2:22:58, 5.49s/it][2025-06-20 02:47:47,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:47:47,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.57 | bwd_microstep: 3316.24 | bwd_inner_microstep: 3315.33 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.13 [2025-06-20 02:47:47,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.57 | bwd: 3316.25 | bwd_inner: 3315.33 | bwd_allreduce: 0.88 | step: 7.13 84%|████████▍ | 8439/10000 [13:18:08<2:22:39, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.055420540273189545, 'learning_rate': 2.502025895795519e-06, 'epoch': 8.44} 84%|████████▍ | 8439/10000 [13:18:08<2:22:39, 5.48s/it][2025-06-20 02:47:53,510] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:47:53,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.69 | bwd_microstep: 3368.37 | bwd_inner_microstep: 3367.41 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.27 [2025-06-20 02:47:53,511] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.69 | bwd: 3368.38 | bwd_inner: 3367.41 | bwd_allreduce: 0.93 | step: 7.27 84%|████████▍ | 8440/10000 [13:18:14<2:23:02, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.016368113458156586, 'learning_rate': 2.498889716625004e-06, 'epoch': 8.44} 84%|████████▍ | 8440/10000 [13:18:14<2:23:02, 5.50s/it][2025-06-20 02:47:58,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:47:58,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.12 | bwd_microstep: 3314.68 | bwd_inner_microstep: 3313.90 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.52 [2025-06-20 02:47:58,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.12 | bwd: 3314.69 | bwd_inner: 3313.90 | bwd_allreduce: 0.75 | step: 6.52 84%|████████▍ | 8441/10000 [13:18:19<2:22:38, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0008685251814313233, 'learning_rate': 2.49575537323973e-06, 'epoch': 8.44} 84%|████████▍ | 8441/10000 [13:18:19<2:22:38, 5.49s/it][2025-06-20 02:48:04,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:48:04,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.74 | bwd_microstep: 3309.29 | bwd_inner_microstep: 3308.52 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 02:48:04,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.74 | bwd: 3309.31 | bwd_inner: 3308.52 | bwd_allreduce: 0.75 | step: 6.59 84%|████████▍ | 8442/10000 [13:18:25<2:22:13, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00028044014470651746, 'learning_rate': 2.4926228659684813e-06, 'epoch': 8.44} 84%|████████▍ | 8442/10000 [13:18:25<2:22:13, 5.48s/it][2025-06-20 02:48:09,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:48:09,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.02 | bwd_microstep: 3314.39 | bwd_inner_microstep: 3313.55 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.81 [2025-06-20 02:48:09,878] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.02 | bwd: 3314.41 | bwd_inner: 3313.55 | bwd_allreduce: 0.81 | step: 6.80 84%|████████▍ | 8443/10000 [13:18:30<2:21:59, 5.47s/it] {'loss': 0.0005, 'grad_norm': 0.18623162806034088, 'learning_rate': 2.4894921951398377e-06, 'epoch': 8.44} 84%|████████▍ | 8443/10000 [13:18:30<2:21:59, 5.47s/it][2025-06-20 02:48:15,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:48:15,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.13 | bwd_microstep: 3379.68 | bwd_inner_microstep: 3378.69 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.24 [2025-06-20 02:48:15,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.13 | bwd: 3379.69 | bwd_inner: 3378.69 | bwd_allreduce: 0.95 | step: 7.25 84%|████████▍ | 8444/10000 [13:18:36<2:22:35, 5.50s/it] {'loss': 0.0, 'grad_norm': 5.122927905176766e-05, 'learning_rate': 2.4863633610821935e-06, 'epoch': 8.44} 84%|████████▍ | 8444/10000 [13:18:36<2:22:35, 5.50s/it][2025-06-20 02:48:20,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:48:20,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.64 | bwd_microstep: 3325.00 | bwd_inner_microstep: 3324.18 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.77 [2025-06-20 02:48:20,915] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.63 | bwd: 3325.02 | bwd_inner: 3324.18 | bwd_allreduce: 0.79 | step: 6.78 84%|████████▍ | 8445/10000 [13:18:41<2:22:20, 5.49s/it] {'loss': 0.0008, 'grad_norm': 0.41692543029785156, 'learning_rate': 2.483236364123749e-06, 'epoch': 8.45} 84%|████████▍ | 8445/10000 [13:18:41<2:22:20, 5.49s/it][2025-06-20 02:48:26,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:48:26,369] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.21 | bwd_microstep: 3316.46 | bwd_inner_microstep: 3315.54 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.86 [2025-06-20 02:48:26,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.21 | bwd: 3316.47 | bwd_inner: 3315.55 | bwd_allreduce: 0.88 | step: 6.87 84%|████████▍ | 8446/10000 [13:18:47<2:21:55, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00015189488476607949, 'learning_rate': 2.4801112045925124e-06, 'epoch': 8.45} 84%|████████▍ | 8446/10000 [13:18:47<2:21:55, 5.48s/it][2025-06-20 02:48:31,828] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:48:31,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.45 | bwd_microstep: 3315.95 | bwd_inner_microstep: 3315.13 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.07 [2025-06-20 02:48:31,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.45 | bwd: 3315.96 | bwd_inner: 3315.13 | bwd_allreduce: 0.79 | step: 7.08 84%|████████▍ | 8447/10000 [13:18:52<2:21:42, 5.47s/it] {'loss': 0.0002, 'grad_norm': 0.03637533262372017, 'learning_rate': 2.476987882816302e-06, 'epoch': 8.45} 84%|████████▍ | 8447/10000 [13:18:52<2:21:42, 5.47s/it][2025-06-20 02:48:37,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:48:37,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.54 | bwd_microstep: 3368.33 | bwd_inner_microstep: 3367.50 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.85 [2025-06-20 02:48:37,375] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.54 | bwd: 3368.35 | bwd_inner: 3367.50 | bwd_allreduce: 0.80 | step: 6.86 84%|████████▍ | 8448/10000 [13:18:58<2:22:09, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0002673137350939214, 'learning_rate': 2.473866399122733e-06, 'epoch': 8.45} 84%|████████▍ | 8448/10000 [13:18:58<2:22:09, 5.50s/it][2025-06-20 02:48:42,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:48:42,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.40 | bwd_microstep: 3314.41 | bwd_inner_microstep: 3313.61 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-20 02:48:42,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.40 | bwd: 3314.42 | bwd_inner: 3313.61 | bwd_allreduce: 0.77 | step: 6.66 84%|████████▍ | 8449/10000 [13:19:03<2:21:44, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.08955588191747665, 'learning_rate': 2.4707467538392347e-06, 'epoch': 8.45} 84%|████████▍ | 8449/10000 [13:19:03<2:21:44, 5.48s/it][2025-06-20 02:48:48,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:48:48,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.92 | bwd_microstep: 3372.42 | bwd_inner_microstep: 3371.64 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-20 02:48:48,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.92 | bwd: 3372.43 | bwd_inner: 3371.64 | bwd_allreduce: 0.75 | step: 6.54 84%|████████▍ | 8450/10000 [13:19:09<2:22:09, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.002152177505195141, 'learning_rate': 2.467628947293048e-06, 'epoch': 8.45} 84%|████████▍ | 8450/10000 [13:19:09<2:22:09, 5.50s/it][2025-06-20 02:48:53,854] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:48:53,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.14 | bwd_microstep: 3333.23 | bwd_inner_microstep: 3332.45 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.58 [2025-06-20 02:48:53,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.14 | bwd: 3333.25 | bwd_inner: 3332.45 | bwd_allreduce: 0.75 | step: 6.58 85%|████████▍ | 8451/10000 [13:19:14<2:21:50, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.03083307109773159, 'learning_rate': 2.4645129798112154e-06, 'epoch': 8.45} 85%|████████▍ | 8451/10000 [13:19:14<2:21:50, 5.49s/it][2025-06-20 02:48:59,395] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:48:59,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.93 | bwd_microstep: 3372.93 | bwd_inner_microstep: 3372.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-20 02:48:59,396] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.94 | bwd: 3372.95 | bwd_inner: 3372.13 | bwd_allreduce: 0.77 | step: 7.06 85%|████████▍ | 8452/10000 [13:19:20<2:22:07, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0008673274423927069, 'learning_rate': 2.4613988517205845e-06, 'epoch': 8.45} 85%|████████▍ | 8452/10000 [13:19:20<2:22:07, 5.51s/it][2025-06-20 02:49:04,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:49:04,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.53 | bwd_microstep: 3362.94 | bwd_inner_microstep: 3362.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 02:49:04,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.53 | bwd: 3362.96 | bwd_inner: 3362.15 | bwd_allreduce: 0.76 | step: 6.70 85%|████████▍ | 8453/10000 [13:19:25<2:22:13, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.021897202357649803, 'learning_rate': 2.4582865633478158e-06, 'epoch': 8.45} 85%|████████▍ | 8453/10000 [13:19:25<2:22:13, 5.52s/it][2025-06-20 02:49:10,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-20 02:49:10,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.53 | bwd_microstep: 3375.72 | bwd_inner_microstep: 3374.91 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-20 02:49:10,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.53 | bwd: 3375.73 | bwd_inner: 3374.91 | bwd_allreduce: 0.77 | step: 7.24 85%|████████▍ | 8454/10000 [13:19:31<2:22:26, 5.53s/it] {'loss': 0.0, 'grad_norm': 3.108395685558207e-05, 'learning_rate': 2.455176115019373e-06, 'epoch': 8.45} 85%|████████▍ | 8454/10000 [13:19:31<2:22:26, 5.53s/it][2025-06-20 02:49:15,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.88 [2025-06-20 02:49:15,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.62 | bwd_microstep: 3322.79 | bwd_inner_microstep: 3321.96 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.93 [2025-06-20 02:49:15,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.62 | bwd: 3322.81 | bwd_inner: 3321.96 | bwd_allreduce: 0.79 | step: 6.93 85%|████████▍ | 8455/10000 [13:19:36<2:21:54, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.008812882006168365, 'learning_rate': 2.452067507061526e-06, 'epoch': 8.46} 85%|████████▍ | 8455/10000 [13:19:36<2:21:54, 5.51s/it][2025-06-20 02:49:21,432] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:49:21,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.48 | bwd_microstep: 3325.66 | bwd_inner_microstep: 3324.87 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 02:49:21,433] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.48 | bwd: 3325.67 | bwd_inner: 3324.87 | bwd_allreduce: 0.76 | step: 6.62 85%|████████▍ | 8456/10000 [13:19:42<2:21:32, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0011757572647184134, 'learning_rate': 2.448960739800357e-06, 'epoch': 8.46} 85%|████████▍ | 8456/10000 [13:19:42<2:21:32, 5.50s/it][2025-06-20 02:49:26,903] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:49:26,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.20 | bwd_microstep: 3324.88 | bwd_inner_microstep: 3324.07 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.86 [2025-06-20 02:49:26,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.20 | bwd: 3324.90 | bwd_inner: 3324.07 | bwd_allreduce: 0.79 | step: 6.86 85%|████████▍ | 8457/10000 [13:19:47<2:21:13, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00010103803651873022, 'learning_rate': 2.4458558135617504e-06, 'epoch': 8.46} 85%|████████▍ | 8457/10000 [13:19:47<2:21:13, 5.49s/it][2025-06-20 02:49:32,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:49:32,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.53 | bwd_microstep: 3323.22 | bwd_inner_microstep: 3322.41 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.74 [2025-06-20 02:49:32,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.53 | bwd: 3323.23 | bwd_inner: 3322.41 | bwd_allreduce: 0.78 | step: 6.75 85%|████████▍ | 8458/10000 [13:19:53<2:21:09, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005112923681735992, 'learning_rate': 2.4427527286714004e-06, 'epoch': 8.46} 85%|████████▍ | 8458/10000 [13:19:53<2:21:09, 5.49s/it][2025-06-20 02:49:37,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:49:37,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.63 | bwd_microstep: 3330.15 | bwd_inner_microstep: 3329.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-20 02:49:37,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.63 | bwd: 3330.16 | bwd_inner: 3329.35 | bwd_allreduce: 0.77 | step: 7.01 85%|████████▍ | 8459/10000 [13:19:58<2:20:58, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0034787519834935665, 'learning_rate': 2.4396514854547993e-06, 'epoch': 8.46} 85%|████████▍ | 8459/10000 [13:19:58<2:20:58, 5.49s/it][2025-06-20 02:49:43,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:49:43,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.64 | bwd_microstep: 3320.91 | bwd_inner_microstep: 3319.98 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.10 [2025-06-20 02:49:43,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.64 | bwd: 3320.92 | bwd_inner: 3319.98 | bwd_allreduce: 0.90 | step: 7.11 85%|████████▍ | 8460/10000 [13:20:04<2:20:43, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0004191109328530729, 'learning_rate': 2.436552084237258e-06, 'epoch': 8.46} 85%|████████▍ | 8460/10000 [13:20:04<2:20:43, 5.48s/it][2025-06-20 02:49:48,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:49:48,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.62 | bwd_microstep: 3324.20 | bwd_inner_microstep: 3323.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 02:49:48,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.62 | bwd: 3324.22 | bwd_inner: 3323.40 | bwd_allreduce: 0.77 | step: 6.72 85%|████████▍ | 8461/10000 [13:20:09<2:20:33, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0019364956533536315, 'learning_rate': 2.433454525343888e-06, 'epoch': 8.46} 85%|████████▍ | 8461/10000 [13:20:09<2:20:33, 5.48s/it][2025-06-20 02:49:54,314] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:49:54,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.25 | bwd_microstep: 3328.63 | bwd_inner_microstep: 3327.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.79 [2025-06-20 02:49:54,315] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.25 | bwd: 3328.65 | bwd_inner: 3327.81 | bwd_allreduce: 0.79 | step: 6.79 85%|████████▍ | 8462/10000 [13:20:15<2:20:35, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00023979603429324925, 'learning_rate': 2.430358809099609e-06, 'epoch': 8.46} 85%|████████▍ | 8462/10000 [13:20:15<2:20:35, 5.48s/it][2025-06-20 02:49:59,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.68 | optimizer_step: 2.87 [2025-06-20 02:49:59,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.27 | bwd_microstep: 3322.64 | bwd_inner_microstep: 3321.85 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.02 [2025-06-20 02:49:59,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.27 | bwd: 3322.65 | bwd_inner: 3321.85 | bwd_allreduce: 0.76 | step: 7.02 85%|████████▍ | 8463/10000 [13:20:20<2:20:24, 5.48s/it] {'loss': 0.0082, 'grad_norm': 5.30076789855957, 'learning_rate': 2.427264935829152e-06, 'epoch': 8.46} 85%|████████▍ | 8463/10000 [13:20:20<2:20:24, 5.48s/it][2025-06-20 02:50:05,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:50:05,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.73 | bwd_microstep: 3330.39 | bwd_inner_microstep: 3329.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 02:50:05,273] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.73 | bwd: 3330.40 | bwd_inner: 3329.61 | bwd_allreduce: 0.75 | step: 6.61 85%|████████▍ | 8464/10000 [13:20:26<2:20:19, 5.48s/it] {'loss': 0.0004, 'grad_norm': 0.16923895478248596, 'learning_rate': 2.4241729058570405e-06, 'epoch': 8.46} 85%|████████▍ | 8464/10000 [13:20:26<2:20:19, 5.48s/it][2025-06-20 02:50:10,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 02:50:10,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.14 | bwd_microstep: 3381.56 | bwd_inner_microstep: 3380.45 | bwd_allreduce_microstep: 1.04 | step_microstep: 8.06 [2025-06-20 02:50:10,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.14 | bwd: 3381.58 | bwd_inner: 3380.45 | bwd_allreduce: 1.07 | step: 8.07 85%|████████▍ | 8465/10000 [13:20:31<2:20:48, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.087321937084198, 'learning_rate': 2.4210827195076214e-06, 'epoch': 8.46} 85%|████████▍ | 8465/10000 [13:20:31<2:20:48, 5.50s/it][2025-06-20 02:50:16,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:50:16,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.73 | bwd_microstep: 3331.38 | bwd_inner_microstep: 3330.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-20 02:50:16,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.73 | bwd: 3331.39 | bwd_inner: 3330.57 | bwd_allreduce: 0.78 | step: 6.93 85%|████████▍ | 8466/10000 [13:20:37<2:20:36, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00033700827043503523, 'learning_rate': 2.4179943771050374e-06, 'epoch': 8.47} 85%|████████▍ | 8466/10000 [13:20:37<2:20:36, 5.50s/it][2025-06-20 02:50:21,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:50:21,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.05 | bwd_microstep: 3336.89 | bwd_inner_microstep: 3336.01 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.82 [2025-06-20 02:50:21,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.06 | bwd: 3336.90 | bwd_inner: 3336.01 | bwd_allreduce: 0.85 | step: 6.82 85%|████████▍ | 8467/10000 [13:20:42<2:20:25, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.011247268877923489, 'learning_rate': 2.414907878973243e-06, 'epoch': 8.47} 85%|████████▍ | 8467/10000 [13:20:42<2:20:25, 5.50s/it][2025-06-20 02:50:27,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.91 [2025-06-20 02:50:27,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.42 | bwd_microstep: 3383.11 | bwd_inner_microstep: 3382.31 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.15 [2025-06-20 02:50:27,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.42 | bwd: 3383.13 | bwd_inner: 3382.31 | bwd_allreduce: 0.77 | step: 7.15 85%|████████▍ | 8468/10000 [13:20:48<2:20:51, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.04512793943285942, 'learning_rate': 2.4118232254359986e-06, 'epoch': 8.47} 85%|████████▍ | 8468/10000 [13:20:48<2:20:51, 5.52s/it][2025-06-20 02:50:32,860] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:50:32,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.69 | bwd_microstep: 3336.15 | bwd_inner_microstep: 3335.31 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.79 [2025-06-20 02:50:32,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.69 | bwd: 3336.17 | bwd_inner: 3335.31 | bwd_allreduce: 0.82 | step: 6.79 85%|████████▍ | 8469/10000 [13:20:53<2:20:33, 5.51s/it] {'loss': 0.0002, 'grad_norm': 0.0626564547419548, 'learning_rate': 2.408740416816866e-06, 'epoch': 8.47} 85%|████████▍ | 8469/10000 [13:20:53<2:20:33, 5.51s/it][2025-06-20 02:50:38,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 02:50:38,335] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.15 | bwd_microstep: 3328.75 | bwd_inner_microstep: 3327.96 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 02:50:38,336] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.15 | bwd: 3328.76 | bwd_inner: 3327.96 | bwd_allreduce: 0.75 | step: 6.58 85%|████████▍ | 8470/10000 [13:20:59<2:20:12, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00044695253018289804, 'learning_rate': 2.405659453439222e-06, 'epoch': 8.47} 85%|████████▍ | 8470/10000 [13:20:59<2:20:12, 5.50s/it][2025-06-20 02:50:43,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:50:43,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.76 | bwd_microstep: 3340.43 | bwd_inner_microstep: 3339.62 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.96 [2025-06-20 02:50:43,826] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.76 | bwd: 3340.45 | bwd_inner: 3339.62 | bwd_allreduce: 0.78 | step: 6.97 85%|████████▍ | 8471/10000 [13:21:04<2:20:03, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0018342130351811647, 'learning_rate': 2.4025803356262432e-06, 'epoch': 8.47} 85%|████████▍ | 8471/10000 [13:21:04<2:20:03, 5.50s/it][2025-06-20 02:50:49,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:50:49,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.08 | bwd_microstep: 3337.25 | bwd_inner_microstep: 3336.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 02:50:49,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.08 | bwd: 3337.26 | bwd_inner: 3336.46 | bwd_allreduce: 0.76 | step: 6.81 85%|████████▍ | 8472/10000 [13:21:10<2:19:57, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0044351015239953995, 'learning_rate': 2.399503063700914e-06, 'epoch': 8.47} 85%|████████▍ | 8472/10000 [13:21:10<2:19:57, 5.50s/it][2025-06-20 02:50:54,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:50:54,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.45 | bwd_microstep: 3377.76 | bwd_inner_microstep: 3376.97 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 02:50:54,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.45 | bwd: 3377.77 | bwd_inner: 3376.97 | bwd_allreduce: 0.76 | step: 6.63 85%|████████▍ | 8473/10000 [13:21:15<2:20:16, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00020379226771183312, 'learning_rate': 2.3964276379860317e-06, 'epoch': 8.47} 85%|████████▍ | 8473/10000 [13:21:15<2:20:16, 5.51s/it][2025-06-20 02:51:00,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:51:00,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.70 | bwd_microstep: 3319.88 | bwd_inner_microstep: 3319.04 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.83 [2025-06-20 02:51:00,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.70 | bwd: 3319.91 | bwd_inner: 3319.04 | bwd_allreduce: 0.81 | step: 6.83 85%|████████▍ | 8474/10000 [13:21:21<2:19:48, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.03921934962272644, 'learning_rate': 2.393354058804187e-06, 'epoch': 8.47} 85%|████████▍ | 8474/10000 [13:21:21<2:19:48, 5.50s/it][2025-06-20 02:51:05,804] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 02:51:05,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.12 | bwd_microstep: 3320.82 | bwd_inner_microstep: 3319.82 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.18 [2025-06-20 02:51:05,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.12 | bwd: 3320.83 | bwd_inner: 3319.82 | bwd_allreduce: 0.96 | step: 7.18 85%|████████▍ | 8475/10000 [13:21:26<2:19:31, 5.49s/it] {'loss': 0.0, 'grad_norm': 3.835970346699469e-05, 'learning_rate': 2.390282326477784e-06, 'epoch': 8.47} 85%|████████▍ | 8475/10000 [13:21:26<2:19:31, 5.49s/it][2025-06-20 02:51:11,383] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:51:11,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.51 | bwd_microstep: 3391.78 | bwd_inner_microstep: 3390.96 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.82 [2025-06-20 02:51:11,384] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.51 | bwd: 3391.79 | bwd_inner: 3390.96 | bwd_allreduce: 0.79 | step: 6.83 85%|████████▍ | 8476/10000 [13:21:32<2:20:06, 5.52s/it] {'loss': 0.0, 'grad_norm': 5.110273923492059e-05, 'learning_rate': 2.3872124413290366e-06, 'epoch': 8.48} 85%|████████▍ | 8476/10000 [13:21:32<2:20:06, 5.52s/it][2025-06-20 02:51:16,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:51:16,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.36 | bwd_microstep: 3370.45 | bwd_inner_microstep: 3369.49 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.07 [2025-06-20 02:51:16,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.36 | bwd: 3370.46 | bwd_inner: 3369.49 | bwd_allreduce: 0.93 | step: 7.08 85%|████████▍ | 8477/10000 [13:21:37<2:20:15, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.001617438392713666, 'learning_rate': 2.38414440367996e-06, 'epoch': 8.48} 85%|████████▍ | 8477/10000 [13:21:37<2:20:15, 5.53s/it][2025-06-20 02:51:22,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-20 02:51:22,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.40 | bwd_microstep: 3320.42 | bwd_inner_microstep: 3319.64 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.51 [2025-06-20 02:51:22,403] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.40 | bwd: 3320.43 | bwd_inner: 3319.64 | bwd_allreduce: 0.75 | step: 6.53 85%|████████▍ | 8478/10000 [13:21:43<2:19:45, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0025234112981706858, 'learning_rate': 2.3810782138523837e-06, 'epoch': 8.48} 85%|████████▍ | 8478/10000 [13:21:43<2:19:45, 5.51s/it][2025-06-20 02:51:27,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:51:27,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.97 | bwd_microstep: 3324.90 | bwd_inner_microstep: 3324.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-20 02:51:27,877] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.97 | bwd: 3324.92 | bwd_inner: 3324.11 | bwd_allreduce: 0.76 | step: 6.75 85%|████████▍ | 8479/10000 [13:21:48<2:19:23, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0041482229717075825, 'learning_rate': 2.378013872167926e-06, 'epoch': 8.48} 85%|████████▍ | 8479/10000 [13:21:48<2:19:23, 5.50s/it][2025-06-20 02:51:33,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:51:33,366] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.39 | bwd_microstep: 3339.95 | bwd_inner_microstep: 3339.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-20 02:51:33,367] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.39 | bwd: 3339.96 | bwd_inner: 3339.14 | bwd_allreduce: 0.78 | step: 6.90 85%|████████▍ | 8480/10000 [13:21:54<2:19:14, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00013645678700413555, 'learning_rate': 2.3749513789480274e-06, 'epoch': 8.48} 85%|████████▍ | 8480/10000 [13:21:54<2:19:14, 5.50s/it][2025-06-20 02:51:38,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:51:38,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.45 | bwd_microstep: 3325.22 | bwd_inner_microstep: 3324.29 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.10 [2025-06-20 02:51:38,850] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.45 | bwd: 3325.23 | bwd_inner: 3324.29 | bwd_allreduce: 0.89 | step: 7.10 85%|████████▍ | 8481/10000 [13:21:59<2:19:02, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002961854450404644, 'learning_rate': 2.3718907345139264e-06, 'epoch': 8.48} 85%|████████▍ | 8481/10000 [13:21:59<2:19:02, 5.49s/it][2025-06-20 02:51:44,311] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:51:44,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.14 | bwd_microstep: 3317.29 | bwd_inner_microstep: 3316.36 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.04 [2025-06-20 02:51:44,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.14 | bwd: 3317.31 | bwd_inner: 3316.36 | bwd_allreduce: 0.90 | step: 7.04 85%|████████▍ | 8482/10000 [13:22:05<2:18:43, 5.48s/it] {'loss': 0.005, 'grad_norm': 2.0231173038482666, 'learning_rate': 2.3688319391866755e-06, 'epoch': 8.48} 85%|████████▍ | 8482/10000 [13:22:05<2:18:43, 5.48s/it][2025-06-20 02:51:49,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:51:49,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.96 | bwd_microstep: 3312.08 | bwd_inner_microstep: 3311.31 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 02:51:49,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.96 | bwd: 3312.10 | bwd_inner: 3311.31 | bwd_allreduce: 0.75 | step: 6.60 85%|████████▍ | 8483/10000 [13:22:10<2:18:25, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0016765767941251397, 'learning_rate': 2.3657749932871265e-06, 'epoch': 8.48} 85%|████████▍ | 8483/10000 [13:22:10<2:18:25, 5.48s/it][2025-06-20 02:51:55,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:51:55,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.74 | bwd_microstep: 3315.84 | bwd_inner_microstep: 3314.98 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.11 [2025-06-20 02:51:55,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.74 | bwd: 3315.86 | bwd_inner: 3314.98 | bwd_allreduce: 0.82 | step: 7.11 85%|████████▍ | 8484/10000 [13:22:16<2:18:15, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.035188090056180954, 'learning_rate': 2.362719897135937e-06, 'epoch': 8.48} 85%|████████▍ | 8484/10000 [13:22:16<2:18:15, 5.47s/it][2025-06-20 02:52:00,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:52:00,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.09 | bwd_microstep: 3322.86 | bwd_inner_microstep: 3322.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.97 [2025-06-20 02:52:00,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.09 | bwd: 3322.88 | bwd_inner: 3322.06 | bwd_allreduce: 0.77 | step: 6.97 85%|████████▍ | 8485/10000 [13:22:21<2:18:09, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0008608823991380632, 'learning_rate': 2.3596666510535713e-06, 'epoch': 8.48} 85%|████████▍ | 8485/10000 [13:22:21<2:18:09, 5.47s/it][2025-06-20 02:52:06,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:52:06,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.68 | bwd_microstep: 3371.99 | bwd_inner_microstep: 3371.04 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.02 [2025-06-20 02:52:06,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.68 | bwd: 3372.01 | bwd_inner: 3371.04 | bwd_allreduce: 0.92 | step: 7.02 85%|████████▍ | 8486/10000 [13:22:27<2:18:37, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.058915138244628906, 'learning_rate': 2.3566152553603015e-06, 'epoch': 8.49} 85%|████████▍ | 8486/10000 [13:22:27<2:18:37, 5.49s/it][2025-06-20 02:52:11,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:52:11,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.44 | bwd_microstep: 3318.76 | bwd_inner_microstep: 3317.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 02:52:11,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.44 | bwd: 3318.77 | bwd_inner: 3317.96 | bwd_allreduce: 0.77 | step: 6.82 85%|████████▍ | 8487/10000 [13:22:32<2:18:21, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006333086173981428, 'learning_rate': 2.353565710376209e-06, 'epoch': 8.49} 85%|████████▍ | 8487/10000 [13:22:32<2:18:21, 5.49s/it][2025-06-20 02:52:17,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:52:17,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.18 | bwd_microstep: 3317.77 | bwd_inner_microstep: 3316.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-20 02:52:17,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.18 | bwd: 3317.78 | bwd_inner: 3316.98 | bwd_allreduce: 0.76 | step: 6.88 85%|████████▍ | 8488/10000 [13:22:37<2:18:03, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001397406100295484, 'learning_rate': 2.3505180164211726e-06, 'epoch': 8.49} 85%|████████▍ | 8488/10000 [13:22:37<2:18:03, 5.48s/it][2025-06-20 02:52:22,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:52:22,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.20 | bwd_microstep: 3322.45 | bwd_inner_microstep: 3321.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 02:52:22,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.20 | bwd: 3322.46 | bwd_inner: 3321.66 | bwd_allreduce: 0.76 | step: 6.71 85%|████████▍ | 8489/10000 [13:22:43<2:17:52, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0007221806445159018, 'learning_rate': 2.347472173814882e-06, 'epoch': 8.49} 85%|████████▍ | 8489/10000 [13:22:43<2:17:52, 5.48s/it][2025-06-20 02:52:28,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:52:28,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.61 | bwd_microstep: 3366.68 | bwd_inner_microstep: 3365.83 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.77 [2025-06-20 02:52:28,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.61 | bwd: 3366.69 | bwd_inner: 3365.83 | bwd_allreduce: 0.82 | step: 6.77 85%|████████▍ | 8490/10000 [13:22:48<2:18:16, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0007195287034846842, 'learning_rate': 2.3444281828768323e-06, 'epoch': 8.49} 85%|████████▍ | 8490/10000 [13:22:48<2:18:16, 5.49s/it][2025-06-20 02:52:33,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:52:33,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.42 | bwd_microstep: 3366.31 | bwd_inner_microstep: 3365.54 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 02:52:33,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.42 | bwd: 3366.33 | bwd_inner: 3365.54 | bwd_allreduce: 0.75 | step: 6.58 85%|████████▍ | 8491/10000 [13:22:54<2:18:34, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00959843210875988, 'learning_rate': 2.3413860439263237e-06, 'epoch': 8.49} 85%|████████▍ | 8491/10000 [13:22:54<2:18:34, 5.51s/it][2025-06-20 02:52:39,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:52:39,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.14 | bwd_microstep: 3366.43 | bwd_inner_microstep: 3365.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-20 02:52:39,259] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.14 | bwd: 3366.45 | bwd_inner: 3365.64 | bwd_allreduce: 0.77 | step: 7.08 85%|████████▍ | 8492/10000 [13:23:00<2:18:36, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00013529826537705958, 'learning_rate': 2.338345757282463e-06, 'epoch': 8.49} 85%|████████▍ | 8492/10000 [13:23:00<2:18:36, 5.52s/it][2025-06-20 02:52:44,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:52:44,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.34 | bwd_microstep: 3323.37 | bwd_inner_microstep: 3322.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 02:52:44,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.34 | bwd: 3323.39 | bwd_inner: 3322.59 | bwd_allreduce: 0.76 | step: 6.69 85%|████████▍ | 8493/10000 [13:23:05<2:18:08, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.292868971824646, 'learning_rate': 2.3353073232641666e-06, 'epoch': 8.49} 85%|████████▍ | 8493/10000 [13:23:05<2:18:08, 5.50s/it][2025-06-20 02:52:50,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:52:50,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.29 | bwd_microstep: 3377.66 | bwd_inner_microstep: 3376.59 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.05 [2025-06-20 02:52:50,272] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.29 | bwd: 3377.68 | bwd_inner: 3376.59 | bwd_allreduce: 1.03 | step: 7.05 85%|████████▍ | 8494/10000 [13:23:11<2:18:24, 5.51s/it] {'loss': 0.0, 'grad_norm': 3.212939918739721e-05, 'learning_rate': 2.332270742190146e-06, 'epoch': 8.49} 85%|████████▍ | 8494/10000 [13:23:11<2:18:24, 5.51s/it][2025-06-20 02:52:55,741] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:52:55,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.56 | bwd_microstep: 3317.83 | bwd_inner_microstep: 3316.77 | bwd_allreduce_microstep: 1.00 | step_microstep: 6.92 [2025-06-20 02:52:55,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.56 | bwd: 3317.85 | bwd_inner: 3316.77 | bwd_allreduce: 1.03 | step: 6.93 85%|████████▍ | 8495/10000 [13:23:16<2:17:59, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0009540567407384515, 'learning_rate': 2.3292360143789215e-06, 'epoch': 8.49} 85%|████████▍ | 8495/10000 [13:23:16<2:17:59, 5.50s/it][2025-06-20 02:53:01,275] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:53:01,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.15 | bwd_microstep: 3363.90 | bwd_inner_microstep: 3363.06 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.20 [2025-06-20 02:53:01,276] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.15 | bwd: 3363.91 | bwd_inner: 3363.06 | bwd_allreduce: 0.81 | step: 7.20 85%|████████▍ | 8496/10000 [13:23:22<2:18:08, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0008139854180626571, 'learning_rate': 2.3262031401488307e-06, 'epoch': 8.5} 85%|████████▍ | 8496/10000 [13:23:22<2:18:08, 5.51s/it][2025-06-20 02:53:06,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:53:06,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.77 | bwd_microstep: 3320.17 | bwd_inner_microstep: 3319.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 02:53:06,734] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.77 | bwd: 3320.18 | bwd_inner: 3319.39 | bwd_allreduce: 0.75 | step: 6.68 85%|████████▍ | 8497/10000 [13:23:27<2:17:38, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.021983638405799866, 'learning_rate': 2.3231721198180024e-06, 'epoch': 8.5} 85%|████████▍ | 8497/10000 [13:23:27<2:17:38, 5.49s/it][2025-06-20 02:53:12,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:53:12,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.14 | bwd_microstep: 3315.98 | bwd_inner_microstep: 3315.19 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 02:53:12,194] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.14 | bwd: 3315.99 | bwd_inner: 3315.19 | bwd_allreduce: 0.76 | step: 6.70 85%|████████▍ | 8498/10000 [13:23:32<2:17:17, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0099782794713974, 'learning_rate': 2.3201429537043806e-06, 'epoch': 8.5} 85%|████████▍ | 8498/10000 [13:23:33<2:17:17, 5.48s/it][2025-06-20 02:53:17,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:53:17,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.99 | bwd_microstep: 3313.30 | bwd_inner_microstep: 3312.38 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.85 [2025-06-20 02:53:17,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.99 | bwd: 3313.31 | bwd_inner: 3312.38 | bwd_allreduce: 0.89 | step: 6.86 85%|████████▍ | 8499/10000 [13:23:38<2:17:51, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.002061669947579503, 'learning_rate': 2.3171156421257048e-06, 'epoch': 8.5} 85%|████████▍ | 8499/10000 [13:23:38<2:17:51, 5.51s/it][2025-06-20 02:53:23,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:53:23,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.04 | bwd_microstep: 3319.02 | bwd_inner_microstep: 3318.24 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 02:53:23,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.04 | bwd: 3319.04 | bwd_inner: 3318.24 | bwd_allreduce: 0.75 | step: 6.65 85%|████████▌ | 8500/10000 [13:23:44<2:17:26, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.006086994893848896, 'learning_rate': 2.3140901853995313e-06, 'epoch': 8.5} 85%|████████▌ | 8500/10000 [13:23:44<2:17:26, 5.50s/it][2025-06-20 02:53:28,692] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:53:28,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.64 | bwd_microstep: 3316.09 | bwd_inner_microstep: 3315.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-20 02:53:28,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.64 | bwd: 3316.10 | bwd_inner: 3315.29 | bwd_allreduce: 0.77 | step: 6.90 85%|████████▌ | 8501/10000 [13:23:49<2:17:03, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0030278153717517853, 'learning_rate': 2.3110665838432113e-06, 'epoch': 8.5} 85%|████████▌ | 8501/10000 [13:23:49<2:17:03, 5.49s/it][2025-06-20 02:53:34,261] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:53:34,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.67 | bwd_microstep: 3396.76 | bwd_inner_microstep: 3395.96 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.00 [2025-06-20 02:53:34,262] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.67 | bwd: 3396.77 | bwd_inner: 3395.96 | bwd_allreduce: 0.76 | step: 7.00 85%|████████▌ | 8502/10000 [13:23:55<2:17:35, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.005876572337001562, 'learning_rate': 2.308044837773913e-06, 'epoch': 8.5} 85%|████████▌ | 8502/10000 [13:23:55<2:17:35, 5.51s/it][2025-06-20 02:53:39,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:53:39,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.31 | bwd_microstep: 3379.42 | bwd_inner_microstep: 3378.57 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.61 [2025-06-20 02:53:39,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.31 | bwd: 3379.43 | bwd_inner: 3378.57 | bwd_allreduce: 0.82 | step: 6.61 85%|████████▌ | 8503/10000 [13:24:00<2:17:47, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0005404026596806943, 'learning_rate': 2.3050249475085983e-06, 'epoch': 8.5} 85%|████████▌ | 8503/10000 [13:24:00<2:17:47, 5.52s/it][2025-06-20 02:53:45,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:53:45,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.31 | bwd_microstep: 3371.92 | bwd_inner_microstep: 3371.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.73 [2025-06-20 02:53:45,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.31 | bwd: 3371.93 | bwd_inner: 3371.11 | bwd_allreduce: 0.78 | step: 6.74 85%|████████▌ | 8504/10000 [13:24:06<2:17:50, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0038320079911500216, 'learning_rate': 2.302006913364041e-06, 'epoch': 8.5} 85%|████████▌ | 8504/10000 [13:24:06<2:17:50, 5.53s/it][2025-06-20 02:53:50,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:53:50,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.04 | bwd_microstep: 3373.80 | bwd_inner_microstep: 3372.90 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.96 [2025-06-20 02:53:50,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.04 | bwd: 3373.82 | bwd_inner: 3372.90 | bwd_allreduce: 0.86 | step: 6.96 85%|████████▌ | 8505/10000 [13:24:11<2:17:51, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.013211165554821491, 'learning_rate': 2.298990735656821e-06, 'epoch': 8.51} 85%|████████▌ | 8505/10000 [13:24:11<2:17:51, 5.53s/it][2025-06-20 02:53:56,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:53:56,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.76 | bwd_microstep: 3319.93 | bwd_inner_microstep: 3319.10 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.68 [2025-06-20 02:53:56,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.77 | bwd: 3319.94 | bwd_inner: 3319.10 | bwd_allreduce: 0.80 | step: 6.69 85%|████████▌ | 8506/10000 [13:24:17<2:17:13, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0031608056742697954, 'learning_rate': 2.29597641470332e-06, 'epoch': 8.51} 85%|████████▌ | 8506/10000 [13:24:17<2:17:13, 5.51s/it][2025-06-20 02:54:01,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:54:01,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.98 | bwd_microstep: 3376.82 | bwd_inner_microstep: 3375.88 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.10 [2025-06-20 02:54:01,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.98 | bwd: 3376.84 | bwd_inner: 3375.88 | bwd_allreduce: 0.91 | step: 7.10 85%|████████▌ | 8507/10000 [13:24:22<2:17:21, 5.52s/it] {'loss': 0.0007, 'grad_norm': 0.3129526376724243, 'learning_rate': 2.2929639508197265e-06, 'epoch': 8.51} 85%|████████▌ | 8507/10000 [13:24:22<2:17:21, 5.52s/it][2025-06-20 02:54:07,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:54:07,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.22 | bwd_microstep: 3317.95 | bwd_inner_microstep: 3317.13 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.32 [2025-06-20 02:54:07,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.22 | bwd: 3317.96 | bwd_inner: 3317.13 | bwd_allreduce: 0.79 | step: 7.32 85%|████████▌ | 8508/10000 [13:24:28<2:16:51, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.006051390897482634, 'learning_rate': 2.289953344322038e-06, 'epoch': 8.51} 85%|████████▌ | 8508/10000 [13:24:28<2:16:51, 5.50s/it][2025-06-20 02:54:12,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:54:12,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.05 | bwd_microstep: 3322.29 | bwd_inner_microstep: 3321.29 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.05 [2025-06-20 02:54:12,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.05 | bwd: 3322.30 | bwd_inner: 3321.29 | bwd_allreduce: 0.96 | step: 7.05 85%|████████▌ | 8509/10000 [13:24:33<2:16:30, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00032404600642621517, 'learning_rate': 2.286944595526044e-06, 'epoch': 8.51} 85%|████████▌ | 8509/10000 [13:24:33<2:16:30, 5.49s/it][2025-06-20 02:54:18,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:54:18,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.14 | bwd_microstep: 3368.14 | bwd_inner_microstep: 3367.33 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-20 02:54:18,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.14 | bwd: 3368.15 | bwd_inner: 3367.34 | bwd_allreduce: 0.77 | step: 7.25 85%|████████▌ | 8510/10000 [13:24:39<2:16:41, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0026116797234863043, 'learning_rate': 2.2839377047473522e-06, 'epoch': 8.51} 85%|████████▌ | 8510/10000 [13:24:39<2:16:41, 5.50s/it][2025-06-20 02:54:23,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:54:23,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.21 | bwd_microstep: 3365.82 | bwd_inner_microstep: 3364.85 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.07 [2025-06-20 02:54:23,890] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.21 | bwd: 3365.83 | bwd_inner: 3364.85 | bwd_allreduce: 0.94 | step: 7.07 85%|████████▌ | 8511/10000 [13:24:44<2:16:45, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.010853891260921955, 'learning_rate': 2.2809326723013726e-06, 'epoch': 8.51} 85%|████████▌ | 8511/10000 [13:24:44<2:16:45, 5.51s/it][2025-06-20 02:54:29,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:54:29,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.71 | bwd_microstep: 3371.25 | bwd_inner_microstep: 3370.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 02:54:29,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.71 | bwd: 3371.26 | bwd_inner: 3370.45 | bwd_allreduce: 0.77 | step: 6.79 85%|████████▌ | 8512/10000 [13:24:50<2:16:50, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.009202186018228531, 'learning_rate': 2.2779294985033196e-06, 'epoch': 8.51} 85%|████████▌ | 8512/10000 [13:24:50<2:16:50, 5.52s/it][2025-06-20 02:54:34,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:54:34,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.06 | bwd_microstep: 3326.17 | bwd_inner_microstep: 3325.37 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.82 [2025-06-20 02:54:34,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.06 | bwd: 3326.18 | bwd_inner: 3325.37 | bwd_allreduce: 0.77 | step: 6.82 85%|████████▌ | 8513/10000 [13:24:55<2:16:24, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0012803521240130067, 'learning_rate': 2.274928183668215e-06, 'epoch': 8.51} 85%|████████▌ | 8513/10000 [13:24:55<2:16:24, 5.50s/it][2025-06-20 02:54:40,345] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:54:40,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.86 | bwd_microstep: 3316.63 | bwd_inner_microstep: 3315.81 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.00 [2025-06-20 02:54:40,346] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.86 | bwd: 3316.65 | bwd_inner: 3315.81 | bwd_allreduce: 0.79 | step: 7.00 85%|████████▌ | 8514/10000 [13:25:01<2:15:56, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00028249999741092324, 'learning_rate': 2.271928728110877e-06, 'epoch': 8.51} 85%|████████▌ | 8514/10000 [13:25:01<2:15:56, 5.49s/it][2025-06-20 02:54:45,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:54:45,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2095.62 | bwd_microstep: 3303.52 | bwd_inner_microstep: 3302.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.10 [2025-06-20 02:54:45,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2095.62 | bwd: 3303.53 | bwd_inner: 3302.73 | bwd_allreduce: 0.76 | step: 7.11 85%|████████▌ | 8515/10000 [13:25:06<2:15:28, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0046617924235761166, 'learning_rate': 2.268931132145935e-06, 'epoch': 8.52} 85%|████████▌ | 8515/10000 [13:25:06<2:15:28, 5.47s/it][2025-06-20 02:54:51,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 02:54:51,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.04 | bwd_microstep: 3314.56 | bwd_inner_microstep: 3313.37 | bwd_allreduce_microstep: 1.09 | step_microstep: 8.11 [2025-06-20 02:54:51,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.05 | bwd: 3314.58 | bwd_inner: 3313.37 | bwd_allreduce: 1.12 | step: 8.11 85%|████████▌ | 8516/10000 [13:25:12<2:15:21, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.002305988920852542, 'learning_rate': 2.265935396087826e-06, 'epoch': 8.52} 85%|████████▌ | 8516/10000 [13:25:12<2:15:21, 5.47s/it][2025-06-20 02:54:56,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:54:56,729] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.99 | bwd_microstep: 3319.79 | bwd_inner_microstep: 3318.97 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.68 [2025-06-20 02:54:56,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.99 | bwd: 3319.80 | bwd_inner: 3318.97 | bwd_allreduce: 0.79 | step: 6.68 85%|████████▌ | 8517/10000 [13:25:17<2:15:15, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.01564602367579937, 'learning_rate': 2.2629415202507876e-06, 'epoch': 8.52} 85%|████████▌ | 8517/10000 [13:25:17<2:15:15, 5.47s/it][2025-06-20 02:55:02,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:55:02,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.73 | bwd_microstep: 3367.78 | bwd_inner_microstep: 3366.90 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.96 [2025-06-20 02:55:02,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.73 | bwd: 3367.80 | bwd_inner: 3366.90 | bwd_allreduce: 0.85 | step: 6.97 85%|████████▌ | 8518/10000 [13:25:23<2:15:37, 5.49s/it] {'loss': 0.0, 'grad_norm': 2.8745302188326605e-05, 'learning_rate': 2.2599495049488617e-06, 'epoch': 8.52} 85%|████████▌ | 8518/10000 [13:25:23<2:15:37, 5.49s/it][2025-06-20 02:55:07,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:55:07,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.42 | bwd_microstep: 3366.70 | bwd_inner_microstep: 3365.87 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.93 [2025-06-20 02:55:07,794] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.42 | bwd: 3366.72 | bwd_inner: 3365.87 | bwd_allreduce: 0.81 | step: 6.93 85%|████████▌ | 8519/10000 [13:25:28<2:15:49, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00013018616300541908, 'learning_rate': 2.256959350495904e-06, 'epoch': 8.52} 85%|████████▌ | 8519/10000 [13:25:28<2:15:49, 5.50s/it][2025-06-20 02:55:13,316] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:55:13,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.86 | bwd_microstep: 3362.72 | bwd_inner_microstep: 3361.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.89 [2025-06-20 02:55:13,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.86 | bwd: 3362.74 | bwd_inner: 3361.92 | bwd_allreduce: 0.78 | step: 6.89 85%|████████▌ | 8520/10000 [13:25:34<2:15:53, 5.51s/it] {'loss': 0.0, 'grad_norm': 7.630729669472203e-05, 'learning_rate': 2.2539710572055595e-06, 'epoch': 8.52} 85%|████████▌ | 8520/10000 [13:25:34<2:15:53, 5.51s/it][2025-06-20 02:55:18,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:55:18,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.57 | bwd_microstep: 3318.12 | bwd_inner_microstep: 3317.31 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.84 [2025-06-20 02:55:18,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.57 | bwd: 3318.13 | bwd_inner: 3317.31 | bwd_allreduce: 0.78 | step: 6.84 85%|████████▌ | 8521/10000 [13:25:39<2:15:27, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.05982954055070877, 'learning_rate': 2.2509846253912883e-06, 'epoch': 8.52} 85%|████████▌ | 8521/10000 [13:25:39<2:15:27, 5.50s/it][2025-06-20 02:55:24,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:55:24,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.53 | bwd_microstep: 3359.95 | bwd_inner_microstep: 3359.12 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.89 [2025-06-20 02:55:24,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.53 | bwd: 3359.97 | bwd_inner: 3359.12 | bwd_allreduce: 0.80 | step: 6.89 85%|████████▌ | 8522/10000 [13:25:45<2:15:36, 5.51s/it] {'loss': 0.0119, 'grad_norm': 1.4748715162277222, 'learning_rate': 2.248000055366353e-06, 'epoch': 8.52} 85%|████████▌ | 8522/10000 [13:25:45<2:15:36, 5.51s/it][2025-06-20 02:55:29,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:55:29,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.09 | bwd_microstep: 3366.06 | bwd_inner_microstep: 3365.15 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.30 [2025-06-20 02:55:29,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.09 | bwd: 3366.07 | bwd_inner: 3365.15 | bwd_allreduce: 0.88 | step: 7.31 85%|████████▌ | 8523/10000 [13:25:50<2:15:46, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0007065460085868835, 'learning_rate': 2.245017347443823e-06, 'epoch': 8.52} 85%|████████▌ | 8523/10000 [13:25:50<2:15:46, 5.52s/it][2025-06-20 02:55:35,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:55:35,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.95 | bwd_microstep: 3308.48 | bwd_inner_microstep: 3307.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 02:55:35,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.96 | bwd: 3308.49 | bwd_inner: 3307.69 | bwd_allreduce: 0.75 | step: 6.60 85%|████████▌ | 8524/10000 [13:25:56<2:15:15, 5.50s/it] {'loss': 0.0008, 'grad_norm': 0.31124937534332275, 'learning_rate': 2.2420365019365707e-06, 'epoch': 8.52} 85%|████████▌ | 8524/10000 [13:25:56<2:15:15, 5.50s/it][2025-06-20 02:55:40,769] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-20 02:55:40,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.96 | bwd_microstep: 3317.99 | bwd_inner_microstep: 3316.95 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.77 [2025-06-20 02:55:40,770] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.96 | bwd: 3318.00 | bwd_inner: 3316.95 | bwd_allreduce: 1.00 | step: 7.77 85%|████████▌ | 8525/10000 [13:26:01<2:14:55, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.029632441699504852, 'learning_rate': 2.2390575191572703e-06, 'epoch': 8.53} 85%|████████▌ | 8525/10000 [13:26:01<2:14:55, 5.49s/it][2025-06-20 02:55:46,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.69 | optimizer_step: 2.73 [2025-06-20 02:55:46,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.43 | bwd_microstep: 3322.96 | bwd_inner_microstep: 3321.57 | bwd_allreduce_microstep: 1.30 | step_microstep: 8.08 [2025-06-20 02:55:46,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.43 | bwd: 3322.98 | bwd_inner: 3321.57 | bwd_allreduce: 1.34 | step: 8.09 85%|████████▌ | 8526/10000 [13:26:07<2:14:45, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00030441299895755947, 'learning_rate': 2.236080399418408e-06, 'epoch': 8.53} 85%|████████▌ | 8526/10000 [13:26:07<2:14:45, 5.49s/it][2025-06-20 02:55:51,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:55:51,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.79 | bwd_microstep: 3354.27 | bwd_inner_microstep: 3353.37 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.94 [2025-06-20 02:55:51,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.79 | bwd: 3354.29 | bwd_inner: 3353.37 | bwd_allreduce: 0.87 | step: 6.95 85%|████████▌ | 8527/10000 [13:26:12<2:14:59, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00024467502953484654, 'learning_rate': 2.2331051430322635e-06, 'epoch': 8.53} 85%|████████▌ | 8527/10000 [13:26:12<2:14:59, 5.50s/it][2025-06-20 02:55:57,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:55:57,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.89 | bwd_microstep: 3316.89 | bwd_inner_microstep: 3316.03 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.38 [2025-06-20 02:55:57,245] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.89 | bwd: 3316.91 | bwd_inner: 3316.03 | bwd_allreduce: 0.83 | step: 7.38 85%|████████▌ | 8528/10000 [13:26:18<2:14:39, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.001036647241562605, 'learning_rate': 2.2301317503109333e-06, 'epoch': 8.53} 85%|████████▌ | 8528/10000 [13:26:18<2:14:39, 5.49s/it][2025-06-20 02:56:02,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:56:02,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.32 | bwd_microstep: 3311.89 | bwd_inner_microstep: 3310.94 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.10 [2025-06-20 02:56:02,705] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.32 | bwd: 3311.90 | bwd_inner: 3310.94 | bwd_allreduce: 0.92 | step: 7.11 85%|████████▌ | 8529/10000 [13:26:23<2:14:21, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0026517705991864204, 'learning_rate': 2.2271602215663136e-06, 'epoch': 8.53} 85%|████████▌ | 8529/10000 [13:26:23<2:14:21, 5.48s/it][2025-06-20 02:56:08,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:56:08,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.37 | bwd_microstep: 3315.44 | bwd_inner_microstep: 3314.60 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.85 [2025-06-20 02:56:08,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.37 | bwd: 3315.46 | bwd_inner: 3314.60 | bwd_allreduce: 0.80 | step: 6.85 85%|████████▌ | 8530/10000 [13:26:28<2:14:05, 5.47s/it] {'loss': 0.0015, 'grad_norm': 0.6443471312522888, 'learning_rate': 2.2241905571100954e-06, 'epoch': 8.53} 85%|████████▌ | 8530/10000 [13:26:28<2:14:05, 5.47s/it][2025-06-20 02:56:13,744] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:56:13,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.49 | bwd_microstep: 3400.09 | bwd_inner_microstep: 3399.22 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.28 [2025-06-20 02:56:13,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.49 | bwd: 3400.11 | bwd_inner: 3399.22 | bwd_allreduce: 0.85 | step: 7.28 85%|████████▌ | 8531/10000 [13:26:34<2:14:48, 5.51s/it] {'loss': 0.0, 'grad_norm': 7.688035111641511e-05, 'learning_rate': 2.221222757253787e-06, 'epoch': 8.53} 85%|████████▌ | 8531/10000 [13:26:34<2:14:48, 5.51s/it][2025-06-20 02:56:19,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:56:19,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.79 | bwd_microstep: 3373.71 | bwd_inner_microstep: 3372.88 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.77 [2025-06-20 02:56:19,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.79 | bwd: 3373.73 | bwd_inner: 3372.88 | bwd_allreduce: 0.80 | step: 6.77 85%|████████▌ | 8532/10000 [13:26:40<2:14:55, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0013054270530119538, 'learning_rate': 2.2182568223086998e-06, 'epoch': 8.53} 85%|████████▌ | 8532/10000 [13:26:40<2:14:55, 5.51s/it][2025-06-20 02:56:24,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:56:24,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.41 | bwd_microstep: 3364.26 | bwd_inner_microstep: 3363.48 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 02:56:24,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.41 | bwd: 3364.28 | bwd_inner: 3363.48 | bwd_allreduce: 0.75 | step: 6.67 85%|████████▌ | 8533/10000 [13:26:45<2:15:01, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0042078508995473385, 'learning_rate': 2.2152927525859424e-06, 'epoch': 8.53} 85%|████████▌ | 8533/10000 [13:26:45<2:15:01, 5.52s/it][2025-06-20 02:56:30,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:56:30,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.20 | bwd_microstep: 3360.06 | bwd_inner_microstep: 3359.27 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-20 02:56:30,348] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.20 | bwd: 3360.07 | bwd_inner: 3359.27 | bwd_allreduce: 0.76 | step: 6.74 85%|████████▌ | 8534/10000 [13:26:51<2:14:58, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.005662768613547087, 'learning_rate': 2.2123305483964396e-06, 'epoch': 8.53} 85%|████████▌ | 8534/10000 [13:26:51<2:14:58, 5.52s/it][2025-06-20 02:56:35,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:56:35,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.28 | bwd_microstep: 3325.37 | bwd_inner_microstep: 3324.57 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 02:56:35,830] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.28 | bwd: 3325.38 | bwd_inner: 3324.57 | bwd_allreduce: 0.76 | step: 6.70 85%|████████▌ | 8535/10000 [13:26:56<2:14:33, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.004184652119874954, 'learning_rate': 2.209370210050901e-06, 'epoch': 8.54} 85%|████████▌ | 8535/10000 [13:26:56<2:14:33, 5.51s/it][2025-06-20 02:56:41,309] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:56:41,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.03 | bwd_microstep: 3324.52 | bwd_inner_microstep: 3323.74 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 02:56:41,310] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.03 | bwd: 3324.53 | bwd_inner: 3323.74 | bwd_allreduce: 0.75 | step: 6.61 85%|████████▌ | 8536/10000 [13:27:02<2:14:14, 5.50s/it] {'loss': 0.0, 'grad_norm': 3.611196734709665e-05, 'learning_rate': 2.2064117378598594e-06, 'epoch': 8.54} 85%|████████▌ | 8536/10000 [13:27:02<2:14:14, 5.50s/it][2025-06-20 02:56:46,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 02:56:46,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.33 | bwd_microstep: 3310.13 | bwd_inner_microstep: 3309.21 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.38 [2025-06-20 02:56:46,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.33 | bwd: 3310.15 | bwd_inner: 3309.21 | bwd_allreduce: 0.88 | step: 7.39 85%|████████▌ | 8537/10000 [13:27:07<2:13:45, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0009447264019399881, 'learning_rate': 2.203455132133645e-06, 'epoch': 8.54} 85%|████████▌ | 8537/10000 [13:27:07<2:13:45, 5.49s/it][2025-06-20 02:56:52,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:56:52,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.03 | bwd_microstep: 3366.00 | bwd_inner_microstep: 3365.12 | bwd_allreduce_microstep: 0.83 | step_microstep: 6.93 [2025-06-20 02:56:52,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.03 | bwd: 3366.01 | bwd_inner: 3365.12 | bwd_allreduce: 0.84 | step: 6.93 85%|████████▌ | 8538/10000 [13:27:13<2:13:56, 5.50s/it] {'loss': 0.0, 'grad_norm': 7.728767377557233e-05, 'learning_rate': 2.20050039318239e-06, 'epoch': 8.54} 85%|████████▌ | 8538/10000 [13:27:13<2:13:56, 5.50s/it][2025-06-20 02:56:57,753] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 02:56:57,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.32 | bwd_microstep: 3318.85 | bwd_inner_microstep: 3318.01 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.00 [2025-06-20 02:56:57,754] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.32 | bwd: 3318.87 | bwd_inner: 3318.01 | bwd_allreduce: 0.81 | step: 7.01 85%|████████▌ | 8539/10000 [13:27:18<2:13:40, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00012376622180454433, 'learning_rate': 2.197547521316035e-06, 'epoch': 8.54} 85%|████████▌ | 8539/10000 [13:27:18<2:13:40, 5.49s/it][2025-06-20 02:57:03,209] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:57:03,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.82 | bwd_microstep: 3313.64 | bwd_inner_microstep: 3312.73 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.96 [2025-06-20 02:57:03,210] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.82 | bwd: 3313.65 | bwd_inner: 3312.73 | bwd_allreduce: 0.88 | step: 6.97 85%|████████▌ | 8540/10000 [13:27:24<2:13:20, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00014914125495124608, 'learning_rate': 2.1945965168443184e-06, 'epoch': 8.54} 85%|████████▌ | 8540/10000 [13:27:24<2:13:20, 5.48s/it][2025-06-20 02:57:08,669] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:57:08,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.40 | bwd_microstep: 3314.25 | bwd_inner_microstep: 3313.41 | bwd_allreduce_microstep: 0.80 | step_microstep: 6.74 [2025-06-20 02:57:08,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.40 | bwd: 3314.27 | bwd_inner: 3313.41 | bwd_allreduce: 0.82 | step: 6.74 85%|████████▌ | 8541/10000 [13:27:29<2:13:06, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0001739875879138708, 'learning_rate': 2.1916473800767914e-06, 'epoch': 8.54} 85%|████████▌ | 8541/10000 [13:27:29<2:13:06, 5.47s/it][2025-06-20 02:57:14,140] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:57:14,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.88 | bwd_microstep: 3322.50 | bwd_inner_microstep: 3321.63 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.37 [2025-06-20 02:57:14,141] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.89 | bwd: 3322.52 | bwd_inner: 3321.63 | bwd_allreduce: 0.83 | step: 7.37 85%|████████▌ | 8542/10000 [13:27:34<2:13:00, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00021799522801302373, 'learning_rate': 2.1887001113228034e-06, 'epoch': 8.54} 85%|████████▌ | 8542/10000 [13:27:34<2:13:00, 5.47s/it][2025-06-20 02:57:19,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:57:19,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.97 | bwd_microstep: 3364.06 | bwd_inner_microstep: 3363.23 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.78 [2025-06-20 02:57:19,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.97 | bwd: 3364.08 | bwd_inner: 3363.23 | bwd_allreduce: 0.80 | step: 6.79 85%|████████▌ | 8543/10000 [13:27:40<2:13:20, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.0402764268219471, 'learning_rate': 2.18575471089151e-06, 'epoch': 8.54} 85%|████████▌ | 8543/10000 [13:27:40<2:13:20, 5.49s/it][2025-06-20 02:57:25,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 02:57:25,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.02 | bwd_microstep: 3326.78 | bwd_inner_microstep: 3325.84 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.12 [2025-06-20 02:57:25,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.02 | bwd: 3326.79 | bwd_inner: 3325.84 | bwd_allreduce: 0.91 | step: 7.12 85%|████████▌ | 8544/10000 [13:27:45<2:13:05, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.014666462317109108, 'learning_rate': 2.1828111790918706e-06, 'epoch': 8.54} 85%|████████▌ | 8544/10000 [13:27:45<2:13:05, 5.48s/it][2025-06-20 02:57:30,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 02:57:30,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.27 | bwd_microstep: 3365.51 | bwd_inner_microstep: 3364.51 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.37 [2025-06-20 02:57:30,685] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.27 | bwd: 3365.52 | bwd_inner: 3364.51 | bwd_allreduce: 0.97 | step: 7.37 85%|████████▌ | 8545/10000 [13:27:51<2:13:25, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.1159026026725769, 'learning_rate': 2.1798695162326444e-06, 'epoch': 8.54} 85%|████████▌ | 8545/10000 [13:27:51<2:13:25, 5.50s/it][2025-06-20 02:57:36,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:57:36,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.65 | bwd_microstep: 3317.60 | bwd_inner_microstep: 3316.76 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.23 [2025-06-20 02:57:36,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.65 | bwd: 3317.62 | bwd_inner: 3316.76 | bwd_allreduce: 0.81 | step: 7.23 85%|████████▌ | 8546/10000 [13:27:56<2:13:03, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00011688595986925066, 'learning_rate': 2.1769297226223984e-06, 'epoch': 8.55} 85%|████████▌ | 8546/10000 [13:27:56<2:13:03, 5.49s/it][2025-06-20 02:57:41,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 02:57:41,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.16 | bwd_microstep: 3370.44 | bwd_inner_microstep: 3369.36 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.43 [2025-06-20 02:57:41,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.16 | bwd: 3370.46 | bwd_inner: 3369.36 | bwd_allreduce: 1.04 | step: 7.43 85%|████████▌ | 8547/10000 [13:28:02<2:13:23, 5.51s/it] {'loss': 0.0004, 'grad_norm': 0.09113450348377228, 'learning_rate': 2.173991798569506e-06, 'epoch': 8.55} 85%|████████▌ | 8547/10000 [13:28:02<2:13:23, 5.51s/it][2025-06-20 02:57:47,159] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 02:57:47,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.32 | bwd_microstep: 3314.54 | bwd_inner_microstep: 3313.48 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.52 [2025-06-20 02:57:47,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.32 | bwd: 3314.55 | bwd_inner: 3313.48 | bwd_allreduce: 1.03 | step: 7.52 85%|████████▌ | 8548/10000 [13:28:07<2:12:58, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0010132957249879837, 'learning_rate': 2.17105574438214e-06, 'epoch': 8.55} 85%|████████▌ | 8548/10000 [13:28:07<2:12:58, 5.49s/it][2025-06-20 02:57:52,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 02:57:52,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.78 | bwd_microstep: 3377.62 | bwd_inner_microstep: 3376.73 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.35 [2025-06-20 02:57:52,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.78 | bwd: 3377.63 | bwd_inner: 3376.73 | bwd_allreduce: 0.86 | step: 7.35 85%|████████▌ | 8549/10000 [13:28:13<2:13:18, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.008276957087218761, 'learning_rate': 2.168121560368286e-06, 'epoch': 8.55} 85%|████████▌ | 8549/10000 [13:28:13<2:13:18, 5.51s/it][2025-06-20 02:57:58,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:57:58,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.56 | bwd_microstep: 3368.24 | bwd_inner_microstep: 3367.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-20 02:57:58,256] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.56 | bwd: 3368.25 | bwd_inner: 3367.44 | bwd_allreduce: 0.76 | step: 6.65 86%|████████▌ | 8550/10000 [13:28:19<2:13:24, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0029768971726298332, 'learning_rate': 2.165189246835715e-06, 'epoch': 8.55} 86%|████████▌ | 8550/10000 [13:28:19<2:13:24, 5.52s/it][2025-06-20 02:58:03,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:58:03,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2145.57 | bwd_microstep: 3395.40 | bwd_inner_microstep: 3394.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 02:58:03,834] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2145.57 | bwd: 3395.41 | bwd_inner: 3394.61 | bwd_allreduce: 0.76 | step: 6.64 86%|████████▌ | 8551/10000 [13:28:24<2:13:43, 5.54s/it] {'loss': 0.0, 'grad_norm': 9.05442502698861e-05, 'learning_rate': 2.1622588040920167e-06, 'epoch': 8.55} 86%|████████▌ | 8551/10000 [13:28:24<2:13:43, 5.54s/it][2025-06-20 02:58:09,295] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 02:58:09,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.58 | bwd_microstep: 3321.33 | bwd_inner_microstep: 3320.52 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.30 [2025-06-20 02:58:09,296] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.58 | bwd: 3321.35 | bwd_inner: 3320.52 | bwd_allreduce: 0.78 | step: 7.30 86%|████████▌ | 8552/10000 [13:28:30<2:13:06, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.006927613168954849, 'learning_rate': 2.1593302324445832e-06, 'epoch': 8.55} 86%|████████▌ | 8552/10000 [13:28:30<2:13:06, 5.52s/it][2025-06-20 02:58:14,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:58:14,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.04 | bwd_microstep: 3335.84 | bwd_inner_microstep: 3335.04 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 02:58:14,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.04 | bwd: 3335.85 | bwd_inner: 3335.04 | bwd_allreduce: 0.76 | step: 6.79 86%|████████▌ | 8553/10000 [13:28:35<2:12:46, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0005213928525336087, 'learning_rate': 2.1564035322006082e-06, 'epoch': 8.55} 86%|████████▌ | 8553/10000 [13:28:35<2:12:46, 5.51s/it][2025-06-20 02:58:20,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:58:20,312] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.59 | bwd_microstep: 3369.04 | bwd_inner_microstep: 3368.24 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.88 [2025-06-20 02:58:20,313] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.59 | bwd: 3369.06 | bwd_inner: 3368.24 | bwd_allreduce: 0.77 | step: 6.88 86%|████████▌ | 8554/10000 [13:28:41<2:12:53, 5.51s/it] {'loss': 0.0, 'grad_norm': 1.820536817831453e-05, 'learning_rate': 2.153478703667089e-06, 'epoch': 8.55} 86%|████████▌ | 8554/10000 [13:28:41<2:12:53, 5.51s/it][2025-06-20 02:58:25,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:58:25,848] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.53 | bwd_microstep: 3367.47 | bwd_inner_microstep: 3366.67 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-20 02:58:25,849] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.53 | bwd: 3367.49 | bwd_inner: 3366.67 | bwd_allreduce: 0.78 | step: 7.01 86%|████████▌ | 8555/10000 [13:28:46<2:12:57, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00037532043643295765, 'learning_rate': 2.1505557471508243e-06, 'epoch': 8.55} 86%|████████▌ | 8555/10000 [13:28:46<2:12:57, 5.52s/it][2025-06-20 02:58:31,317] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 02:58:31,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.69 | bwd_microstep: 3327.08 | bwd_inner_microstep: 3326.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 02:58:31,318] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.69 | bwd: 3327.10 | bwd_inner: 3326.29 | bwd_allreduce: 0.76 | step: 6.70 86%|████████▌ | 8556/10000 [13:28:52<2:12:29, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.009457183070480824, 'learning_rate': 2.14763466295842e-06, 'epoch': 8.56} 86%|████████▌ | 8556/10000 [13:28:52<2:12:29, 5.50s/it][2025-06-20 02:58:36,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:58:36,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.20 | bwd_microstep: 3372.70 | bwd_inner_microstep: 3371.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-20 02:58:36,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.20 | bwd: 3372.71 | bwd_inner: 3371.89 | bwd_allreduce: 0.78 | step: 7.07 86%|████████▌ | 8557/10000 [13:28:57<2:12:36, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00560465082526207, 'learning_rate': 2.144715451396282e-06, 'epoch': 8.56} 86%|████████▌ | 8557/10000 [13:28:57<2:12:36, 5.51s/it][2025-06-20 02:58:42,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:58:42,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.07 | bwd_microstep: 3376.82 | bwd_inner_microstep: 3376.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.96 [2025-06-20 02:58:42,397] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.07 | bwd: 3376.84 | bwd_inner: 3376.03 | bwd_allreduce: 0.76 | step: 6.97 86%|████████▌ | 8558/10000 [13:29:03<2:12:44, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00020793637668248266, 'learning_rate': 2.1417981127706254e-06, 'epoch': 8.56} 86%|████████▌ | 8558/10000 [13:29:03<2:12:44, 5.52s/it][2025-06-20 02:58:47,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 02:58:47,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.49 | bwd_microstep: 3316.27 | bwd_inner_microstep: 3315.32 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.07 [2025-06-20 02:58:47,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.49 | bwd: 3316.28 | bwd_inner: 3315.32 | bwd_allreduce: 0.92 | step: 7.07 86%|████████▌ | 8559/10000 [13:29:08<2:12:11, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.02544521540403366, 'learning_rate': 2.1388826473874657e-06, 'epoch': 8.56} 86%|████████▌ | 8559/10000 [13:29:08<2:12:11, 5.50s/it][2025-06-20 02:58:53,423] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:58:53,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.66 | bwd_microstep: 3384.95 | bwd_inner_microstep: 3384.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 02:58:53,424] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.66 | bwd: 3384.96 | bwd_inner: 3384.16 | bwd_allreduce: 0.76 | step: 6.70 86%|████████▌ | 8560/10000 [13:29:14<2:12:32, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0010395842837169766, 'learning_rate': 2.135969055552618e-06, 'epoch': 8.56} 86%|████████▌ | 8560/10000 [13:29:14<2:12:32, 5.52s/it][2025-06-20 02:58:58,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:58:58,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.08 | bwd_microstep: 3375.80 | bwd_inner_microstep: 3375.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-20 02:58:58,980] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.08 | bwd: 3375.82 | bwd_inner: 3375.00 | bwd_allreduce: 0.78 | step: 6.86 86%|████████▌ | 8561/10000 [13:29:19<2:12:41, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0038434225134551525, 'learning_rate': 2.133057337571709e-06, 'epoch': 8.56} 86%|████████▌ | 8561/10000 [13:29:19<2:12:41, 5.53s/it][2025-06-20 02:59:04,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:59:04,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.56 | bwd_microstep: 3325.18 | bwd_inner_microstep: 3324.36 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.27 [2025-06-20 02:59:04,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.56 | bwd: 3325.19 | bwd_inner: 3324.36 | bwd_allreduce: 0.79 | step: 7.27 86%|████████▌ | 8562/10000 [13:29:25<2:12:10, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0003340171533636749, 'learning_rate': 2.130147493750161e-06, 'epoch': 8.56} 86%|████████▌ | 8562/10000 [13:29:25<2:12:10, 5.51s/it][2025-06-20 02:59:10,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:59:10,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.03 | bwd_microstep: 3398.69 | bwd_inner_microstep: 3397.89 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-20 02:59:10,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.03 | bwd: 3398.71 | bwd_inner: 3397.89 | bwd_allreduce: 0.77 | step: 6.77 86%|████████▌ | 8563/10000 [13:29:30<2:12:31, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0007723302114754915, 'learning_rate': 2.1272395243932032e-06, 'epoch': 8.56} 86%|████████▌ | 8563/10000 [13:29:30<2:12:31, 5.53s/it][2025-06-20 02:59:15,576] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:59:15,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.29 | bwd_microstep: 3375.31 | bwd_inner_microstep: 3374.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-20 02:59:15,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.29 | bwd: 3375.32 | bwd_inner: 3374.51 | bwd_allreduce: 0.77 | step: 6.92 86%|████████▌ | 8564/10000 [13:29:36<2:12:32, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0009468455100432038, 'learning_rate': 2.124333429805876e-06, 'epoch': 8.56} 86%|████████▌ | 8564/10000 [13:29:36<2:12:32, 5.54s/it][2025-06-20 02:59:21,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:59:21,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.55 | bwd_microstep: 3323.82 | bwd_inner_microstep: 3323.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 02:59:21,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.55 | bwd: 3323.83 | bwd_inner: 3323.02 | bwd_allreduce: 0.77 | step: 6.76 86%|████████▌ | 8565/10000 [13:29:41<2:11:58, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0007025630911812186, 'learning_rate': 2.1214292102930023e-06, 'epoch': 8.56} 86%|████████▌ | 8565/10000 [13:29:41<2:11:58, 5.52s/it][2025-06-20 02:59:26,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 02:59:26,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.38 | bwd_microstep: 3326.08 | bwd_inner_microstep: 3325.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-20 02:59:26,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.38 | bwd: 3326.09 | bwd_inner: 3325.28 | bwd_allreduce: 0.76 | step: 6.84 86%|████████▌ | 8566/10000 [13:29:47<2:11:33, 5.50s/it] {'loss': 0.0, 'grad_norm': 1.2334333405306097e-05, 'learning_rate': 2.1185268661592273e-06, 'epoch': 8.57} 86%|████████▌ | 8566/10000 [13:29:47<2:11:33, 5.50s/it][2025-06-20 02:59:32,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 02:59:32,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.23 | bwd_microstep: 3367.48 | bwd_inner_microstep: 3366.52 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.70 [2025-06-20 02:59:32,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.23 | bwd: 3367.49 | bwd_inner: 3366.52 | bwd_allreduce: 0.92 | step: 7.71 86%|████████▌ | 8567/10000 [13:29:52<2:11:41, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00015427956532221287, 'learning_rate': 2.115626397708994e-06, 'epoch': 8.57} 86%|████████▌ | 8567/10000 [13:29:52<2:11:41, 5.51s/it][2025-06-20 02:59:37,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 02:59:37,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.60 | bwd_microstep: 3322.93 | bwd_inner_microstep: 3322.14 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-20 02:59:37,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.60 | bwd: 3322.95 | bwd_inner: 3322.14 | bwd_allreduce: 0.76 | step: 6.74 86%|████████▌ | 8568/10000 [13:29:58<2:11:20, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.023085935041308403, 'learning_rate': 2.1127278052465485e-06, 'epoch': 8.57} 86%|████████▌ | 8568/10000 [13:29:58<2:11:20, 5.50s/it][2025-06-20 02:59:43,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:59:43,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.08 | bwd_microstep: 3320.36 | bwd_inner_microstep: 3319.48 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.09 [2025-06-20 02:59:43,003] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.08 | bwd: 3320.37 | bwd_inner: 3319.48 | bwd_allreduce: 0.85 | step: 7.09 86%|████████▌ | 8569/10000 [13:30:03<2:10:59, 5.49s/it] {'loss': 0.0762, 'grad_norm': 16.622234344482422, 'learning_rate': 2.109831089075942e-06, 'epoch': 8.57} 86%|████████▌ | 8569/10000 [13:30:03<2:10:59, 5.49s/it][2025-06-20 02:59:48,546] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.90 [2025-06-20 02:59:48,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.70 | bwd_microstep: 3375.53 | bwd_inner_microstep: 3374.74 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.30 [2025-06-20 02:59:48,547] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.70 | bwd: 3375.54 | bwd_inner: 3374.74 | bwd_allreduce: 0.76 | step: 7.30 86%|████████▌ | 8570/10000 [13:30:09<2:11:16, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0002112326183123514, 'learning_rate': 2.1069362495010215e-06, 'epoch': 8.57} 86%|████████▌ | 8570/10000 [13:30:09<2:11:16, 5.51s/it][2025-06-20 02:59:54,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 02:59:54,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.07 | bwd_microstep: 3397.93 | bwd_inner_microstep: 3397.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-20 02:59:54,125] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.07 | bwd: 3397.94 | bwd_inner: 3397.13 | bwd_allreduce: 0.77 | step: 6.86 86%|████████▌ | 8571/10000 [13:30:14<2:11:40, 5.53s/it] {'loss': 0.0002, 'grad_norm': 0.043456416577100754, 'learning_rate': 2.1040432868254414e-06, 'epoch': 8.57} 86%|████████▌ | 8571/10000 [13:30:14<2:11:40, 5.53s/it][2025-06-20 02:59:59,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 02:59:59,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.27 | bwd_microstep: 3370.71 | bwd_inner_microstep: 3369.92 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-20 02:59:59,678] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.27 | bwd: 3370.72 | bwd_inner: 3369.92 | bwd_allreduce: 0.76 | step: 6.73 86%|████████▌ | 8572/10000 [13:30:20<2:11:45, 5.54s/it] {'loss': 0.0, 'grad_norm': 6.528302765218541e-05, 'learning_rate': 2.101152201352663e-06, 'epoch': 8.57} 86%|████████▌ | 8572/10000 [13:30:20<2:11:45, 5.54s/it][2025-06-20 03:00:05,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 03:00:05,217] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.62 | bwd_microstep: 3369.35 | bwd_inner_microstep: 3368.37 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.35 [2025-06-20 03:00:05,218] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.62 | bwd: 3369.36 | bwd_inner: 3368.37 | bwd_allreduce: 0.95 | step: 7.35 86%|████████▌ | 8573/10000 [13:30:26<2:11:41, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0011430385056883097, 'learning_rate': 2.0982629933859487e-06, 'epoch': 8.57} 86%|████████▌ | 8573/10000 [13:30:26<2:11:41, 5.54s/it][2025-06-20 03:00:10,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:00:10,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.11 | bwd_microstep: 3325.28 | bwd_inner_microstep: 3324.46 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.05 [2025-06-20 03:00:10,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.11 | bwd: 3325.29 | bwd_inner: 3324.46 | bwd_allreduce: 0.79 | step: 7.05 86%|████████▌ | 8574/10000 [13:30:31<2:11:14, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.003011071588844061, 'learning_rate': 2.095375663228363e-06, 'epoch': 8.57} 86%|████████▌ | 8574/10000 [13:30:31<2:11:14, 5.52s/it][2025-06-20 03:00:16,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:00:16,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.11 | bwd_microstep: 3337.58 | bwd_inner_microstep: 3336.60 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.26 [2025-06-20 03:00:16,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.11 | bwd: 3337.59 | bwd_inner: 3336.60 | bwd_allreduce: 0.94 | step: 7.26 86%|████████▌ | 8575/10000 [13:30:37<2:10:59, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.003497035475447774, 'learning_rate': 2.0924902111827694e-06, 'epoch': 8.57} 86%|████████▌ | 8575/10000 [13:30:37<2:10:59, 5.52s/it][2025-06-20 03:00:21,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:00:21,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.07 | bwd_microstep: 3323.66 | bwd_inner_microstep: 3322.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.83 [2025-06-20 03:00:21,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.07 | bwd: 3323.68 | bwd_inner: 3322.87 | bwd_allreduce: 0.77 | step: 6.83 86%|████████▌ | 8576/10000 [13:30:42<2:10:34, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00297010806389153, 'learning_rate': 2.089606637551844e-06, 'epoch': 8.58} 86%|████████▌ | 8576/10000 [13:30:42<2:10:34, 5.50s/it][2025-06-20 03:00:27,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 03:00:27,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.06 | bwd_microstep: 3375.41 | bwd_inner_microstep: 3374.35 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.32 [2025-06-20 03:00:27,222] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.06 | bwd: 3375.43 | bwd_inner: 3374.35 | bwd_allreduce: 1.02 | step: 7.32 86%|████████▌ | 8577/10000 [13:30:48<2:10:49, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.007448812015354633, 'learning_rate': 2.0867249426380563e-06, 'epoch': 8.58} 86%|████████▌ | 8577/10000 [13:30:48<2:10:49, 5.52s/it][2025-06-20 03:00:32,764] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:00:32,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.95 | bwd_microstep: 3369.65 | bwd_inner_microstep: 3368.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 03:00:32,765] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.95 | bwd: 3369.66 | bwd_inner: 3368.86 | bwd_allreduce: 0.76 | step: 6.74 86%|████████▌ | 8578/10000 [13:30:53<2:10:54, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.002397222211584449, 'learning_rate': 2.083845126743684e-06, 'epoch': 8.58} 86%|████████▌ | 8578/10000 [13:30:53<2:10:54, 5.52s/it][2025-06-20 03:00:38,229] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:00:38,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.89 | bwd_microstep: 3315.51 | bwd_inner_microstep: 3314.68 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.84 [2025-06-20 03:00:38,230] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.90 | bwd: 3315.52 | bwd_inner: 3314.68 | bwd_allreduce: 0.79 | step: 6.85 86%|████████▌ | 8579/10000 [13:30:59<2:10:25, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.05444883182644844, 'learning_rate': 2.080967190170806e-06, 'epoch': 8.58} 86%|████████▌ | 8579/10000 [13:30:59<2:10:25, 5.51s/it][2025-06-20 03:00:43,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 03:00:43,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.40 | bwd_microstep: 3370.72 | bwd_inner_microstep: 3369.77 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.58 [2025-06-20 03:00:43,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.40 | bwd: 3370.73 | bwd_inner: 3369.77 | bwd_allreduce: 0.92 | step: 7.59 86%|████████▌ | 8580/10000 [13:31:04<2:10:36, 5.52s/it] {'loss': 0.0, 'grad_norm': 2.9630497010657564e-05, 'learning_rate': 2.0780911332213094e-06, 'epoch': 8.58} 86%|████████▌ | 8580/10000 [13:31:04<2:10:36, 5.52s/it][2025-06-20 03:00:49,252] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:00:49,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.93 | bwd_microstep: 3322.45 | bwd_inner_microstep: 3321.66 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 03:00:49,253] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.93 | bwd: 3322.46 | bwd_inner: 3321.66 | bwd_allreduce: 0.75 | step: 6.68 86%|████████▌ | 8581/10000 [13:31:10<2:10:11, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00011503975110827014, 'learning_rate': 2.0752169561968705e-06, 'epoch': 8.58} 86%|████████▌ | 8581/10000 [13:31:10<2:10:11, 5.50s/it][2025-06-20 03:00:54,724] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:00:54,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.30 | bwd_microstep: 3325.81 | bwd_inner_microstep: 3325.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.04 [2025-06-20 03:00:54,725] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.30 | bwd: 3325.82 | bwd_inner: 3325.01 | bwd_allreduce: 0.77 | step: 7.05 86%|████████▌ | 8582/10000 [13:31:15<2:09:52, 5.50s/it] {'loss': 0.0032, 'grad_norm': 1.8276326656341553, 'learning_rate': 2.0723446593989817e-06, 'epoch': 8.58} 86%|████████▌ | 8582/10000 [13:31:15<2:09:52, 5.50s/it][2025-06-20 03:01:00,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:01:00,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.67 | bwd_microstep: 3317.03 | bwd_inner_microstep: 3316.23 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 03:01:00,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.67 | bwd: 3317.04 | bwd_inner: 3316.23 | bwd_allreduce: 0.77 | step: 6.82 86%|████████▌ | 8583/10000 [13:31:20<2:09:32, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.011388958431780338, 'learning_rate': 2.069474243128935e-06, 'epoch': 8.58} 86%|████████▌ | 8583/10000 [13:31:20<2:09:32, 5.49s/it][2025-06-20 03:01:05,663] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.71 | optimizer_step: 2.73 [2025-06-20 03:01:05,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.08 | bwd_microstep: 3325.14 | bwd_inner_microstep: 3324.34 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.31 [2025-06-20 03:01:05,664] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.08 | bwd: 3325.16 | bwd_inner: 3324.34 | bwd_allreduce: 0.78 | step: 7.32 86%|████████▌ | 8584/10000 [13:31:26<2:09:23, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.005186236929148436, 'learning_rate': 2.0666057076878186e-06, 'epoch': 8.58} 86%|████████▌ | 8584/10000 [13:31:26<2:09:23, 5.48s/it][2025-06-20 03:01:11,142] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:01:11,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.05 | bwd_microstep: 3331.86 | bwd_inner_microstep: 3331.05 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-20 03:01:11,143] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.05 | bwd: 3331.87 | bwd_inner: 3331.05 | bwd_allreduce: 0.78 | step: 7.13 86%|████████▌ | 8585/10000 [13:31:31<2:09:17, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0030142946634441614, 'learning_rate': 2.063739053376539e-06, 'epoch': 8.59} 86%|████████▌ | 8585/10000 [13:31:31<2:09:17, 5.48s/it][2025-06-20 03:01:16,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 03:01:16,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.98 | bwd_microstep: 3368.97 | bwd_inner_microstep: 3368.07 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.26 [2025-06-20 03:01:16,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.98 | bwd: 3368.98 | bwd_inner: 3368.07 | bwd_allreduce: 0.86 | step: 7.26 86%|████████▌ | 8586/10000 [13:31:37<2:09:42, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.05534292757511139, 'learning_rate': 2.0608742804957835e-06, 'epoch': 8.59} 86%|████████▌ | 8586/10000 [13:31:37<2:09:42, 5.50s/it][2025-06-20 03:01:22,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 03:01:22,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.26 | bwd_microstep: 3329.00 | bwd_inner_microstep: 3327.70 | bwd_allreduce_microstep: 1.24 | step_microstep: 8.23 [2025-06-20 03:01:22,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.26 | bwd: 3329.02 | bwd_inner: 3327.70 | bwd_allreduce: 1.26 | step: 8.25 86%|████████▌ | 8587/10000 [13:31:42<2:09:30, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0029997287783771753, 'learning_rate': 2.0580113893460576e-06, 'epoch': 8.59} 86%|████████▌ | 8587/10000 [13:31:42<2:09:30, 5.50s/it][2025-06-20 03:01:27,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:01:27,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.65 | bwd_microstep: 3315.90 | bwd_inner_microstep: 3315.09 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.24 [2025-06-20 03:01:27,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.65 | bwd: 3315.91 | bwd_inner: 3315.09 | bwd_allreduce: 0.78 | step: 7.24 86%|████████▌ | 8588/10000 [13:31:48<2:09:12, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005225476808845997, 'learning_rate': 2.055150380227664e-06, 'epoch': 8.59} 86%|████████▌ | 8588/10000 [13:31:48<2:09:12, 5.49s/it][2025-06-20 03:01:33,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:01:33,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.89 | bwd_microstep: 3329.67 | bwd_inner_microstep: 3328.88 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 03:01:33,133] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.89 | bwd: 3329.69 | bwd_inner: 3328.88 | bwd_allreduce: 0.76 | step: 6.79 86%|████████▌ | 8589/10000 [13:31:53<2:09:00, 5.49s/it] {'loss': 0.0, 'grad_norm': 5.7982339058071375e-05, 'learning_rate': 2.0522912534407123e-06, 'epoch': 8.59} 86%|████████▌ | 8589/10000 [13:31:53<2:09:00, 5.49s/it][2025-06-20 03:01:38,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:01:38,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.56 | bwd_microstep: 3324.19 | bwd_inner_microstep: 3323.39 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 03:01:38,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.56 | bwd: 3324.20 | bwd_inner: 3323.39 | bwd_allreduce: 0.76 | step: 6.72 86%|████████▌ | 8590/10000 [13:31:59<2:08:48, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.010167851112782955, 'learning_rate': 2.049434009285114e-06, 'epoch': 8.59} 86%|████████▌ | 8590/10000 [13:31:59<2:08:48, 5.48s/it][2025-06-20 03:01:44,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:01:44,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.61 | bwd_microstep: 3370.60 | bwd_inner_microstep: 3369.79 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 03:01:44,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.61 | bwd: 3370.61 | bwd_inner: 3369.79 | bwd_allreduce: 0.78 | step: 6.82 86%|████████▌ | 8591/10000 [13:32:04<2:09:04, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.061999231576919556, 'learning_rate': 2.046578648060573e-06, 'epoch': 8.59} 86%|████████▌ | 8591/10000 [13:32:04<2:09:04, 5.50s/it][2025-06-20 03:01:49,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 03:01:49,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.37 | bwd_microstep: 3404.62 | bwd_inner_microstep: 3403.80 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.27 [2025-06-20 03:01:49,720] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.38 | bwd: 3404.63 | bwd_inner: 3403.81 | bwd_allreduce: 0.79 | step: 7.27 86%|████████▌ | 8592/10000 [13:32:10<2:09:36, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.002002516994252801, 'learning_rate': 2.043725170066608e-06, 'epoch': 8.59} 86%|████████▌ | 8592/10000 [13:32:10<2:09:36, 5.52s/it][2025-06-20 03:01:55,192] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:01:55,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.72 | bwd_microstep: 3320.25 | bwd_inner_microstep: 3319.38 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.11 [2025-06-20 03:01:55,193] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.72 | bwd: 3320.27 | bwd_inner: 3319.38 | bwd_allreduce: 0.82 | step: 7.11 86%|████████▌ | 8593/10000 [13:32:15<2:09:11, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0004897229955531657, 'learning_rate': 2.0408735756025333e-06, 'epoch': 8.59} 86%|████████▌ | 8593/10000 [13:32:15<2:09:11, 5.51s/it][2025-06-20 03:02:00,727] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:02:00,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.89 | bwd_microstep: 3362.75 | bwd_inner_microstep: 3361.94 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.18 [2025-06-20 03:02:00,728] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.89 | bwd: 3362.76 | bwd_inner: 3361.94 | bwd_allreduce: 0.78 | step: 7.19 86%|████████▌ | 8594/10000 [13:32:21<2:09:15, 5.52s/it] {'loss': 0.0008, 'grad_norm': 0.19190877676010132, 'learning_rate': 2.0380238649674713e-06, 'epoch': 8.59} 86%|████████▌ | 8594/10000 [13:32:21<2:09:15, 5.52s/it][2025-06-20 03:02:06,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:02:06,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.92 | bwd_microstep: 3316.91 | bwd_inner_microstep: 3316.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 03:02:06,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.92 | bwd: 3316.92 | bwd_inner: 3316.11 | bwd_allreduce: 0.77 | step: 6.82 86%|████████▌ | 8595/10000 [13:32:26<2:08:46, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.006431299727410078, 'learning_rate': 2.0351760384603426e-06, 'epoch': 8.6} 86%|████████▌ | 8595/10000 [13:32:26<2:08:46, 5.50s/it][2025-06-20 03:02:11,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:02:11,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.51 | bwd_microstep: 3321.82 | bwd_inner_microstep: 3320.90 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.43 [2025-06-20 03:02:11,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.51 | bwd: 3321.84 | bwd_inner: 3320.90 | bwd_allreduce: 0.89 | step: 7.43 86%|████████▌ | 8596/10000 [13:32:32<2:08:25, 5.49s/it] {'loss': 0.0015, 'grad_norm': 0.3949330151081085, 'learning_rate': 2.0323300963798688e-06, 'epoch': 8.6} 86%|████████▌ | 8596/10000 [13:32:32<2:08:25, 5.49s/it][2025-06-20 03:02:17,184] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.60 | optimizer_step: 2.73 [2025-06-20 03:02:17,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.30 | bwd_microstep: 3367.73 | bwd_inner_microstep: 3366.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.24 [2025-06-20 03:02:17,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.30 | bwd: 3367.74 | bwd_inner: 3366.94 | bwd_allreduce: 0.76 | step: 7.24 86%|████████▌ | 8597/10000 [13:32:37<2:08:39, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0005959774134680629, 'learning_rate': 2.0294860390245776e-06, 'epoch': 8.6} 86%|████████▌ | 8597/10000 [13:32:37<2:08:39, 5.50s/it][2025-06-20 03:02:22,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:02:22,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.05 | bwd_microstep: 3312.94 | bwd_inner_microstep: 3312.14 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.86 [2025-06-20 03:02:22,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.05 | bwd: 3312.95 | bwd_inner: 3312.14 | bwd_allreduce: 0.77 | step: 6.86 86%|████████▌ | 8598/10000 [13:32:43<2:08:15, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.00498990248888731, 'learning_rate': 2.0266438666927966e-06, 'epoch': 8.6} 86%|████████▌ | 8598/10000 [13:32:43<2:08:15, 5.49s/it][2025-06-20 03:02:28,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:02:28,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.74 | bwd_microstep: 3322.72 | bwd_inner_microstep: 3321.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-20 03:02:28,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.74 | bwd: 3322.73 | bwd_inner: 3321.92 | bwd_allreduce: 0.77 | step: 6.94 86%|████████▌ | 8599/10000 [13:32:48<2:07:59, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.06937888264656067, 'learning_rate': 2.0238035796826573e-06, 'epoch': 8.6} 86%|████████▌ | 8599/10000 [13:32:48<2:07:59, 5.48s/it][2025-06-20 03:02:33,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:02:33,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.03 | bwd_microstep: 3316.06 | bwd_inner_microstep: 3315.26 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 03:02:33,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.03 | bwd: 3316.08 | bwd_inner: 3315.26 | bwd_allreduce: 0.76 | step: 6.70 86%|████████▌ | 8600/10000 [13:32:54<2:07:43, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.006669072899967432, 'learning_rate': 2.020965178292096e-06, 'epoch': 8.6} 86%|████████▌ | 8600/10000 [13:32:54<2:07:43, 5.47s/it][2025-06-20 03:02:39,017] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:02:39,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.29 | bwd_microstep: 3314.58 | bwd_inner_microstep: 3313.59 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.19 [2025-06-20 03:02:39,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.29 | bwd: 3314.59 | bwd_inner: 3313.59 | bwd_allreduce: 0.96 | step: 7.19 86%|████████▌ | 8601/10000 [13:32:59<2:07:29, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00014895947242621332, 'learning_rate': 2.0181286628188414e-06, 'epoch': 8.6} 86%|████████▌ | 8601/10000 [13:32:59<2:07:29, 5.47s/it][2025-06-20 03:02:44,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 03:02:44,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.77 | bwd_microstep: 3313.93 | bwd_inner_microstep: 3313.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 03:02:44,481] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.77 | bwd: 3313.94 | bwd_inner: 3313.15 | bwd_allreduce: 0.75 | step: 6.67 86%|████████▌ | 8602/10000 [13:33:05<2:07:21, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.004011108074337244, 'learning_rate': 2.015294033560431e-06, 'epoch': 8.6} 86%|████████▌ | 8602/10000 [13:33:05<2:07:21, 5.47s/it][2025-06-20 03:02:49,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:02:49,936] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.53 | bwd_microstep: 3314.67 | bwd_inner_microstep: 3313.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 03:02:49,937] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.53 | bwd: 3314.68 | bwd_inner: 3313.88 | bwd_allreduce: 0.75 | step: 6.58 86%|████████▌ | 8603/10000 [13:33:10<2:07:11, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0006338250241242349, 'learning_rate': 2.0124612908142093e-06, 'epoch': 8.6} 86%|████████▌ | 8603/10000 [13:33:10<2:07:11, 5.46s/it][2025-06-20 03:02:55,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:02:55,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.63 | bwd_microstep: 3309.26 | bwd_inner_microstep: 3308.47 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 03:02:55,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.63 | bwd: 3309.27 | bwd_inner: 3308.47 | bwd_allreduce: 0.75 | step: 6.62 86%|████████▌ | 8604/10000 [13:33:16<2:06:58, 5.46s/it] {'loss': 0.0, 'grad_norm': 1.9321396393934265e-05, 'learning_rate': 2.0096304348773168e-06, 'epoch': 8.6} 86%|████████▌ | 8604/10000 [13:33:16<2:06:58, 5.46s/it][2025-06-20 03:03:00,831] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:03:00,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.43 | bwd_microstep: 3312.41 | bwd_inner_microstep: 3311.61 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.04 [2025-06-20 03:03:00,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.43 | bwd: 3312.42 | bwd_inner: 3311.61 | bwd_allreduce: 0.77 | step: 7.04 86%|████████▌ | 8605/10000 [13:33:21<2:06:50, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.029775705188512802, 'learning_rate': 2.0068014660466974e-06, 'epoch': 8.61} 86%|████████▌ | 8605/10000 [13:33:21<2:06:50, 5.46s/it][2025-06-20 03:03:06,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:03:06,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.37 | bwd_microstep: 3357.39 | bwd_inner_microstep: 3356.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 03:03:06,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.37 | bwd: 3357.40 | bwd_inner: 3356.60 | bwd_allreduce: 0.76 | step: 6.58 86%|████████▌ | 8606/10000 [13:33:27<2:07:16, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.006717579904943705, 'learning_rate': 2.0039743846190918e-06, 'epoch': 8.61} 86%|████████▌ | 8606/10000 [13:33:27<2:07:16, 5.48s/it][2025-06-20 03:03:11,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:03:11,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.82 | bwd_microstep: 3319.84 | bwd_inner_microstep: 3319.05 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.80 [2025-06-20 03:03:11,822] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.82 | bwd: 3319.85 | bwd_inner: 3319.05 | bwd_allreduce: 0.75 | step: 6.80 86%|████████▌ | 8607/10000 [13:33:32<2:07:02, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0009933231631293893, 'learning_rate': 2.0011491908910496e-06, 'epoch': 8.61} 86%|████████▌ | 8607/10000 [13:33:32<2:07:02, 5.47s/it][2025-06-20 03:03:17,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:03:17,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.51 | bwd_microstep: 3366.72 | bwd_inner_microstep: 3365.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 03:03:17,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.51 | bwd: 3366.74 | bwd_inner: 3365.93 | bwd_allreduce: 0.76 | step: 6.65 86%|████████▌ | 8608/10000 [13:33:38<2:07:23, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.01076021883636713, 'learning_rate': 1.9983258851589247e-06, 'epoch': 8.61} 86%|████████▌ | 8608/10000 [13:33:38<2:07:23, 5.49s/it][2025-06-20 03:03:22,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 03:03:22,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.52 | bwd_microstep: 3365.60 | bwd_inner_microstep: 3364.82 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.58 [2025-06-20 03:03:22,889] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.52 | bwd: 3365.61 | bwd_inner: 3364.82 | bwd_allreduce: 0.75 | step: 6.58 86%|████████▌ | 8609/10000 [13:33:43<2:07:35, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.007486959919333458, 'learning_rate': 1.995504467718865e-06, 'epoch': 8.61} 86%|████████▌ | 8609/10000 [13:33:43<2:07:35, 5.50s/it][2025-06-20 03:03:28,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:03:28,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.21 | bwd_microstep: 3366.63 | bwd_inner_microstep: 3365.70 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.41 [2025-06-20 03:03:28,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.21 | bwd: 3366.65 | bwd_inner: 3365.70 | bwd_allreduce: 0.91 | step: 7.41 86%|████████▌ | 8610/10000 [13:33:49<2:07:40, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00019028405949939042, 'learning_rate': 1.9926849388668245e-06, 'epoch': 8.61} 86%|████████▌ | 8610/10000 [13:33:49<2:07:40, 5.51s/it][2025-06-20 03:03:33,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 03:03:33,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.76 | bwd_microstep: 3329.00 | bwd_inner_microstep: 3327.95 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.19 [2025-06-20 03:03:33,899] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.76 | bwd: 3329.02 | bwd_inner: 3327.95 | bwd_allreduce: 1.01 | step: 7.19 86%|████████▌ | 8611/10000 [13:33:54<2:07:23, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0019511869177222252, 'learning_rate': 1.98986729889856e-06, 'epoch': 8.61} 86%|████████▌ | 8611/10000 [13:33:54<2:07:23, 5.50s/it][2025-06-20 03:03:39,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:03:39,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.67 | bwd_microstep: 3317.64 | bwd_inner_microstep: 3316.82 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.87 [2025-06-20 03:03:39,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.67 | bwd: 3317.66 | bwd_inner: 3316.82 | bwd_allreduce: 0.79 | step: 6.87 86%|████████▌ | 8612/10000 [13:34:00<2:07:01, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.000213915147469379, 'learning_rate': 1.9870515481096285e-06, 'epoch': 8.61} 86%|████████▌ | 8612/10000 [13:34:00<2:07:01, 5.49s/it][2025-06-20 03:03:44,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:03:44,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.45 | bwd_microstep: 3313.25 | bwd_inner_microstep: 3312.35 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.26 [2025-06-20 03:03:44,821] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.45 | bwd: 3313.26 | bwd_inner: 3312.35 | bwd_allreduce: 0.87 | step: 7.26 86%|████████▌ | 8613/10000 [13:34:05<2:06:41, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00022958015324547887, 'learning_rate': 1.9842376867953895e-06, 'epoch': 8.61} 86%|████████▌ | 8613/10000 [13:34:05<2:06:41, 5.48s/it][2025-06-20 03:03:50,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 03:03:50,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.06 | bwd_microstep: 3373.98 | bwd_inner_microstep: 3372.88 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.23 [2025-06-20 03:03:50,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.06 | bwd: 3373.99 | bwd_inner: 3372.88 | bwd_allreduce: 1.06 | step: 7.23 86%|████████▌ | 8614/10000 [13:34:11<2:06:59, 5.50s/it] {'loss': 0.0021, 'grad_norm': 0.5276440382003784, 'learning_rate': 1.981425715251002e-06, 'epoch': 8.61} 86%|████████▌ | 8614/10000 [13:34:11<2:06:59, 5.50s/it][2025-06-20 03:03:55,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:03:55,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.46 | bwd_microstep: 3314.20 | bwd_inner_microstep: 3313.41 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 03:03:55,819] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.46 | bwd: 3314.22 | bwd_inner: 3313.41 | bwd_allreduce: 0.77 | step: 6.71 86%|████████▌ | 8615/10000 [13:34:16<2:06:38, 5.49s/it] {'loss': 0.0, 'grad_norm': 8.814354077912867e-05, 'learning_rate': 1.9786156337714347e-06, 'epoch': 8.62} 86%|████████▌ | 8615/10000 [13:34:16<2:06:38, 5.49s/it][2025-06-20 03:04:01,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:04:01,341] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.38 | bwd_microstep: 3360.85 | bwd_inner_microstep: 3359.94 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.00 [2025-06-20 03:04:01,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.38 | bwd: 3360.86 | bwd_inner: 3359.94 | bwd_allreduce: 0.88 | step: 7.00 86%|████████▌ | 8616/10000 [13:34:22<2:06:48, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005161183886229992, 'learning_rate': 1.9758074426514428e-06, 'epoch': 8.62} 86%|████████▌ | 8616/10000 [13:34:22<2:06:48, 5.50s/it][2025-06-20 03:04:06,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:04:06,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.47 | bwd_microstep: 3319.04 | bwd_inner_microstep: 3318.12 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.05 [2025-06-20 03:04:06,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.47 | bwd: 3319.05 | bwd_inner: 3318.12 | bwd_allreduce: 0.89 | step: 7.06 86%|████████▌ | 8617/10000 [13:34:27<2:06:31, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.000922402017749846, 'learning_rate': 1.9730011421855977e-06, 'epoch': 8.62} 86%|████████▌ | 8617/10000 [13:34:27<2:06:31, 5.49s/it][2025-06-20 03:04:12,352] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:04:12,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.98 | bwd_microstep: 3372.93 | bwd_inner_microstep: 3372.11 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 03:04:12,353] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.98 | bwd: 3372.95 | bwd_inner: 3372.11 | bwd_allreduce: 0.79 | step: 6.81 86%|████████▌ | 8618/10000 [13:34:33<2:06:48, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.1269969791173935, 'learning_rate': 1.970196732668268e-06, 'epoch': 8.62} 86%|████████▌ | 8618/10000 [13:34:33<2:06:48, 5.51s/it][2025-06-20 03:04:17,894] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.77 [2025-06-20 03:04:17,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.43 | bwd_microstep: 3367.44 | bwd_inner_microstep: 3366.52 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.59 [2025-06-20 03:04:17,895] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.43 | bwd: 3367.46 | bwd_inner: 3366.52 | bwd_allreduce: 0.89 | step: 7.59 86%|████████▌ | 8619/10000 [13:34:38<2:06:57, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.01808060146868229, 'learning_rate': 1.967394214393621e-06, 'epoch': 8.62} 86%|████████▌ | 8619/10000 [13:34:38<2:06:57, 5.52s/it][2025-06-20 03:04:23,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:04:23,355] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.27 | bwd_microstep: 3314.15 | bwd_inner_microstep: 3313.38 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 03:04:23,356] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.27 | bwd: 3314.17 | bwd_inner: 3313.38 | bwd_allreduce: 0.75 | step: 6.63 86%|████████▌ | 8620/10000 [13:34:44<2:06:28, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.008143425919115543, 'learning_rate': 1.9645935876556343e-06, 'epoch': 8.62} 86%|████████▌ | 8620/10000 [13:34:44<2:06:28, 5.50s/it][2025-06-20 03:04:28,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:04:28,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.36 | bwd_microstep: 3309.24 | bwd_inner_microstep: 3308.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 03:04:28,799] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.36 | bwd: 3309.25 | bwd_inner: 3308.45 | bwd_allreduce: 0.76 | step: 6.58 86%|████████▌ | 8621/10000 [13:34:49<2:06:00, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0028345612809062004, 'learning_rate': 1.961794852748069e-06, 'epoch': 8.62} 86%|████████▌ | 8621/10000 [13:34:49<2:06:00, 5.48s/it][2025-06-20 03:04:34,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-20 03:04:34,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.03 | bwd_microstep: 3308.93 | bwd_inner_microstep: 3308.15 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-20 03:04:34,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.03 | bwd: 3308.94 | bwd_inner: 3308.15 | bwd_allreduce: 0.75 | step: 6.56 86%|████████▌ | 8622/10000 [13:34:55<2:05:39, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.002790992148220539, 'learning_rate': 1.958998009964508e-06, 'epoch': 8.62} 86%|████████▌ | 8622/10000 [13:34:55<2:05:39, 5.47s/it][2025-06-20 03:04:39,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:04:39,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.01 | bwd_microstep: 3317.66 | bwd_inner_microstep: 3316.81 | bwd_allreduce_microstep: 0.81 | step_microstep: 6.72 [2025-06-20 03:04:39,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.01 | bwd: 3317.68 | bwd_inner: 3316.81 | bwd_allreduce: 0.83 | step: 6.72 86%|████████▌ | 8623/10000 [13:35:00<2:05:29, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00025587648269720376, 'learning_rate': 1.9562030595983252e-06, 'epoch': 8.62} 86%|████████▌ | 8623/10000 [13:35:00<2:05:29, 5.47s/it][2025-06-20 03:04:45,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:04:45,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.67 | bwd_microstep: 3309.34 | bwd_inner_microstep: 3308.47 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.96 [2025-06-20 03:04:45,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.67 | bwd: 3309.35 | bwd_inner: 3308.47 | bwd_allreduce: 0.84 | step: 6.97 86%|████████▌ | 8624/10000 [13:35:05<2:05:19, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.024776641279459, 'learning_rate': 1.953410001942695e-06, 'epoch': 8.62} 86%|████████▌ | 8624/10000 [13:35:05<2:05:19, 5.46s/it][2025-06-20 03:04:50,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:04:50,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.18 | bwd_microstep: 3362.66 | bwd_inner_microstep: 3361.73 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.76 [2025-06-20 03:04:50,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.18 | bwd: 3362.68 | bwd_inner: 3361.73 | bwd_allreduce: 0.91 | step: 6.77 86%|████████▋ | 8625/10000 [13:35:11<2:05:40, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001119021326303482, 'learning_rate': 1.9506188372906056e-06, 'epoch': 8.62} 86%|████████▋ | 8625/10000 [13:35:11<2:05:40, 5.48s/it][2025-06-20 03:04:56,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:04:56,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.21 | bwd_microstep: 3307.27 | bwd_inner_microstep: 3306.47 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 03:04:56,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.21 | bwd: 3307.28 | bwd_inner: 3306.47 | bwd_allreduce: 0.77 | step: 6.78 86%|████████▋ | 8626/10000 [13:35:16<2:05:18, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.02471606805920601, 'learning_rate': 1.9478295659348247e-06, 'epoch': 8.63} 86%|████████▋ | 8626/10000 [13:35:16<2:05:18, 5.47s/it][2025-06-20 03:05:01,672] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:05:01,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.19 | bwd_microstep: 3370.56 | bwd_inner_microstep: 3369.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.79 [2025-06-20 03:05:01,673] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.19 | bwd: 3370.58 | bwd_inner: 3369.78 | bwd_allreduce: 0.76 | step: 6.79 86%|████████▋ | 8627/10000 [13:35:22<2:05:40, 5.49s/it] {'loss': 0.0, 'grad_norm': 8.905556023819372e-05, 'learning_rate': 1.9450421881679405e-06, 'epoch': 8.63} 86%|████████▋ | 8627/10000 [13:35:22<2:05:40, 5.49s/it][2025-06-20 03:05:07,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:05:07,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.68 | bwd_microstep: 3306.25 | bwd_inner_microstep: 3305.46 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.73 [2025-06-20 03:05:07,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.68 | bwd: 3306.27 | bwd_inner: 3305.46 | bwd_allreduce: 0.76 | step: 6.73 86%|████████▋ | 8628/10000 [13:35:27<2:05:13, 5.48s/it] {'loss': 0.0, 'grad_norm': 3.143608046229929e-05, 'learning_rate': 1.942256704282335e-06, 'epoch': 8.63} 86%|████████▋ | 8628/10000 [13:35:27<2:05:13, 5.48s/it][2025-06-20 03:05:12,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:05:12,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.00 | bwd_microstep: 3310.55 | bwd_inner_microstep: 3309.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-20 03:05:12,560] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.00 | bwd: 3310.57 | bwd_inner: 3309.76 | bwd_allreduce: 0.76 | step: 6.69 86%|████████▋ | 8629/10000 [13:35:33<2:04:56, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0008960075792856514, 'learning_rate': 1.9394731145701916e-06, 'epoch': 8.63} 86%|████████▋ | 8629/10000 [13:35:33<2:04:56, 5.47s/it][2025-06-20 03:05:18,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:05:18,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.19 | bwd_microstep: 3309.28 | bwd_inner_microstep: 3308.47 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.72 [2025-06-20 03:05:18,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.19 | bwd: 3309.29 | bwd_inner: 3308.47 | bwd_allreduce: 0.77 | step: 6.72 86%|████████▋ | 8630/10000 [13:35:38<2:04:42, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.018570417538285255, 'learning_rate': 1.9366914193234975e-06, 'epoch': 8.63} 86%|████████▋ | 8630/10000 [13:35:38<2:04:42, 5.46s/it][2025-06-20 03:05:23,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:05:23,543] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.93 | bwd_microstep: 3366.93 | bwd_inner_microstep: 3365.99 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.19 [2025-06-20 03:05:23,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.93 | bwd: 3366.95 | bwd_inner: 3365.99 | bwd_allreduce: 0.92 | step: 7.20 86%|████████▋ | 8631/10000 [13:35:44<2:05:07, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.020881934091448784, 'learning_rate': 1.933911618834039e-06, 'epoch': 8.63} 86%|████████▋ | 8631/10000 [13:35:44<2:05:07, 5.48s/it][2025-06-20 03:05:28,997] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:05:28,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.37 | bwd_microstep: 3313.34 | bwd_inner_microstep: 3312.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 03:05:28,998] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.37 | bwd: 3313.35 | bwd_inner: 3312.55 | bwd_allreduce: 0.75 | step: 6.58 86%|████████▋ | 8632/10000 [13:35:49<2:04:49, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.006200047675520182, 'learning_rate': 1.931133713393405e-06, 'epoch': 8.63} 86%|████████▋ | 8632/10000 [13:35:49<2:04:49, 5.47s/it][2025-06-20 03:05:34,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:05:34,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.24 | bwd_microstep: 3307.51 | bwd_inner_microstep: 3306.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 03:05:34,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.24 | bwd: 3307.53 | bwd_inner: 3306.72 | bwd_allreduce: 0.77 | step: 6.79 86%|████████▋ | 8633/10000 [13:35:55<2:04:30, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0003278081421740353, 'learning_rate': 1.9283577032929845e-06, 'epoch': 8.63} 86%|████████▋ | 8633/10000 [13:35:55<2:04:30, 5.47s/it][2025-06-20 03:05:39,891] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:05:39,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.53 | bwd_microstep: 3308.92 | bwd_inner_microstep: 3308.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 03:05:39,892] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.53 | bwd: 3308.93 | bwd_inner: 3308.13 | bwd_allreduce: 0.76 | step: 6.64 86%|████████▋ | 8634/10000 [13:36:00<2:04:20, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0007990672020241618, 'learning_rate': 1.925583588823967e-06, 'epoch': 8.63} 86%|████████▋ | 8634/10000 [13:36:00<2:04:20, 5.46s/it][2025-06-20 03:05:45,339] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 03:05:45,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.81 | bwd_microstep: 3311.94 | bwd_inner_microstep: 3311.12 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.27 [2025-06-20 03:05:45,340] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.81 | bwd: 3311.96 | bwd_inner: 3311.12 | bwd_allreduce: 0.79 | step: 7.27 86%|████████▋ | 8635/10000 [13:36:06<2:04:09, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0007025700761005282, 'learning_rate': 1.9228113702773486e-06, 'epoch': 8.63} 86%|████████▋ | 8635/10000 [13:36:06<2:04:09, 5.46s/it][2025-06-20 03:05:50,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:05:50,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.55 | bwd_microstep: 3356.95 | bwd_inner_microstep: 3356.01 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.15 [2025-06-20 03:05:50,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.55 | bwd: 3356.97 | bwd_inner: 3356.01 | bwd_allreduce: 0.92 | step: 7.16 86%|████████▋ | 8636/10000 [13:36:11<2:04:28, 5.48s/it] {'loss': 0.0, 'grad_norm': 7.35749417799525e-05, 'learning_rate': 1.9200410479439168e-06, 'epoch': 8.64} 86%|████████▋ | 8636/10000 [13:36:11<2:04:28, 5.48s/it][2025-06-20 03:05:56,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:05:56,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.65 | bwd_microstep: 3352.72 | bwd_inner_microstep: 3351.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-20 03:05:56,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.65 | bwd: 3352.73 | bwd_inner: 3351.92 | bwd_allreduce: 0.77 | step: 6.64 86%|████████▋ | 8637/10000 [13:36:17<2:04:38, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.018069321289658546, 'learning_rate': 1.917272622114266e-06, 'epoch': 8.64} 86%|████████▋ | 8637/10000 [13:36:17<2:04:38, 5.49s/it][2025-06-20 03:06:01,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:06:01,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.32 | bwd_microstep: 3307.25 | bwd_inner_microstep: 3306.28 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.17 [2025-06-20 03:06:01,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.32 | bwd: 3307.27 | bwd_inner: 3306.28 | bwd_allreduce: 0.92 | step: 7.17 86%|████████▋ | 8638/10000 [13:36:22<2:04:16, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.000923184328712523, 'learning_rate': 1.914506093078792e-06, 'epoch': 8.64} 86%|████████▋ | 8638/10000 [13:36:22<2:04:16, 5.47s/it][2025-06-20 03:06:07,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:06:07,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2096.39 | bwd_microstep: 3311.95 | bwd_inner_microstep: 3311.14 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-20 03:06:07,269] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2096.39 | bwd: 3311.96 | bwd_inner: 3311.14 | bwd_allreduce: 0.78 | step: 6.99 86%|████████▋ | 8639/10000 [13:36:28<2:04:01, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0003340046969242394, 'learning_rate': 1.9117414611276917e-06, 'epoch': 8.64} 86%|████████▋ | 8639/10000 [13:36:28<2:04:01, 5.47s/it][2025-06-20 03:06:12,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.72 [2025-06-20 03:06:12,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.82 | bwd_microstep: 3308.11 | bwd_inner_microstep: 3307.33 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 03:06:12,730] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.82 | bwd: 3308.12 | bwd_inner: 3307.33 | bwd_allreduce: 0.75 | step: 6.62 86%|████████▋ | 8640/10000 [13:36:33<2:03:52, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00023976867669261992, 'learning_rate': 1.908978726550963e-06, 'epoch': 8.64} 86%|████████▋ | 8640/10000 [13:36:33<2:03:52, 5.47s/it][2025-06-20 03:06:18,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:06:18,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.29 | bwd_microstep: 3314.42 | bwd_inner_microstep: 3313.63 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 03:06:18,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.29 | bwd: 3314.43 | bwd_inner: 3313.63 | bwd_allreduce: 0.76 | step: 6.58 86%|████████▋ | 8641/10000 [13:36:38<2:03:44, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0002912786731030792, 'learning_rate': 1.9062178896384043e-06, 'epoch': 8.64} 86%|████████▋ | 8641/10000 [13:36:38<2:03:44, 5.46s/it][2025-06-20 03:06:23,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:06:23,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.93 | bwd_microstep: 3311.58 | bwd_inner_microstep: 3310.61 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.09 [2025-06-20 03:06:23,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.93 | bwd: 3311.60 | bwd_inner: 3310.61 | bwd_allreduce: 0.94 | step: 7.09 86%|████████▋ | 8642/10000 [13:36:44<2:03:33, 5.46s/it] {'loss': 0.0, 'grad_norm': 6.067697540856898e-05, 'learning_rate': 1.903458950679613e-06, 'epoch': 8.64} 86%|████████▋ | 8642/10000 [13:36:44<2:03:33, 5.46s/it][2025-06-20 03:06:29,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:06:29,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.82 | bwd_microstep: 3305.50 | bwd_inner_microstep: 3304.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.00 [2025-06-20 03:06:29,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.82 | bwd: 3305.51 | bwd_inner: 3304.72 | bwd_allreduce: 0.76 | step: 7.01 86%|████████▋ | 8643/10000 [13:36:49<2:03:22, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.0011524480069056153, 'learning_rate': 1.9007019099639867e-06, 'epoch': 8.64} 86%|████████▋ | 8643/10000 [13:36:49<2:03:22, 5.46s/it][2025-06-20 03:06:34,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:06:34,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.35 | bwd_microstep: 3316.09 | bwd_inner_microstep: 3315.30 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-20 03:06:34,538] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.35 | bwd: 3316.10 | bwd_inner: 3315.30 | bwd_allreduce: 0.76 | step: 7.00 86%|████████▋ | 8644/10000 [13:36:55<2:03:16, 5.45s/it] {'loss': 0.0, 'grad_norm': 0.0016024300130084157, 'learning_rate': 1.8979467677807295e-06, 'epoch': 8.64} 86%|████████▋ | 8644/10000 [13:36:55<2:03:16, 5.45s/it][2025-06-20 03:06:39,983] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 03:06:39,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.06 | bwd_microstep: 3308.13 | bwd_inner_microstep: 3307.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 03:06:39,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.06 | bwd: 3308.14 | bwd_inner: 3307.34 | bwd_allreduce: 0.76 | step: 6.69 86%|████████▋ | 8645/10000 [13:37:00<2:03:07, 5.45s/it] {'loss': 0.0, 'grad_norm': 0.006698965560644865, 'learning_rate': 1.8951935244188414e-06, 'epoch': 8.64} 86%|████████▋ | 8645/10000 [13:37:00<2:03:07, 5.45s/it][2025-06-20 03:06:45,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 03:06:45,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.73 | bwd_microstep: 3358.66 | bwd_inner_microstep: 3357.84 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.00 [2025-06-20 03:06:45,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.73 | bwd: 3358.67 | bwd_inner: 3357.84 | bwd_allreduce: 0.79 | step: 7.01 86%|████████▋ | 8646/10000 [13:37:06<2:03:30, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0005118099506944418, 'learning_rate': 1.8924421801671245e-06, 'epoch': 8.65} 86%|████████▋ | 8646/10000 [13:37:06<2:03:30, 5.47s/it][2025-06-20 03:06:50,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:06:50,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.36 | bwd_microstep: 3328.95 | bwd_inner_microstep: 3328.13 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.76 [2025-06-20 03:06:50,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.36 | bwd: 3328.96 | bwd_inner: 3328.13 | bwd_allreduce: 0.79 | step: 6.76 86%|████████▋ | 8647/10000 [13:37:11<2:03:29, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00018180732149630785, 'learning_rate': 1.8896927353141858e-06, 'epoch': 8.65} 86%|████████▋ | 8647/10000 [13:37:11<2:03:29, 5.48s/it][2025-06-20 03:06:56,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:06:56,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.44 | bwd_microstep: 3318.51 | bwd_inner_microstep: 3317.73 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 03:06:56,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.44 | bwd: 3318.52 | bwd_inner: 3317.73 | bwd_allreduce: 0.75 | step: 6.61 86%|████████▋ | 8648/10000 [13:37:17<2:03:20, 5.47s/it] {'loss': 0.0, 'grad_norm': 9.29498128243722e-05, 'learning_rate': 1.8869451901484238e-06, 'epoch': 8.65} 86%|████████▋ | 8648/10000 [13:37:17<2:03:20, 5.47s/it][2025-06-20 03:07:01,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:07:01,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.34 | bwd_microstep: 3367.78 | bwd_inner_microstep: 3366.98 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 03:07:01,986] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.34 | bwd: 3367.79 | bwd_inner: 3366.98 | bwd_allreduce: 0.77 | step: 6.79 86%|████████▋ | 8649/10000 [13:37:22<2:03:37, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0026109858881682158, 'learning_rate': 1.8841995449580476e-06, 'epoch': 8.65} 86%|████████▋ | 8649/10000 [13:37:22<2:03:37, 5.49s/it][2025-06-20 03:07:07,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:07:07,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.94 | bwd_microstep: 3367.90 | bwd_inner_microstep: 3367.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 03:07:07,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.94 | bwd: 3367.92 | bwd_inner: 3367.11 | bwd_allreduce: 0.76 | step: 6.80 86%|████████▋ | 8650/10000 [13:37:28<2:03:47, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0008209418156184256, 'learning_rate': 1.8814558000310623e-06, 'epoch': 8.65} 86%|████████▋ | 8650/10000 [13:37:28<2:03:47, 5.50s/it][2025-06-20 03:07:13,059] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 03:07:13,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.39 | bwd_microstep: 3369.29 | bwd_inner_microstep: 3368.25 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.21 [2025-06-20 03:07:13,060] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.39 | bwd: 3369.30 | bwd_inner: 3368.25 | bwd_allreduce: 1.01 | step: 7.22 87%|████████▋ | 8651/10000 [13:37:33<2:03:59, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.008793270215392113, 'learning_rate': 1.8787139556552757e-06, 'epoch': 8.65} 87%|████████▋ | 8651/10000 [13:37:33<2:03:59, 5.52s/it][2025-06-20 03:07:18,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:07:18,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.62 | bwd_microstep: 3318.45 | bwd_inner_microstep: 3317.67 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 03:07:18,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.62 | bwd: 3318.46 | bwd_inner: 3317.67 | bwd_allreduce: 0.76 | step: 6.68 87%|████████▋ | 8652/10000 [13:37:39<2:03:36, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.00927593745291233, 'learning_rate': 1.8759740121182868e-06, 'epoch': 8.65} 87%|████████▋ | 8652/10000 [13:37:39<2:03:36, 5.50s/it][2025-06-20 03:07:23,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:07:23,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.41 | bwd_microstep: 3321.84 | bwd_inner_microstep: 3321.05 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 03:07:23,993] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.41 | bwd: 3321.85 | bwd_inner: 3321.05 | bwd_allreduce: 0.76 | step: 6.63 87%|████████▋ | 8653/10000 [13:37:44<2:03:14, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006078826729208231, 'learning_rate': 1.8732359697075097e-06, 'epoch': 8.65} 87%|████████▋ | 8653/10000 [13:37:44<2:03:14, 5.49s/it][2025-06-20 03:07:29,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:07:29,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.63 | bwd_microstep: 3317.08 | bwd_inner_microstep: 3316.30 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-20 03:07:29,452] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.63 | bwd: 3317.09 | bwd_inner: 3316.30 | bwd_allreduce: 0.75 | step: 6.55 87%|████████▋ | 8654/10000 [13:37:50<2:02:56, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00045788814895786345, 'learning_rate': 1.8704998287101483e-06, 'epoch': 8.65} 87%|████████▋ | 8654/10000 [13:37:50<2:02:56, 5.48s/it][2025-06-20 03:07:34,988] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:07:34,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.98 | bwd_microstep: 3363.95 | bwd_inner_microstep: 3363.13 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.71 [2025-06-20 03:07:34,989] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.98 | bwd: 3363.97 | bwd_inner: 3363.13 | bwd_allreduce: 0.79 | step: 6.72 87%|████████▋ | 8655/10000 [13:37:55<2:03:15, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.04892569035291672, 'learning_rate': 1.8677655894132152e-06, 'epoch': 8.65} 87%|████████▋ | 8655/10000 [13:37:55<2:03:15, 5.50s/it][2025-06-20 03:07:40,456] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 03:07:40,457] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.87 | bwd_microstep: 3321.87 | bwd_inner_microstep: 3320.74 | bwd_allreduce_microstep: 1.06 | step_microstep: 8.30 [2025-06-20 03:07:40,458] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.88 | bwd: 3321.89 | bwd_inner: 3320.74 | bwd_allreduce: 1.09 | step: 8.31 87%|████████▋ | 8656/10000 [13:38:01<2:02:59, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.15978024899959564, 'learning_rate': 1.8650332521035185e-06, 'epoch': 8.66} 87%|████████▋ | 8656/10000 [13:38:01<2:02:59, 5.49s/it][2025-06-20 03:07:46,054] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:07:46,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.33 | bwd_microstep: 3398.94 | bwd_inner_microstep: 3398.06 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.41 [2025-06-20 03:07:46,055] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.33 | bwd: 3398.95 | bwd_inner: 3398.06 | bwd_allreduce: 0.84 | step: 7.42 87%|████████▋ | 8657/10000 [13:38:06<2:03:35, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0023626284673810005, 'learning_rate': 1.862302817067665e-06, 'epoch': 8.66} 87%|████████▋ | 8657/10000 [13:38:06<2:03:35, 5.52s/it][2025-06-20 03:07:51,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 03:07:51,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.04 | bwd_microstep: 3370.43 | bwd_inner_microstep: 3369.64 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 03:07:51,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.04 | bwd: 3370.44 | bwd_inner: 3369.64 | bwd_allreduce: 0.75 | step: 6.65 87%|████████▋ | 8658/10000 [13:38:12<2:03:40, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0006897725979797542, 'learning_rate': 1.8595742845920672e-06, 'epoch': 8.66} 87%|████████▋ | 8658/10000 [13:38:12<2:03:40, 5.53s/it][2025-06-20 03:07:57,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:07:57,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.42 | bwd_microstep: 3324.66 | bwd_inner_microstep: 3323.83 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.83 [2025-06-20 03:07:57,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.42 | bwd: 3324.67 | bwd_inner: 3323.83 | bwd_allreduce: 0.80 | step: 6.84 87%|████████▋ | 8659/10000 [13:38:17<2:03:14, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.001110583427362144, 'learning_rate': 1.8568476549629344e-06, 'epoch': 8.66} 87%|████████▋ | 8659/10000 [13:38:17<2:03:14, 5.51s/it][2025-06-20 03:08:02,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:08:02,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.04 | bwd_microstep: 3316.34 | bwd_inner_microstep: 3315.37 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.01 [2025-06-20 03:08:02,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.04 | bwd: 3316.36 | bwd_inner: 3315.37 | bwd_allreduce: 0.94 | step: 7.01 87%|████████▋ | 8660/10000 [13:38:23<2:02:53, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.06964976340532303, 'learning_rate': 1.8541229284662755e-06, 'epoch': 8.66} 87%|████████▋ | 8660/10000 [13:38:23<2:02:53, 5.50s/it][2025-06-20 03:08:08,049] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:08:08,050] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.69 | bwd_microstep: 3323.87 | bwd_inner_microstep: 3322.93 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.50 [2025-06-20 03:08:08,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.69 | bwd: 3323.88 | bwd_inner: 3322.93 | bwd_allreduce: 0.91 | step: 7.50 87%|████████▋ | 8661/10000 [13:38:28<2:02:44, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005367583595216274, 'learning_rate': 1.8514001053879083e-06, 'epoch': 8.66} 87%|████████▋ | 8661/10000 [13:38:28<2:02:44, 5.50s/it][2025-06-20 03:08:13,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 03:08:13,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.21 | bwd_microstep: 3330.11 | bwd_inner_microstep: 3329.18 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.22 [2025-06-20 03:08:13,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.21 | bwd: 3330.12 | bwd_inner: 3329.18 | bwd_allreduce: 0.90 | step: 7.22 87%|████████▋ | 8662/10000 [13:38:34<2:02:36, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.002951835747808218, 'learning_rate': 1.8486791860134378e-06, 'epoch': 8.66} 87%|████████▋ | 8662/10000 [13:38:34<2:02:36, 5.50s/it][2025-06-20 03:08:19,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:08:19,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.93 | bwd_microstep: 3332.39 | bwd_inner_microstep: 3331.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-20 03:08:19,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.93 | bwd: 3332.40 | bwd_inner: 3331.61 | bwd_allreduce: 0.75 | step: 6.56 87%|████████▋ | 8663/10000 [13:38:39<2:02:28, 5.50s/it] {'loss': 0.0, 'grad_norm': 9.347302693640813e-05, 'learning_rate': 1.845960170628276e-06, 'epoch': 8.66} 87%|████████▋ | 8663/10000 [13:38:39<2:02:28, 5.50s/it][2025-06-20 03:08:24,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:08:24,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.85 | bwd_microstep: 3319.03 | bwd_inner_microstep: 3318.25 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 03:08:24,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.85 | bwd: 3319.05 | bwd_inner: 3318.25 | bwd_allreduce: 0.76 | step: 6.68 87%|████████▋ | 8664/10000 [13:38:45<2:02:10, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.09335201978683472, 'learning_rate': 1.8432430595176365e-06, 'epoch': 8.66} 87%|████████▋ | 8664/10000 [13:38:45<2:02:10, 5.49s/it][2025-06-20 03:08:30,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:08:30,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.84 | bwd_microstep: 3375.20 | bwd_inner_microstep: 3374.25 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.12 [2025-06-20 03:08:30,052] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.84 | bwd: 3375.22 | bwd_inner: 3374.25 | bwd_allreduce: 0.92 | step: 7.12 87%|████████▋ | 8665/10000 [13:38:50<2:02:31, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.033798281103372574, 'learning_rate': 1.8405278529665339e-06, 'epoch': 8.66} 87%|████████▋ | 8665/10000 [13:38:50<2:02:31, 5.51s/it][2025-06-20 03:08:35,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:08:35,610] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.76 | bwd_microstep: 3373.95 | bwd_inner_microstep: 3373.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.08 [2025-06-20 03:08:35,611] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.76 | bwd: 3373.97 | bwd_inner: 3373.15 | bwd_allreduce: 0.77 | step: 7.08 87%|████████▋ | 8666/10000 [13:38:56<2:02:45, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0008012483594939113, 'learning_rate': 1.8378145512597778e-06, 'epoch': 8.67} 87%|████████▋ | 8666/10000 [13:38:56<2:02:45, 5.52s/it][2025-06-20 03:08:41,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:08:41,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.00 | bwd_microstep: 3378.86 | bwd_inner_microstep: 3377.89 | bwd_allreduce_microstep: 0.93 | step_microstep: 6.66 [2025-06-20 03:08:41,165] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.00 | bwd: 3378.88 | bwd_inner: 3377.89 | bwd_allreduce: 0.95 | step: 6.67 87%|████████▋ | 8667/10000 [13:39:01<2:02:52, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.00021528340585064143, 'learning_rate': 1.8351031546819808e-06, 'epoch': 8.67} 87%|████████▋ | 8667/10000 [13:39:01<2:02:52, 5.53s/it][2025-06-20 03:08:46,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:08:46,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.37 | bwd_microstep: 3324.20 | bwd_inner_microstep: 3323.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 03:08:46,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.37 | bwd: 3324.21 | bwd_inner: 3323.42 | bwd_allreduce: 0.75 | step: 6.63 87%|████████▋ | 8668/10000 [13:39:07<2:02:27, 5.52s/it] {'loss': 0.0032, 'grad_norm': 3.043766736984253, 'learning_rate': 1.8323936635175577e-06, 'epoch': 8.67} 87%|████████▋ | 8668/10000 [13:39:07<2:02:27, 5.52s/it][2025-06-20 03:08:52,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.74 [2025-06-20 03:08:52,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.42 | bwd_microstep: 3323.16 | bwd_inner_microstep: 3322.31 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.22 [2025-06-20 03:08:52,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.42 | bwd: 3323.18 | bwd_inner: 3322.31 | bwd_allreduce: 0.82 | step: 7.22 87%|████████▋ | 8669/10000 [13:39:12<2:02:02, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0014814577298238873, 'learning_rate': 1.829686078050723e-06, 'epoch': 8.67} 87%|████████▋ | 8669/10000 [13:39:12<2:02:02, 5.50s/it][2025-06-20 03:08:57,670] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:08:57,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.14 | bwd_microstep: 3376.93 | bwd_inner_microstep: 3376.02 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.16 [2025-06-20 03:08:57,671] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.14 | bwd: 3376.95 | bwd_inner: 3376.02 | bwd_allreduce: 0.89 | step: 7.17 87%|████████▋ | 8670/10000 [13:39:18<2:02:19, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.02155439741909504, 'learning_rate': 1.8269803985654854e-06, 'epoch': 8.67} 87%|████████▋ | 8670/10000 [13:39:18<2:02:19, 5.52s/it][2025-06-20 03:09:03,137] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:09:03,138] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.93 | bwd_microstep: 3317.90 | bwd_inner_microstep: 3316.94 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.55 [2025-06-20 03:09:03,139] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.93 | bwd: 3317.92 | bwd_inner: 3316.94 | bwd_allreduce: 0.94 | step: 7.56 87%|████████▋ | 8671/10000 [13:39:23<2:01:53, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0001342741452390328, 'learning_rate': 1.8242766253456645e-06, 'epoch': 8.67} 87%|████████▋ | 8671/10000 [13:39:23<2:01:53, 5.50s/it][2025-06-20 03:09:08,627] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 03:09:08,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.13 | bwd_microstep: 3334.07 | bwd_inner_microstep: 3332.94 | bwd_allreduce_microstep: 1.07 | step_microstep: 7.65 [2025-06-20 03:09:08,628] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.13 | bwd: 3334.09 | bwd_inner: 3332.94 | bwd_allreduce: 1.09 | step: 7.66 87%|████████▋ | 8672/10000 [13:39:29<2:01:43, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0008025002316571772, 'learning_rate': 1.8215747586748644e-06, 'epoch': 8.67} 87%|████████▋ | 8672/10000 [13:39:29<2:01:43, 5.50s/it][2025-06-20 03:09:14,176] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:09:14,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.02 | bwd_microstep: 3366.35 | bwd_inner_microstep: 3365.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 03:09:14,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.02 | bwd: 3366.36 | bwd_inner: 3365.56 | bwd_allreduce: 0.76 | step: 6.72 87%|████████▋ | 8673/10000 [13:39:34<2:01:56, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.010625528171658516, 'learning_rate': 1.818874798836505e-06, 'epoch': 8.67} 87%|████████▋ | 8673/10000 [13:39:34<2:01:56, 5.51s/it][2025-06-20 03:09:19,721] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:09:19,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.34 | bwd_microstep: 3377.38 | bwd_inner_microstep: 3376.59 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.65 [2025-06-20 03:09:19,722] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.34 | bwd: 3377.40 | bwd_inner: 3376.59 | bwd_allreduce: 0.76 | step: 6.66 87%|████████▋ | 8674/10000 [13:39:40<2:02:03, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0009546224027872086, 'learning_rate': 1.816176746113798e-06, 'epoch': 8.67} 87%|████████▋ | 8674/10000 [13:39:40<2:02:03, 5.52s/it][2025-06-20 03:09:25,304] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:09:25,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.27 | bwd_microstep: 3406.43 | bwd_inner_microstep: 3405.59 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.08 [2025-06-20 03:09:25,305] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.27 | bwd: 3406.45 | bwd_inner: 3405.59 | bwd_allreduce: 0.80 | step: 7.08 87%|████████▋ | 8675/10000 [13:39:46<2:02:22, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0007831751718185842, 'learning_rate': 1.8134806007897566e-06, 'epoch': 8.68} 87%|████████▋ | 8675/10000 [13:39:46<2:02:22, 5.54s/it][2025-06-20 03:09:30,781] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:09:30,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.18 | bwd_microstep: 3328.68 | bwd_inner_microstep: 3327.89 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-20 03:09:30,782] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.18 | bwd: 3328.69 | bwd_inner: 3327.89 | bwd_allreduce: 0.76 | step: 6.78 87%|████████▋ | 8676/10000 [13:39:51<2:01:50, 5.52s/it] {'loss': 0.0, 'grad_norm': 1.605999204912223e-05, 'learning_rate': 1.810786363147199e-06, 'epoch': 8.68} 87%|████████▋ | 8676/10000 [13:39:51<2:01:50, 5.52s/it][2025-06-20 03:09:36,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:09:36,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.94 | bwd_microstep: 3379.31 | bwd_inner_microstep: 3378.37 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.53 [2025-06-20 03:09:36,331] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.94 | bwd: 3379.33 | bwd_inner: 3378.37 | bwd_allreduce: 0.92 | step: 7.54 87%|████████▋ | 8677/10000 [13:39:57<2:01:56, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.005151542369276285, 'learning_rate': 1.808094033468728e-06, 'epoch': 8.68} 87%|████████▋ | 8677/10000 [13:39:57<2:01:56, 5.53s/it][2025-06-20 03:09:41,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:09:41,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.87 | bwd_microstep: 3383.96 | bwd_inner_microstep: 3383.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 03:09:41,893] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.87 | bwd: 3383.97 | bwd_inner: 3383.15 | bwd_allreduce: 0.78 | step: 6.81 87%|████████▋ | 8678/10000 [13:40:02<2:02:03, 5.54s/it] {'loss': 0.0002, 'grad_norm': 0.03274790570139885, 'learning_rate': 1.8054036120367603e-06, 'epoch': 8.68} 87%|████████▋ | 8678/10000 [13:40:02<2:02:03, 5.54s/it][2025-06-20 03:09:47,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:09:47,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.71 | bwd_microstep: 3320.36 | bwd_inner_microstep: 3319.53 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.18 [2025-06-20 03:09:47,372] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.71 | bwd: 3320.38 | bwd_inner: 3319.53 | bwd_allreduce: 0.80 | step: 7.18 87%|████████▋ | 8679/10000 [13:40:08<2:01:33, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0025060027837753296, 'learning_rate': 1.8027150991335118e-06, 'epoch': 8.68} 87%|████████▋ | 8679/10000 [13:40:08<2:01:33, 5.52s/it][2025-06-20 03:09:52,906] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:09:52,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.82 | bwd_microstep: 3368.42 | bwd_inner_microstep: 3367.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 03:09:52,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.82 | bwd: 3368.43 | bwd_inner: 3367.62 | bwd_allreduce: 0.77 | step: 6.81 87%|████████▋ | 8680/10000 [13:40:13<2:01:33, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.003424367168918252, 'learning_rate': 1.800028495040993e-06, 'epoch': 8.68} 87%|████████▋ | 8680/10000 [13:40:13<2:01:33, 5.53s/it][2025-06-20 03:09:58,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:09:58,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.77 | bwd_microstep: 3325.56 | bwd_inner_microstep: 3324.58 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.10 [2025-06-20 03:09:58,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.77 | bwd: 3325.58 | bwd_inner: 3324.58 | bwd_allreduce: 0.95 | step: 7.11 87%|████████▋ | 8681/10000 [13:40:19<2:01:07, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0035688027273863554, 'learning_rate': 1.7973438000410159e-06, 'epoch': 8.68} 87%|████████▋ | 8681/10000 [13:40:19<2:01:07, 5.51s/it][2025-06-20 03:10:03,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:10:03,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.01 | bwd_microstep: 3325.59 | bwd_inner_microstep: 3324.60 | bwd_allreduce_microstep: 0.94 | step_microstep: 6.65 [2025-06-20 03:10:03,855] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.01 | bwd: 3325.60 | bwd_inner: 3324.60 | bwd_allreduce: 0.96 | step: 6.65 87%|████████▋ | 8682/10000 [13:40:24<2:00:47, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0015624853549525142, 'learning_rate': 1.7946610144151933e-06, 'epoch': 8.68} 87%|████████▋ | 8682/10000 [13:40:24<2:00:47, 5.50s/it][2025-06-20 03:10:09,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:10:09,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.86 | bwd_microstep: 3372.26 | bwd_inner_microstep: 3371.44 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.77 [2025-06-20 03:10:09,402] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.86 | bwd: 3372.27 | bwd_inner: 3371.44 | bwd_allreduce: 0.78 | step: 6.78 87%|████████▋ | 8683/10000 [13:40:30<2:01:01, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0016891624545678496, 'learning_rate': 1.7919801384449353e-06, 'epoch': 8.68} 87%|████████▋ | 8683/10000 [13:40:30<2:01:01, 5.51s/it][2025-06-20 03:10:14,879] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:10:14,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.93 | bwd_microstep: 3325.62 | bwd_inner_microstep: 3324.82 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-20 03:10:14,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.93 | bwd: 3325.63 | bwd_inner: 3324.82 | bwd_allreduce: 0.76 | step: 6.93 87%|████████▋ | 8684/10000 [13:40:35<2:00:41, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0015177360037341714, 'learning_rate': 1.7893011724114572e-06, 'epoch': 8.68} 87%|████████▋ | 8684/10000 [13:40:35<2:00:41, 5.50s/it][2025-06-20 03:10:20,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:10:20,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.34 | bwd_microstep: 3325.95 | bwd_inner_microstep: 3325.15 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-20 03:10:20,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.34 | bwd: 3325.97 | bwd_inner: 3325.15 | bwd_allreduce: 0.77 | step: 7.00 87%|████████▋ | 8685/10000 [13:40:41<2:00:25, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00015589629765599966, 'learning_rate': 1.7866241165957655e-06, 'epoch': 8.69} 87%|████████▋ | 8685/10000 [13:40:41<2:00:25, 5.49s/it][2025-06-20 03:10:25,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:10:25,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.39 | bwd_microstep: 3317.81 | bwd_inner_microstep: 3316.88 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.01 [2025-06-20 03:10:25,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.39 | bwd: 3317.82 | bwd_inner: 3316.88 | bwd_allreduce: 0.90 | step: 7.01 87%|████████▋ | 8686/10000 [13:40:46<2:00:07, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0002609382790978998, 'learning_rate': 1.7839489712786773e-06, 'epoch': 8.69} 87%|████████▋ | 8686/10000 [13:40:46<2:00:07, 5.48s/it][2025-06-20 03:10:31,280] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:10:31,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.69 | bwd_microstep: 3319.62 | bwd_inner_microstep: 3318.81 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-20 03:10:31,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.69 | bwd: 3319.63 | bwd_inner: 3318.81 | bwd_allreduce: 0.78 | step: 6.94 87%|████████▋ | 8687/10000 [13:40:52<1:59:53, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.007363283075392246, 'learning_rate': 1.7812757367407973e-06, 'epoch': 8.69} 87%|████████▋ | 8687/10000 [13:40:52<1:59:53, 5.48s/it][2025-06-20 03:10:36,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:10:36,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.36 | bwd_microstep: 3382.08 | bwd_inner_microstep: 3381.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-20 03:10:36,836] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.36 | bwd: 3382.10 | bwd_inner: 3381.28 | bwd_allreduce: 0.77 | step: 6.96 87%|████████▋ | 8688/10000 [13:40:57<2:00:17, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0007730807410553098, 'learning_rate': 1.7786044132625347e-06, 'epoch': 8.69} 87%|████████▋ | 8688/10000 [13:40:57<2:00:17, 5.50s/it][2025-06-20 03:10:42,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 03:10:42,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.09 | bwd_microstep: 3373.52 | bwd_inner_microstep: 3372.52 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.27 [2025-06-20 03:10:42,382] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.09 | bwd: 3373.53 | bwd_inner: 3372.52 | bwd_allreduce: 0.97 | step: 7.27 87%|████████▋ | 8689/10000 [13:41:03<2:00:30, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00022645237913820893, 'learning_rate': 1.7759350011241005e-06, 'epoch': 8.69} 87%|████████▋ | 8689/10000 [13:41:03<2:00:30, 5.52s/it][2025-06-20 03:10:47,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:10:47,846] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.96 | bwd_microstep: 3315.40 | bwd_inner_microstep: 3314.60 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-20 03:10:47,847] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.96 | bwd: 3315.41 | bwd_inner: 3314.60 | bwd_allreduce: 0.77 | step: 6.92 87%|████████▋ | 8690/10000 [13:41:08<2:00:04, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00023400505597237498, 'learning_rate': 1.7732675006055046e-06, 'epoch': 8.69} 87%|████████▋ | 8690/10000 [13:41:08<2:00:04, 5.50s/it][2025-06-20 03:10:53,321] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:10:53,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.09 | bwd_microstep: 3320.73 | bwd_inner_microstep: 3319.79 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.41 [2025-06-20 03:10:53,322] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.09 | bwd: 3320.74 | bwd_inner: 3319.79 | bwd_allreduce: 0.92 | step: 7.41 87%|████████▋ | 8691/10000 [13:41:14<1:59:49, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0002297841856488958, 'learning_rate': 1.7706019119865603e-06, 'epoch': 8.69} 87%|████████▋ | 8691/10000 [13:41:14<1:59:49, 5.49s/it][2025-06-20 03:10:58,866] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-20 03:10:58,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.21 | bwd_microstep: 3376.33 | bwd_inner_microstep: 3375.20 | bwd_allreduce_microstep: 1.04 | step_microstep: 7.94 [2025-06-20 03:10:58,867] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.21 | bwd: 3376.35 | bwd_inner: 3375.20 | bwd_allreduce: 1.09 | step: 7.95 87%|████████▋ | 8692/10000 [13:41:19<2:00:06, 5.51s/it] {'loss': 0.0, 'grad_norm': 8.949205948738381e-05, 'learning_rate': 1.7679382355468643e-06, 'epoch': 8.69} 87%|████████▋ | 8692/10000 [13:41:19<2:00:06, 5.51s/it][2025-06-20 03:11:04,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 03:11:04,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.99 | bwd_microstep: 3328.86 | bwd_inner_microstep: 3328.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 03:11:04,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.99 | bwd: 3328.87 | bwd_inner: 3328.06 | bwd_allreduce: 0.77 | step: 6.69 87%|████████▋ | 8693/10000 [13:41:25<1:59:51, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.006576893851161003, 'learning_rate': 1.765276471565831e-06, 'epoch': 8.69} 87%|████████▋ | 8693/10000 [13:41:25<1:59:51, 5.50s/it][2025-06-20 03:11:09,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:11:09,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.54 | bwd_microstep: 3317.55 | bwd_inner_microstep: 3316.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-20 03:11:09,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.54 | bwd: 3317.57 | bwd_inner: 3316.76 | bwd_allreduce: 0.76 | step: 6.68 87%|████████▋ | 8694/10000 [13:41:30<1:59:29, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0009880716679617763, 'learning_rate': 1.7626166203226658e-06, 'epoch': 8.69} 87%|████████▋ | 8694/10000 [13:41:30<1:59:29, 5.49s/it][2025-06-20 03:11:15,278] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:11:15,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.24 | bwd_microstep: 3318.29 | bwd_inner_microstep: 3317.46 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.95 [2025-06-20 03:11:15,279] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.24 | bwd: 3318.30 | bwd_inner: 3317.46 | bwd_allreduce: 0.79 | step: 6.95 87%|████████▋ | 8695/10000 [13:41:36<1:59:13, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0011782344663515687, 'learning_rate': 1.7599586820963743e-06, 'epoch': 8.7} 87%|████████▋ | 8695/10000 [13:41:36<1:59:13, 5.48s/it][2025-06-20 03:11:20,742] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:11:20,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.48 | bwd_microstep: 3316.09 | bwd_inner_microstep: 3315.28 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.94 [2025-06-20 03:11:20,743] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.48 | bwd: 3316.10 | bwd_inner: 3315.28 | bwd_allreduce: 0.78 | step: 6.95 87%|████████▋ | 8696/10000 [13:41:41<1:59:00, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0010001155314967036, 'learning_rate': 1.757302657165767e-06, 'epoch': 8.7} 87%|████████▋ | 8696/10000 [13:41:41<1:59:00, 5.48s/it][2025-06-20 03:11:26,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:11:26,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.46 | bwd_microstep: 3318.64 | bwd_inner_microstep: 3317.83 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.83 [2025-06-20 03:11:26,204] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.46 | bwd: 3318.66 | bwd_inner: 3317.83 | bwd_allreduce: 0.79 | step: 6.83 87%|████████▋ | 8697/10000 [13:41:47<1:58:49, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00021591875702142715, 'learning_rate': 1.7546485458094386e-06, 'epoch': 8.7} 87%|████████▋ | 8697/10000 [13:41:47<1:58:49, 5.47s/it][2025-06-20 03:11:31,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:11:31,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.96 | bwd_microstep: 3361.70 | bwd_inner_microstep: 3360.91 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-20 03:11:31,740] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.96 | bwd: 3361.72 | bwd_inner: 3360.91 | bwd_allreduce: 0.76 | step: 6.56 87%|████████▋ | 8698/10000 [13:41:52<1:59:08, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.040623269975185394, 'learning_rate': 1.7519963483058e-06, 'epoch': 8.7} 87%|████████▋ | 8698/10000 [13:41:52<1:59:08, 5.49s/it][2025-06-20 03:11:37,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:11:37,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.58 | bwd_microstep: 3309.21 | bwd_inner_microstep: 3308.38 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.81 [2025-06-20 03:11:37,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.59 | bwd: 3309.23 | bwd_inner: 3308.38 | bwd_allreduce: 0.80 | step: 6.82 87%|████████▋ | 8699/10000 [13:41:57<1:58:47, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0009723390685394406, 'learning_rate': 1.749346064933053e-06, 'epoch': 8.7} 87%|████████▋ | 8699/10000 [13:41:57<1:58:47, 5.48s/it][2025-06-20 03:11:42,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:11:42,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.57 | bwd_microstep: 3316.58 | bwd_inner_microstep: 3315.80 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 03:11:42,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.57 | bwd: 3316.59 | bwd_inner: 3315.80 | bwd_allreduce: 0.76 | step: 6.64 87%|████████▋ | 8700/10000 [13:42:03<1:58:34, 5.47s/it] {'loss': 0.0, 'grad_norm': 5.497999154613353e-05, 'learning_rate': 1.7466976959691994e-06, 'epoch': 8.7} 87%|████████▋ | 8700/10000 [13:42:03<1:58:34, 5.47s/it][2025-06-20 03:11:48,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:11:48,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.21 | bwd_microstep: 3367.48 | bwd_inner_microstep: 3366.49 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.03 [2025-06-20 03:11:48,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.21 | bwd: 3367.50 | bwd_inner: 3366.49 | bwd_allreduce: 0.96 | step: 7.03 87%|████████▋ | 8701/10000 [13:42:08<1:58:56, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.06280532479286194, 'learning_rate': 1.7440512416920374e-06, 'epoch': 8.7} 87%|████████▋ | 8701/10000 [13:42:08<1:58:56, 5.49s/it][2025-06-20 03:11:53,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 03:11:53,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.25 | bwd_microstep: 3324.37 | bwd_inner_microstep: 3323.58 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.85 [2025-06-20 03:11:53,674] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.25 | bwd: 3324.39 | bwd_inner: 3323.58 | bwd_allreduce: 0.76 | step: 6.86 87%|████████▋ | 8702/10000 [13:42:14<1:58:46, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.003862418932840228, 'learning_rate': 1.741406702379178e-06, 'epoch': 8.7} 87%|████████▋ | 8702/10000 [13:42:14<1:58:46, 5.49s/it][2025-06-20 03:11:59,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:11:59,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.15 | bwd_microstep: 3363.89 | bwd_inner_microstep: 3363.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-20 03:11:59,203] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.15 | bwd: 3363.91 | bwd_inner: 3363.10 | bwd_allreduce: 0.76 | step: 6.74 87%|████████▋ | 8703/10000 [13:42:20<1:58:56, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.002625528024509549, 'learning_rate': 1.7387640783080128e-06, 'epoch': 8.7} 87%|████████▋ | 8703/10000 [13:42:20<1:58:56, 5.50s/it][2025-06-20 03:12:04,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:12:04,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.50 | bwd_microstep: 3306.15 | bwd_inner_microstep: 3305.34 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.79 [2025-06-20 03:12:04,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.50 | bwd: 3306.17 | bwd_inner: 3305.34 | bwd_allreduce: 0.78 | step: 6.79 87%|████████▋ | 8704/10000 [13:42:25<1:58:27, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.003222087165340781, 'learning_rate': 1.7361233697557422e-06, 'epoch': 8.7} 87%|████████▋ | 8704/10000 [13:42:25<1:58:27, 5.48s/it][2025-06-20 03:12:10,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:12:10,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.89 | bwd_microstep: 3318.97 | bwd_inner_microstep: 3318.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 03:12:10,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.89 | bwd: 3318.99 | bwd_inner: 3318.19 | bwd_allreduce: 0.75 | step: 6.61 87%|████████▋ | 8705/10000 [13:42:30<1:58:15, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0038150325417518616, 'learning_rate': 1.7334845769993646e-06, 'epoch': 8.71} 87%|████████▋ | 8705/10000 [13:42:30<1:58:15, 5.48s/it][2025-06-20 03:12:15,580] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:12:15,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.24 | bwd_microstep: 3325.82 | bwd_inner_microstep: 3324.92 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.20 [2025-06-20 03:12:15,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.24 | bwd: 3325.83 | bwd_inner: 3324.92 | bwd_allreduce: 0.87 | step: 7.20 87%|████████▋ | 8706/10000 [13:42:36<1:58:05, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.04781521484255791, 'learning_rate': 1.7308477003156786e-06, 'epoch': 8.71} 87%|████████▋ | 8706/10000 [13:42:36<1:58:05, 5.48s/it][2025-06-20 03:12:21,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:12:21,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.77 | bwd_microstep: 3367.39 | bwd_inner_microstep: 3366.61 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 03:12:21,117] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.77 | bwd: 3367.40 | bwd_inner: 3366.61 | bwd_allreduce: 0.75 | step: 6.58 87%|████████▋ | 8707/10000 [13:42:41<1:58:23, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00010180289973504841, 'learning_rate': 1.7282127399812809e-06, 'epoch': 8.71} 87%|████████▋ | 8707/10000 [13:42:41<1:58:23, 5.49s/it][2025-06-20 03:12:26,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:12:26,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.62 | bwd_microstep: 3315.80 | bwd_inner_microstep: 3315.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-20 03:12:26,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.62 | bwd: 3315.82 | bwd_inner: 3315.00 | bwd_allreduce: 0.78 | step: 6.76 87%|████████▋ | 8708/10000 [13:42:47<1:58:02, 5.48s/it] {'loss': 0.0, 'grad_norm': 4.667608664021827e-05, 'learning_rate': 1.7255796962725634e-06, 'epoch': 8.71} 87%|████████▋ | 8708/10000 [13:42:47<1:58:02, 5.48s/it][2025-06-20 03:12:32,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:12:32,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.99 | bwd_microstep: 3316.73 | bwd_inner_microstep: 3315.91 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.88 [2025-06-20 03:12:32,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.99 | bwd: 3316.74 | bwd_inner: 3315.91 | bwd_allreduce: 0.79 | step: 6.89 87%|████████▋ | 8709/10000 [13:42:52<1:57:51, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0012084098998457193, 'learning_rate': 1.7229485694657188e-06, 'epoch': 8.71} 87%|████████▋ | 8709/10000 [13:42:52<1:57:51, 5.48s/it][2025-06-20 03:12:37,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:12:37,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.03 | bwd_microstep: 3363.38 | bwd_inner_microstep: 3362.58 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.77 [2025-06-20 03:12:37,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.03 | bwd: 3363.39 | bwd_inner: 3362.58 | bwd_allreduce: 0.77 | step: 6.77 87%|████████▋ | 8710/10000 [13:42:58<1:58:08, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00013119069626554847, 'learning_rate': 1.720319359836744e-06, 'epoch': 8.71} 87%|████████▋ | 8710/10000 [13:42:58<1:58:08, 5.49s/it][2025-06-20 03:12:43,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:12:43,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.39 | bwd_microstep: 3371.04 | bwd_inner_microstep: 3370.12 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.96 [2025-06-20 03:12:43,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.39 | bwd: 3371.06 | bwd_inner: 3370.12 | bwd_allreduce: 0.89 | step: 6.96 87%|████████▋ | 8711/10000 [13:43:03<1:58:19, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.014158111996948719, 'learning_rate': 1.7176920676614317e-06, 'epoch': 8.71} 87%|████████▋ | 8711/10000 [13:43:03<1:58:19, 5.51s/it][2025-06-20 03:12:48,637] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:12:48,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.58 | bwd_microstep: 3356.77 | bwd_inner_microstep: 3355.81 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.09 [2025-06-20 03:12:48,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.58 | bwd: 3356.78 | bwd_inner: 3355.81 | bwd_allreduce: 0.93 | step: 7.10 87%|████████▋ | 8712/10000 [13:43:09<1:58:21, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00200386019423604, 'learning_rate': 1.7150666932153726e-06, 'epoch': 8.71} 87%|████████▋ | 8712/10000 [13:43:09<1:58:21, 5.51s/it][2025-06-20 03:12:54,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:12:54,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.18 | bwd_microstep: 3387.14 | bwd_inner_microstep: 3386.33 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.66 [2025-06-20 03:12:54,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.18 | bwd: 3387.15 | bwd_inner: 3386.33 | bwd_allreduce: 0.78 | step: 6.66 87%|████████▋ | 8713/10000 [13:43:14<1:58:34, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.03408807888627052, 'learning_rate': 1.7124432367739508e-06, 'epoch': 8.71} 87%|████████▋ | 8713/10000 [13:43:14<1:58:34, 5.53s/it][2025-06-20 03:12:59,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 03:12:59,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.54 | bwd_microstep: 3313.85 | bwd_inner_microstep: 3312.88 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.11 [2025-06-20 03:12:59,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.54 | bwd: 3313.87 | bwd_inner: 3312.88 | bwd_allreduce: 0.93 | step: 7.11 87%|████████▋ | 8714/10000 [13:43:20<1:58:00, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00013631716137751937, 'learning_rate': 1.7098216986123572e-06, 'epoch': 8.71} 87%|████████▋ | 8714/10000 [13:43:20<1:58:00, 5.51s/it][2025-06-20 03:13:05,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:13:05,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.23 | bwd_microstep: 3330.14 | bwd_inner_microstep: 3329.31 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.72 [2025-06-20 03:13:05,132] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.23 | bwd: 3330.16 | bwd_inner: 3329.31 | bwd_allreduce: 0.80 | step: 6.72 87%|████████▋ | 8715/10000 [13:43:25<1:57:44, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00030549755319952965, 'learning_rate': 1.707202079005581e-06, 'epoch': 8.71} 87%|████████▋ | 8715/10000 [13:43:25<1:57:44, 5.50s/it][2025-06-20 03:13:10,590] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 03:13:10,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.31 | bwd_microstep: 3318.94 | bwd_inner_microstep: 3318.02 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.07 [2025-06-20 03:13:10,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.32 | bwd: 3318.95 | bwd_inner: 3318.02 | bwd_allreduce: 0.89 | step: 7.07 87%|████████▋ | 8716/10000 [13:43:31<1:57:24, 5.49s/it] {'loss': 0.0, 'grad_norm': 8.285984222311527e-05, 'learning_rate': 1.7045843782284067e-06, 'epoch': 8.72} 87%|████████▋ | 8716/10000 [13:43:31<1:57:24, 5.49s/it][2025-06-20 03:13:16,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 03:13:16,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.36 | bwd_microstep: 3357.95 | bwd_inner_microstep: 3357.16 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.55 [2025-06-20 03:13:16,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.36 | bwd: 3357.96 | bwd_inner: 3357.16 | bwd_allreduce: 0.76 | step: 6.55 87%|████████▋ | 8717/10000 [13:43:36<1:57:35, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.006443412508815527, 'learning_rate': 1.701968596555421e-06, 'epoch': 8.72} 87%|████████▋ | 8717/10000 [13:43:36<1:57:35, 5.50s/it][2025-06-20 03:13:21,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:13:21,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2120.56 | bwd_microstep: 3367.31 | bwd_inner_microstep: 3366.53 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 03:13:21,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2120.56 | bwd: 3367.32 | bwd_inner: 3366.53 | bwd_allreduce: 0.76 | step: 6.60 87%|████████▋ | 8718/10000 [13:43:42<1:57:39, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.001629252452403307, 'learning_rate': 1.699354734261005e-06, 'epoch': 8.72} 87%|████████▋ | 8718/10000 [13:43:42<1:57:39, 5.51s/it][2025-06-20 03:13:27,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:13:27,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.60 | bwd_microstep: 3371.19 | bwd_inner_microstep: 3370.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 03:13:27,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.60 | bwd: 3371.21 | bwd_inner: 3370.40 | bwd_allreduce: 0.76 | step: 6.71 87%|████████▋ | 8719/10000 [13:43:47<1:57:45, 5.52s/it] {'loss': 0.0, 'grad_norm': 2.6100187824340537e-05, 'learning_rate': 1.6967427916193413e-06, 'epoch': 8.72} 87%|████████▋ | 8719/10000 [13:43:47<1:57:45, 5.52s/it][2025-06-20 03:13:32,650] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:13:32,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.94 | bwd_microstep: 3314.78 | bwd_inner_microstep: 3313.97 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.88 [2025-06-20 03:13:32,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.94 | bwd: 3314.80 | bwd_inner: 3313.97 | bwd_allreduce: 0.78 | step: 6.89 87%|████████▋ | 8720/10000 [13:43:53<1:57:22, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00022315877140499651, 'learning_rate': 1.6941327689044106e-06, 'epoch': 8.72} 87%|████████▋ | 8720/10000 [13:43:53<1:57:22, 5.50s/it][2025-06-20 03:13:38,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:13:38,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.41 | bwd_microstep: 3307.19 | bwd_inner_microstep: 3306.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 03:13:38,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.41 | bwd: 3307.20 | bwd_inner: 3306.40 | bwd_allreduce: 0.76 | step: 6.65 87%|████████▋ | 8721/10000 [13:43:58<1:56:57, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.007897678762674332, 'learning_rate': 1.6915246663899921e-06, 'epoch': 8.72} 87%|████████▋ | 8721/10000 [13:43:58<1:56:57, 5.49s/it][2025-06-20 03:13:43,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:13:43,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.09 | bwd_microstep: 3310.97 | bwd_inner_microstep: 3310.18 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 03:13:43,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.09 | bwd: 3310.98 | bwd_inner: 3310.18 | bwd_allreduce: 0.76 | step: 6.63 87%|████████▋ | 8722/10000 [13:44:04<1:56:39, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0001882334181573242, 'learning_rate': 1.6889184843496664e-06, 'epoch': 8.72} 87%|████████▋ | 8722/10000 [13:44:04<1:56:39, 5.48s/it][2025-06-20 03:13:49,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:13:49,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.31 | bwd_microstep: 3323.22 | bwd_inner_microstep: 3322.44 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 03:13:49,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.31 | bwd: 3323.24 | bwd_inner: 3322.44 | bwd_allreduce: 0.75 | step: 6.70 87%|████████▋ | 8723/10000 [13:44:09<1:56:30, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0010710512287914753, 'learning_rate': 1.6863142230568064e-06, 'epoch': 8.72} 87%|████████▋ | 8723/10000 [13:44:09<1:56:30, 5.47s/it][2025-06-20 03:13:54,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:13:54,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.74 | bwd_microstep: 3367.59 | bwd_inner_microstep: 3366.79 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.66 [2025-06-20 03:13:54,553] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.74 | bwd: 3367.60 | bwd_inner: 3366.79 | bwd_allreduce: 0.77 | step: 6.67 87%|████████▋ | 8724/10000 [13:44:15<1:56:45, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.020474297925829887, 'learning_rate': 1.6837118827845867e-06, 'epoch': 8.72} 87%|████████▋ | 8724/10000 [13:44:15<1:56:45, 5.49s/it][2025-06-20 03:14:00,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:14:00,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.98 | bwd_microstep: 3374.34 | bwd_inner_microstep: 3373.56 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.53 [2025-06-20 03:14:00,098] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.98 | bwd: 3374.35 | bwd_inner: 3373.56 | bwd_allreduce: 0.75 | step: 6.53 87%|████████▋ | 8725/10000 [13:44:20<1:57:01, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0011459270026534796, 'learning_rate': 1.6811114638059822e-06, 'epoch': 8.72} 87%|████████▋ | 8725/10000 [13:44:20<1:57:01, 5.51s/it][2025-06-20 03:14:05,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:14:05,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.93 | bwd_microstep: 3373.74 | bwd_inner_microstep: 3372.96 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-20 03:14:05,644] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.93 | bwd: 3373.76 | bwd_inner: 3372.96 | bwd_allreduce: 0.75 | step: 6.56 87%|████████▋ | 8726/10000 [13:44:26<1:57:10, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0003411724464967847, 'learning_rate': 1.6785129663937637e-06, 'epoch': 8.73} 87%|████████▋ | 8726/10000 [13:44:26<1:57:10, 5.52s/it][2025-06-20 03:14:11,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 03:14:11,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.16 | bwd_microstep: 3372.55 | bwd_inner_microstep: 3371.57 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.18 [2025-06-20 03:14:11,179] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.16 | bwd: 3372.57 | bwd_inner: 3371.57 | bwd_allreduce: 0.96 | step: 7.19 87%|████████▋ | 8727/10000 [13:44:31<1:57:11, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.015588460490107536, 'learning_rate': 1.6759163908205067e-06, 'epoch': 8.73} 87%|████████▋ | 8727/10000 [13:44:31<1:57:11, 5.52s/it][2025-06-20 03:14:16,635] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:14:16,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.35 | bwd_microstep: 3308.97 | bwd_inner_microstep: 3308.00 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.26 [2025-06-20 03:14:16,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.35 | bwd: 3308.99 | bwd_inner: 3308.00 | bwd_allreduce: 0.94 | step: 7.27 87%|████████▋ | 8728/10000 [13:44:37<1:56:41, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.004484727047383785, 'learning_rate': 1.6733217373585708e-06, 'epoch': 8.73} 87%|████████▋ | 8728/10000 [13:44:37<1:56:41, 5.50s/it][2025-06-20 03:14:22,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:14:22,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.56 | bwd_microstep: 3304.61 | bwd_inner_microstep: 3303.72 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.86 [2025-06-20 03:14:22,087] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.56 | bwd: 3304.63 | bwd_inner: 3303.72 | bwd_allreduce: 0.87 | step: 6.87 87%|████████▋ | 8729/10000 [13:44:42<1:56:14, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0005036498187109828, 'learning_rate': 1.6707290062801296e-06, 'epoch': 8.73} 87%|████████▋ | 8729/10000 [13:44:42<1:56:14, 5.49s/it][2025-06-20 03:14:27,617] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:14:27,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.93 | bwd_microstep: 3365.62 | bwd_inner_microstep: 3364.78 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.72 [2025-06-20 03:14:27,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.93 | bwd: 3365.64 | bwd_inner: 3364.78 | bwd_allreduce: 0.81 | step: 6.72 87%|████████▋ | 8730/10000 [13:44:48<1:56:25, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00011163416638737544, 'learning_rate': 1.6681381978571454e-06, 'epoch': 8.73} 87%|████████▋ | 8730/10000 [13:44:48<1:56:25, 5.50s/it][2025-06-20 03:14:33,070] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:14:33,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.41 | bwd_microstep: 3314.94 | bwd_inner_microstep: 3313.89 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.39 [2025-06-20 03:14:33,071] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.41 | bwd: 3314.96 | bwd_inner: 3313.89 | bwd_allreduce: 1.01 | step: 7.40 87%|████████▋ | 8731/10000 [13:44:53<1:56:03, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.000314925069687888, 'learning_rate': 1.6655493123613853e-06, 'epoch': 8.73} 87%|████████▋ | 8731/10000 [13:44:53<1:56:03, 5.49s/it][2025-06-20 03:14:38,530] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:14:38,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.95 | bwd_microstep: 3307.40 | bwd_inner_microstep: 3306.59 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.95 [2025-06-20 03:14:38,531] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.95 | bwd: 3307.41 | bwd_inner: 3306.59 | bwd_allreduce: 0.78 | step: 6.95 87%|████████▋ | 8732/10000 [13:44:59<1:55:46, 5.48s/it] {'loss': 0.0, 'grad_norm': 3.6147801438346505e-05, 'learning_rate': 1.662962350064412e-06, 'epoch': 8.73} 87%|████████▋ | 8732/10000 [13:44:59<1:55:46, 5.48s/it][2025-06-20 03:14:44,056] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:14:44,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.61 | bwd_microstep: 3364.88 | bwd_inner_microstep: 3363.92 | bwd_allreduce_microstep: 0.91 | step_microstep: 6.95 [2025-06-20 03:14:44,057] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.61 | bwd: 3364.89 | bwd_inner: 3363.92 | bwd_allreduce: 0.93 | step: 6.95 87%|████████▋ | 8733/10000 [13:45:04<1:55:58, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0002660225727595389, 'learning_rate': 1.6603773112375798e-06, 'epoch': 8.73} 87%|████████▋ | 8733/10000 [13:45:04<1:55:58, 5.49s/it][2025-06-20 03:14:49,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:14:49,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.18 | bwd_microstep: 3365.87 | bwd_inner_microstep: 3365.07 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.90 [2025-06-20 03:14:49,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.18 | bwd: 3365.88 | bwd_inner: 3365.07 | bwd_allreduce: 0.77 | step: 6.91 87%|████████▋ | 8734/10000 [13:45:10<1:56:12, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.02373918704688549, 'learning_rate': 1.6577941961520517e-06, 'epoch': 8.73} 87%|████████▋ | 8734/10000 [13:45:10<1:56:12, 5.51s/it][2025-06-20 03:14:55,046] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:14:55,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.78 | bwd_microstep: 3310.33 | bwd_inner_microstep: 3309.55 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.70 [2025-06-20 03:14:55,047] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.78 | bwd: 3310.35 | bwd_inner: 3309.55 | bwd_allreduce: 0.75 | step: 6.70 87%|████████▋ | 8735/10000 [13:45:15<1:55:44, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.01317755226045847, 'learning_rate': 1.6552130050787818e-06, 'epoch': 8.73} 87%|████████▋ | 8735/10000 [13:45:15<1:55:44, 5.49s/it][2025-06-20 03:15:00,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:15:00,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.97 | bwd_microstep: 3367.08 | bwd_inner_microstep: 3366.29 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 03:15:00,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.97 | bwd: 3367.11 | bwd_inner: 3366.29 | bwd_allreduce: 0.76 | step: 6.72 87%|████████▋ | 8736/10000 [13:45:21<1:55:56, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0013137311907485127, 'learning_rate': 1.6526337382885249e-06, 'epoch': 8.74} 87%|████████▋ | 8736/10000 [13:45:21<1:55:56, 5.50s/it][2025-06-20 03:15:06,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:15:06,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.30 | bwd_microstep: 3312.57 | bwd_inner_microstep: 3311.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.90 [2025-06-20 03:15:06,039] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.30 | bwd: 3312.58 | bwd_inner: 3311.78 | bwd_allreduce: 0.76 | step: 6.90 87%|████████▋ | 8737/10000 [13:45:26<1:55:33, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0012451661750674248, 'learning_rate': 1.6500563960518423e-06, 'epoch': 8.74} 87%|████████▋ | 8737/10000 [13:45:26<1:55:33, 5.49s/it][2025-06-20 03:15:11,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:15:11,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.99 | bwd_microstep: 3309.21 | bwd_inner_microstep: 3308.38 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.04 [2025-06-20 03:15:11,498] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.99 | bwd: 3309.23 | bwd_inner: 3308.38 | bwd_allreduce: 0.80 | step: 7.04 87%|████████▋ | 8738/10000 [13:45:32<1:55:16, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0028751043137162924, 'learning_rate': 1.6474809786390755e-06, 'epoch': 8.74} 87%|████████▋ | 8738/10000 [13:45:32<1:55:16, 5.48s/it][2025-06-20 03:15:17,025] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:15:17,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.24 | bwd_microstep: 3354.24 | bwd_inner_microstep: 3353.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.92 [2025-06-20 03:15:17,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.24 | bwd: 3354.25 | bwd_inner: 3353.44 | bwd_allreduce: 0.77 | step: 6.92 87%|████████▋ | 8739/10000 [13:45:37<1:55:28, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0006734977359883487, 'learning_rate': 1.6449074863203773e-06, 'epoch': 8.74} 87%|████████▋ | 8739/10000 [13:45:37<1:55:28, 5.49s/it][2025-06-20 03:15:22,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:15:22,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.33 | bwd_microstep: 3322.43 | bwd_inner_microstep: 3321.63 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-20 03:15:22,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.33 | bwd: 3322.44 | bwd_inner: 3321.63 | bwd_allreduce: 0.77 | step: 6.76 87%|████████▋ | 8740/10000 [13:45:43<1:55:14, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.06229640915989876, 'learning_rate': 1.6423359193656962e-06, 'epoch': 8.74} 87%|████████▋ | 8740/10000 [13:45:43<1:55:14, 5.49s/it][2025-06-20 03:15:27,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:15:27,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.96 | bwd_microstep: 3317.72 | bwd_inner_microstep: 3316.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.59 [2025-06-20 03:15:27,965] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.96 | bwd: 3317.74 | bwd_inner: 3316.94 | bwd_allreduce: 0.76 | step: 6.59 87%|████████▋ | 8741/10000 [13:45:48<1:55:01, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0015136656584218144, 'learning_rate': 1.6397662780447742e-06, 'epoch': 8.74} 87%|████████▋ | 8741/10000 [13:45:48<1:55:01, 5.48s/it][2025-06-20 03:15:33,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:15:33,495] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.33 | bwd_microstep: 3364.73 | bwd_inner_microstep: 3363.94 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-20 03:15:33,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.33 | bwd: 3364.75 | bwd_inner: 3363.94 | bwd_allreduce: 0.76 | step: 6.79 87%|████████▋ | 8742/10000 [13:45:54<1:55:14, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0013677064562216401, 'learning_rate': 1.6371985626271625e-06, 'epoch': 8.74} 87%|████████▋ | 8742/10000 [13:45:54<1:55:14, 5.50s/it][2025-06-20 03:15:38,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:15:38,967] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.58 | bwd_microstep: 3323.16 | bwd_inner_microstep: 3322.31 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.26 [2025-06-20 03:15:38,968] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.58 | bwd: 3323.17 | bwd_inner: 3322.31 | bwd_allreduce: 0.83 | step: 7.27 87%|████████▋ | 8743/10000 [13:45:59<1:55:00, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.024189695715904236, 'learning_rate': 1.6346327733821944e-06, 'epoch': 8.74} 87%|████████▋ | 8743/10000 [13:45:59<1:55:00, 5.49s/it][2025-06-20 03:15:44,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:15:44,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.88 | bwd_microstep: 3315.15 | bwd_inner_microstep: 3314.33 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-20 03:15:44,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.88 | bwd: 3315.16 | bwd_inner: 3314.33 | bwd_allreduce: 0.78 | step: 7.31 87%|████████▋ | 8744/10000 [13:46:05<1:54:43, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0017297903541475534, 'learning_rate': 1.6320689105790121e-06, 'epoch': 8.74} 87%|████████▋ | 8744/10000 [13:46:05<1:54:43, 5.48s/it][2025-06-20 03:15:49,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:15:49,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.29 | bwd_microstep: 3318.35 | bwd_inner_microstep: 3317.57 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 03:15:49,898] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.29 | bwd: 3318.36 | bwd_inner: 3317.57 | bwd_allreduce: 0.75 | step: 6.58 87%|████████▋ | 8745/10000 [13:46:10<1:54:33, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.03935931250452995, 'learning_rate': 1.6295069744865522e-06, 'epoch': 8.74} 87%|████████▋ | 8745/10000 [13:46:10<1:54:33, 5.48s/it][2025-06-20 03:15:55,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.74 [2025-06-20 03:15:55,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.40 | bwd_microstep: 3318.71 | bwd_inner_microstep: 3317.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-20 03:15:55,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.40 | bwd: 3318.72 | bwd_inner: 3317.93 | bwd_allreduce: 0.76 | step: 6.53 87%|████████▋ | 8746/10000 [13:46:16<1:54:22, 5.47s/it] {'loss': 0.0001, 'grad_norm': 0.011515801772475243, 'learning_rate': 1.6269469653735503e-06, 'epoch': 8.75} 87%|████████▋ | 8746/10000 [13:46:16<1:54:22, 5.47s/it][2025-06-20 03:16:00,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:16:00,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.98 | bwd_microstep: 3326.24 | bwd_inner_microstep: 3325.45 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.63 [2025-06-20 03:16:00,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.98 | bwd: 3326.25 | bwd_inner: 3325.45 | bwd_allreduce: 0.76 | step: 6.63 87%|████████▋ | 8747/10000 [13:46:21<1:54:16, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00017414501053281128, 'learning_rate': 1.624388883508543e-06, 'epoch': 8.75} 87%|████████▋ | 8747/10000 [13:46:21<1:54:16, 5.47s/it][2025-06-20 03:16:06,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:16:06,370] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.78 | bwd_microstep: 3365.90 | bwd_inner_microstep: 3365.11 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-20 03:16:06,371] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.78 | bwd: 3365.92 | bwd_inner: 3365.11 | bwd_allreduce: 0.77 | step: 6.88 87%|████████▋ | 8748/10000 [13:46:27<1:54:36, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0011657618451863527, 'learning_rate': 1.6218327291598534e-06, 'epoch': 8.75} 87%|████████▋ | 8748/10000 [13:46:27<1:54:36, 5.49s/it][2025-06-20 03:16:11,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:16:11,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.16 | bwd_microstep: 3309.34 | bwd_inner_microstep: 3308.56 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 03:16:11,820] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.16 | bwd: 3309.35 | bwd_inner: 3308.56 | bwd_allreduce: 0.75 | step: 6.67 87%|████████▋ | 8749/10000 [13:46:32<1:54:14, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.005025967955589294, 'learning_rate': 1.6192785025956182e-06, 'epoch': 8.75} 87%|████████▋ | 8749/10000 [13:46:32<1:54:14, 5.48s/it][2025-06-20 03:16:17,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.82 [2025-06-20 03:16:17,281] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.68 | bwd_microstep: 3319.72 | bwd_inner_microstep: 3318.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 03:16:17,282] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.68 | bwd: 3319.74 | bwd_inner: 3318.93 | bwd_allreduce: 0.76 | step: 6.81 88%|████████▊ | 8750/10000 [13:46:38<1:54:02, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0017633151728659868, 'learning_rate': 1.6167262040837583e-06, 'epoch': 8.75} 88%|████████▊ | 8750/10000 [13:46:38<1:54:02, 5.47s/it][2025-06-20 03:16:22,823] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:16:22,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.41 | bwd_microstep: 3371.24 | bwd_inner_microstep: 3370.40 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.89 [2025-06-20 03:16:22,824] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.41 | bwd: 3371.26 | bwd_inner: 3370.40 | bwd_allreduce: 0.80 | step: 6.89 88%|████████▊ | 8751/10000 [13:46:43<1:54:22, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0016970053547993302, 'learning_rate': 1.6141758338919999e-06, 'epoch': 8.75} 88%|████████▊ | 8751/10000 [13:46:43<1:54:22, 5.49s/it][2025-06-20 03:16:28,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:16:28,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.04 | bwd_microstep: 3365.94 | bwd_inner_microstep: 3365.17 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.53 [2025-06-20 03:16:28,360] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.04 | bwd: 3365.96 | bwd_inner: 3365.17 | bwd_allreduce: 0.75 | step: 6.54 88%|████████▊ | 8752/10000 [13:46:49<1:54:32, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0001789247471606359, 'learning_rate': 1.6116273922878667e-06, 'epoch': 8.75} 88%|████████▊ | 8752/10000 [13:46:49<1:54:32, 5.51s/it][2025-06-20 03:16:33,811] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:16:33,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.75 | bwd_microstep: 3314.24 | bwd_inner_microstep: 3313.47 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-20 03:16:33,812] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.75 | bwd: 3314.26 | bwd_inner: 3313.47 | bwd_allreduce: 0.75 | step: 6.76 88%|████████▊ | 8753/10000 [13:46:54<1:54:06, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0040083820931613445, 'learning_rate': 1.6090808795386759e-06, 'epoch': 8.75} 88%|████████▊ | 8753/10000 [13:46:54<1:54:06, 5.49s/it][2025-06-20 03:16:39,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:16:39,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.79 | bwd_microstep: 3368.04 | bwd_inner_microstep: 3367.26 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 03:16:39,354] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.79 | bwd: 3368.06 | bwd_inner: 3367.26 | bwd_allreduce: 0.75 | step: 6.57 88%|████████▊ | 8754/10000 [13:47:00<1:54:20, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.003091654274612665, 'learning_rate': 1.606536295911545e-06, 'epoch': 8.75} 88%|████████▊ | 8754/10000 [13:47:00<1:54:20, 5.51s/it][2025-06-20 03:16:44,832] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:16:44,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.71 | bwd_microstep: 3330.13 | bwd_inner_microstep: 3329.32 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.83 [2025-06-20 03:16:44,833] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.71 | bwd: 3330.15 | bwd_inner: 3329.32 | bwd_allreduce: 0.79 | step: 6.84 88%|████████▊ | 8755/10000 [13:47:05<1:54:04, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005349086131900549, 'learning_rate': 1.6039936416733915e-06, 'epoch': 8.76} 88%|████████▊ | 8755/10000 [13:47:05<1:54:04, 5.50s/it][2025-06-20 03:16:50,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:16:50,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.30 | bwd_microstep: 3316.20 | bwd_inner_microstep: 3315.42 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 03:16:50,293] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.30 | bwd: 3316.21 | bwd_inner: 3315.42 | bwd_allreduce: 0.76 | step: 6.62 88%|████████▊ | 8756/10000 [13:47:11<1:53:45, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.009562090039253235, 'learning_rate': 1.6014529170909265e-06, 'epoch': 8.76} 88%|████████▊ | 8756/10000 [13:47:11<1:53:45, 5.49s/it][2025-06-20 03:16:55,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:16:55,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.76 | bwd_microstep: 3319.76 | bwd_inner_microstep: 3318.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 03:16:55,760] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.77 | bwd: 3319.78 | bwd_inner: 3318.97 | bwd_allreduce: 0.77 | step: 6.73 88%|████████▊ | 8757/10000 [13:47:16<1:53:32, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00243873312138021, 'learning_rate': 1.5989141224306638e-06, 'epoch': 8.76} 88%|████████▊ | 8757/10000 [13:47:16<1:53:32, 5.48s/it][2025-06-20 03:17:01,234] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.72 [2025-06-20 03:17:01,235] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.76 | bwd_microstep: 3321.01 | bwd_inner_microstep: 3319.97 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.81 [2025-06-20 03:17:01,236] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.76 | bwd: 3321.02 | bwd_inner: 3319.97 | bwd_allreduce: 1.00 | step: 7.81 88%|████████▊ | 8758/10000 [13:47:22<1:53:25, 5.48s/it] {'loss': 0.0004, 'grad_norm': 0.062317606061697006, 'learning_rate': 1.5963772579589032e-06, 'epoch': 8.76} 88%|████████▊ | 8758/10000 [13:47:22<1:53:25, 5.48s/it][2025-06-20 03:17:06,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 03:17:06,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.59 | bwd_microstep: 3319.27 | bwd_inner_microstep: 3318.32 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.28 [2025-06-20 03:17:06,712] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.59 | bwd: 3319.28 | bwd_inner: 3318.32 | bwd_allreduce: 0.91 | step: 7.29 88%|████████▊ | 8759/10000 [13:47:27<1:53:18, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.018640389665961266, 'learning_rate': 1.5938423239417545e-06, 'epoch': 8.76} 88%|████████▊ | 8759/10000 [13:47:27<1:53:18, 5.48s/it][2025-06-20 03:17:12,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:17:12,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.35 | bwd_microstep: 3374.19 | bwd_inner_microstep: 3373.31 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.08 [2025-06-20 03:17:12,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.34 | bwd: 3374.20 | bwd_inner: 3373.31 | bwd_allreduce: 0.84 | step: 7.08 88%|████████▊ | 8760/10000 [13:47:33<1:53:42, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.05643825978040695, 'learning_rate': 1.5913093206451202e-06, 'epoch': 8.76} 88%|████████▊ | 8760/10000 [13:47:33<1:53:42, 5.50s/it][2025-06-20 03:17:17,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:17:17,861] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.20 | bwd_microstep: 3401.09 | bwd_inner_microstep: 3400.23 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.04 [2025-06-20 03:17:17,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.20 | bwd: 3401.11 | bwd_inner: 3400.23 | bwd_allreduce: 0.83 | step: 7.04 88%|████████▊ | 8761/10000 [13:47:38<1:54:11, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0012408780166879296, 'learning_rate': 1.5887782483347015e-06, 'epoch': 8.76} 88%|████████▊ | 8761/10000 [13:47:38<1:54:11, 5.53s/it][2025-06-20 03:17:23,329] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:17:23,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.25 | bwd_microstep: 3317.71 | bwd_inner_microstep: 3316.81 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.07 [2025-06-20 03:17:23,330] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.25 | bwd: 3317.73 | bwd_inner: 3316.81 | bwd_allreduce: 0.86 | step: 7.07 88%|████████▊ | 8762/10000 [13:47:44<1:53:43, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0003738036029972136, 'learning_rate': 1.5862491072759946e-06, 'epoch': 8.76} 88%|████████▊ | 8762/10000 [13:47:44<1:53:43, 5.51s/it][2025-06-20 03:17:28,792] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:17:28,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.17 | bwd_microstep: 3317.24 | bwd_inner_microstep: 3316.43 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-20 03:17:28,793] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.18 | bwd: 3317.26 | bwd_inner: 3316.43 | bwd_allreduce: 0.78 | step: 6.96 88%|████████▊ | 8763/10000 [13:47:49<1:53:18, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.043866194784641266, 'learning_rate': 1.5837218977343006e-06, 'epoch': 8.76} 88%|████████▊ | 8763/10000 [13:47:49<1:53:18, 5.50s/it][2025-06-20 03:17:34,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:17:34,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.67 | bwd_microstep: 3322.98 | bwd_inner_microstep: 3321.96 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.41 [2025-06-20 03:17:34,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.67 | bwd: 3323.00 | bwd_inner: 3321.96 | bwd_allreduce: 0.99 | step: 7.41 88%|████████▊ | 8764/10000 [13:47:55<1:53:06, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002781693823635578, 'learning_rate': 1.5811966199747008e-06, 'epoch': 8.76} 88%|████████▊ | 8764/10000 [13:47:55<1:53:06, 5.49s/it][2025-06-20 03:17:39,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:17:39,813] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.38 | bwd_microstep: 3368.95 | bwd_inner_microstep: 3367.96 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.03 [2025-06-20 03:17:39,814] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.38 | bwd: 3368.97 | bwd_inner: 3367.96 | bwd_allreduce: 0.95 | step: 7.03 88%|████████▊ | 8765/10000 [13:48:00<1:53:20, 5.51s/it] {'loss': 0.0, 'grad_norm': 8.073366188909858e-05, 'learning_rate': 1.5786732742620926e-06, 'epoch': 8.77} 88%|████████▊ | 8765/10000 [13:48:00<1:53:20, 5.51s/it][2025-06-20 03:17:45,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:17:45,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.91 | bwd_microstep: 3378.89 | bwd_inner_microstep: 3378.11 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 03:17:45,368] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.91 | bwd: 3378.91 | bwd_inner: 3378.11 | bwd_allreduce: 0.75 | step: 6.59 88%|████████▊ | 8766/10000 [13:48:06<1:53:32, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.009096305817365646, 'learning_rate': 1.5761518608611637e-06, 'epoch': 8.77} 88%|████████▊ | 8766/10000 [13:48:06<1:53:32, 5.52s/it][2025-06-20 03:17:50,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:17:50,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.33 | bwd_microstep: 3319.15 | bwd_inner_microstep: 3318.26 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.93 [2025-06-20 03:17:50,829] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.33 | bwd: 3319.17 | bwd_inner: 3318.26 | bwd_allreduce: 0.87 | step: 6.93 88%|████████▊ | 8767/10000 [13:48:11<1:53:04, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.023406323045492172, 'learning_rate': 1.573632380036394e-06, 'epoch': 8.77} 88%|████████▊ | 8767/10000 [13:48:11<1:53:04, 5.50s/it][2025-06-20 03:17:56,379] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:17:56,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.36 | bwd_microstep: 3381.15 | bwd_inner_microstep: 3380.35 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.70 [2025-06-20 03:17:56,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.36 | bwd: 3381.17 | bwd_inner: 3380.35 | bwd_allreduce: 0.77 | step: 6.71 88%|████████▊ | 8768/10000 [13:48:17<1:53:16, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.001020624884404242, 'learning_rate': 1.5711148320520742e-06, 'epoch': 8.77} 88%|████████▊ | 8768/10000 [13:48:17<1:53:16, 5.52s/it][2025-06-20 03:18:01,852] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 03:18:01,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.36 | bwd_microstep: 3327.09 | bwd_inner_microstep: 3326.16 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.28 [2025-06-20 03:18:01,853] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.36 | bwd: 3327.11 | bwd_inner: 3326.16 | bwd_allreduce: 0.90 | step: 7.29 88%|████████▊ | 8769/10000 [13:48:22<1:52:55, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001416131155565381, 'learning_rate': 1.568599217172273e-06, 'epoch': 8.77} 88%|████████▊ | 8769/10000 [13:48:22<1:52:55, 5.50s/it][2025-06-20 03:18:07,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 03:18:07,332] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.39 | bwd_microstep: 3323.43 | bwd_inner_microstep: 3322.65 | bwd_allreduce_microstep: 0.73 | step_microstep: 6.53 [2025-06-20 03:18:07,333] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.39 | bwd: 3323.44 | bwd_inner: 3322.65 | bwd_allreduce: 0.75 | step: 6.53 88%|████████▊ | 8770/10000 [13:48:28<1:52:40, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.05808648094534874, 'learning_rate': 1.5660855356608706e-06, 'epoch': 8.77} 88%|████████▊ | 8770/10000 [13:48:28<1:52:40, 5.50s/it][2025-06-20 03:18:12,884] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:18:12,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.54 | bwd_microstep: 3381.14 | bwd_inner_microstep: 3380.20 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.46 [2025-06-20 03:18:12,885] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.54 | bwd: 3381.16 | bwd_inner: 3380.20 | bwd_allreduce: 0.91 | step: 7.46 88%|████████▊ | 8771/10000 [13:48:33<1:52:56, 5.51s/it] {'loss': 0.0, 'grad_norm': 2.199242589995265e-05, 'learning_rate': 1.5635737877815383e-06, 'epoch': 8.77} 88%|████████▊ | 8771/10000 [13:48:33<1:52:56, 5.51s/it][2025-06-20 03:18:18,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:18:18,363] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.34 | bwd_microstep: 3322.26 | bwd_inner_microstep: 3321.33 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.07 [2025-06-20 03:18:18,364] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.34 | bwd: 3322.28 | bwd_inner: 3321.33 | bwd_allreduce: 0.89 | step: 7.08 88%|████████▊ | 8772/10000 [13:48:39<1:52:38, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0022367280907928944, 'learning_rate': 1.5610639737977518e-06, 'epoch': 8.77} 88%|████████▊ | 8772/10000 [13:48:39<1:52:38, 5.50s/it][2025-06-20 03:18:23,864] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:18:23,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.64 | bwd_microstep: 3329.38 | bwd_inner_microstep: 3328.60 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.77 [2025-06-20 03:18:23,865] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.64 | bwd: 3329.40 | bwd_inner: 3328.60 | bwd_allreduce: 0.76 | step: 6.78 88%|████████▊ | 8773/10000 [13:48:44<1:52:31, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0014392156153917313, 'learning_rate': 1.5585560939727806e-06, 'epoch': 8.77} 88%|████████▊ | 8773/10000 [13:48:44<1:52:31, 5.50s/it][2025-06-20 03:18:29,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:18:29,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.32 | bwd_microstep: 3324.29 | bwd_inner_microstep: 3323.47 | bwd_allreduce_microstep: 0.78 | step_microstep: 7.15 [2025-06-20 03:18:29,337] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.33 | bwd: 3324.31 | bwd_inner: 3323.47 | bwd_allreduce: 0.80 | step: 7.15 88%|████████▊ | 8774/10000 [13:48:50<1:52:14, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0007669389597140253, 'learning_rate': 1.5560501485696833e-06, 'epoch': 8.77} 88%|████████▊ | 8774/10000 [13:48:50<1:52:14, 5.49s/it][2025-06-20 03:18:34,805] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:18:34,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.56 | bwd_microstep: 3321.12 | bwd_inner_microstep: 3320.32 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 03:18:34,806] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.56 | bwd: 3321.13 | bwd_inner: 3320.32 | bwd_allreduce: 0.76 | step: 6.79 88%|████████▊ | 8775/10000 [13:48:55<1:51:59, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005210298113524914, 'learning_rate': 1.5535461378513227e-06, 'epoch': 8.78} 88%|████████▊ | 8775/10000 [13:48:55<1:51:59, 5.49s/it][2025-06-20 03:18:40,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:18:40,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.11 | bwd_microstep: 3375.61 | bwd_inner_microstep: 3374.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.69 [2025-06-20 03:18:40,347] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.11 | bwd: 3375.62 | bwd_inner: 3374.81 | bwd_allreduce: 0.77 | step: 6.69 88%|████████▊ | 8776/10000 [13:49:01<1:52:14, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.010260039009153843, 'learning_rate': 1.5510440620803625e-06, 'epoch': 8.78} 88%|████████▊ | 8776/10000 [13:49:01<1:52:14, 5.50s/it][2025-06-20 03:18:45,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:18:45,815] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.51 | bwd_microstep: 3326.57 | bwd_inner_microstep: 3325.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-20 03:18:45,816] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.51 | bwd: 3326.58 | bwd_inner: 3325.78 | bwd_allreduce: 0.76 | step: 6.75 88%|████████▊ | 8777/10000 [13:49:06<1:51:56, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0002244712522951886, 'learning_rate': 1.5485439215192522e-06, 'epoch': 8.78} 88%|████████▊ | 8777/10000 [13:49:06<1:51:56, 5.49s/it][2025-06-20 03:18:51,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:18:51,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.56 | bwd_microstep: 3325.26 | bwd_inner_microstep: 3324.46 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 03:18:51,289] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.56 | bwd: 3325.27 | bwd_inner: 3324.46 | bwd_allreduce: 0.77 | step: 6.72 88%|████████▊ | 8778/10000 [13:49:12<1:51:44, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00019443796190898865, 'learning_rate': 1.5460457164302556e-06, 'epoch': 8.78} 88%|████████▊ | 8778/10000 [13:49:12<1:51:44, 5.49s/it][2025-06-20 03:18:56,755] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:18:56,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.16 | bwd_microstep: 3323.03 | bwd_inner_microstep: 3320.69 | bwd_allreduce_microstep: 2.29 | step_microstep: 7.35 [2025-06-20 03:18:56,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.16 | bwd: 3323.05 | bwd_inner: 3320.69 | bwd_allreduce: 2.31 | step: 7.35 88%|████████▊ | 8779/10000 [13:49:17<1:51:32, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.002595601137727499, 'learning_rate': 1.5435494470754097e-06, 'epoch': 8.78} 88%|████████▊ | 8779/10000 [13:49:17<1:51:32, 5.48s/it][2025-06-20 03:19:02,238] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:19:02,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.40 | bwd_microstep: 3333.86 | bwd_inner_microstep: 3333.02 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.19 [2025-06-20 03:19:02,239] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.40 | bwd: 3333.88 | bwd_inner: 3333.02 | bwd_allreduce: 0.81 | step: 7.19 88%|████████▊ | 8780/10000 [13:49:23<1:51:27, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0014881533570587635, 'learning_rate': 1.5410551137165696e-06, 'epoch': 8.78} 88%|████████▊ | 8780/10000 [13:49:23<1:51:27, 5.48s/it][2025-06-20 03:19:07,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:19:07,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.80 | bwd_microstep: 3322.16 | bwd_inner_microstep: 3321.37 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.76 [2025-06-20 03:19:07,711] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.80 | bwd: 3322.17 | bwd_inner: 3321.37 | bwd_allreduce: 0.76 | step: 6.76 88%|████████▊ | 8781/10000 [13:49:28<1:51:18, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.002233039354905486, 'learning_rate': 1.538562716615377e-06, 'epoch': 8.78} 88%|████████▊ | 8781/10000 [13:49:28<1:51:18, 5.48s/it][2025-06-20 03:19:13,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:19:13,177] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.98 | bwd_microstep: 3323.65 | bwd_inner_microstep: 3322.84 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-20 03:19:13,178] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.98 | bwd: 3323.66 | bwd_inner: 3322.84 | bwd_allreduce: 0.77 | step: 6.76 88%|████████▊ | 8782/10000 [13:49:33<1:51:08, 5.47s/it] {'loss': 0.0004, 'grad_norm': 0.08789212256669998, 'learning_rate': 1.5360722560332718e-06, 'epoch': 8.78} 88%|████████▊ | 8782/10000 [13:49:33<1:51:08, 5.47s/it][2025-06-20 03:19:18,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:19:18,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.09 | bwd_microstep: 3334.36 | bwd_inner_microstep: 3333.56 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-20 03:19:18,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.09 | bwd: 3334.38 | bwd_inner: 3333.56 | bwd_allreduce: 0.78 | step: 6.78 88%|████████▊ | 8783/10000 [13:49:39<1:51:03, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.05246160924434662, 'learning_rate': 1.5335837322314984e-06, 'epoch': 8.78} 88%|████████▊ | 8783/10000 [13:49:39<1:51:03, 5.48s/it][2025-06-20 03:19:24,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:19:24,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.03 | bwd_microstep: 3371.42 | bwd_inner_microstep: 3370.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.84 [2025-06-20 03:19:24,191] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.03 | bwd: 3371.44 | bwd_inner: 3370.62 | bwd_allreduce: 0.77 | step: 6.84 88%|████████▊ | 8784/10000 [13:49:44<1:51:20, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.04540364071726799, 'learning_rate': 1.5310971454710833e-06, 'epoch': 8.78} 88%|████████▊ | 8784/10000 [13:49:44<1:51:20, 5.49s/it][2025-06-20 03:19:29,647] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:19:29,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.89 | bwd_microstep: 3314.46 | bwd_inner_microstep: 3313.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.01 [2025-06-20 03:19:29,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.89 | bwd: 3314.48 | bwd_inner: 3313.65 | bwd_allreduce: 0.78 | step: 7.01 88%|████████▊ | 8785/10000 [13:49:50<1:51:01, 5.48s/it] {'loss': 0.0, 'grad_norm': 4.1290641092928126e-05, 'learning_rate': 1.5286124960128602e-06, 'epoch': 8.79} 88%|████████▊ | 8785/10000 [13:49:50<1:51:01, 5.48s/it][2025-06-20 03:19:35,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:19:35,185] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.52 | bwd_microstep: 3373.35 | bwd_inner_microstep: 3372.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.87 [2025-06-20 03:19:35,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.52 | bwd: 3373.36 | bwd_inner: 3372.55 | bwd_allreduce: 0.77 | step: 6.87 88%|████████▊ | 8786/10000 [13:49:55<1:51:15, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.008057520724833012, 'learning_rate': 1.526129784117456e-06, 'epoch': 8.79} 88%|████████▊ | 8786/10000 [13:49:55<1:51:15, 5.50s/it][2025-06-20 03:19:40,659] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 03:19:40,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.93 | bwd_microstep: 3326.14 | bwd_inner_microstep: 3325.26 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.01 [2025-06-20 03:19:40,660] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.93 | bwd: 3326.15 | bwd_inner: 3325.26 | bwd_allreduce: 0.85 | step: 7.02 88%|████████▊ | 8787/10000 [13:50:01<1:51:01, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.060119472444057465, 'learning_rate': 1.5236490100453006e-06, 'epoch': 8.79} 88%|████████▊ | 8787/10000 [13:50:01<1:51:01, 5.49s/it][2025-06-20 03:19:46,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:19:46,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.54 | bwd_microstep: 3321.27 | bwd_inner_microstep: 3320.42 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.89 [2025-06-20 03:19:46,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.54 | bwd: 3321.29 | bwd_inner: 3320.42 | bwd_allreduce: 0.81 | step: 6.89 88%|████████▊ | 8788/10000 [13:50:06<1:50:50, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.037491608411073685, 'learning_rate': 1.52117017405661e-06, 'epoch': 8.79} 88%|████████▊ | 8788/10000 [13:50:06<1:50:50, 5.49s/it][2025-06-20 03:19:51,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:19:51,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.09 | bwd_microstep: 3374.81 | bwd_inner_microstep: 3374.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.54 [2025-06-20 03:19:51,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.09 | bwd: 3374.83 | bwd_inner: 3374.03 | bwd_allreduce: 0.75 | step: 6.55 88%|████████▊ | 8789/10000 [13:50:12<1:51:06, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0012536152498796582, 'learning_rate': 1.518693276411407e-06, 'epoch': 8.79} 88%|████████▊ | 8789/10000 [13:50:12<1:51:06, 5.51s/it][2025-06-20 03:19:57,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:19:57,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.42 | bwd_microstep: 3332.19 | bwd_inner_microstep: 3331.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 03:19:57,161] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.42 | bwd: 3332.21 | bwd_inner: 3331.40 | bwd_allreduce: 0.76 | step: 6.68 88%|████████▊ | 8790/10000 [13:50:17<1:50:50, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0016918302280828357, 'learning_rate': 1.5162183173695044e-06, 'epoch': 8.79} 88%|████████▊ | 8790/10000 [13:50:17<1:50:50, 5.50s/it][2025-06-20 03:20:02,713] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 03:20:02,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2144.77 | bwd_microstep: 3372.71 | bwd_inner_microstep: 3371.72 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.18 [2025-06-20 03:20:02,714] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2144.77 | bwd: 3372.73 | bwd_inner: 3371.72 | bwd_allreduce: 0.96 | step: 7.18 88%|████████▊ | 8791/10000 [13:50:23<1:51:06, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0005521805724129081, 'learning_rate': 1.5137452971905141e-06, 'epoch': 8.79} 88%|████████▊ | 8791/10000 [13:50:23<1:51:06, 5.51s/it][2025-06-20 03:20:08,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 03:20:08,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.83 | bwd_microstep: 3371.98 | bwd_inner_microstep: 3370.95 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.36 [2025-06-20 03:20:08,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.83 | bwd: 3372.00 | bwd_inner: 3370.95 | bwd_allreduce: 1.00 | step: 7.38 88%|████████▊ | 8792/10000 [13:50:29<1:51:09, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00126155826728791, 'learning_rate': 1.5112742161338444e-06, 'epoch': 8.79} 88%|████████▊ | 8792/10000 [13:50:29<1:51:09, 5.52s/it][2025-06-20 03:20:13,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:20:13,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.98 | bwd_microstep: 3371.97 | bwd_inner_microstep: 3371.14 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.69 [2025-06-20 03:20:13,791] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.98 | bwd: 3371.99 | bwd_inner: 3371.14 | bwd_allreduce: 0.81 | step: 6.70 88%|████████▊ | 8793/10000 [13:50:34<1:51:10, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0014614290557801723, 'learning_rate': 1.5088050744587057e-06, 'epoch': 8.79} 88%|████████▊ | 8793/10000 [13:50:34<1:51:10, 5.53s/it][2025-06-20 03:20:19,257] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:20:19,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.02 | bwd_microstep: 3320.34 | bwd_inner_microstep: 3319.52 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-20 03:20:19,258] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.02 | bwd: 3320.35 | bwd_inner: 3319.52 | bwd_allreduce: 0.78 | step: 6.96 88%|████████▊ | 8794/10000 [13:50:40<1:50:43, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0017912241164594889, 'learning_rate': 1.5063378724240885e-06, 'epoch': 8.79} 88%|████████▊ | 8794/10000 [13:50:40<1:50:43, 5.51s/it][2025-06-20 03:20:24,738] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:20:24,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.39 | bwd_microstep: 3334.88 | bwd_inner_microstep: 3334.07 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.94 [2025-06-20 03:20:24,739] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.39 | bwd: 3334.90 | bwd_inner: 3334.07 | bwd_allreduce: 0.79 | step: 6.94 88%|████████▊ | 8795/10000 [13:50:45<1:50:27, 5.50s/it] {'loss': 0.0, 'grad_norm': 2.045706241915468e-05, 'learning_rate': 1.5038726102887991e-06, 'epoch': 8.79} 88%|████████▊ | 8795/10000 [13:50:45<1:50:27, 5.50s/it][2025-06-20 03:20:30,205] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:20:30,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.33 | bwd_microstep: 3328.88 | bwd_inner_microstep: 3328.08 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-20 03:20:30,206] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.33 | bwd: 3328.89 | bwd_inner: 3328.08 | bwd_allreduce: 0.77 | step: 6.90 88%|████████▊ | 8796/10000 [13:50:51<1:50:10, 5.49s/it] {'loss': 0.0, 'grad_norm': 2.076850250887219e-05, 'learning_rate': 1.5014092883114283e-06, 'epoch': 8.8} 88%|████████▊ | 8796/10000 [13:50:51<1:50:10, 5.49s/it][2025-06-20 03:20:35,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:20:35,665] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.88 | bwd_microstep: 3318.66 | bwd_inner_microstep: 3317.83 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.87 [2025-06-20 03:20:35,666] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.88 | bwd: 3318.68 | bwd_inner: 3317.83 | bwd_allreduce: 0.80 | step: 6.88 88%|████████▊ | 8797/10000 [13:50:56<1:49:53, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0003014930116478354, 'learning_rate': 1.4989479067503699e-06, 'epoch': 8.8} 88%|████████▊ | 8797/10000 [13:50:56<1:49:53, 5.48s/it][2025-06-20 03:20:41,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:20:41,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.24 | bwd_microstep: 3315.74 | bwd_inner_microstep: 3314.94 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 03:20:41,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.24 | bwd: 3315.76 | bwd_inner: 3314.94 | bwd_allreduce: 0.77 | step: 6.81 88%|████████▊ | 8798/10000 [13:51:01<1:49:39, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.007731608580797911, 'learning_rate': 1.4964884658638124e-06, 'epoch': 8.8} 88%|████████▊ | 8798/10000 [13:51:01<1:49:39, 5.47s/it][2025-06-20 03:20:46,696] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:20:46,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.68 | bwd_microstep: 3404.24 | bwd_inner_microstep: 3403.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 03:20:46,697] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.68 | bwd: 3404.26 | bwd_inner: 3403.45 | bwd_allreduce: 0.76 | step: 6.72 88%|████████▊ | 8799/10000 [13:51:07<1:50:10, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001903710886836052, 'learning_rate': 1.4940309659097341e-06, 'epoch': 8.8} 88%|████████▊ | 8799/10000 [13:51:07<1:50:10, 5.50s/it][2025-06-20 03:20:52,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:20:52,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.26 | bwd_microstep: 3363.06 | bwd_inner_microstep: 3362.24 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.06 [2025-06-20 03:20:52,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.26 | bwd: 3363.07 | bwd_inner: 3362.24 | bwd_allreduce: 0.79 | step: 7.06 88%|████████▊ | 8800/10000 [13:51:13<1:50:12, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.03832368552684784, 'learning_rate': 1.4915754071459176e-06, 'epoch': 8.8} 88%|████████▊ | 8800/10000 [13:51:13<1:50:12, 5.51s/it][2025-06-20 03:20:57,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 03:20:57,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.59 | bwd_microstep: 3333.75 | bwd_inner_microstep: 3332.95 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-20 03:20:57,700] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.59 | bwd: 3333.76 | bwd_inner: 3332.95 | bwd_allreduce: 0.77 | step: 7.01 88%|████████▊ | 8801/10000 [13:51:18<1:49:55, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0014938676031306386, 'learning_rate': 1.4891217898299415e-06, 'epoch': 8.8} 88%|████████▊ | 8801/10000 [13:51:18<1:49:55, 5.50s/it][2025-06-20 03:21:03,219] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.82 [2025-06-20 03:21:03,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.79 | bwd_microstep: 3356.64 | bwd_inner_microstep: 3355.83 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-20 03:21:03,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.79 | bwd: 3356.65 | bwd_inner: 3355.83 | bwd_allreduce: 0.77 | step: 6.95 88%|████████▊ | 8802/10000 [13:51:24<1:49:56, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0002055270888376981, 'learning_rate': 1.4866701142191796e-06, 'epoch': 8.8} 88%|████████▊ | 8802/10000 [13:51:24<1:49:56, 5.51s/it][2025-06-20 03:21:08,756] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 03:21:08,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.84 | bwd_microstep: 3371.80 | bwd_inner_microstep: 3370.87 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.01 [2025-06-20 03:21:08,757] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.84 | bwd: 3371.82 | bwd_inner: 3370.87 | bwd_allreduce: 0.91 | step: 7.01 88%|████████▊ | 8803/10000 [13:51:29<1:50:02, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0022620712406933308, 'learning_rate': 1.4842203805708e-06, 'epoch': 8.8} 88%|████████▊ | 8803/10000 [13:51:29<1:50:02, 5.52s/it][2025-06-20 03:21:14,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:21:14,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.78 | bwd_microstep: 3319.77 | bwd_inner_microstep: 3318.84 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.02 [2025-06-20 03:21:14,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.78 | bwd: 3319.78 | bwd_inner: 3318.84 | bwd_allreduce: 0.90 | step: 7.03 88%|████████▊ | 8804/10000 [13:51:35<1:49:39, 5.50s/it] {'loss': 0.0, 'grad_norm': 7.348762301262468e-05, 'learning_rate': 1.4817725891417678e-06, 'epoch': 8.8} 88%|████████▊ | 8804/10000 [13:51:35<1:49:39, 5.50s/it][2025-06-20 03:21:19,688] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:21:19,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.05 | bwd_microstep: 3314.92 | bwd_inner_microstep: 3313.89 | bwd_allreduce_microstep: 0.98 | step_microstep: 7.56 [2025-06-20 03:21:19,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.05 | bwd: 3314.94 | bwd_inner: 3313.89 | bwd_allreduce: 1.00 | step: 7.56 88%|████████▊ | 8805/10000 [13:51:40<1:49:21, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.002189223188906908, 'learning_rate': 1.47932674018884e-06, 'epoch': 8.8} 88%|████████▊ | 8805/10000 [13:51:40<1:49:21, 5.49s/it][2025-06-20 03:21:25,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:21:25,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.40 | bwd_microstep: 3378.44 | bwd_inner_microstep: 3377.65 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 03:21:25,244] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.40 | bwd: 3378.45 | bwd_inner: 3377.65 | bwd_allreduce: 0.76 | step: 6.64 88%|████████▊ | 8806/10000 [13:51:46<1:49:38, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0030169938690960407, 'learning_rate': 1.4768828339685848e-06, 'epoch': 8.81} 88%|████████▊ | 8806/10000 [13:51:46<1:49:38, 5.51s/it][2025-06-20 03:21:30,701] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:21:30,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.65 | bwd_microstep: 3315.55 | bwd_inner_microstep: 3314.74 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.88 [2025-06-20 03:21:30,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.65 | bwd: 3315.57 | bwd_inner: 3314.74 | bwd_allreduce: 0.78 | step: 6.88 88%|████████▊ | 8807/10000 [13:51:51<1:49:15, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.13194632530212402, 'learning_rate': 1.4744408707373503e-06, 'epoch': 8.81} 88%|████████▊ | 8807/10000 [13:51:51<1:49:15, 5.49s/it][2025-06-20 03:21:36,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:21:36,247] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.54 | bwd_microstep: 3375.28 | bwd_inner_microstep: 3374.42 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.00 [2025-06-20 03:21:36,248] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.54 | bwd: 3375.29 | bwd_inner: 3374.42 | bwd_allreduce: 0.82 | step: 7.01 88%|████████▊ | 8808/10000 [13:51:57<1:49:27, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.002297816565260291, 'learning_rate': 1.4720008507512917e-06, 'epoch': 8.81} 88%|████████▊ | 8808/10000 [13:51:57<1:49:27, 5.51s/it][2025-06-20 03:21:41,715] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:21:41,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.29 | bwd_microstep: 3324.66 | bwd_inner_microstep: 3323.86 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.04 [2025-06-20 03:21:41,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.29 | bwd: 3324.67 | bwd_inner: 3323.86 | bwd_allreduce: 0.77 | step: 7.04 88%|████████▊ | 8809/10000 [13:52:02<1:49:07, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00020249004592187703, 'learning_rate': 1.469562774266351e-06, 'epoch': 8.81} 88%|████████▊ | 8809/10000 [13:52:02<1:49:07, 5.50s/it][2025-06-20 03:21:47,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:21:47,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.29 | bwd_microstep: 3320.88 | bwd_inner_microstep: 3320.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 03:21:47,183] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.29 | bwd: 3320.89 | bwd_inner: 3320.09 | bwd_allreduce: 0.76 | step: 6.64 88%|████████▊ | 8810/10000 [13:52:07<1:48:50, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00014416832709684968, 'learning_rate': 1.4671266415382724e-06, 'epoch': 8.81} 88%|████████▊ | 8810/10000 [13:52:07<1:48:50, 5.49s/it][2025-06-20 03:21:52,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 03:21:52,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.47 | bwd_microstep: 3312.10 | bwd_inner_microstep: 3311.32 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.52 [2025-06-20 03:21:52,633] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.47 | bwd: 3312.11 | bwd_inner: 3311.32 | bwd_allreduce: 0.75 | step: 6.53 88%|████████▊ | 8811/10000 [13:52:13<1:48:31, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.000787728582508862, 'learning_rate': 1.464692452822596e-06, 'epoch': 8.81} 88%|████████▊ | 8811/10000 [13:52:13<1:48:31, 5.48s/it][2025-06-20 03:21:58,181] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 03:21:58,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.42 | bwd_microstep: 3374.15 | bwd_inner_microstep: 3373.20 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.11 [2025-06-20 03:21:58,182] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.42 | bwd: 3374.17 | bwd_inner: 3373.20 | bwd_allreduce: 0.92 | step: 7.12 88%|████████▊ | 8812/10000 [13:52:18<1:48:52, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.002043886808678508, 'learning_rate': 1.4622602083746552e-06, 'epoch': 8.81} 88%|████████▊ | 8812/10000 [13:52:18<1:48:52, 5.50s/it][2025-06-20 03:22:03,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 03:22:03,718] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.58 | bwd_microstep: 3362.43 | bwd_inner_microstep: 3361.23 | bwd_allreduce_microstep: 1.12 | step_microstep: 8.45 [2025-06-20 03:22:03,719] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.59 | bwd: 3362.45 | bwd_inner: 3361.23 | bwd_allreduce: 1.16 | step: 8.46 88%|████████▊ | 8813/10000 [13:52:24<1:49:02, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00031697165104560554, 'learning_rate': 1.4598299084495859e-06, 'epoch': 8.81} 88%|████████▊ | 8813/10000 [13:52:24<1:49:02, 5.51s/it][2025-06-20 03:22:09,270] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:22:09,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.24 | bwd_microstep: 3367.17 | bwd_inner_microstep: 3366.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.60 [2025-06-20 03:22:09,271] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.24 | bwd: 3367.18 | bwd_inner: 3366.39 | bwd_allreduce: 0.75 | step: 6.60 88%|████████▊ | 8814/10000 [13:52:30<1:49:09, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.00940940622240305, 'learning_rate': 1.4574015533023067e-06, 'epoch': 8.81} 88%|████████▊ | 8814/10000 [13:52:30<1:49:09, 5.52s/it][2025-06-20 03:22:14,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:22:14,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.20 | bwd_microstep: 3313.05 | bwd_inner_microstep: 3312.22 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.95 [2025-06-20 03:22:14,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.20 | bwd: 3313.07 | bwd_inner: 3312.22 | bwd_allreduce: 0.80 | step: 6.95 88%|████████▊ | 8815/10000 [13:52:35<1:48:42, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.006856831256300211, 'learning_rate': 1.4549751431875492e-06, 'epoch': 8.81} 88%|████████▊ | 8815/10000 [13:52:35<1:48:42, 5.50s/it][2025-06-20 03:22:20,201] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 03:22:20,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.26 | bwd_microstep: 3322.93 | bwd_inner_microstep: 3321.87 | bwd_allreduce_microstep: 0.99 | step_microstep: 7.23 [2025-06-20 03:22:20,202] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.26 | bwd: 3322.95 | bwd_inner: 3321.87 | bwd_allreduce: 1.02 | step: 7.24 88%|████████▊ | 8816/10000 [13:52:41<1:48:25, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00018437218386679888, 'learning_rate': 1.4525506783598254e-06, 'epoch': 8.82} 88%|████████▊ | 8816/10000 [13:52:41<1:48:25, 5.49s/it][2025-06-20 03:22:25,749] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:22:25,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.69 | bwd_microstep: 3373.82 | bwd_inner_microstep: 3373.03 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-20 03:22:25,750] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.69 | bwd: 3373.83 | bwd_inner: 3373.03 | bwd_allreduce: 0.77 | step: 6.92 88%|████████▊ | 8817/10000 [13:52:46<1:48:37, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0031522302888333797, 'learning_rate': 1.4501281590734563e-06, 'epoch': 8.82} 88%|████████▊ | 8817/10000 [13:52:46<1:48:37, 5.51s/it][2025-06-20 03:22:31,224] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:22:31,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.84 | bwd_microstep: 3326.71 | bwd_inner_microstep: 3325.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.10 [2025-06-20 03:22:31,225] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.84 | bwd: 3326.73 | bwd_inner: 3325.92 | bwd_allreduce: 0.77 | step: 7.10 88%|████████▊ | 8818/10000 [13:52:52<1:48:20, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.02986948937177658, 'learning_rate': 1.4477075855825495e-06, 'epoch': 8.82} 88%|████████▊ | 8818/10000 [13:52:52<1:48:20, 5.50s/it][2025-06-20 03:22:36,681] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:22:36,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.17 | bwd_microstep: 3310.47 | bwd_inner_microstep: 3309.68 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 03:22:36,682] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.17 | bwd: 3310.49 | bwd_inner: 3309.68 | bwd_allreduce: 0.76 | step: 6.68 88%|████████▊ | 8819/10000 [13:52:57<1:47:59, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.014319208450615406, 'learning_rate': 1.445288958141018e-06, 'epoch': 8.82} 88%|████████▊ | 8819/10000 [13:52:57<1:47:59, 5.49s/it][2025-06-20 03:22:42,147] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:22:42,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.63 | bwd_microstep: 3318.63 | bwd_inner_microstep: 3317.78 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.52 [2025-06-20 03:22:42,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.63 | bwd: 3318.65 | bwd_inner: 3317.78 | bwd_allreduce: 0.81 | step: 7.52 88%|████████▊ | 8820/10000 [13:53:02<1:47:47, 5.48s/it] {'loss': 0.0, 'grad_norm': 2.987982406921219e-05, 'learning_rate': 1.442872277002556e-06, 'epoch': 8.82} 88%|████████▊ | 8820/10000 [13:53:02<1:47:47, 5.48s/it][2025-06-20 03:22:47,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:22:47,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.14 | bwd_microstep: 3326.17 | bwd_inner_microstep: 3325.39 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.69 [2025-06-20 03:22:47,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.14 | bwd: 3326.19 | bwd_inner: 3325.39 | bwd_allreduce: 0.76 | step: 6.69 88%|████████▊ | 8821/10000 [13:53:08<1:47:39, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0002959111297968775, 'learning_rate': 1.4404575424206657e-06, 'epoch': 8.82} 88%|████████▊ | 8821/10000 [13:53:08<1:47:39, 5.48s/it][2025-06-20 03:22:53,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:22:53,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.91 | bwd_microstep: 3313.41 | bwd_inner_microstep: 3312.44 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.26 [2025-06-20 03:22:53,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.91 | bwd: 3313.42 | bwd_inner: 3312.44 | bwd_allreduce: 0.94 | step: 7.26 88%|████████▊ | 8822/10000 [13:53:13<1:47:26, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0012731830356642604, 'learning_rate': 1.4380447546486398e-06, 'epoch': 8.82} 88%|████████▊ | 8822/10000 [13:53:13<1:47:26, 5.47s/it][2025-06-20 03:22:58,642] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:22:58,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.64 | bwd_microstep: 3386.81 | bwd_inner_microstep: 3386.03 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.58 [2025-06-20 03:22:58,643] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.64 | bwd: 3386.83 | bwd_inner: 3386.03 | bwd_allreduce: 0.76 | step: 6.59 88%|████████▊ | 8823/10000 [13:53:19<1:47:52, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0012326042633503675, 'learning_rate': 1.4356339139395736e-06, 'epoch': 8.82} 88%|████████▊ | 8823/10000 [13:53:19<1:47:52, 5.50s/it][2025-06-20 03:23:04,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:23:04,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.65 | bwd_microstep: 3307.68 | bwd_inner_microstep: 3306.90 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 03:23:04,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.65 | bwd: 3307.69 | bwd_inner: 3306.90 | bwd_allreduce: 0.75 | step: 6.65 88%|████████▊ | 8824/10000 [13:53:24<1:47:28, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.06933095306158066, 'learning_rate': 1.433225020546347e-06, 'epoch': 8.82} 88%|████████▊ | 8824/10000 [13:53:24<1:47:28, 5.48s/it][2025-06-20 03:23:09,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:23:09,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.01 | bwd_microstep: 3352.43 | bwd_inner_microstep: 3351.51 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.21 [2025-06-20 03:23:09,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.01 | bwd: 3352.45 | bwd_inner: 3351.51 | bwd_allreduce: 0.86 | step: 7.21 88%|████████▊ | 8825/10000 [13:53:30<1:47:33, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.07917336374521255, 'learning_rate': 1.4308180747216471e-06, 'epoch': 8.82} 88%|████████▊ | 8825/10000 [13:53:30<1:47:33, 5.49s/it][2025-06-20 03:23:15,121] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:23:15,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.95 | bwd_microstep: 3355.46 | bwd_inner_microstep: 3354.65 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.99 [2025-06-20 03:23:15,122] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.96 | bwd: 3355.47 | bwd_inner: 3354.65 | bwd_allreduce: 0.78 | step: 6.99 88%|████████▊ | 8826/10000 [13:53:35<1:47:37, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.004365806933492422, 'learning_rate': 1.4284130767179472e-06, 'epoch': 8.83} 88%|████████▊ | 8826/10000 [13:53:35<1:47:37, 5.50s/it][2025-06-20 03:23:20,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:23:20,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.36 | bwd_microstep: 3350.42 | bwd_inner_microstep: 3349.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-20 03:23:20,636] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.36 | bwd: 3350.44 | bwd_inner: 3349.62 | bwd_allreduce: 0.77 | step: 6.64 88%|████████▊ | 8827/10000 [13:53:41<1:47:36, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.10021248459815979, 'learning_rate': 1.4260100267875233e-06, 'epoch': 8.83} 88%|████████▊ | 8827/10000 [13:53:41<1:47:36, 5.50s/it][2025-06-20 03:23:26,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:23:26,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2099.82 | bwd_microstep: 3310.06 | bwd_inner_microstep: 3309.25 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.10 [2025-06-20 03:23:26,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2099.82 | bwd: 3310.07 | bwd_inner: 3309.25 | bwd_allreduce: 0.78 | step: 7.10 88%|████████▊ | 8828/10000 [13:53:46<1:47:11, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.014413394965231419, 'learning_rate': 1.4236089251824425e-06, 'epoch': 8.83} 88%|████████▊ | 8828/10000 [13:53:46<1:47:11, 5.49s/it][2025-06-20 03:23:31,536] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:23:31,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.04 | bwd_microstep: 3310.59 | bwd_inner_microstep: 3309.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 03:23:31,537] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.04 | bwd: 3310.61 | bwd_inner: 3309.81 | bwd_allreduce: 0.76 | step: 6.80 88%|████████▊ | 8829/10000 [13:53:52<1:46:53, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0038225390017032623, 'learning_rate': 1.4212097721545726e-06, 'epoch': 8.83} 88%|████████▊ | 8829/10000 [13:53:52<1:46:53, 5.48s/it][2025-06-20 03:23:37,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.80 [2025-06-20 03:23:37,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.91 | bwd_microstep: 3367.00 | bwd_inner_microstep: 3366.20 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.76 [2025-06-20 03:23:37,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.91 | bwd: 3367.01 | bwd_inner: 3366.20 | bwd_allreduce: 0.77 | step: 6.76 88%|████████▊ | 8830/10000 [13:53:57<1:47:06, 5.49s/it] {'loss': 0.0064, 'grad_norm': 1.9249011278152466, 'learning_rate': 1.4188125679555675e-06, 'epoch': 8.83} 88%|████████▊ | 8830/10000 [13:53:57<1:47:06, 5.49s/it][2025-06-20 03:23:42,519] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 03:23:42,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.84 | bwd_microstep: 3314.66 | bwd_inner_microstep: 3313.87 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.79 [2025-06-20 03:23:42,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.84 | bwd: 3314.68 | bwd_inner: 3313.87 | bwd_allreduce: 0.77 | step: 6.80 88%|████████▊ | 8831/10000 [13:54:03<1:46:47, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.009270275942981243, 'learning_rate': 1.4164173128368864e-06, 'epoch': 8.83} 88%|████████▊ | 8831/10000 [13:54:03<1:46:47, 5.48s/it][2025-06-20 03:23:47,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:23:47,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.09 | bwd_microstep: 3309.13 | bwd_inner_microstep: 3308.33 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-20 03:23:47,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.09 | bwd: 3309.15 | bwd_inner: 3308.33 | bwd_allreduce: 0.78 | step: 7.27 88%|████████▊ | 8832/10000 [13:54:08<1:46:32, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0013250283664092422, 'learning_rate': 1.4140240070497813e-06, 'epoch': 8.83} 88%|████████▊ | 8832/10000 [13:54:08<1:46:32, 5.47s/it][2025-06-20 03:23:53,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:23:53,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.82 | bwd_microstep: 3318.60 | bwd_inner_microstep: 3317.80 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.80 [2025-06-20 03:23:53,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.82 | bwd: 3318.61 | bwd_inner: 3317.80 | bwd_allreduce: 0.77 | step: 6.80 88%|████████▊ | 8833/10000 [13:54:14<1:46:22, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00013125230907462537, 'learning_rate': 1.411632650845296e-06, 'epoch': 8.83} 88%|████████▊ | 8833/10000 [13:54:14<1:46:22, 5.47s/it][2025-06-20 03:23:58,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:23:58,962] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.00 | bwd_microstep: 3363.55 | bwd_inner_microstep: 3362.76 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 03:23:58,963] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.00 | bwd: 3363.56 | bwd_inner: 3362.76 | bwd_allreduce: 0.76 | step: 6.66 88%|████████▊ | 8834/10000 [13:54:19<1:46:37, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.07954955101013184, 'learning_rate': 1.4092432444742787e-06, 'epoch': 8.83} 88%|████████▊ | 8834/10000 [13:54:19<1:46:37, 5.49s/it][2025-06-20 03:24:04,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 03:24:04,421] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.91 | bwd_microstep: 3314.02 | bwd_inner_microstep: 3313.18 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.40 [2025-06-20 03:24:04,422] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.91 | bwd: 3314.03 | bwd_inner: 3313.18 | bwd_allreduce: 0.81 | step: 7.40 88%|████████▊ | 8835/10000 [13:54:25<1:46:22, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0001715902762953192, 'learning_rate': 1.406855788187358e-06, 'epoch': 8.84} 88%|████████▊ | 8835/10000 [13:54:25<1:46:22, 5.48s/it][2025-06-20 03:24:09,939] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:24:09,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.91 | bwd_microstep: 3354.63 | bwd_inner_microstep: 3353.84 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.76 [2025-06-20 03:24:09,940] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.91 | bwd: 3354.64 | bwd_inner: 3353.84 | bwd_allreduce: 0.76 | step: 6.77 88%|████████▊ | 8836/10000 [13:54:30<1:46:30, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0016978109488263726, 'learning_rate': 1.4044702822349731e-06, 'epoch': 8.84} 88%|████████▊ | 8836/10000 [13:54:30<1:46:30, 5.49s/it][2025-06-20 03:24:15,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:24:15,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.05 | bwd_microstep: 3319.32 | bwd_inner_microstep: 3318.51 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.93 [2025-06-20 03:24:15,400] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.05 | bwd: 3319.33 | bwd_inner: 3318.51 | bwd_allreduce: 0.78 | step: 6.93 88%|████████▊ | 8837/10000 [13:54:36<1:46:15, 5.48s/it] {'loss': 0.0, 'grad_norm': 9.699368820292875e-05, 'learning_rate': 1.402086726867351e-06, 'epoch': 8.84} 88%|████████▊ | 8837/10000 [13:54:36<1:46:15, 5.48s/it][2025-06-20 03:24:20,862] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:24:20,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.35 | bwd_microstep: 3318.96 | bwd_inner_microstep: 3318.17 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-20 03:24:20,863] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.35 | bwd: 3318.98 | bwd_inner: 3318.17 | bwd_allreduce: 0.77 | step: 7.01 88%|████████▊ | 8838/10000 [13:54:41<1:46:02, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0012545377248898149, 'learning_rate': 1.399705122334516e-06, 'epoch': 8.84} 88%|████████▊ | 8838/10000 [13:54:41<1:46:02, 5.48s/it][2025-06-20 03:24:26,319] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:24:26,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.31 | bwd_microstep: 3316.96 | bwd_inner_microstep: 3316.01 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.05 [2025-06-20 03:24:26,320] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.31 | bwd: 3316.97 | bwd_inner: 3316.01 | bwd_allreduce: 0.92 | step: 7.05 88%|████████▊ | 8839/10000 [13:54:47<1:45:50, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00030493453959934413, 'learning_rate': 1.397325468886288e-06, 'epoch': 8.84} 88%|████████▊ | 8839/10000 [13:54:47<1:45:50, 5.47s/it][2025-06-20 03:24:31,779] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 03:24:31,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.67 | bwd_microstep: 3317.73 | bwd_inner_microstep: 3316.93 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 03:24:31,780] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.67 | bwd: 3317.74 | bwd_inner: 3316.93 | bwd_allreduce: 0.76 | step: 6.78 88%|████████▊ | 8840/10000 [13:54:52<1:45:41, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0058912248350679874, 'learning_rate': 1.3949477667722766e-06, 'epoch': 8.84} 88%|████████▊ | 8840/10000 [13:54:52<1:45:41, 5.47s/it][2025-06-20 03:24:37,240] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:24:37,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.10 | bwd_microstep: 3312.38 | bwd_inner_microstep: 3311.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.87 [2025-06-20 03:24:37,241] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.10 | bwd: 3312.40 | bwd_inner: 3311.57 | bwd_allreduce: 0.78 | step: 6.87 88%|████████▊ | 8841/10000 [13:54:58<1:45:34, 5.47s/it] {'loss': 0.0004, 'grad_norm': 0.11051981151103973, 'learning_rate': 1.3925720162418977e-06, 'epoch': 8.84} 88%|████████▊ | 8841/10000 [13:54:58<1:45:34, 5.47s/it][2025-06-20 03:24:42,698] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:24:42,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.63 | bwd_microstep: 3314.93 | bwd_inner_microstep: 3314.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.96 [2025-06-20 03:24:42,699] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.63 | bwd: 3314.95 | bwd_inner: 3314.13 | bwd_allreduce: 0.78 | step: 6.96 88%|████████▊ | 8842/10000 [13:55:03<1:45:26, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.015352881513535976, 'learning_rate': 1.3901982175443561e-06, 'epoch': 8.84} 88%|████████▊ | 8842/10000 [13:55:03<1:45:26, 5.46s/it][2025-06-20 03:24:48,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:24:48,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.91 | bwd_microstep: 3316.83 | bwd_inner_microstep: 3315.90 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.29 [2025-06-20 03:24:48,160] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.91 | bwd: 3316.85 | bwd_inner: 3315.90 | bwd_allreduce: 0.90 | step: 7.29 88%|████████▊ | 8843/10000 [13:55:08<1:45:20, 5.46s/it] {'loss': 0.0001, 'grad_norm': 0.03511134907603264, 'learning_rate': 1.3878263709286488e-06, 'epoch': 8.84} 88%|████████▊ | 8843/10000 [13:55:08<1:45:20, 5.46s/it][2025-06-20 03:24:53,686] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 03:24:53,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.27 | bwd_microstep: 3362.36 | bwd_inner_microstep: 3361.20 | bwd_allreduce_microstep: 1.09 | step_microstep: 7.92 [2025-06-20 03:24:53,687] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.27 | bwd: 3362.38 | bwd_inner: 3361.20 | bwd_allreduce: 1.12 | step: 7.92 88%|████████▊ | 8844/10000 [13:55:14<1:45:37, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00028315550298430026, 'learning_rate': 1.3854564766435786e-06, 'epoch': 8.84} 88%|████████▊ | 8844/10000 [13:55:14<1:45:37, 5.48s/it][2025-06-20 03:24:59,148] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:24:59,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.60 | bwd_microstep: 3313.51 | bwd_inner_microstep: 3312.72 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 03:24:59,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.60 | bwd: 3313.52 | bwd_inner: 3312.72 | bwd_allreduce: 0.76 | step: 6.69 88%|████████▊ | 8845/10000 [13:55:19<1:45:24, 5.48s/it] {'loss': 0.0, 'grad_norm': 5.2927498472854495e-05, 'learning_rate': 1.3830885349377265e-06, 'epoch': 8.85} 88%|████████▊ | 8845/10000 [13:55:19<1:45:24, 5.48s/it][2025-06-20 03:25:04,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:25:04,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.93 | bwd_microstep: 3359.83 | bwd_inner_microstep: 3359.02 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.81 [2025-06-20 03:25:04,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.93 | bwd: 3359.84 | bwd_inner: 3359.02 | bwd_allreduce: 0.78 | step: 6.81 88%|████████▊ | 8846/10000 [13:55:25<1:45:39, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0026322295889258385, 'learning_rate': 1.3807225460594831e-06, 'epoch': 8.85} 88%|████████▊ | 8846/10000 [13:55:25<1:45:39, 5.49s/it][2025-06-20 03:25:10,220] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:25:10,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.08 | bwd_microstep: 3375.72 | bwd_inner_microstep: 3374.80 | bwd_allreduce_microstep: 0.87 | step_microstep: 6.99 [2025-06-20 03:25:10,221] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.09 | bwd: 3375.73 | bwd_inner: 3374.80 | bwd_allreduce: 0.89 | step: 7.00 88%|████████▊ | 8847/10000 [13:55:31<1:45:49, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0005423171678557992, 'learning_rate': 1.3783585102570297e-06, 'epoch': 8.85} 88%|████████▊ | 8847/10000 [13:55:31<1:45:49, 5.51s/it][2025-06-20 03:25:15,693] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:25:15,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.65 | bwd_microstep: 3324.06 | bwd_inner_microstep: 3323.13 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.02 [2025-06-20 03:25:15,694] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.65 | bwd: 3324.07 | bwd_inner: 3323.13 | bwd_allreduce: 0.90 | step: 7.02 88%|████████▊ | 8848/10000 [13:55:36<1:45:32, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0017153470544144511, 'learning_rate': 1.3759964277783433e-06, 'epoch': 8.85} 88%|████████▊ | 8848/10000 [13:55:36<1:45:32, 5.50s/it][2025-06-20 03:25:21,154] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:25:21,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.58 | bwd_microstep: 3318.26 | bwd_inner_microstep: 3317.28 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.07 [2025-06-20 03:25:21,155] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.58 | bwd: 3318.28 | bwd_inner: 3317.28 | bwd_allreduce: 0.95 | step: 7.08 88%|████████▊ | 8849/10000 [13:55:41<1:45:14, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.06198178231716156, 'learning_rate': 1.3736362988711992e-06, 'epoch': 8.85} 88%|████████▊ | 8849/10000 [13:55:41<1:45:14, 5.49s/it][2025-06-20 03:25:26,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:25:26,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.57 | bwd_microstep: 3314.93 | bwd_inner_microstep: 3314.13 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-20 03:25:26,618] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.57 | bwd: 3314.95 | bwd_inner: 3314.13 | bwd_allreduce: 0.77 | step: 7.07 88%|████████▊ | 8850/10000 [13:55:47<1:45:01, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.03348138928413391, 'learning_rate': 1.371278123783155e-06, 'epoch': 8.85} 88%|████████▊ | 8850/10000 [13:55:47<1:45:01, 5.48s/it][2025-06-20 03:25:32,101] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 03:25:32,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.61 | bwd_microstep: 3334.89 | bwd_inner_microstep: 3333.92 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.87 [2025-06-20 03:25:32,102] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.61 | bwd: 3334.91 | bwd_inner: 3333.92 | bwd_allreduce: 0.94 | step: 7.89 89%|████████▊ | 8851/10000 [13:55:52<1:44:57, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0019774215761572123, 'learning_rate': 1.3689219027615796e-06, 'epoch': 8.85} 89%|████████▊ | 8851/10000 [13:55:52<1:44:57, 5.48s/it][2025-06-20 03:25:37,571] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:25:37,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.47 | bwd_microstep: 3320.23 | bwd_inner_microstep: 3319.44 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.78 [2025-06-20 03:25:37,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.47 | bwd: 3320.25 | bwd_inner: 3319.44 | bwd_allreduce: 0.76 | step: 6.79 89%|████████▊ | 8852/10000 [13:55:58<1:44:47, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0006938420119695365, 'learning_rate': 1.366567636053624e-06, 'epoch': 8.85} 89%|████████▊ | 8852/10000 [13:55:58<1:44:47, 5.48s/it][2025-06-20 03:25:43,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:25:43,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.40 | bwd_microstep: 3376.53 | bwd_inner_microstep: 3375.59 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.14 [2025-06-20 03:25:43,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.40 | bwd: 3376.54 | bwd_inner: 3375.59 | bwd_allreduce: 0.90 | step: 7.14 89%|████████▊ | 8853/10000 [13:56:03<1:45:03, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001002349890768528, 'learning_rate': 1.364215323906244e-06, 'epoch': 8.85} 89%|████████▊ | 8853/10000 [13:56:03<1:45:03, 5.50s/it][2025-06-20 03:25:48,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:25:48,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.02 | bwd_microstep: 3321.36 | bwd_inner_microstep: 3320.53 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.79 [2025-06-20 03:25:48,574] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.02 | bwd: 3321.37 | bwd_inner: 3320.53 | bwd_allreduce: 0.80 | step: 6.80 89%|████████▊ | 8854/10000 [13:56:09<1:44:46, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.007269003428518772, 'learning_rate': 1.3618649665661888e-06, 'epoch': 8.85} 89%|████████▊ | 8854/10000 [13:56:09<1:44:46, 5.49s/it][2025-06-20 03:25:54,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:25:54,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.36 | bwd_microstep: 3322.34 | bwd_inner_microstep: 3321.54 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.91 [2025-06-20 03:25:54,038] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.36 | bwd: 3322.36 | bwd_inner: 3321.54 | bwd_allreduce: 0.77 | step: 6.92 89%|████████▊ | 8855/10000 [13:56:14<1:44:34, 5.48s/it] {'loss': 0.002, 'grad_norm': 1.406177282333374, 'learning_rate': 1.3595165642799946e-06, 'epoch': 8.86} 89%|████████▊ | 8855/10000 [13:56:14<1:44:34, 5.48s/it][2025-06-20 03:25:59,513] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:25:59,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.64 | bwd_microstep: 3330.10 | bwd_inner_microstep: 3329.30 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.98 [2025-06-20 03:25:59,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.64 | bwd: 3330.12 | bwd_inner: 3329.30 | bwd_allreduce: 0.78 | step: 6.98 89%|████████▊ | 8856/10000 [13:56:20<1:44:27, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0010359185980632901, 'learning_rate': 1.3571701172939977e-06, 'epoch': 8.86} 89%|████████▊ | 8856/10000 [13:56:20<1:44:27, 5.48s/it][2025-06-20 03:26:05,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:26:05,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.32 | bwd_microstep: 3368.84 | bwd_inner_microstep: 3368.04 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.78 [2025-06-20 03:26:05,051] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.32 | bwd: 3368.85 | bwd_inner: 3368.04 | bwd_allreduce: 0.77 | step: 6.79 89%|████████▊ | 8857/10000 [13:56:25<1:44:41, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0003286685678176582, 'learning_rate': 1.3548256258543302e-06, 'epoch': 8.86} 89%|████████▊ | 8857/10000 [13:56:25<1:44:41, 5.50s/it][2025-06-20 03:26:10,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:26:10,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.84 | bwd_microstep: 3320.77 | bwd_inner_microstep: 3319.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.74 [2025-06-20 03:26:10,533] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.84 | bwd: 3320.78 | bwd_inner: 3319.97 | bwd_allreduce: 0.76 | step: 6.74 89%|████████▊ | 8858/10000 [13:56:31<1:44:31, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0009564536157995462, 'learning_rate': 1.352483090206922e-06, 'epoch': 8.86} 89%|████████▊ | 8858/10000 [13:56:31<1:44:31, 5.49s/it][2025-06-20 03:26:16,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:26:16,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.17 | bwd_microstep: 3367.43 | bwd_inner_microstep: 3366.51 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.37 [2025-06-20 03:26:16,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.17 | bwd: 3367.46 | bwd_inner: 3366.51 | bwd_allreduce: 0.89 | step: 7.37 89%|████████▊ | 8859/10000 [13:56:36<1:44:40, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.05592210963368416, 'learning_rate': 1.350142510597492e-06, 'epoch': 8.86} 89%|████████▊ | 8859/10000 [13:56:36<1:44:40, 5.50s/it][2025-06-20 03:26:21,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.73 [2025-06-20 03:26:21,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.76 | bwd_microstep: 3378.89 | bwd_inner_microstep: 3378.10 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.18 [2025-06-20 03:26:21,615] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.76 | bwd: 3378.91 | bwd_inner: 3378.10 | bwd_allreduce: 0.77 | step: 7.18 89%|████████▊ | 8860/10000 [13:56:42<1:44:49, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.007704215589910746, 'learning_rate': 1.3478038872715548e-06, 'epoch': 8.86} 89%|████████▊ | 8860/10000 [13:56:42<1:44:49, 5.52s/it][2025-06-20 03:26:27,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:26:27,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.98 | bwd_microstep: 3328.37 | bwd_inner_microstep: 3327.48 | bwd_allreduce_microstep: 0.85 | step_microstep: 6.97 [2025-06-20 03:26:27,091] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.98 | bwd: 3328.39 | bwd_inner: 3327.48 | bwd_allreduce: 0.87 | step: 6.97 89%|████████▊ | 8861/10000 [13:56:47<1:44:30, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0005668725934810936, 'learning_rate': 1.3454672204744212e-06, 'epoch': 8.86} 89%|████████▊ | 8861/10000 [13:56:47<1:44:30, 5.50s/it][2025-06-20 03:26:32,548] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:26:32,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.82 | bwd_microstep: 3314.68 | bwd_inner_microstep: 3313.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.68 [2025-06-20 03:26:32,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.82 | bwd: 3314.70 | bwd_inner: 3313.89 | bwd_allreduce: 0.76 | step: 6.69 89%|████████▊ | 8862/10000 [13:56:53<1:44:08, 5.49s/it] {'loss': 0.0002, 'grad_norm': 0.04432375356554985, 'learning_rate': 1.3431325104511995e-06, 'epoch': 8.86} 89%|████████▊ | 8862/10000 [13:56:53<1:44:08, 5.49s/it][2025-06-20 03:26:38,013] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:26:38,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.87 | bwd_microstep: 3322.36 | bwd_inner_microstep: 3321.56 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-20 03:26:38,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.87 | bwd: 3322.38 | bwd_inner: 3321.56 | bwd_allreduce: 0.77 | step: 6.87 89%|████████▊ | 8863/10000 [13:56:58<1:43:54, 5.48s/it] {'loss': 0.0002, 'grad_norm': 0.028352634981274605, 'learning_rate': 1.3407997574467891e-06, 'epoch': 8.86} 89%|████████▊ | 8863/10000 [13:56:58<1:43:54, 5.48s/it][2025-06-20 03:26:43,484] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 03:26:43,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.51 | bwd_microstep: 3324.47 | bwd_inner_microstep: 3323.49 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.50 [2025-06-20 03:26:43,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.51 | bwd: 3324.48 | bwd_inner: 3323.49 | bwd_allreduce: 0.95 | step: 7.51 89%|████████▊ | 8864/10000 [13:57:04<1:43:45, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.012219138443470001, 'learning_rate': 1.3384689617058854e-06, 'epoch': 8.86} 89%|████████▊ | 8864/10000 [13:57:04<1:43:45, 5.48s/it][2025-06-20 03:26:49,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 03:26:49,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.99 | bwd_microstep: 3369.52 | bwd_inner_microstep: 3368.66 | bwd_allreduce_microstep: 0.82 | step_microstep: 6.80 [2025-06-20 03:26:49,024] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.99 | bwd: 3369.54 | bwd_inner: 3368.66 | bwd_allreduce: 0.84 | step: 6.80 89%|████████▊ | 8865/10000 [13:57:09<1:43:59, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.051064085215330124, 'learning_rate': 1.3361401234729753e-06, 'epoch': 8.87} 89%|████████▊ | 8865/10000 [13:57:09<1:43:59, 5.50s/it][2025-06-20 03:26:54,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.67 | optimizer_step: 2.73 [2025-06-20 03:26:54,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.57 | bwd_microstep: 3323.31 | bwd_inner_microstep: 3322.52 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.82 [2025-06-20 03:26:54,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.57 | bwd: 3323.32 | bwd_inner: 3322.52 | bwd_allreduce: 0.76 | step: 6.82 89%|████████▊ | 8866/10000 [13:57:15<1:43:45, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.04908732697367668, 'learning_rate': 1.3338132429923434e-06, 'epoch': 8.87} 89%|████████▊ | 8866/10000 [13:57:15<1:43:45, 5.49s/it][2025-06-20 03:26:59,975] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:26:59,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.62 | bwd_microstep: 3329.72 | bwd_inner_microstep: 3328.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-20 03:26:59,976] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.62 | bwd: 3329.73 | bwd_inner: 3328.92 | bwd_allreduce: 0.76 | step: 6.63 89%|████████▊ | 8867/10000 [13:57:20<1:43:36, 5.49s/it] {'loss': 0.0, 'grad_norm': 9.667372069088742e-05, 'learning_rate': 1.3314883205080697e-06, 'epoch': 8.87} 89%|████████▊ | 8867/10000 [13:57:20<1:43:36, 5.49s/it][2025-06-20 03:27:05,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:27:05,448] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.02 | bwd_microstep: 3324.66 | bwd_inner_microstep: 3323.88 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 03:27:05,449] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.02 | bwd: 3324.68 | bwd_inner: 3323.88 | bwd_allreduce: 0.76 | step: 6.62 89%|████████▊ | 8868/10000 [13:57:26<1:43:26, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.001150983851402998, 'learning_rate': 1.3291653562640282e-06, 'epoch': 8.87} 89%|████████▊ | 8868/10000 [13:57:26<1:43:26, 5.48s/it][2025-06-20 03:27:10,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:27:10,984] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.11 | bwd_microstep: 3370.42 | bwd_inner_microstep: 3369.56 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.29 [2025-06-20 03:27:10,985] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.11 | bwd: 3370.44 | bwd_inner: 3369.56 | bwd_allreduce: 0.82 | step: 7.11 89%|████████▊ | 8869/10000 [13:57:31<1:43:39, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.009749154560267925, 'learning_rate': 1.3268443505038909e-06, 'epoch': 8.87} 89%|████████▊ | 8869/10000 [13:57:31<1:43:39, 5.50s/it][2025-06-20 03:27:16,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 03:27:16,462] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.94 | bwd_microstep: 3324.75 | bwd_inner_microstep: 3323.67 | bwd_allreduce_microstep: 1.01 | step_microstep: 7.33 [2025-06-20 03:27:16,463] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.94 | bwd: 3324.77 | bwd_inner: 3323.67 | bwd_allreduce: 1.04 | step: 7.32 89%|████████▊ | 8870/10000 [13:57:37<1:43:27, 5.49s/it] {'loss': 0.0, 'grad_norm': 9.146227239398286e-05, 'learning_rate': 1.3245253034711114e-06, 'epoch': 8.87} 89%|████████▊ | 8870/10000 [13:57:37<1:43:27, 5.49s/it][2025-06-20 03:27:21,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:27:21,953] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.68 | bwd_microstep: 3329.47 | bwd_inner_microstep: 3328.65 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.23 [2025-06-20 03:27:21,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.68 | bwd: 3329.48 | bwd_inner: 3328.65 | bwd_allreduce: 0.78 | step: 7.23 89%|████████▊ | 8871/10000 [13:57:42<1:43:20, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0022137267515063286, 'learning_rate': 1.3222082154089533e-06, 'epoch': 8.87} 89%|████████▊ | 8871/10000 [13:57:42<1:43:20, 5.49s/it][2025-06-20 03:27:27,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:27:27,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.34 | bwd_microstep: 3334.30 | bwd_inner_microstep: 3333.33 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.12 [2025-06-20 03:27:27,438] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.34 | bwd: 3334.31 | bwd_inner: 3333.33 | bwd_allreduce: 0.93 | step: 7.12 89%|████████▊ | 8872/10000 [13:57:48<1:43:12, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.03744121640920639, 'learning_rate': 1.3198930865604664e-06, 'epoch': 8.87} 89%|████████▊ | 8872/10000 [13:57:48<1:43:12, 5.49s/it][2025-06-20 03:27:33,006] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:27:33,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2142.40 | bwd_microstep: 3384.36 | bwd_inner_microstep: 3383.52 | bwd_allreduce_microstep: 0.79 | step_microstep: 6.94 [2025-06-20 03:27:33,007] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2142.40 | bwd: 3384.37 | bwd_inner: 3383.52 | bwd_allreduce: 0.81 | step: 6.94 89%|████████▊ | 8873/10000 [13:57:53<1:43:33, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.006585438270121813, 'learning_rate': 1.3175799171684988e-06, 'epoch': 8.87} 89%|████████▊ | 8873/10000 [13:57:53<1:43:33, 5.51s/it][2025-06-20 03:27:38,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:27:38,493] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.55 | bwd_microstep: 3327.60 | bwd_inner_microstep: 3326.81 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.01 [2025-06-20 03:27:38,494] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.54 | bwd: 3327.61 | bwd_inner: 3326.81 | bwd_allreduce: 0.76 | step: 7.01 89%|████████▊ | 8874/10000 [13:57:59<1:43:18, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.007990416139364243, 'learning_rate': 1.315268707475692e-06, 'epoch': 8.87} 89%|████████▊ | 8874/10000 [13:57:59<1:43:18, 5.51s/it][2025-06-20 03:27:43,978] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 03:27:43,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.38 | bwd_microstep: 3335.67 | bwd_inner_microstep: 3334.66 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.03 [2025-06-20 03:27:43,979] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.38 | bwd: 3335.69 | bwd_inner: 3334.66 | bwd_allreduce: 0.98 | step: 7.03 89%|████████▉ | 8875/10000 [13:58:04<1:43:06, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0002660969039425254, 'learning_rate': 1.3129594577244742e-06, 'epoch': 8.88} 89%|████████▉ | 8875/10000 [13:58:04<1:43:06, 5.50s/it][2025-06-20 03:27:49,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:27:49,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.04 | bwd_microstep: 3334.69 | bwd_inner_microstep: 3333.78 | bwd_allreduce_microstep: 0.86 | step_microstep: 6.89 [2025-06-20 03:27:49,466] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.04 | bwd: 3334.70 | bwd_inner: 3333.78 | bwd_allreduce: 0.88 | step: 6.90 89%|████████▉ | 8876/10000 [13:58:10<1:42:56, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0006444866303354502, 'learning_rate': 1.3106521681570828e-06, 'epoch': 8.88} 89%|████████▉ | 8876/10000 [13:58:10<1:42:56, 5.50s/it][2025-06-20 03:27:55,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:27:55,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2141.45 | bwd_microstep: 3385.05 | bwd_inner_microstep: 3384.08 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.25 [2025-06-20 03:27:55,029] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2141.45 | bwd: 3385.07 | bwd_inner: 3384.08 | bwd_allreduce: 0.93 | step: 7.26 89%|████████▉ | 8877/10000 [13:58:15<1:43:15, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.018296191468834877, 'learning_rate': 1.3083468390155373e-06, 'epoch': 8.88} 89%|████████▉ | 8877/10000 [13:58:15<1:43:15, 5.52s/it][2025-06-20 03:28:00,520] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:28:00,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.76 | bwd_microstep: 3330.52 | bwd_inner_microstep: 3329.72 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.78 [2025-06-20 03:28:00,521] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.76 | bwd: 3330.53 | bwd_inner: 3329.72 | bwd_allreduce: 0.77 | step: 6.79 89%|████████▉ | 8878/10000 [13:58:21<1:43:00, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0007990147569216788, 'learning_rate': 1.3060434705416602e-06, 'epoch': 8.88} 89%|████████▉ | 8878/10000 [13:58:21<1:43:00, 5.51s/it][2025-06-20 03:28:06,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:28:06,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.97 | bwd_microstep: 3374.18 | bwd_inner_microstep: 3373.25 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.43 [2025-06-20 03:28:06,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.97 | bwd: 3374.20 | bwd_inner: 3373.25 | bwd_allreduce: 0.91 | step: 7.43 89%|████████▉ | 8879/10000 [13:58:26<1:43:07, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0007545933476649225, 'learning_rate': 1.3037420629770604e-06, 'epoch': 8.88} 89%|████████▉ | 8879/10000 [13:58:26<1:43:07, 5.52s/it][2025-06-20 03:28:11,613] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:28:11,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.33 | bwd_microstep: 3374.41 | bwd_inner_microstep: 3373.63 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 03:28:11,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.33 | bwd: 3374.43 | bwd_inner: 3373.63 | bwd_allreduce: 0.75 | step: 6.66 89%|████████▉ | 8880/10000 [13:58:32<1:43:11, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0004755358095280826, 'learning_rate': 1.301442616563149e-06, 'epoch': 8.88} 89%|████████▉ | 8880/10000 [13:58:32<1:43:11, 5.53s/it][2025-06-20 03:28:17,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:28:17,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.49 | bwd_microstep: 3330.99 | bwd_inner_microstep: 3330.02 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.20 [2025-06-20 03:28:17,090] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.49 | bwd: 3331.01 | bwd_inner: 3330.02 | bwd_allreduce: 0.93 | step: 7.20 89%|████████▉ | 8881/10000 [13:58:37<1:42:48, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.015942325815558434, 'learning_rate': 1.2991451315411218e-06, 'epoch': 8.88} 89%|████████▉ | 8881/10000 [13:58:37<1:42:48, 5.51s/it][2025-06-20 03:28:22,564] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:28:22,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.38 | bwd_microstep: 3324.24 | bwd_inner_microstep: 3323.44 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.91 [2025-06-20 03:28:22,565] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.38 | bwd: 3324.25 | bwd_inner: 3323.44 | bwd_allreduce: 0.77 | step: 6.91 89%|████████▉ | 8882/10000 [13:58:43<1:42:30, 5.50s/it] {'loss': 0.0, 'grad_norm': 1.7900407328852452e-05, 'learning_rate': 1.2968496081519776e-06, 'epoch': 8.88} 89%|████████▉ | 8882/10000 [13:58:43<1:42:30, 5.50s/it][2025-06-20 03:28:28,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:28:28,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.85 | bwd_microstep: 3324.29 | bwd_inner_microstep: 3323.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-20 03:28:28,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.85 | bwd: 3324.31 | bwd_inner: 3323.50 | bwd_allreduce: 0.76 | step: 6.67 89%|████████▉ | 8883/10000 [13:58:48<1:42:13, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005480906460434198, 'learning_rate': 1.294556046636506e-06, 'epoch': 8.88} 89%|████████▉ | 8883/10000 [13:58:48<1:42:13, 5.49s/it][2025-06-20 03:28:33,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:28:33,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.37 | bwd_microstep: 3406.48 | bwd_inner_microstep: 3405.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 03:28:33,614] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.37 | bwd: 3406.49 | bwd_inner: 3405.69 | bwd_allreduce: 0.76 | step: 6.68 89%|████████▉ | 8884/10000 [13:58:54<1:42:38, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0007556392229162157, 'learning_rate': 1.2922644472352875e-06, 'epoch': 8.88} 89%|████████▉ | 8884/10000 [13:58:54<1:42:38, 5.52s/it][2025-06-20 03:28:39,094] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:28:39,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.18 | bwd_microstep: 3333.47 | bwd_inner_microstep: 3332.47 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.58 [2025-06-20 03:28:39,095] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.18 | bwd: 3333.49 | bwd_inner: 3332.47 | bwd_allreduce: 0.96 | step: 7.58 89%|████████▉ | 8885/10000 [13:58:59<1:42:20, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0011107954196631908, 'learning_rate': 1.2899748101887099e-06, 'epoch': 8.88} 89%|████████▉ | 8885/10000 [13:58:59<1:42:20, 5.51s/it][2025-06-20 03:28:44,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:28:44,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.93 | bwd_microstep: 3314.52 | bwd_inner_microstep: 3313.73 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.67 [2025-06-20 03:28:44,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.94 | bwd: 3314.53 | bwd_inner: 3313.73 | bwd_allreduce: 0.76 | step: 6.67 89%|████████▉ | 8886/10000 [13:59:05<1:42:03, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0010764151811599731, 'learning_rate': 1.2876871357369324e-06, 'epoch': 8.89} 89%|████████▉ | 8886/10000 [13:59:05<1:42:03, 5.50s/it][2025-06-20 03:28:50,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.87 [2025-06-20 03:28:50,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.69 | bwd_microstep: 3325.92 | bwd_inner_microstep: 3325.12 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-20 03:28:50,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.69 | bwd: 3325.93 | bwd_inner: 3325.12 | bwd_allreduce: 0.77 | step: 7.06 89%|████████▉ | 8887/10000 [13:59:10<1:41:49, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.013992391526699066, 'learning_rate': 1.2854014241199298e-06, 'epoch': 8.89} 89%|████████▉ | 8887/10000 [13:59:10<1:41:49, 5.49s/it][2025-06-20 03:28:55,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:28:55,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.69 | bwd_microstep: 3321.81 | bwd_inner_microstep: 3321.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.64 [2025-06-20 03:28:55,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.69 | bwd: 3321.83 | bwd_inner: 3321.02 | bwd_allreduce: 0.76 | step: 6.65 89%|████████▉ | 8888/10000 [13:59:16<1:41:35, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.002734377747401595, 'learning_rate': 1.283117675577461e-06, 'epoch': 8.89} 89%|████████▉ | 8888/10000 [13:59:16<1:41:35, 5.48s/it][2025-06-20 03:29:01,036] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:29:01,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.54 | bwd_microstep: 3374.15 | bwd_inner_microstep: 3373.35 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-20 03:29:01,037] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.54 | bwd: 3374.17 | bwd_inner: 3373.35 | bwd_allreduce: 0.77 | step: 6.93 89%|████████▉ | 8889/10000 [13:59:21<1:41:47, 5.50s/it] {'loss': 0.0002, 'grad_norm': 0.07717560976743698, 'learning_rate': 1.2808358903490793e-06, 'epoch': 8.89} 89%|████████▉ | 8889/10000 [13:59:21<1:41:47, 5.50s/it][2025-06-20 03:29:06,569] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:29:06,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.45 | bwd_microstep: 3362.70 | bwd_inner_microstep: 3361.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.05 [2025-06-20 03:29:06,570] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.45 | bwd: 3362.71 | bwd_inner: 3361.90 | bwd_allreduce: 0.77 | step: 7.05 89%|████████▉ | 8890/10000 [13:59:27<1:41:53, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0004624938010238111, 'learning_rate': 1.2785560686741372e-06, 'epoch': 8.89} 89%|████████▉ | 8890/10000 [13:59:27<1:41:53, 5.51s/it][2025-06-20 03:29:12,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:29:12,103] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.25 | bwd_microstep: 3369.41 | bwd_inner_microstep: 3368.62 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.66 [2025-06-20 03:29:12,104] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.25 | bwd: 3369.43 | bwd_inner: 3368.62 | bwd_allreduce: 0.76 | step: 6.66 89%|████████▉ | 8891/10000 [13:59:32<1:41:56, 5.52s/it] {'loss': 0.0003, 'grad_norm': 0.10468258708715439, 'learning_rate': 1.2762782107917726e-06, 'epoch': 8.89} 89%|████████▉ | 8891/10000 [13:59:32<1:41:56, 5.52s/it][2025-06-20 03:29:17,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 3.02 [2025-06-20 03:29:17,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.77 | bwd_microstep: 3381.06 | bwd_inner_microstep: 3380.14 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.23 [2025-06-20 03:29:17,651] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.77 | bwd: 3381.07 | bwd_inner: 3380.14 | bwd_allreduce: 0.89 | step: 7.23 89%|████████▉ | 8892/10000 [13:59:38<1:42:02, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0004028816765639931, 'learning_rate': 1.2740023169409233e-06, 'epoch': 8.89} 89%|████████▉ | 8892/10000 [13:59:38<1:42:02, 5.53s/it][2025-06-20 03:29:23,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 03:29:23,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.66 | bwd_microstep: 3314.66 | bwd_inner_microstep: 3313.71 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.19 [2025-06-20 03:29:23,109] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.66 | bwd: 3314.67 | bwd_inner: 3313.71 | bwd_allreduce: 0.92 | step: 7.19 89%|████████▉ | 8893/10000 [13:59:43<1:41:33, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00012020311260130256, 'learning_rate': 1.2717283873603225e-06, 'epoch': 8.89} 89%|████████▉ | 8893/10000 [13:59:43<1:41:33, 5.50s/it][2025-06-20 03:29:28,572] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:29:28,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.61 | bwd_microstep: 3315.85 | bwd_inner_microstep: 3315.06 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 03:29:28,573] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.61 | bwd: 3315.86 | bwd_inner: 3315.06 | bwd_allreduce: 0.76 | step: 6.72 89%|████████▉ | 8894/10000 [13:59:49<1:41:15, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00412350706756115, 'learning_rate': 1.269456422288491e-06, 'epoch': 8.89} 89%|████████▉ | 8894/10000 [13:59:49<1:41:15, 5.49s/it][2025-06-20 03:29:34,112] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 03:29:34,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.36 | bwd_microstep: 3368.18 | bwd_inner_microstep: 3367.25 | bwd_allreduce_microstep: 0.88 | step_microstep: 6.95 [2025-06-20 03:29:34,113] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.36 | bwd: 3368.20 | bwd_inner: 3367.25 | bwd_allreduce: 0.90 | step: 6.95 89%|████████▉ | 8895/10000 [13:59:54<1:41:24, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00018532411195337772, 'learning_rate': 1.2671864219637508e-06, 'epoch': 8.89} 89%|████████▉ | 8895/10000 [13:59:54<1:41:24, 5.51s/it][2025-06-20 03:29:39,577] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:29:39,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.77 | bwd_microstep: 3316.08 | bwd_inner_microstep: 3315.28 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.07 [2025-06-20 03:29:39,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.77 | bwd: 3316.09 | bwd_inner: 3315.28 | bwd_allreduce: 0.77 | step: 7.08 89%|████████▉ | 8896/10000 [14:00:00<1:41:05, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.028993548825383186, 'learning_rate': 1.2649183866242143e-06, 'epoch': 8.9} 89%|████████▉ | 8896/10000 [14:00:00<1:41:05, 5.49s/it][2025-06-20 03:29:45,040] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:29:45,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.29 | bwd_microstep: 3322.13 | bwd_inner_microstep: 3321.35 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 03:29:45,041] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.29 | bwd: 3322.15 | bwd_inner: 3321.35 | bwd_allreduce: 0.76 | step: 6.63 89%|████████▉ | 8897/10000 [14:00:05<1:40:49, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.010863618925213814, 'learning_rate': 1.2626523165077863e-06, 'epoch': 8.9} 89%|████████▉ | 8897/10000 [14:00:05<1:40:49, 5.48s/it][2025-06-20 03:29:50,501] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:29:50,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.75 | bwd_microstep: 3321.31 | bwd_inner_microstep: 3320.51 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 03:29:50,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.75 | bwd: 3321.32 | bwd_inner: 3320.51 | bwd_allreduce: 0.77 | step: 6.78 89%|████████▉ | 8898/10000 [14:00:11<1:40:36, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0013535774778574705, 'learning_rate': 1.260388211852166e-06, 'epoch': 8.9} 89%|████████▉ | 8898/10000 [14:00:11<1:40:36, 5.48s/it][2025-06-20 03:29:56,033] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:29:56,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.66 | bwd_microstep: 3364.09 | bwd_inner_microstep: 3363.18 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.00 [2025-06-20 03:29:56,034] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.66 | bwd: 3364.11 | bwd_inner: 3363.18 | bwd_allreduce: 0.89 | step: 7.00 89%|████████▉ | 8899/10000 [14:00:16<1:40:48, 5.49s/it] {'loss': 0.0, 'grad_norm': 5.252546179690398e-05, 'learning_rate': 1.2581260728948496e-06, 'epoch': 8.9} 89%|████████▉ | 8899/10000 [14:00:16<1:40:48, 5.49s/it][2025-06-20 03:30:01,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:30:01,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.46 | bwd_microstep: 3315.99 | bwd_inner_microstep: 3315.19 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.14 [2025-06-20 03:30:01,504] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.46 | bwd: 3316.01 | bwd_inner: 3315.19 | bwd_allreduce: 0.78 | step: 7.14 89%|████████▉ | 8900/10000 [14:00:22<1:40:35, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0002473387576173991, 'learning_rate': 1.2558658998731298e-06, 'epoch': 8.9} 89%|████████▉ | 8900/10000 [14:00:22<1:40:35, 5.49s/it][2025-06-20 03:30:07,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:30:07,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.54 | bwd_microstep: 3360.53 | bwd_inner_microstep: 3359.75 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.63 [2025-06-20 03:30:07,028] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.54 | bwd: 3360.54 | bwd_inner: 3359.75 | bwd_allreduce: 0.75 | step: 6.63 89%|████████▉ | 8901/10000 [14:00:27<1:40:42, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.013575181365013123, 'learning_rate': 1.2536076930240771e-06, 'epoch': 8.9} 89%|████████▉ | 8901/10000 [14:00:27<1:40:42, 5.50s/it][2025-06-20 03:30:12,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:30:12,485] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.41 | bwd_microstep: 3314.58 | bwd_inner_microstep: 3313.74 | bwd_allreduce_microstep: 0.78 | step_microstep: 6.96 [2025-06-20 03:30:12,486] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.41 | bwd: 3314.60 | bwd_inner: 3313.74 | bwd_allreduce: 0.80 | step: 6.97 89%|████████▉ | 8902/10000 [14:00:33<1:40:24, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0006626769318245351, 'learning_rate': 1.2513514525845727e-06, 'epoch': 8.9} 89%|████████▉ | 8902/10000 [14:00:33<1:40:24, 5.49s/it][2025-06-20 03:30:18,018] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:30:18,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.98 | bwd_microstep: 3365.48 | bwd_inner_microstep: 3364.51 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.00 [2025-06-20 03:30:18,019] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.98 | bwd: 3365.49 | bwd_inner: 3364.51 | bwd_allreduce: 0.93 | step: 7.01 89%|████████▉ | 8903/10000 [14:00:38<1:40:33, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0002757575421128422, 'learning_rate': 1.2490971787912876e-06, 'epoch': 8.9} 89%|████████▉ | 8903/10000 [14:00:38<1:40:33, 5.50s/it][2025-06-20 03:30:23,561] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:30:23,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.58 | bwd_microstep: 3366.93 | bwd_inner_microstep: 3365.92 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.69 [2025-06-20 03:30:23,562] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.58 | bwd: 3366.94 | bwd_inner: 3365.92 | bwd_allreduce: 0.98 | step: 7.70 89%|████████▉ | 8904/10000 [14:00:44<1:40:42, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00020201770530547947, 'learning_rate': 1.2468448718806814e-06, 'epoch': 8.9} 89%|████████▉ | 8904/10000 [14:00:44<1:40:42, 5.51s/it][2025-06-20 03:30:29,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:30:29,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.36 | bwd_microstep: 3366.19 | bwd_inner_microstep: 3365.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 03:30:29,096] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.36 | bwd: 3366.20 | bwd_inner: 3365.40 | bwd_allreduce: 0.76 | step: 6.61 89%|████████▉ | 8905/10000 [14:00:49<1:40:43, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0031279721297323704, 'learning_rate': 1.2445945320890164e-06, 'epoch': 8.9} 89%|████████▉ | 8905/10000 [14:00:49<1:40:43, 5.52s/it][2025-06-20 03:30:34,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:30:34,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.05 | bwd_microstep: 3316.65 | bwd_inner_microstep: 3315.79 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.00 [2025-06-20 03:30:34,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.05 | bwd: 3316.67 | bwd_inner: 3315.79 | bwd_allreduce: 0.82 | step: 7.01 89%|████████▉ | 8906/10000 [14:00:55<1:40:17, 5.50s/it] {'loss': 0.0007, 'grad_norm': 0.4024524986743927, 'learning_rate': 1.2423461596523346e-06, 'epoch': 8.91} 89%|████████▉ | 8906/10000 [14:00:55<1:40:17, 5.50s/it][2025-06-20 03:30:40,083] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.49 | optimizer_step: 2.73 [2025-06-20 03:30:40,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.46 | bwd_microstep: 3365.52 | bwd_inner_microstep: 3364.74 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.56 [2025-06-20 03:30:40,084] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.46 | bwd: 3365.53 | bwd_inner: 3364.74 | bwd_allreduce: 0.75 | step: 6.56 89%|████████▉ | 8907/10000 [14:01:00<1:40:21, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0043058255687355995, 'learning_rate': 1.240099754806483e-06, 'epoch': 8.91} 89%|████████▉ | 8907/10000 [14:01:00<1:40:21, 5.51s/it][2025-06-20 03:30:45,622] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:30:45,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.33 | bwd_microstep: 3362.71 | bwd_inner_microstep: 3361.75 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.32 [2025-06-20 03:30:45,623] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.33 | bwd: 3362.73 | bwd_inner: 3361.75 | bwd_allreduce: 0.93 | step: 7.32 89%|████████▉ | 8908/10000 [14:01:06<1:40:26, 5.52s/it] {'loss': 0.0008, 'grad_norm': 0.19943338632583618, 'learning_rate': 1.2378553177871e-06, 'epoch': 8.91} 89%|████████▉ | 8908/10000 [14:01:06<1:40:26, 5.52s/it][2025-06-20 03:30:51,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 03:30:51,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.48 | bwd_microstep: 3368.58 | bwd_inner_microstep: 3367.68 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.49 [2025-06-20 03:30:51,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.48 | bwd: 3368.60 | bwd_inner: 3367.68 | bwd_allreduce: 0.87 | step: 7.50 89%|████████▉ | 8909/10000 [14:01:11<1:40:29, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.001801396836526692, 'learning_rate': 1.235612848829617e-06, 'epoch': 8.91} 89%|████████▉ | 8909/10000 [14:01:11<1:40:29, 5.53s/it][2025-06-20 03:30:56,707] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:30:56,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.46 | bwd_microstep: 3368.74 | bwd_inner_microstep: 3367.69 | bwd_allreduce_microstep: 0.97 | step_microstep: 7.23 [2025-06-20 03:30:56,708] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.46 | bwd: 3368.76 | bwd_inner: 3367.69 | bwd_allreduce: 1.00 | step: 7.23 89%|████████▉ | 8910/10000 [14:01:17<1:40:28, 5.53s/it] {'loss': 0.0, 'grad_norm': 3.401592039153911e-05, 'learning_rate': 1.2333723481692594e-06, 'epoch': 8.91} 89%|████████▉ | 8910/10000 [14:01:17<1:40:28, 5.53s/it][2025-06-20 03:31:02,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:31:02,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.15 | bwd_microstep: 3356.88 | bwd_inner_microstep: 3356.09 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 03:31:02,237] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.15 | bwd: 3356.90 | bwd_inner: 3356.09 | bwd_allreduce: 0.76 | step: 6.64 89%|████████▉ | 8911/10000 [14:01:23<1:40:21, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0001088857461581938, 'learning_rate': 1.2311338160410413e-06, 'epoch': 8.91} 89%|████████▉ | 8911/10000 [14:01:23<1:40:21, 5.53s/it][2025-06-20 03:31:07,766] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 03:31:07,767] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.13 | bwd_microstep: 3364.72 | bwd_inner_microstep: 3363.62 | bwd_allreduce_microstep: 1.02 | step_microstep: 8.02 [2025-06-20 03:31:07,768] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.13 | bwd: 3364.74 | bwd_inner: 3363.62 | bwd_allreduce: 1.06 | step: 8.02 89%|████████▉ | 8912/10000 [14:01:28<1:40:18, 5.53s/it] {'loss': 0.0078, 'grad_norm': 5.262414455413818, 'learning_rate': 1.2288972526797772e-06, 'epoch': 8.91} 89%|████████▉ | 8912/10000 [14:01:28<1:40:18, 5.53s/it][2025-06-20 03:31:13,306] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 03:31:13,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.13 | bwd_microstep: 3356.61 | bwd_inner_microstep: 3355.68 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.40 [2025-06-20 03:31:13,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.13 | bwd: 3356.62 | bwd_inner: 3355.68 | bwd_allreduce: 0.90 | step: 7.40 89%|████████▉ | 8913/10000 [14:01:34<1:40:14, 5.53s/it] {'loss': 0.005, 'grad_norm': 1.368406891822815, 'learning_rate': 1.2266626583200725e-06, 'epoch': 8.91} 89%|████████▉ | 8913/10000 [14:01:34<1:40:14, 5.53s/it][2025-06-20 03:31:18,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:31:18,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.75 | bwd_microstep: 3313.63 | bwd_inner_microstep: 3312.82 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.75 [2025-06-20 03:31:18,778] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.75 | bwd: 3313.65 | bwd_inner: 3312.82 | bwd_allreduce: 0.79 | step: 6.75 89%|████████▉ | 8914/10000 [14:01:39<1:39:48, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00028658166411332786, 'learning_rate': 1.2244300331963266e-06, 'epoch': 8.91} 89%|████████▉ | 8914/10000 [14:01:39<1:39:48, 5.51s/it][2025-06-20 03:31:24,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:31:24,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.92 | bwd_microstep: 3310.97 | bwd_inner_microstep: 3310.18 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 03:31:24,233] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.92 | bwd: 3310.99 | bwd_inner: 3310.18 | bwd_allreduce: 0.77 | step: 6.77 89%|████████▉ | 8915/10000 [14:01:45<1:39:23, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.011769491247832775, 'learning_rate': 1.222199377542732e-06, 'epoch': 8.91} 89%|████████▉ | 8915/10000 [14:01:45<1:39:23, 5.50s/it][2025-06-20 03:31:29,758] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.72 [2025-06-20 03:31:29,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.88 | bwd_microstep: 3357.04 | bwd_inner_microstep: 3356.06 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.25 [2025-06-20 03:31:29,759] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.88 | bwd: 3357.05 | bwd_inner: 3356.06 | bwd_allreduce: 0.95 | step: 7.26 89%|████████▉ | 8916/10000 [14:01:50<1:39:28, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0003652571758721024, 'learning_rate': 1.2199706915932685e-06, 'epoch': 8.92} 89%|████████▉ | 8916/10000 [14:01:50<1:39:28, 5.51s/it][2025-06-20 03:31:35,227] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:31:35,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.90 | bwd_microstep: 3315.42 | bwd_inner_microstep: 3314.53 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.36 [2025-06-20 03:31:35,228] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.90 | bwd: 3315.44 | bwd_inner: 3314.53 | bwd_allreduce: 0.86 | step: 7.37 89%|████████▉ | 8917/10000 [14:01:56<1:39:11, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0019082825165241957, 'learning_rate': 1.2177439755817178e-06, 'epoch': 8.92} 89%|████████▉ | 8917/10000 [14:01:56<1:39:11, 5.50s/it][2025-06-20 03:31:40,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:31:40,789] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2136.95 | bwd_microstep: 3376.20 | bwd_inner_microstep: 3375.39 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.97 [2025-06-20 03:31:40,790] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2136.95 | bwd: 3376.21 | bwd_inner: 3375.39 | bwd_allreduce: 0.78 | step: 6.98 89%|████████▉ | 8918/10000 [14:02:01<1:39:26, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00023999436234589666, 'learning_rate': 1.215519229741655e-06, 'epoch': 8.92} 89%|████████▉ | 8918/10000 [14:02:01<1:39:26, 5.51s/it][2025-06-20 03:31:46,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:31:46,242] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.71 | bwd_microstep: 3306.85 | bwd_inner_microstep: 3306.05 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 03:31:46,243] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.72 | bwd: 3306.86 | bwd_inner: 3306.05 | bwd_allreduce: 0.76 | step: 6.72 89%|████████▉ | 8919/10000 [14:02:07<1:39:00, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0018346315482631326, 'learning_rate': 1.213296454306443e-06, 'epoch': 8.92} 89%|████████▉ | 8919/10000 [14:02:07<1:39:00, 5.50s/it][2025-06-20 03:31:51,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:31:51,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.32 | bwd_microstep: 3375.51 | bwd_inner_microstep: 3374.70 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.19 [2025-06-20 03:31:51,788] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.32 | bwd: 3375.52 | bwd_inner: 3374.70 | bwd_allreduce: 0.77 | step: 7.19 89%|████████▉ | 8920/10000 [14:02:12<1:39:11, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0027195357251912355, 'learning_rate': 1.2110756495092434e-06, 'epoch': 8.92} 89%|████████▉ | 8920/10000 [14:02:12<1:39:11, 5.51s/it][2025-06-20 03:31:57,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 03:31:57,250] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.77 | bwd_microstep: 3318.52 | bwd_inner_microstep: 3317.63 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.04 [2025-06-20 03:31:57,251] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.77 | bwd: 3318.54 | bwd_inner: 3317.63 | bwd_allreduce: 0.86 | step: 7.05 89%|████████▉ | 8921/10000 [14:02:18<1:38:50, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00222779787145555, 'learning_rate': 1.2088568155830038e-06, 'epoch': 8.92} 89%|████████▉ | 8921/10000 [14:02:18<1:38:50, 5.50s/it][2025-06-20 03:32:02,786] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:32:02,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.07 | bwd_microstep: 3360.47 | bwd_inner_microstep: 3359.51 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.44 [2025-06-20 03:32:02,787] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.07 | bwd: 3360.49 | bwd_inner: 3359.51 | bwd_allreduce: 0.93 | step: 7.44 89%|████████▉ | 8922/10000 [14:02:23<1:38:58, 5.51s/it] {'loss': 0.0012, 'grad_norm': 0.28448212146759033, 'learning_rate': 1.2066399527604712e-06, 'epoch': 8.92} 89%|████████▉ | 8922/10000 [14:02:23<1:38:58, 5.51s/it][2025-06-20 03:32:08,263] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.66 | optimizer_step: 2.73 [2025-06-20 03:32:08,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.39 | bwd_microstep: 3315.32 | bwd_inner_microstep: 3314.37 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.51 [2025-06-20 03:32:08,264] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.39 | bwd: 3315.34 | bwd_inner: 3314.36 | bwd_allreduce: 0.92 | step: 7.51 89%|████████▉ | 8923/10000 [14:02:29<1:38:42, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.12323421239852905, 'learning_rate': 1.2044250612741859e-06, 'epoch': 8.92} 89%|████████▉ | 8923/10000 [14:02:29<1:38:42, 5.50s/it][2025-06-20 03:32:13,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:32:13,716] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.77 | bwd_microstep: 3306.35 | bwd_inner_microstep: 3305.56 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.74 [2025-06-20 03:32:13,717] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.77 | bwd: 3306.36 | bwd_inner: 3305.56 | bwd_allreduce: 0.76 | step: 6.75 89%|████████▉ | 8924/10000 [14:02:34<1:38:21, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00017545615264680237, 'learning_rate': 1.202212141356478e-06, 'epoch': 8.92} 89%|████████▉ | 8924/10000 [14:02:34<1:38:21, 5.48s/it][2025-06-20 03:32:19,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:32:19,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.35 | bwd_microstep: 3312.28 | bwd_inner_microstep: 3311.43 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.34 [2025-06-20 03:32:19,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.35 | bwd: 3312.29 | bwd_inner: 3311.43 | bwd_allreduce: 0.82 | step: 7.34 89%|████████▉ | 8925/10000 [14:02:39<1:38:05, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.00011365519458195195, 'learning_rate': 1.2000011932394773e-06, 'epoch': 8.93} 89%|████████▉ | 8925/10000 [14:02:39<1:38:05, 5.47s/it][2025-06-20 03:32:24,631] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:32:24,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.96 | bwd_microstep: 3317.43 | bwd_inner_microstep: 3316.64 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 03:32:24,632] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.96 | bwd: 3317.45 | bwd_inner: 3316.64 | bwd_allreduce: 0.77 | step: 6.77 89%|████████▉ | 8926/10000 [14:02:45<1:37:56, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.0002720166521612555, 'learning_rate': 1.1977922171550937e-06, 'epoch': 8.93} 89%|████████▉ | 8926/10000 [14:02:45<1:37:56, 5.47s/it][2025-06-20 03:32:30,081] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:32:30,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.45 | bwd_microstep: 3306.97 | bwd_inner_microstep: 3306.16 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.08 [2025-06-20 03:32:30,082] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.45 | bwd: 3306.98 | bwd_inner: 3306.17 | bwd_allreduce: 0.77 | step: 7.08 89%|████████▉ | 8927/10000 [14:02:50<1:37:44, 5.47s/it] {'loss': 0.0003, 'grad_norm': 0.08469607681035995, 'learning_rate': 1.195585213335042e-06, 'epoch': 8.93} 89%|████████▉ | 8927/10000 [14:02:50<1:37:44, 5.47s/it][2025-06-20 03:32:35,596] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:32:35,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.20 | bwd_microstep: 3353.43 | bwd_inner_microstep: 3352.64 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.61 [2025-06-20 03:32:35,597] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.20 | bwd: 3353.44 | bwd_inner: 3352.64 | bwd_allreduce: 0.75 | step: 6.61 89%|████████▉ | 8928/10000 [14:02:56<1:37:54, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0008285496733151376, 'learning_rate': 1.1933801820108282e-06, 'epoch': 8.93} 89%|████████▉ | 8928/10000 [14:02:56<1:37:54, 5.48s/it][2025-06-20 03:32:41,166] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:32:41,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.86 | bwd_microstep: 3397.30 | bwd_inner_microstep: 3396.41 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.23 [2025-06-20 03:32:41,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.87 | bwd: 3397.32 | bwd_inner: 3396.41 | bwd_allreduce: 0.87 | step: 7.23 89%|████████▉ | 8929/10000 [14:03:01<1:38:18, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.00011359514610376209, 'learning_rate': 1.1911771234137515e-06, 'epoch': 8.93} 89%|████████▉ | 8929/10000 [14:03:01<1:38:18, 5.51s/it][2025-06-20 03:32:46,702] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 03:32:46,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.10 | bwd_microstep: 3367.28 | bwd_inner_microstep: 3366.43 | bwd_allreduce_microstep: 0.79 | step_microstep: 7.26 [2025-06-20 03:32:46,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.10 | bwd: 3367.30 | bwd_inner: 3366.43 | bwd_allreduce: 0.81 | step: 7.26 89%|████████▉ | 8930/10000 [14:03:07<1:38:22, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0023365879897028208, 'learning_rate': 1.188976037774896e-06, 'epoch': 8.93} 89%|████████▉ | 8930/10000 [14:03:07<1:38:22, 5.52s/it][2025-06-20 03:32:52,157] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:32:52,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.19 | bwd_microstep: 3316.56 | bwd_inner_microstep: 3315.76 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-20 03:32:52,158] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.19 | bwd: 3316.58 | bwd_inner: 3315.76 | bwd_allreduce: 0.77 | step: 6.95 89%|████████▉ | 8931/10000 [14:03:12<1:37:56, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.08681172132492065, 'learning_rate': 1.1867769253251527e-06, 'epoch': 8.93} 89%|████████▉ | 8931/10000 [14:03:12<1:37:56, 5.50s/it][2025-06-20 03:32:57,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:32:57,620] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.78 | bwd_microstep: 3319.30 | bwd_inner_microstep: 3318.50 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.13 [2025-06-20 03:32:57,621] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.78 | bwd: 3319.31 | bwd_inner: 3318.50 | bwd_allreduce: 0.77 | step: 7.13 89%|████████▉ | 8932/10000 [14:03:18<1:37:39, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.01366016361862421, 'learning_rate': 1.1845797862951946e-06, 'epoch': 8.93} 89%|████████▉ | 8932/10000 [14:03:18<1:37:39, 5.49s/it][2025-06-20 03:33:03,149] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:33:03,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.42 | bwd_microstep: 3361.18 | bwd_inner_microstep: 3360.33 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.38 [2025-06-20 03:33:03,150] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.42 | bwd: 3361.20 | bwd_inner: 3360.33 | bwd_allreduce: 0.82 | step: 7.38 89%|████████▉ | 8933/10000 [14:03:23<1:37:48, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.024440689012408257, 'learning_rate': 1.182384620915491e-06, 'epoch': 8.93} 89%|████████▉ | 8933/10000 [14:03:23<1:37:48, 5.50s/it][2025-06-20 03:33:08,598] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.72 [2025-06-20 03:33:08,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2100.49 | bwd_microstep: 3306.97 | bwd_inner_microstep: 3306.15 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.22 [2025-06-20 03:33:08,599] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2100.49 | bwd: 3306.98 | bwd_inner: 3306.15 | bwd_allreduce: 0.79 | step: 7.22 89%|████████▉ | 8934/10000 [14:03:29<1:37:26, 5.48s/it] {'loss': 0.0003, 'grad_norm': 0.21968841552734375, 'learning_rate': 1.180191429416304e-06, 'epoch': 8.93} 89%|████████▉ | 8934/10000 [14:03:29<1:37:26, 5.48s/it][2025-06-20 03:33:14,134] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:33:14,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2132.70 | bwd_microstep: 3367.53 | bwd_inner_microstep: 3366.72 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.85 [2025-06-20 03:33:14,135] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2132.70 | bwd: 3367.55 | bwd_inner: 3366.72 | bwd_allreduce: 0.79 | step: 6.85 89%|████████▉ | 8935/10000 [14:03:34<1:37:37, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.012427865527570248, 'learning_rate': 1.1780002120276968e-06, 'epoch': 8.94} 89%|████████▉ | 8935/10000 [14:03:34<1:37:37, 5.50s/it][2025-06-20 03:33:19,600] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:33:19,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.16 | bwd_microstep: 3322.14 | bwd_inner_microstep: 3321.19 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.54 [2025-06-20 03:33:19,601] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.16 | bwd: 3322.16 | bwd_inner: 3321.19 | bwd_allreduce: 0.92 | step: 7.54 89%|████████▉ | 8936/10000 [14:03:40<1:37:21, 5.49s/it] {'loss': 0.0005, 'grad_norm': 0.08424276858568192, 'learning_rate': 1.1758109689795072e-06, 'epoch': 8.94} 89%|████████▉ | 8936/10000 [14:03:40<1:37:21, 5.49s/it][2025-06-20 03:33:25,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:33:25,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.69 | bwd_microstep: 3314.96 | bwd_inner_microstep: 3314.02 | bwd_allreduce_microstep: 0.89 | step_microstep: 6.87 [2025-06-20 03:33:25,061] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.69 | bwd: 3314.98 | bwd_inner: 3314.02 | bwd_allreduce: 0.91 | step: 6.87 89%|████████▉ | 8937/10000 [14:03:45<1:37:05, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0015031726798042655, 'learning_rate': 1.1736237005013807e-06, 'epoch': 8.94} 89%|████████▉ | 8937/10000 [14:03:45<1:37:05, 5.48s/it][2025-06-20 03:33:30,514] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:33:30,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.68 | bwd_microstep: 3311.99 | bwd_inner_microstep: 3311.10 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.54 [2025-06-20 03:33:30,515] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.68 | bwd: 3312.01 | bwd_inner: 3311.10 | bwd_allreduce: 0.85 | step: 7.54 89%|████████▉ | 8938/10000 [14:03:51<1:36:53, 5.47s/it] {'loss': 0.001, 'grad_norm': 0.2684032917022705, 'learning_rate': 1.1714384068227514e-06, 'epoch': 8.94} 89%|████████▉ | 8938/10000 [14:03:51<1:36:53, 5.47s/it][2025-06-20 03:33:35,972] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.79 [2025-06-20 03:33:35,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.93 | bwd_microstep: 3312.42 | bwd_inner_microstep: 3311.46 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.34 [2025-06-20 03:33:35,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.93 | bwd: 3312.44 | bwd_inner: 3311.46 | bwd_allreduce: 0.93 | step: 7.34 89%|████████▉ | 8939/10000 [14:03:56<1:36:41, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.011755832470953465, 'learning_rate': 1.1692550881728492e-06, 'epoch': 8.94} 89%|████████▉ | 8939/10000 [14:03:56<1:36:41, 5.47s/it][2025-06-20 03:33:41,428] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:33:41,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.89 | bwd_microstep: 3309.56 | bwd_inner_microstep: 3308.75 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-20 03:33:41,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.89 | bwd: 3309.57 | bwd_inner: 3308.75 | bwd_allreduce: 0.78 | step: 6.86 89%|████████▉ | 8940/10000 [14:04:02<1:36:32, 5.46s/it] {'loss': 0.0, 'grad_norm': 0.00344000943005085, 'learning_rate': 1.1670737447806913e-06, 'epoch': 8.94} 89%|████████▉ | 8940/10000 [14:04:02<1:36:32, 5.46s/it][2025-06-20 03:33:46,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:33:46,969] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.00 | bwd_microstep: 3377.40 | bwd_inner_microstep: 3376.47 | bwd_allreduce_microstep: 0.88 | step_microstep: 7.40 [2025-06-20 03:33:46,970] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.00 | bwd: 3377.41 | bwd_inner: 3376.47 | bwd_allreduce: 0.89 | step: 7.41 89%|████████▉ | 8941/10000 [14:04:07<1:36:51, 5.49s/it] {'loss': 0.0, 'grad_norm': 9.130519174505025e-05, 'learning_rate': 1.1648943768750943e-06, 'epoch': 8.94} 89%|████████▉ | 8941/10000 [14:04:07<1:36:51, 5.49s/it][2025-06-20 03:33:52,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:33:52,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.45 | bwd_microstep: 3323.57 | bwd_inner_microstep: 3322.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 03:33:52,442] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.45 | bwd: 3323.58 | bwd_inner: 3322.78 | bwd_allreduce: 0.76 | step: 6.73 89%|████████▉ | 8942/10000 [14:04:13<1:36:40, 5.48s/it] {'loss': 0.0, 'grad_norm': 8.564987365389243e-05, 'learning_rate': 1.1627169846846575e-06, 'epoch': 8.94} 89%|████████▉ | 8942/10000 [14:04:13<1:36:40, 5.48s/it][2025-06-20 03:33:57,904] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.78 [2025-06-20 03:33:57,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.13 | bwd_microstep: 3321.98 | bwd_inner_microstep: 3321.02 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.54 [2025-06-20 03:33:57,905] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.13 | bwd: 3322.00 | bwd_inner: 3321.02 | bwd_allreduce: 0.93 | step: 7.55 89%|████████▉ | 8943/10000 [14:04:18<1:36:29, 5.48s/it] {'loss': 0.0, 'grad_norm': 7.739882858004421e-05, 'learning_rate': 1.160541568437783e-06, 'epoch': 8.94} 89%|████████▉ | 8943/10000 [14:04:18<1:36:29, 5.48s/it][2025-06-20 03:34:03,380] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:34:03,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.07 | bwd_microstep: 3320.07 | bwd_inner_microstep: 3319.29 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.81 [2025-06-20 03:34:03,381] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.07 | bwd: 3320.09 | bwd_inner: 3319.29 | bwd_allreduce: 0.76 | step: 6.81 89%|████████▉ | 8944/10000 [14:04:24<1:36:23, 5.48s/it] {'loss': 0.0, 'grad_norm': 3.858064883388579e-05, 'learning_rate': 1.158368128362659e-06, 'epoch': 8.94} 89%|████████▉ | 8944/10000 [14:04:24<1:36:23, 5.48s/it][2025-06-20 03:34:08,930] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:34:08,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.44 | bwd_microstep: 3375.95 | bwd_inner_microstep: 3375.03 | bwd_allreduce_microstep: 0.87 | step_microstep: 7.50 [2025-06-20 03:34:08,931] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.44 | bwd: 3375.97 | bwd_inner: 3375.03 | bwd_allreduce: 0.89 | step: 7.51 89%|████████▉ | 8945/10000 [14:04:29<1:36:41, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.010862820781767368, 'learning_rate': 1.1561966646872747e-06, 'epoch': 8.95} 89%|████████▉ | 8945/10000 [14:04:29<1:36:41, 5.50s/it][2025-06-20 03:34:14,389] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:34:14,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.40 | bwd_microstep: 3313.23 | bwd_inner_microstep: 3312.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 03:34:14,390] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.41 | bwd: 3313.24 | bwd_inner: 3312.45 | bwd_allreduce: 0.75 | step: 6.68 89%|████████▉ | 8946/10000 [14:04:35<1:36:22, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.043598756194114685, 'learning_rate': 1.1540271776394007e-06, 'epoch': 8.95} 89%|████████▉ | 8946/10000 [14:04:35<1:36:22, 5.49s/it][2025-06-20 03:34:19,856] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 03:34:19,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.42 | bwd_microstep: 3324.50 | bwd_inner_microstep: 3323.28 | bwd_allreduce_microstep: 1.15 | step_microstep: 8.21 [2025-06-20 03:34:19,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.42 | bwd: 3324.53 | bwd_inner: 3323.28 | bwd_allreduce: 1.18 | step: 8.22 89%|████████▉ | 8947/10000 [14:04:40<1:36:12, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.00047910932335071266, 'learning_rate': 1.1518596674466109e-06, 'epoch': 8.95} 89%|████████▉ | 8947/10000 [14:04:40<1:36:12, 5.48s/it][2025-06-20 03:34:25,342] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:34:25,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.63 | bwd_microstep: 3325.52 | bwd_inner_microstep: 3324.72 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.75 [2025-06-20 03:34:25,343] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.63 | bwd: 3325.54 | bwd_inner: 3324.72 | bwd_allreduce: 0.78 | step: 6.76 89%|████████▉ | 8948/10000 [14:04:46<1:36:06, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0013280875282362103, 'learning_rate': 1.1496941343362656e-06, 'epoch': 8.95} 89%|████████▉ | 8948/10000 [14:04:46<1:36:06, 5.48s/it][2025-06-20 03:34:30,817] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:34:30,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.73 | bwd_microstep: 3327.74 | bwd_inner_microstep: 3326.80 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.39 [2025-06-20 03:34:30,818] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.73 | bwd: 3327.76 | bwd_inner: 3326.80 | bwd_allreduce: 0.91 | step: 7.40 89%|████████▉ | 8949/10000 [14:04:51<1:35:59, 5.48s/it] {'loss': 0.0001, 'grad_norm': 0.02236703410744667, 'learning_rate': 1.1475305785355184e-06, 'epoch': 8.95} 89%|████████▉ | 8949/10000 [14:04:51<1:35:59, 5.48s/it][2025-06-20 03:34:36,361] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:34:36,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.06 | bwd_microstep: 3367.09 | bwd_inner_microstep: 3366.12 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.69 [2025-06-20 03:34:36,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.06 | bwd: 3367.12 | bwd_inner: 3366.12 | bwd_allreduce: 0.93 | step: 7.69 90%|████████▉ | 8950/10000 [14:04:57<1:36:15, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.001149871852248907, 'learning_rate': 1.1453690002713147e-06, 'epoch': 8.95} 90%|████████▉ | 8950/10000 [14:04:57<1:36:15, 5.50s/it][2025-06-20 03:34:41,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:34:41,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.92 | bwd_microstep: 3366.47 | bwd_inner_microstep: 3365.69 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.57 [2025-06-20 03:34:41,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.92 | bwd: 3366.48 | bwd_inner: 3365.69 | bwd_allreduce: 0.75 | step: 6.57 90%|████████▉ | 8951/10000 [14:05:02<1:36:22, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.002734401496127248, 'learning_rate': 1.1432093997703998e-06, 'epoch': 8.95} 90%|████████▉ | 8951/10000 [14:05:02<1:36:22, 5.51s/it][2025-06-20 03:34:47,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.73 [2025-06-20 03:34:47,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.37 | bwd_microstep: 3372.54 | bwd_inner_microstep: 3371.53 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.31 [2025-06-20 03:34:47,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.37 | bwd: 3372.56 | bwd_inner: 3371.53 | bwd_allreduce: 0.98 | step: 7.32 90%|████████▉ | 8952/10000 [14:05:08<1:36:27, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.017539696767926216, 'learning_rate': 1.1410517772592989e-06, 'epoch': 8.95} 90%|████████▉ | 8952/10000 [14:05:08<1:36:27, 5.52s/it][2025-06-20 03:34:53,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:34:53,014] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.81 | bwd_microstep: 3375.65 | bwd_inner_microstep: 3374.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.93 [2025-06-20 03:34:53,015] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.81 | bwd: 3375.66 | bwd_inner: 3374.85 | bwd_allreduce: 0.77 | step: 6.94 90%|████████▉ | 8953/10000 [14:05:13<1:36:34, 5.53s/it] {'loss': 0.0001, 'grad_norm': 0.015284971334040165, 'learning_rate': 1.13889613296434e-06, 'epoch': 8.95} 90%|████████▉ | 8953/10000 [14:05:13<1:36:34, 5.53s/it][2025-06-20 03:34:58,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:34:58,496] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2117.05 | bwd_microstep: 3329.00 | bwd_inner_microstep: 3328.22 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.71 [2025-06-20 03:34:58,497] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2117.05 | bwd: 3329.02 | bwd_inner: 3328.22 | bwd_allreduce: 0.76 | step: 6.71 90%|████████▉ | 8954/10000 [14:05:19<1:36:12, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0003911840030923486, 'learning_rate': 1.1367424671116377e-06, 'epoch': 8.95} 90%|████████▉ | 8954/10000 [14:05:19<1:36:12, 5.52s/it][2025-06-20 03:35:03,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:35:03,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.26 | bwd_microstep: 3327.34 | bwd_inner_microstep: 3326.38 | bwd_allreduce_microstep: 0.92 | step_microstep: 6.84 [2025-06-20 03:35:03,966] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.26 | bwd: 3327.35 | bwd_inner: 3326.38 | bwd_allreduce: 0.93 | step: 6.84 90%|████████▉ | 8955/10000 [14:05:24<1:35:51, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0009473999380134046, 'learning_rate': 1.134590779927105e-06, 'epoch': 8.96} 90%|████████▉ | 8955/10000 [14:05:24<1:35:51, 5.50s/it][2025-06-20 03:35:09,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:35:09,434] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.92 | bwd_microstep: 3325.92 | bwd_inner_microstep: 3325.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 03:35:09,435] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.92 | bwd: 3325.93 | bwd_inner: 3325.13 | bwd_allreduce: 0.76 | step: 6.65 90%|████████▉ | 8956/10000 [14:05:30<1:35:34, 5.49s/it] {'loss': 0.0, 'grad_norm': 5.7098528486676514e-05, 'learning_rate': 1.1324410716364453e-06, 'epoch': 8.96} 90%|████████▉ | 8956/10000 [14:05:30<1:35:34, 5.49s/it][2025-06-20 03:35:14,913] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:35:14,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2110.62 | bwd_microstep: 3331.54 | bwd_inner_microstep: 3330.58 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.45 [2025-06-20 03:35:14,914] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2110.62 | bwd: 3331.56 | bwd_inner: 3330.58 | bwd_allreduce: 0.93 | step: 7.45 90%|████████▉ | 8957/10000 [14:05:35<1:35:25, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005689479876309633, 'learning_rate': 1.1302933424651475e-06, 'epoch': 8.96} 90%|████████▉ | 8957/10000 [14:05:35<1:35:25, 5.49s/it][2025-06-20 03:35:20,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.70 | optimizer_step: 2.72 [2025-06-20 03:35:20,398] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.24 | bwd_microstep: 3328.02 | bwd_inner_microstep: 3327.02 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.19 [2025-06-20 03:35:20,399] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.24 | bwd: 3328.03 | bwd_inner: 3327.02 | bwd_allreduce: 0.97 | step: 7.20 90%|████████▉ | 8958/10000 [14:05:41<1:35:18, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.001357857952825725, 'learning_rate': 1.1281475926385022e-06, 'epoch': 8.96} 90%|████████▉ | 8958/10000 [14:05:41<1:35:18, 5.49s/it][2025-06-20 03:35:25,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:35:25,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.72 | bwd_microstep: 3367.18 | bwd_inner_microstep: 3366.17 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.24 [2025-06-20 03:35:25,946] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.72 | bwd: 3367.19 | bwd_inner: 3366.17 | bwd_allreduce: 0.98 | step: 7.23 90%|████████▉ | 8959/10000 [14:05:46<1:35:31, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.000252127880230546, 'learning_rate': 1.126003822381585e-06, 'epoch': 8.96} 90%|████████▉ | 8959/10000 [14:05:46<1:35:31, 5.51s/it][2025-06-20 03:35:31,429] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 03:35:31,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.84 | bwd_microstep: 3328.28 | bwd_inner_microstep: 3327.47 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.37 [2025-06-20 03:35:31,430] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.84 | bwd: 3328.30 | bwd_inner: 3327.47 | bwd_allreduce: 0.79 | step: 7.37 90%|████████▉ | 8960/10000 [14:05:52<1:35:19, 5.50s/it] {'loss': 0.0, 'grad_norm': 3.911094245268032e-05, 'learning_rate': 1.1238620319192717e-06, 'epoch': 8.96} 90%|████████▉ | 8960/10000 [14:05:52<1:35:19, 5.50s/it][2025-06-20 03:35:36,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:35:36,907] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.71 | bwd_microstep: 3331.59 | bwd_inner_microstep: 3330.79 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.72 [2025-06-20 03:35:36,908] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.71 | bwd: 3331.61 | bwd_inner: 3330.79 | bwd_allreduce: 0.77 | step: 6.72 90%|████████▉ | 8961/10000 [14:05:57<1:35:06, 5.49s/it] {'loss': 0.0003, 'grad_norm': 0.08913148194551468, 'learning_rate': 1.121722221476227e-06, 'epoch': 8.96} 90%|████████▉ | 8961/10000 [14:05:57<1:35:06, 5.49s/it][2025-06-20 03:35:42,450] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:35:42,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2131.59 | bwd_microstep: 3376.01 | bwd_inner_microstep: 3375.04 | bwd_allreduce_microstep: 0.91 | step_microstep: 7.06 [2025-06-20 03:35:42,451] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2131.59 | bwd: 3376.02 | bwd_inner: 3375.04 | bwd_allreduce: 0.93 | step: 7.07 90%|████████▉ | 8962/10000 [14:06:03<1:35:16, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.14914150536060333, 'learning_rate': 1.1195843912769e-06, 'epoch': 8.96} 90%|████████▉ | 8962/10000 [14:06:03<1:35:16, 5.51s/it][2025-06-20 03:35:47,926] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:35:47,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.38 | bwd_microstep: 3326.17 | bwd_inner_microstep: 3325.36 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.16 [2025-06-20 03:35:47,927] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.38 | bwd: 3326.19 | bwd_inner: 3325.36 | bwd_allreduce: 0.78 | step: 7.16 90%|████████▉ | 8963/10000 [14:06:08<1:35:01, 5.50s/it] {'loss': 0.0001, 'grad_norm': 0.0245752464979887, 'learning_rate': 1.117448541545545e-06, 'epoch': 8.96} 90%|████████▉ | 8963/10000 [14:06:08<1:35:01, 5.50s/it][2025-06-20 03:35:53,416] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:35:53,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.30 | bwd_microstep: 3339.71 | bwd_inner_microstep: 3338.92 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.77 [2025-06-20 03:35:53,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.30 | bwd: 3339.72 | bwd_inner: 3338.92 | bwd_allreduce: 0.76 | step: 6.78 90%|████████▉ | 8964/10000 [14:06:14<1:34:53, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0016013955464586616, 'learning_rate': 1.1153146725062002e-06, 'epoch': 8.96} 90%|████████▉ | 8964/10000 [14:06:14<1:34:53, 5.50s/it][2025-06-20 03:35:58,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:35:58,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.01 | bwd_microstep: 3328.46 | bwd_inner_microstep: 3327.68 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 03:35:58,888] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.01 | bwd: 3328.48 | bwd_inner: 3327.68 | bwd_allreduce: 0.76 | step: 6.68 90%|████████▉ | 8965/10000 [14:06:19<1:34:40, 5.49s/it] {'loss': 0.0001, 'grad_norm': 0.02898312546312809, 'learning_rate': 1.1131827843827026e-06, 'epoch': 8.96} 90%|████████▉ | 8965/10000 [14:06:19<1:34:40, 5.49s/it][2025-06-20 03:36:04,439] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:36:04,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.64 | bwd_microstep: 3383.45 | bwd_inner_microstep: 3382.64 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.06 [2025-06-20 03:36:04,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.65 | bwd: 3383.46 | bwd_inner: 3382.64 | bwd_allreduce: 0.78 | step: 7.06 90%|████████▉ | 8966/10000 [14:06:25<1:34:54, 5.51s/it] {'loss': 0.0016, 'grad_norm': 0.41478583216667175, 'learning_rate': 1.1110528773986706e-06, 'epoch': 8.97} 90%|████████▉ | 8966/10000 [14:06:25<1:34:54, 5.51s/it][2025-06-20 03:36:09,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:36:09,990] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2130.39 | bwd_microstep: 3378.10 | bwd_inner_microstep: 3377.20 | bwd_allreduce_microstep: 0.85 | step_microstep: 7.32 [2025-06-20 03:36:09,991] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2130.39 | bwd: 3378.12 | bwd_inner: 3377.20 | bwd_allreduce: 0.87 | step: 7.32 90%|████████▉ | 8967/10000 [14:06:30<1:35:02, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0005397807690314949, 'learning_rate': 1.108924951777528e-06, 'epoch': 8.97} 90%|████████▉ | 8967/10000 [14:06:30<1:35:02, 5.52s/it][2025-06-20 03:36:15,558] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:36:15,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.94 | bwd_microstep: 3381.76 | bwd_inner_microstep: 3380.97 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.73 [2025-06-20 03:36:15,559] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.94 | bwd: 3381.78 | bwd_inner: 3380.97 | bwd_allreduce: 0.76 | step: 6.74 90%|████████▉ | 8968/10000 [14:06:36<1:35:11, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.004109051078557968, 'learning_rate': 1.1067990077424805e-06, 'epoch': 8.97} 90%|████████▉ | 8968/10000 [14:06:36<1:35:11, 5.53s/it][2025-06-20 03:36:21,114] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:36:21,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.29 | bwd_microstep: 3377.52 | bwd_inner_microstep: 3376.52 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.33 [2025-06-20 03:36:21,115] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.29 | bwd: 3377.54 | bwd_inner: 3376.52 | bwd_allreduce: 0.97 | step: 7.34 90%|████████▉ | 8969/10000 [14:06:41<1:35:13, 5.54s/it] {'loss': 0.0003, 'grad_norm': 0.11227856576442719, 'learning_rate': 1.1046750455165323e-06, 'epoch': 8.97} 90%|████████▉ | 8969/10000 [14:06:41<1:35:13, 5.54s/it][2025-06-20 03:36:26,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:36:26,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.18 | bwd_microstep: 3328.45 | bwd_inner_microstep: 3327.66 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.99 [2025-06-20 03:36:26,591] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.18 | bwd: 3328.47 | bwd_inner: 3327.66 | bwd_allreduce: 0.77 | step: 7.00 90%|████████▉ | 8970/10000 [14:06:47<1:34:47, 5.52s/it] {'loss': 0.0, 'grad_norm': 3.827766704489477e-05, 'learning_rate': 1.102553065322476e-06, 'epoch': 8.97} 90%|████████▉ | 8970/10000 [14:06:47<1:34:47, 5.52s/it][2025-06-20 03:36:32,065] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:36:32,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.75 | bwd_microstep: 3324.74 | bwd_inner_microstep: 3323.93 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.71 [2025-06-20 03:36:32,066] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.75 | bwd: 3324.75 | bwd_inner: 3323.93 | bwd_allreduce: 0.77 | step: 6.71 90%|████████▉ | 8971/10000 [14:06:52<1:34:27, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0001057279368978925, 'learning_rate': 1.1004330673829e-06, 'epoch': 8.97} 90%|████████▉ | 8971/10000 [14:06:52<1:34:27, 5.51s/it][2025-06-20 03:36:37,540] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:36:37,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.72 | bwd_microstep: 3328.94 | bwd_inner_microstep: 3327.93 | bwd_allreduce_microstep: 0.95 | step_microstep: 7.29 [2025-06-20 03:36:37,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.72 | bwd: 3328.95 | bwd_inner: 3327.93 | bwd_allreduce: 0.97 | step: 7.30 90%|████████▉ | 8972/10000 [14:06:58<1:34:11, 5.50s/it] {'loss': 0.0003, 'grad_norm': 0.1114509254693985, 'learning_rate': 1.0983150519201757e-06, 'epoch': 8.97} 90%|████████▉ | 8972/10000 [14:06:58<1:34:11, 5.50s/it][2025-06-20 03:36:43,026] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:36:43,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.52 | bwd_microstep: 3327.04 | bwd_inner_microstep: 3326.05 | bwd_allreduce_microstep: 0.93 | step_microstep: 7.22 [2025-06-20 03:36:43,027] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.52 | bwd: 3327.05 | bwd_inner: 3326.05 | bwd_allreduce: 0.96 | step: 7.23 90%|████████▉ | 8973/10000 [14:07:03<1:34:02, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0005319517804309726, 'learning_rate': 1.0961990191564763e-06, 'epoch': 8.97} 90%|████████▉ | 8973/10000 [14:07:03<1:34:02, 5.49s/it][2025-06-20 03:36:48,502] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:36:48,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.95 | bwd_microstep: 3321.75 | bwd_inner_microstep: 3320.90 | bwd_allreduce_microstep: 0.81 | step_microstep: 7.09 [2025-06-20 03:36:48,503] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.94 | bwd: 3321.77 | bwd_inner: 3320.90 | bwd_allreduce: 0.83 | step: 7.09 90%|████████▉ | 8974/10000 [14:07:09<1:33:51, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.008746849372982979, 'learning_rate': 1.0940849693137667e-06, 'epoch': 8.97} 90%|████████▉ | 8974/10000 [14:07:09<1:33:51, 5.49s/it][2025-06-20 03:36:54,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:36:54,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2129.65 | bwd_microstep: 3371.35 | bwd_inner_microstep: 3370.57 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 03:36:54,048] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2129.65 | bwd: 3371.37 | bwd_inner: 3370.57 | bwd_allreduce: 0.75 | step: 6.72 90%|████████▉ | 8975/10000 [14:07:14<1:34:03, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.009714450687170029, 'learning_rate': 1.0919729026137982e-06, 'epoch': 8.97} 90%|████████▉ | 8975/10000 [14:07:14<1:34:03, 5.51s/it][2025-06-20 03:36:59,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.87 [2025-06-20 03:36:59,523] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2109.44 | bwd_microstep: 3327.97 | bwd_inner_microstep: 3327.19 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.03 [2025-06-20 03:36:59,524] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2109.44 | bwd: 3327.99 | bwd_inner: 3327.19 | bwd_allreduce: 0.76 | step: 7.03 90%|████████▉ | 8976/10000 [14:07:20<1:33:48, 5.50s/it] {'loss': 0.0, 'grad_norm': 4.5333254092838615e-05, 'learning_rate': 1.0898628192781201e-06, 'epoch': 8.98} 90%|████████▉ | 8976/10000 [14:07:20<1:33:48, 5.50s/it][2025-06-20 03:37:05,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:37:05,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2128.82 | bwd_microstep: 3375.86 | bwd_inner_microstep: 3375.05 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-20 03:37:05,064] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2128.82 | bwd: 3375.88 | bwd_inner: 3375.05 | bwd_allreduce: 0.78 | step: 7.23 90%|████████▉ | 8977/10000 [14:07:25<1:33:56, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.018094785511493683, 'learning_rate': 1.0877547195280646e-06, 'epoch': 8.98} 90%|████████▉ | 8977/10000 [14:07:25<1:33:56, 5.51s/it][2025-06-20 03:37:10,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:37:10,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2105.95 | bwd_microstep: 3319.82 | bwd_inner_microstep: 3319.02 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.98 [2025-06-20 03:37:10,532] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2105.95 | bwd: 3319.84 | bwd_inner: 3319.02 | bwd_allreduce: 0.77 | step: 6.98 90%|████████▉ | 8978/10000 [14:07:31<1:33:38, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.010345500893890858, 'learning_rate': 1.0856486035847634e-06, 'epoch': 8.98} 90%|████████▉ | 8978/10000 [14:07:31<1:33:38, 5.50s/it][2025-06-20 03:37:16,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:37:16,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.54 | bwd_microstep: 3369.46 | bwd_inner_microstep: 3368.65 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.11 [2025-06-20 03:37:16,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.54 | bwd: 3369.48 | bwd_inner: 3368.65 | bwd_allreduce: 0.78 | step: 7.12 90%|████████▉ | 8979/10000 [14:07:36<1:33:45, 5.51s/it] {'loss': 0.0005, 'grad_norm': 0.1254713386297226, 'learning_rate': 1.0835444716691424e-06, 'epoch': 8.98} 90%|████████▉ | 8979/10000 [14:07:36<1:33:45, 5.51s/it][2025-06-20 03:37:21,541] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.87 [2025-06-20 03:37:21,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.77 | bwd_microstep: 3323.35 | bwd_inner_microstep: 3322.55 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.41 [2025-06-20 03:37:21,542] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.77 | bwd: 3323.37 | bwd_inner: 3322.55 | bwd_allreduce: 0.77 | step: 7.41 90%|████████▉ | 8980/10000 [14:07:42<1:33:28, 5.50s/it] {'loss': 0.0, 'grad_norm': 6.742722325725481e-05, 'learning_rate': 1.081442324001909e-06, 'epoch': 8.98} 90%|████████▉ | 8980/10000 [14:07:42<1:33:28, 5.50s/it][2025-06-20 03:37:27,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:37:27,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.96 | bwd_microstep: 3325.12 | bwd_inner_microstep: 3324.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.68 [2025-06-20 03:37:27,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.96 | bwd: 3325.14 | bwd_inner: 3324.34 | bwd_allreduce: 0.76 | step: 6.68 90%|████████▉ | 8981/10000 [14:07:47<1:33:16, 5.49s/it] {'loss': 0.0007, 'grad_norm': 0.22667352855205536, 'learning_rate': 1.0793421608035736e-06, 'epoch': 8.98} 90%|████████▉ | 8981/10000 [14:07:47<1:33:16, 5.49s/it][2025-06-20 03:37:32,555] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:37:32,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2127.84 | bwd_microstep: 3368.73 | bwd_inner_microstep: 3367.92 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.27 [2025-06-20 03:37:32,556] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2127.84 | bwd: 3368.74 | bwd_inner: 3367.92 | bwd_allreduce: 0.78 | step: 7.27 90%|████████▉ | 8982/10000 [14:07:53<1:33:24, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0025003128685057163, 'learning_rate': 1.0772439822944337e-06, 'epoch': 8.98} 90%|████████▉ | 8982/10000 [14:07:53<1:33:24, 5.51s/it][2025-06-20 03:37:38,020] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.73 [2025-06-20 03:37:38,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2107.87 | bwd_microstep: 3315.93 | bwd_inner_microstep: 3314.95 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.30 [2025-06-20 03:37:38,021] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2107.87 | bwd: 3315.94 | bwd_inner: 3314.95 | bwd_allreduce: 0.95 | step: 7.30 90%|████████▉ | 8983/10000 [14:07:58<1:33:06, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.006770341191440821, 'learning_rate': 1.0751477886945749e-06, 'epoch': 8.98} 90%|████████▉ | 8983/10000 [14:07:58<1:33:06, 5.49s/it][2025-06-20 03:37:43,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:37:43,602] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2143.33 | bwd_microstep: 3395.81 | bwd_inner_microstep: 3395.01 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.89 [2025-06-20 03:37:43,603] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2143.33 | bwd: 3395.82 | bwd_inner: 3395.01 | bwd_allreduce: 0.76 | step: 6.89 90%|████████▉ | 8984/10000 [14:08:04<1:33:27, 5.52s/it] {'loss': 0.0001, 'grad_norm': 0.009119978174567223, 'learning_rate': 1.0730535802238796e-06, 'epoch': 8.98} 90%|████████▉ | 8984/10000 [14:08:04<1:33:27, 5.52s/it][2025-06-20 03:37:49,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:37:49,068] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2108.46 | bwd_microstep: 3315.92 | bwd_inner_microstep: 3315.13 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 03:37:49,069] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2108.46 | bwd: 3315.94 | bwd_inner: 3315.13 | bwd_allreduce: 0.76 | step: 6.66 90%|████████▉ | 8985/10000 [14:08:09<1:33:05, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.0005270623951219022, 'learning_rate': 1.0709613571020227e-06, 'epoch': 8.98} 90%|████████▉ | 8985/10000 [14:08:09<1:33:05, 5.50s/it][2025-06-20 03:37:54,645] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.72 [2025-06-20 03:37:54,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2139.02 | bwd_microstep: 3400.88 | bwd_inner_microstep: 3400.00 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.24 [2025-06-20 03:37:54,646] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2139.02 | bwd: 3400.89 | bwd_inner: 3400.00 | bwd_allreduce: 0.85 | step: 7.24 90%|████████▉ | 8986/10000 [14:08:15<1:33:23, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0008177408599294722, 'learning_rate': 1.068871119548469e-06, 'epoch': 8.99} 90%|████████▉ | 8986/10000 [14:08:15<1:33:23, 5.53s/it][2025-06-20 03:38:00,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:38:00,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.01 | bwd_microstep: 3373.13 | bwd_inner_microstep: 3372.34 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 03:38:00,188] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.01 | bwd: 3373.14 | bwd_inner: 3372.34 | bwd_allreduce: 0.76 | step: 6.65 90%|████████▉ | 8987/10000 [14:08:20<1:33:22, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.0016465920489281416, 'learning_rate': 1.0667828677824699e-06, 'epoch': 8.99} 90%|████████▉ | 8987/10000 [14:08:20<1:33:22, 5.53s/it][2025-06-20 03:38:05,648] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:38:05,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2101.82 | bwd_microstep: 3322.08 | bwd_inner_microstep: 3321.11 | bwd_allreduce_microstep: 0.92 | step_microstep: 7.18 [2025-06-20 03:38:05,649] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2101.82 | bwd: 3322.10 | bwd_inner: 3321.11 | bwd_allreduce: 0.94 | step: 7.18 90%|████████▉ | 8988/10000 [14:08:26<1:32:55, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.02540220506489277, 'learning_rate': 1.0646966020230765e-06, 'epoch': 8.99} 90%|████████▉ | 8988/10000 [14:08:26<1:32:55, 5.51s/it][2025-06-20 03:38:11,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:38:11,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2104.13 | bwd_microstep: 3327.72 | bwd_inner_microstep: 3326.93 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.83 [2025-06-20 03:38:11,118] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2104.13 | bwd: 3327.73 | bwd_inner: 3326.93 | bwd_allreduce: 0.76 | step: 6.84 90%|████████▉ | 8989/10000 [14:08:31<1:32:38, 5.50s/it] {'loss': 0.002, 'grad_norm': 0.5849658846855164, 'learning_rate': 1.0626123224891272e-06, 'epoch': 8.99} 90%|████████▉ | 8989/10000 [14:08:31<1:32:38, 5.50s/it][2025-06-20 03:38:16,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:38:16,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2124.02 | bwd_microstep: 3372.58 | bwd_inner_microstep: 3371.69 | bwd_allreduce_microstep: 0.84 | step_microstep: 7.47 [2025-06-20 03:38:16,656] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2124.02 | bwd: 3372.59 | bwd_inner: 3371.69 | bwd_allreduce: 0.86 | step: 7.48 90%|████████▉ | 8990/10000 [14:08:37<1:32:44, 5.51s/it] {'loss': 0.0001, 'grad_norm': 0.033224985003471375, 'learning_rate': 1.060530029399256e-06, 'epoch': 8.99} 90%|████████▉ | 8990/10000 [14:08:37<1:32:44, 5.51s/it][2025-06-20 03:38:22,196] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:38:22,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.83 | bwd_microstep: 3363.53 | bwd_inner_microstep: 3362.59 | bwd_allreduce_microstep: 0.89 | step_microstep: 7.10 [2025-06-20 03:38:22,197] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.83 | bwd: 3363.55 | bwd_inner: 3362.59 | bwd_allreduce: 0.91 | step: 7.10 90%|████████▉ | 8991/10000 [14:08:42<1:32:48, 5.52s/it] {'loss': 0.0004, 'grad_norm': 0.11565851420164108, 'learning_rate': 1.0584497229718838e-06, 'epoch': 8.99} 90%|████████▉ | 8991/10000 [14:08:42<1:32:48, 5.52s/it][2025-06-20 03:38:27,654] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:38:27,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.74 | bwd_microstep: 3316.68 | bwd_inner_microstep: 3315.89 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.71 [2025-06-20 03:38:27,655] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.74 | bwd: 3316.69 | bwd_inner: 3315.89 | bwd_allreduce: 0.76 | step: 6.72 90%|████████▉ | 8992/10000 [14:08:48<1:32:24, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.005182490684092045, 'learning_rate': 1.0563714034252248e-06, 'epoch': 8.99} 90%|████████▉ | 8992/10000 [14:08:48<1:32:24, 5.50s/it][2025-06-20 03:38:33,110] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:38:33,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.01 | bwd_microstep: 3315.96 | bwd_inner_microstep: 3315.15 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.12 [2025-06-20 03:38:33,111] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.01 | bwd: 3315.97 | bwd_inner: 3315.15 | bwd_allreduce: 0.78 | step: 7.12 90%|████████▉ | 8993/10000 [14:08:53<1:32:05, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.00019037035235669464, 'learning_rate': 1.0542950709772847e-06, 'epoch': 8.99} 90%|████████▉ | 8993/10000 [14:08:53<1:32:05, 5.49s/it][2025-06-20 03:38:38,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:38:38,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2103.00 | bwd_microstep: 3327.55 | bwd_inner_microstep: 3326.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.62 [2025-06-20 03:38:38,578] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2103.00 | bwd: 3327.56 | bwd_inner: 3326.76 | bwd_allreduce: 0.76 | step: 6.63 90%|████████▉ | 8994/10000 [14:08:59<1:31:54, 5.48s/it] {'loss': 0.0, 'grad_norm': 1.915279972308781e-05, 'learning_rate': 1.0522207258458627e-06, 'epoch': 8.99} 90%|████████▉ | 8994/10000 [14:08:59<1:31:54, 5.48s/it][2025-06-20 03:38:44,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:38:44,030] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2098.14 | bwd_microstep: 3318.24 | bwd_inner_microstep: 3317.45 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 03:38:44,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2098.14 | bwd: 3318.25 | bwd_inner: 3317.45 | bwd_allreduce: 0.76 | step: 6.62 90%|████████▉ | 8995/10000 [14:09:04<1:31:39, 5.47s/it] {'loss': 0.0, 'grad_norm': 0.000249512551818043, 'learning_rate': 1.050148368248547e-06, 'epoch': 8.99} 90%|████████▉ | 8995/10000 [14:09:04<1:31:39, 5.47s/it][2025-06-20 03:38:49,566] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.79 [2025-06-20 03:38:49,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.00 | bwd_microstep: 3374.15 | bwd_inner_microstep: 3373.28 | bwd_allreduce_microstep: 0.82 | step_microstep: 7.41 [2025-06-20 03:38:49,567] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.00 | bwd: 3374.16 | bwd_inner: 3373.28 | bwd_allreduce: 0.84 | step: 7.41 90%|████████▉ | 8996/10000 [14:09:10<1:31:54, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0004168431623838842, 'learning_rate': 1.0480779984027212e-06, 'epoch': 9.0} 90%|████████▉ | 8996/10000 [14:09:10<1:31:54, 5.49s/it][2025-06-20 03:38:55,031] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:38:55,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.98 | bwd_microstep: 3320.48 | bwd_inner_microstep: 3319.69 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 03:38:55,032] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.98 | bwd: 3320.50 | bwd_inner: 3319.69 | bwd_allreduce: 0.76 | step: 6.76 90%|████████▉ | 8997/10000 [14:09:15<1:31:39, 5.48s/it] {'loss': 0.0, 'grad_norm': 0.0007903202204033732, 'learning_rate': 1.0460096165255496e-06, 'epoch': 9.0} 90%|████████▉ | 8997/10000 [14:09:15<1:31:39, 5.48s/it][2025-06-20 03:39:00,551] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:39:00,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2123.25 | bwd_microstep: 3359.85 | bwd_inner_microstep: 3359.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.66 [2025-06-20 03:39:00,552] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2123.25 | bwd: 3359.87 | bwd_inner: 3359.07 | bwd_allreduce: 0.75 | step: 6.67 90%|████████▉ | 8998/10000 [14:09:21<1:31:45, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.0005722219939343631, 'learning_rate': 1.0439432228340051e-06, 'epoch': 9.0} 90%|████████▉ | 8998/10000 [14:09:21<1:31:45, 5.49s/it][2025-06-20 03:39:06,001] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:39:06,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.32 | bwd_microstep: 3310.05 | bwd_inner_microstep: 3309.24 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.02 [2025-06-20 03:39:06,002] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.32 | bwd: 3310.07 | bwd_inner: 3309.24 | bwd_allreduce: 0.78 | step: 7.02 90%|████████▉ | 8999/10000 [14:09:26<1:31:26, 5.48s/it] {'loss': 0.0, 'grad_norm': 4.4903008529217914e-05, 'learning_rate': 1.0418788175448347e-06, 'epoch': 9.0} 90%|████████▉ | 8999/10000 [14:09:26<1:31:26, 5.48s/it]petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. petrel_client is not installed. If you read data locally instead of from ceph, ignore it. Replace train sampler!! petrel_client is not installed. Using PIL to load images. [2025-06-20 03:39:13,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:39:13,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2093.67 | bwd_microstep: 3312.19 | bwd_inner_microstep: 3311.40 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.94 [2025-06-20 03:39:13,077] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2093.67 | bwd: 3312.20 | bwd_inner: 3311.40 | bwd_allreduce: 0.77 | step: 6.95 90%|█████████ | 9000/10000 [14:09:33<1:39:19, 5.96s/it] {'loss': 0.0, 'grad_norm': 0.0002638440055307001, 'learning_rate': 1.0398164008745916e-06, 'epoch': 9.0} 90%|█████████ | 9000/10000 [14:09:33<1:39:19, 5.96s/it]evaluate! [INFO|trainer.py:3910] 2025-06-20 03:39:23,244 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-20 03:39:23,248 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-20 03:39:23,249 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-20 03:40:23,738 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-20 03:40:23,741 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-20 03:40:23,741 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-20 03:40:23,742 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json evaluate! [INFO|trainer.py:3910] 2025-06-20 03:40:39,501 >> Saving model checkpoint to weights/st1/mos0_st1 [INFO|configuration_utils.py:420] 2025-06-20 03:40:39,506 >> Configuration saved in weights/st1/mos0_st1/config.json [INFO|configuration_utils.py:909] 2025-06-20 03:40:39,506 >> Configuration saved in weights/st1/mos0_st1/generation_config.json [INFO|modeling_utils.py:2996] 2025-06-20 03:41:40,602 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at weights/st1/mos0_st1/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2491] 2025-06-20 03:41:40,606 >> tokenizer config file saved in weights/st1/mos0_st1/tokenizer_config.json [INFO|tokenization_utils_base.py:2500] 2025-06-20 03:41:40,606 >> Special tokens file saved in weights/st1/mos0_st1/special_tokens_map.json [INFO|tokenization_utils_base.py:2553] 2025-06-20 03:41:40,607 >> added tokens file saved in weights/st1/mos0_st1/added_tokens.json [2025-06-20 03:41:45,620] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-20 03:41:51,673] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-20 03:41:57,479] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-20 03:42:03,504] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/wangjiarui/.conda/envs/intern25/lib/python3.9/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning) [2025-06-20 03:42:22,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:42:22,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2078.36 | bwd_microstep: 3269.14 | bwd_inner_microstep: 3268.32 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.14 [2025-06-20 03:42:22,362] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2078.30 | bwd: 3269.15 | bwd_inner: 3268.32 | bwd_allreduce: 0.79 | step: 7.14 90%|█████████ | 9001/10000 [14:12:43<16:54:56, 60.96s/it] {'loss': 0.0, 'grad_norm': 6.63836399326101e-05, 'learning_rate': 1.0377559730396114e-06, 'epoch': 9.0} 90%|█████████ | 9001/10000 [14:12:43<16:54:56, 60.96s/it][2025-06-20 03:42:27,880] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.57 | optimizer_step: 2.73 [2025-06-20 03:42:27,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.52 | bwd_microstep: 3357.68 | bwd_inner_microstep: 3356.76 | bwd_allreduce_microstep: 0.84 | step_microstep: 8.13 [2025-06-20 03:42:27,882] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.52 | bwd: 3357.71 | bwd_inner: 3356.76 | bwd_allreduce: 0.88 | step: 8.13 90%|█████████ | 9002/10000 [14:12:48<12:17:18, 44.33s/it] {'loss': 0.0, 'grad_norm': 0.004730382468551397, 'learning_rate': 1.0356975342560217e-06, 'epoch': 9.0} 90%|█████████ | 9002/10000 [14:12:48<12:17:18, 44.33s/it][2025-06-20 03:42:33,307] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:42:33,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2097.45 | bwd_microstep: 3280.05 | bwd_inner_microstep: 3279.24 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.28 [2025-06-20 03:42:33,308] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2097.45 | bwd: 3280.07 | bwd_inner: 3279.24 | bwd_allreduce: 0.79 | step: 7.28 90%|█████████ | 9003/10000 [14:12:54<9:02:38, 32.66s/it] {'loss': 0.0, 'grad_norm': 0.003144819289445877, 'learning_rate': 1.0336410847397449e-06, 'epoch': 9.0} 90%|█████████ | 9003/10000 [14:12:54<9:02:38, 32.66s/it][2025-06-20 03:42:38,732] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:42:38,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2094.13 | bwd_microstep: 3286.56 | bwd_inner_microstep: 3285.75 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.32 [2025-06-20 03:42:38,733] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2094.13 | bwd: 3286.57 | bwd_inner: 3285.75 | bwd_allreduce: 0.78 | step: 7.32 90%|█████████ | 9004/10000 [14:12:59<6:46:28, 24.49s/it] {'loss': 0.0, 'grad_norm': 8.237641304731369e-05, 'learning_rate': 1.0315866247064932e-06, 'epoch': 9.0} 90%|█████████ | 9004/10000 [14:12:59<6:46:28, 24.49s/it][2025-06-20 03:42:44,167] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.58 | optimizer_step: 2.73 [2025-06-20 03:42:44,168] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2089.09 | bwd_microstep: 3300.96 | bwd_inner_microstep: 3300.03 | bwd_allreduce_microstep: 0.85 | step_microstep: 8.16 [2025-06-20 03:42:44,169] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2089.09 | bwd: 3300.98 | bwd_inner: 3300.03 | bwd_allreduce: 0.89 | step: 8.18 90%|█████████ | 9005/10000 [14:13:04<5:11:18, 18.77s/it] {'loss': 0.0, 'grad_norm': 0.0030509592033922672, 'learning_rate': 1.0295341543717696e-06, 'epoch': 9.01} 90%|█████████ | 9005/10000 [14:13:04<5:11:18, 18.77s/it][2025-06-20 03:42:49,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:42:49,683] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2126.20 | bwd_microstep: 3345.68 | bwd_inner_microstep: 3344.87 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.10 [2025-06-20 03:42:49,684] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2126.20 | bwd: 3345.70 | bwd_inner: 3344.87 | bwd_allreduce: 0.78 | step: 7.10 90%|█████████ | 9006/10000 [14:13:10<4:05:04, 14.79s/it] {'loss': 0.0001, 'grad_norm': 0.03095971792936325, 'learning_rate': 1.0274836739508687e-06, 'epoch': 9.01} 90%|█████████ | 9006/10000 [14:13:10<4:05:04, 14.79s/it][2025-06-20 03:42:55,199] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:42:55,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2116.87 | bwd_microstep: 3360.87 | bwd_inner_microstep: 3359.96 | bwd_allreduce_microstep: 0.86 | step_microstep: 7.45 [2025-06-20 03:42:55,200] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2116.87 | bwd: 3360.89 | bwd_inner: 3359.96 | bwd_allreduce: 0.88 | step: 7.46 90%|█████████ | 9007/10000 [14:13:15<3:18:46, 12.01s/it] {'loss': 0.0001, 'grad_norm': 0.035169295966625214, 'learning_rate': 1.0254351836588783e-06, 'epoch': 9.01} 90%|█████████ | 9007/10000 [14:13:16<3:18:46, 12.01s/it][2025-06-20 03:43:00,689] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:43:00,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2119.09 | bwd_microstep: 3330.79 | bwd_inner_microstep: 3329.99 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.90 [2025-06-20 03:43:00,690] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2119.09 | bwd: 3330.80 | bwd_inner: 3329.99 | bwd_allreduce: 0.76 | step: 6.90 90%|█████████ | 9008/10000 [14:13:21<2:46:13, 10.05s/it] {'loss': 0.0, 'grad_norm': 0.0006914135883562267, 'learning_rate': 1.023388683710671e-06, 'epoch': 9.01} 90%|█████████ | 9008/10000 [14:13:21<2:46:13, 10.05s/it][2025-06-20 03:43:06,189] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.77 [2025-06-20 03:43:06,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.93 | bwd_microstep: 3345.81 | bwd_inner_microstep: 3345.00 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.13 [2025-06-20 03:43:06,190] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.93 | bwd: 3345.82 | bwd_inner: 3345.00 | bwd_allreduce: 0.78 | step: 7.14 90%|█████████ | 9009/10000 [14:13:26<2:23:29, 8.69s/it] {'loss': 0.0, 'grad_norm': 0.00015559584426227957, 'learning_rate': 1.0213441743209173e-06, 'epoch': 9.01} 90%|█████████ | 9009/10000 [14:13:26<2:23:29, 8.69s/it][2025-06-20 03:43:11,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:43:11,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2085.90 | bwd_microstep: 3291.17 | bwd_inner_microstep: 3290.29 | bwd_allreduce_microstep: 0.80 | step_microstep: 7.49 [2025-06-20 03:43:11,607] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2085.90 | bwd: 3291.19 | bwd_inner: 3290.29 | bwd_allreduce: 0.83 | step: 7.50 90%|█████████ | 9010/10000 [14:13:32<2:07:10, 7.71s/it] {'loss': 0.0, 'grad_norm': 7.716899563092738e-05, 'learning_rate': 1.019301655704077e-06, 'epoch': 9.01} 90%|█████████ | 9010/10000 [14:13:32<2:07:10, 7.71s/it][2025-06-20 03:43:17,107] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:43:17,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.41 | bwd_microstep: 3346.46 | bwd_inner_microstep: 3345.67 | bwd_allreduce_microstep: 0.74 | step_microstep: 7.13 [2025-06-20 03:43:17,108] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.41 | bwd: 3346.47 | bwd_inner: 3345.67 | bwd_allreduce: 0.76 | step: 7.13 90%|█████████ | 9011/10000 [14:13:37<1:56:08, 7.05s/it] {'loss': 0.0, 'grad_norm': 0.0019349943613633513, 'learning_rate': 1.0172611280744004e-06, 'epoch': 9.01} 90%|█████████ | 9011/10000 [14:13:37<1:56:08, 7.05s/it][2025-06-20 03:43:22,534] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:43:22,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2088.61 | bwd_microstep: 3295.02 | bwd_inner_microstep: 3294.22 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.21 [2025-06-20 03:43:22,535] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2088.61 | bwd: 3295.04 | bwd_inner: 3294.22 | bwd_allreduce: 0.77 | step: 7.21 90%|█████████ | 9012/10000 [14:13:43<1:48:00, 6.56s/it] {'loss': 0.0002, 'grad_norm': 0.1177409365773201, 'learning_rate': 1.0152225916459346e-06, 'epoch': 9.01} 90%|█████████ | 9012/10000 [14:13:43<1:48:00, 6.56s/it][2025-06-20 03:43:28,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.72 [2025-06-20 03:43:28,044] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2121.17 | bwd_microstep: 3348.57 | bwd_inner_microstep: 3347.78 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.62 [2025-06-20 03:43:28,045] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2121.17 | bwd: 3348.59 | bwd_inner: 3347.78 | bwd_allreduce: 0.76 | step: 6.62 90%|█████████ | 9013/10000 [14:13:48<1:42:42, 6.24s/it] {'loss': 0.0, 'grad_norm': 0.0005829287110827863, 'learning_rate': 1.013186046632504e-06, 'epoch': 9.01} 90%|█████████ | 9013/10000 [14:13:48<1:42:42, 6.24s/it][h264 @ 0x2f41e440] Reference 5 >= 5 [h264 @ 0x2f41e440] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x30b7ed40] left block unavailable for requested intra mode [h264 @ 0x30b7ed40] error while decoding MB 0 25, bytestream 45493 [h264 @ 0x30a049c0] Reference 5 >= 5 [h264 @ 0x30a049c0] error while decoding MB 15 42, bytestream 9292 [h264 @ 0x30a049c0] left block unavailable for requested intra mode [h264 @ 0x30a049c0] error while decoding MB 0 25, bytestream 45493 [2025-06-20 03:43:33,544] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:43:33,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.06 | bwd_microstep: 3347.67 | bwd_inner_microstep: 3346.86 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.23 [2025-06-20 03:43:33,545] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.06 | bwd: 3347.69 | bwd_inner: 3346.86 | bwd_allreduce: 0.78 | step: 7.23 90%|█████████ | 9014/10000 [14:13:54<1:38:57, 6.02s/it] {'loss': 0.0, 'grad_norm': 4.839798930333927e-05, 'learning_rate': 1.0111514932477352e-06, 'epoch': 9.01} 90%|█████████ | 9014/10000 [14:13:54<1:38:57, 6.02s/it][2025-06-20 03:43:39,062] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:43:39,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2122.98 | bwd_microstep: 3357.19 | bwd_inner_microstep: 3356.40 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.64 [2025-06-20 03:43:39,063] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2122.98 | bwd: 3357.20 | bwd_inner: 3356.40 | bwd_allreduce: 0.76 | step: 6.64 90%|█████████ | 9015/10000 [14:13:59<1:36:21, 5.87s/it] {'loss': 0.0, 'grad_norm': 0.007119529880583286, 'learning_rate': 1.0091189317050465e-06, 'epoch': 9.02} 90%|█████████ | 9015/10000 [14:13:59<1:36:21, 5.87s/it][2025-06-20 03:43:44,581] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.54 | optimizer_step: 2.73 [2025-06-20 03:43:44,582] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2125.14 | bwd_microstep: 3356.86 | bwd_inner_microstep: 3355.96 | bwd_allreduce_microstep: 0.83 | step_microstep: 7.64 [2025-06-20 03:43:44,583] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2125.14 | bwd: 3356.88 | bwd_inner: 3355.96 | bwd_allreduce: 0.86 | step: 7.64 90%|█████████ | 9016/10000 [14:14:05<1:34:33, 5.77s/it] {'loss': 0.0, 'grad_norm': 0.0005384592805057764, 'learning_rate': 1.0070883622176409e-06, 'epoch': 9.02} 90%|█████████ | 9016/10000 [14:14:05<1:34:33, 5.77s/it][2025-06-20 03:43:50,123] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.55 | optimizer_step: 2.72 [2025-06-20 03:43:50,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.05 | bwd_microstep: 3365.30 | bwd_inner_microstep: 3364.23 | bwd_allreduce_microstep: 1.00 | step_microstep: 7.29 [2025-06-20 03:43:50,124] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.05 | bwd: 3365.31 | bwd_inner: 3364.23 | bwd_allreduce: 1.03 | step: 7.30 90%|█████████ | 9017/10000 [14:14:10<1:33:21, 5.70s/it] {'loss': 0.0001, 'grad_norm': 0.02468426711857319, 'learning_rate': 1.0050597849985188e-06, 'epoch': 9.02} 90%|█████████ | 9017/10000 [14:14:10<1:33:21, 5.70s/it][2025-06-20 03:43:55,605] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:43:55,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.91 | bwd_microstep: 3323.56 | bwd_inner_microstep: 3322.76 | bwd_allreduce_microstep: 0.75 | step_microstep: 7.03 [2025-06-20 03:43:55,606] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.91 | bwd: 3323.57 | bwd_inner: 3322.76 | bwd_allreduce: 0.77 | step: 7.03 90%|█████████ | 9018/10000 [14:14:16<1:32:11, 5.63s/it] {'loss': 0.0, 'grad_norm': 0.006328308954834938, 'learning_rate': 1.003033200260466e-06, 'epoch': 9.02} 90%|█████████ | 9018/10000 [14:14:16<1:32:11, 5.63s/it][2025-06-20 03:44:01,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:44:01,075] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.00 | bwd_microstep: 3315.41 | bwd_inner_microstep: 3314.42 | bwd_allreduce_microstep: 0.94 | step_microstep: 7.10 [2025-06-20 03:44:01,076] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.00 | bwd: 3315.42 | bwd_inner: 3314.42 | bwd_allreduce: 0.96 | step: 7.11 90%|█████████ | 9019/10000 [14:14:21<1:31:17, 5.58s/it] {'loss': 0.0, 'grad_norm': 0.0008812045562081039, 'learning_rate': 1.0010086082160653e-06, 'epoch': 9.02} 90%|█████████ | 9019/10000 [14:14:21<1:31:17, 5.58s/it][2025-06-20 03:44:06,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.62 | optimizer_step: 2.72 [2025-06-20 03:44:06,549] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2113.90 | bwd_microstep: 3317.96 | bwd_inner_microstep: 3316.87 | bwd_allreduce_microstep: 1.03 | step_microstep: 7.70 [2025-06-20 03:44:06,550] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2113.90 | bwd: 3317.98 | bwd_inner: 3316.87 | bwd_allreduce: 1.05 | step: 7.70 90%|█████████ | 9020/10000 [14:14:27<1:30:40, 5.55s/it] {'loss': 0.0, 'grad_norm': 0.00952841155230999, 'learning_rate': 9.989860090776827e-07, 'epoch': 9.02} 90%|█████████ | 9020/10000 [14:14:27<1:30:40, 5.55s/it][2025-06-20 03:44:12,088] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:44:12,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2134.08 | bwd_microstep: 3359.65 | bwd_inner_microstep: 3358.85 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.75 [2025-06-20 03:44:12,089] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2134.08 | bwd: 3359.67 | bwd_inner: 3358.85 | bwd_allreduce: 0.77 | step: 6.75 90%|█████████ | 9021/10000 [14:14:32<1:30:30, 5.55s/it] {'loss': 0.0, 'grad_norm': 0.0020448348950594664, 'learning_rate': 9.969654030574794e-07, 'epoch': 9.02} 90%|█████████ | 9021/10000 [14:14:32<1:30:30, 5.55s/it][2025-06-20 03:44:17,638] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.64 | optimizer_step: 2.72 [2025-06-20 03:44:17,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2133.65 | bwd_microstep: 3375.72 | bwd_inner_microstep: 3374.50 | bwd_allreduce_microstep: 1.13 | step_microstep: 8.56 [2025-06-20 03:44:17,639] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2133.65 | bwd: 3375.74 | bwd_inner: 3374.50 | bwd_allreduce: 1.17 | step: 8.57 90%|█████████ | 9022/10000 [14:14:38<1:30:27, 5.55s/it] {'loss': 0.0, 'grad_norm': 9.843061707215384e-05, 'learning_rate': 9.949467903674148e-07, 'epoch': 9.02} 90%|█████████ | 9022/10000 [14:14:38<1:30:27, 5.55s/it][2025-06-20 03:44:23,211] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.56 | optimizer_step: 2.72 [2025-06-20 03:44:23,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.97 | bwd_microstep: 3380.67 | bwd_inner_microstep: 3379.65 | bwd_allreduce_microstep: 0.96 | step_microstep: 7.83 [2025-06-20 03:44:23,213] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.97 | bwd: 3380.68 | bwd_inner: 3379.65 | bwd_allreduce: 0.98 | step: 7.84 90%|█████████ | 9023/10000 [14:14:44<1:30:28, 5.56s/it] {'loss': 0.0, 'grad_norm': 0.00011228278890484944, 'learning_rate': 9.929301712192241e-07, 'epoch': 9.02} 90%|█████████ | 9023/10000 [14:14:44<1:30:28, 5.56s/it][2025-06-20 03:44:28,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.72 [2025-06-20 03:44:28,703] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.70 | bwd_microstep: 3331.22 | bwd_inner_microstep: 3330.40 | bwd_allreduce_microstep: 0.76 | step_microstep: 7.00 [2025-06-20 03:44:28,704] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.70 | bwd: 3331.23 | bwd_inner: 3330.40 | bwd_allreduce: 0.78 | step: 7.00 90%|█████████ | 9024/10000 [14:14:49<1:30:03, 5.54s/it] {'loss': 0.0006, 'grad_norm': 0.16086164116859436, 'learning_rate': 9.909155458244424e-07, 'epoch': 9.02} 90%|█████████ | 9024/10000 [14:14:49<1:30:03, 5.54s/it][2025-06-20 03:44:34,186] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.74 [2025-06-20 03:44:34,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.80 | bwd_microstep: 3330.20 | bwd_inner_microstep: 3329.26 | bwd_allreduce_microstep: 0.90 | step_microstep: 7.22 [2025-06-20 03:44:34,187] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.80 | bwd: 3330.22 | bwd_inner: 3329.26 | bwd_allreduce: 0.91 | step: 7.22 90%|█████████ | 9025/10000 [14:14:54<1:29:42, 5.52s/it] {'loss': 0.0002, 'grad_norm': 0.035917919129133224, 'learning_rate': 9.889029143943963e-07, 'epoch': 9.03} 90%|█████████ | 9025/10000 [14:14:54<1:29:42, 5.52s/it][2025-06-20 03:44:39,745] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:44:39,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2138.24 | bwd_microstep: 3380.34 | bwd_inner_microstep: 3379.22 | bwd_allreduce_microstep: 1.05 | step_microstep: 7.58 [2025-06-20 03:44:39,746] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2138.25 | bwd: 3380.36 | bwd_inner: 3379.22 | bwd_allreduce: 1.08 | step: 7.58 90%|█████████ | 9026/10000 [14:15:00<1:29:47, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.001162980799563229, 'learning_rate': 9.868922771402012e-07, 'epoch': 9.03} 90%|█████████ | 9026/10000 [14:15:00<1:29:47, 5.53s/it][2025-06-20 03:44:45,299] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:44:45,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2135.21 | bwd_microstep: 3377.69 | bwd_inner_microstep: 3376.90 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.81 [2025-06-20 03:44:45,300] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2135.21 | bwd: 3377.71 | bwd_inner: 3376.90 | bwd_allreduce: 0.76 | step: 6.81 90%|█████████ | 9027/10000 [14:15:06<1:29:48, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.00054510886548087, 'learning_rate': 9.848836342727686e-07, 'epoch': 9.03} 90%|█████████ | 9027/10000 [14:15:06<1:29:48, 5.54s/it][2025-06-20 03:44:50,776] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:44:50,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2118.73 | bwd_microstep: 3316.91 | bwd_inner_microstep: 3316.10 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.85 [2025-06-20 03:44:50,777] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2118.73 | bwd: 3316.93 | bwd_inner: 3316.10 | bwd_allreduce: 0.78 | step: 6.86 90%|█████████ | 9028/10000 [14:15:11<1:29:25, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.0001551799214212224, 'learning_rate': 9.828769860027854e-07, 'epoch': 9.03} 90%|█████████ | 9028/10000 [14:15:11<1:29:25, 5.52s/it][2025-06-20 03:44:56,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:44:56,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2114.91 | bwd_microstep: 3325.24 | bwd_inner_microstep: 3324.42 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.95 [2025-06-20 03:44:56,255] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2114.91 | bwd: 3325.26 | bwd_inner: 3324.42 | bwd_allreduce: 0.79 | step: 6.96 90%|█████████ | 9029/10000 [14:15:17<1:29:08, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.002774226013571024, 'learning_rate': 9.808723325407477e-07, 'epoch': 9.03} 90%|█████████ | 9029/10000 [14:15:17<1:29:08, 5.51s/it][2025-06-20 03:45:01,857] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.53 | optimizer_step: 2.73 [2025-06-20 03:45:01,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2152.18 | bwd_microstep: 3410.62 | bwd_inner_microstep: 3409.73 | bwd_allreduce_microstep: 0.84 | step_microstep: 6.85 [2025-06-20 03:45:01,858] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2152.18 | bwd: 3410.63 | bwd_inner: 3409.73 | bwd_allreduce: 0.86 | step: 6.86 90%|█████████ | 9030/10000 [14:15:22<1:29:29, 5.54s/it] {'loss': 0.0, 'grad_norm': 0.0034402268938720226, 'learning_rate': 9.788696740969295e-07, 'epoch': 9.03} 90%|█████████ | 9030/10000 [14:15:22<1:29:29, 5.54s/it][2025-06-20 03:45:07,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:45:07,417] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.21 | bwd_microstep: 3378.60 | bwd_inner_microstep: 3377.82 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.67 [2025-06-20 03:45:07,418] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.21 | bwd: 3378.62 | bwd_inner: 3377.82 | bwd_allreduce: 0.76 | step: 6.67 90%|█████████ | 9031/10000 [14:15:28<1:29:31, 5.54s/it] {'loss': 0.0001, 'grad_norm': 0.013527477160096169, 'learning_rate': 9.768690108814027e-07, 'epoch': 9.03} 90%|█████████ | 9031/10000 [14:15:28<1:29:31, 5.54s/it][2025-06-20 03:45:12,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:45:12,973] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2140.08 | bwd_microstep: 3377.07 | bwd_inner_microstep: 3376.27 | bwd_allreduce_microstep: 0.75 | step_microstep: 6.95 [2025-06-20 03:45:12,974] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2140.08 | bwd: 3377.08 | bwd_inner: 3376.27 | bwd_allreduce: 0.77 | step: 6.95 90%|█████████ | 9032/10000 [14:15:33<1:29:29, 5.55s/it] {'loss': 0.0, 'grad_norm': 0.0001039356502587907, 'learning_rate': 9.748703431040308e-07, 'epoch': 9.03} 90%|█████████ | 9032/10000 [14:15:33<1:29:29, 5.55s/it][2025-06-20 03:45:18,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:45:18,440] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2106.63 | bwd_microstep: 3322.85 | bwd_inner_microstep: 3322.07 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.72 [2025-06-20 03:45:18,441] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2106.63 | bwd: 3322.87 | bwd_inner: 3322.07 | bwd_allreduce: 0.76 | step: 6.72 90%|█████████ | 9033/10000 [14:15:39<1:29:00, 5.52s/it] {'loss': 0.0, 'grad_norm': 0.00965158175677061, 'learning_rate': 9.728736709744612e-07, 'epoch': 9.03} 90%|█████████ | 9033/10000 [14:15:39<1:29:00, 5.52s/it][2025-06-20 03:45:23,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:45:23,918] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2102.63 | bwd_microstep: 3330.81 | bwd_inner_microstep: 3330.00 | bwd_allreduce_microstep: 0.77 | step_microstep: 6.69 [2025-06-20 03:45:23,919] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2102.63 | bwd: 3330.82 | bwd_inner: 3330.00 | bwd_allreduce: 0.79 | step: 6.69 90%|█████████ | 9034/10000 [14:15:44<1:28:41, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.003576320828869939, 'learning_rate': 9.708789947021335e-07, 'epoch': 9.03} 90%|█████████ | 9034/10000 [14:15:44<1:28:41, 5.51s/it][2025-06-20 03:45:29,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:45:29,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.63 | bwd_microstep: 3323.76 | bwd_inner_microstep: 3322.98 | bwd_allreduce_microstep: 0.74 | step_microstep: 6.65 [2025-06-20 03:45:29,393] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.63 | bwd: 3323.77 | bwd_inner: 3322.98 | bwd_allreduce: 0.75 | step: 6.65 90%|█████████ | 9035/10000 [14:15:50<1:28:26, 5.50s/it] {'loss': 0.0, 'grad_norm': 0.00010389853559900075, 'learning_rate': 9.688863144962824e-07, 'epoch': 9.04} 90%|█████████ | 9035/10000 [14:15:50<1:28:26, 5.50s/it][2025-06-20 03:45:34,873] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.73 [2025-06-20 03:45:34,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2115.57 | bwd_microstep: 3324.83 | bwd_inner_microstep: 3324.01 | bwd_allreduce_microstep: 0.77 | step_microstep: 7.31 [2025-06-20 03:45:34,874] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2115.57 | bwd: 3324.85 | bwd_inner: 3324.01 | bwd_allreduce: 0.79 | step: 7.32 90%|█████████ | 9036/10000 [14:15:55<1:28:16, 5.49s/it] {'loss': 0.0, 'grad_norm': 0.005214829929172993, 'learning_rate': 9.668956305659315e-07, 'epoch': 9.04} 90%|█████████ | 9036/10000 [14:15:55<1:28:16, 5.49s/it][2025-06-20 03:45:40,357] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.59 | optimizer_step: 2.73 [2025-06-20 03:45:40,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2111.59 | bwd_microstep: 3323.11 | bwd_inner_microstep: 3322.05 | bwd_allreduce_microstep: 0.99 | step_microstep: 8.07 [2025-06-20 03:45:40,358] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2111.59 | bwd: 3323.14 | bwd_inner: 3322.05 | bwd_allreduce: 1.02 | step: 8.07 90%|█████████ | 9037/10000 [14:16:01<1:28:08, 5.49s/it] {'loss': 0.0, 'grad_norm': 8.976535900728777e-05, 'learning_rate': 9.649069431198922e-07, 'epoch': 9.04} 90%|█████████ | 9037/10000 [14:16:01<1:28:08, 5.49s/it][2025-06-20 03:45:45,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.52 | optimizer_step: 2.72 [2025-06-20 03:45:45,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2146.19 | bwd_microstep: 3399.21 | bwd_inner_microstep: 3398.40 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.86 [2025-06-20 03:45:45,947] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2146.19 | bwd: 3399.23 | bwd_inner: 3398.40 | bwd_allreduce: 0.78 | step: 6.86 90%|█████████ | 9038/10000 [14:16:06<1:28:30, 5.52s/it] {'loss': 0.0, 'grad_norm': 8.024470298551023e-05, 'learning_rate': 9.629202523667703e-07, 'epoch': 9.04} 90%|█████████ | 9038/10000 [14:16:06<1:28:30, 5.52s/it][2025-06-20 03:45:51,489] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.51 | optimizer_step: 2.73 [2025-06-20 03:45:51,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2137.77 | bwd_microstep: 3367.37 | bwd_inner_microstep: 3366.57 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.94 [2025-06-20 03:45:51,490] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2137.77 | bwd: 3367.38 | bwd_inner: 3366.57 | bwd_allreduce: 0.77 | step: 6.94 90%|█████████ | 9039/10000 [14:16:12<1:28:31, 5.53s/it] {'loss': 0.0, 'grad_norm': 0.005891505628824234, 'learning_rate': 9.609355585149615e-07, 'epoch': 9.04} 90%|█████████ | 9039/10000 [14:16:12<1:28:31, 5.53s/it][2025-06-20 03:45:56,954] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 0.00 | optimizer_gradients: 1.50 | optimizer_step: 2.73 [2025-06-20 03:45:56,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 2112.97 | bwd_microstep: 3311.98 | bwd_inner_microstep: 3311.18 | bwd_allreduce_microstep: 0.76 | step_microstep: 6.92 [2025-06-20 03:45:56,955] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 2112.97 | bwd: 3312.00 | bwd_inner: 3311.18 | bwd_allreduce: 0.77 | step: 6.92 90%|█████████ | 9040/10000 [14:16:17<1:28:07, 5.51s/it] {'loss': 0.0, 'grad_norm': 0.0002587404160294682, 'learning_rate': 9.589528617726485e-07, 'epoch': 9.04} 90%|█████████ | 9040/10000 [14:16:17<1:28:07, 5.51s/it]